[GH-ISSUE #211 ] Please add the minimal VRAM for Ollama #40

Author

Owner

@lastzero commented on GitHub (Dec 4, 2025):

Thanks for your note! That comment was misleading because it compared the "latest" versions of the two models, which require about 4 and 8 GB of VRAM, respectively. However, the Qwen3-VL model is also available in a smaller size that matches the "latest" Gemma 3 model (4b). Additionally, both models are available as Instruction Tuned variants, which are designed for instruction-following tasks. These variants should be better suited for caption and label generation, though it might also depend on your prompt and expectations.

Model	Use Case	Notes
Gemma 3	Standard caption and label generation	Light, reliable JSON output; good default.
Qwen3-VL	Advanced vision and reasoning tasks (OCR, complex prompts)	Better visual grounding and multi-language support; available in many sizes and variants.

@lastzero commented on GitHub (Dec 4, 2025): Thanks for your note! That comment was misleading because it compared the "latest" versions of the two models, which require about 4 and 8 GB of VRAM, respectively. However, the Qwen3-VL model is also available in a smaller size that matches the "latest" Gemma 3 model (`4b`). Additionally, both models are available as Instruction Tuned variants, which are designed for instruction-following tasks. These variants should be better suited for caption and label generation, though it might also depend on your prompt and expectations. | Model | Use Case | Notes | |--------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------| | **Gemma 3** | Standard caption and label generation | Light, reliable JSON output; good default. | | **Qwen3-VL** | Advanced vision and reasoning tasks (OCR, complex prompts) | Better visual grounding and multi-language support; available in many sizes and variants. |

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 4, 2025):

The example mentions RTX 4060 but the Ti models have from 8GB to 16GB of VRAM: https://en.wikipedia.org/wiki/GeForce_RTX_40_series#RTX_4060_Ti_(8_and_16_GB_version)

The documentation may mention something like “(non Ti)” to avoid ambiguities.

@alexislefebvre commented on GitHub (Dec 4, 2025): The example mentions RTX 4060 but the Ti models have from 8GB to 16GB of VRAM: https://en.wikipedia.org/wiki/GeForce_RTX_40_series#RTX_4060_Ti_(8_and_16_GB_version) The documentation may mention something like “*(non Ti)*” to avoid ambiguities.

kerem commented

Author

Owner

@lastzero commented on GitHub (Dec 4, 2025):

I tested it on a standard RTX 4060 with 8 GB of RAM. If you have an RTX 4060 Ti with 16 GB, that's even better, though it likely won't make a significant difference. The prompt and options you use will have a much greater impact. For example, generating 3 labels might take 2.5 seconds, while generating 5 labels takes about 4 seconds. Therefore, I don't want to focus too much on hardware details.

@lastzero commented on GitHub (Dec 4, 2025): I tested it on a standard RTX 4060 with 8 GB of RAM. If you have an RTX 4060 Ti with 16 GB, that's even better, though it likely won't make a significant difference. The prompt and options you use will have a much greater impact. For example, generating 3 labels might take 2.5 seconds, while generating 5 labels takes about 4 seconds. Therefore, I don't want to focus too much on hardware details.

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 4, 2025):

I have a GTX 1060 with 6 GB of VRAM, I’m going to try Gemma 3 since it’s the lighter model.

@alexislefebvre commented on GitHub (Dec 4, 2025): I have a GTX 1060 with 6 GB of VRAM, I’m going to try Gemma 3 since it’s the lighter model.

kerem commented

Author

Owner

@lastzero commented on GitHub (Dec 4, 2025):

I suggest also trying the qwen3-vl:4b-instruct model, as shown in our documentation:

It's the same size as Gemma 3, but slightly more complicated to use - which is why we provide these examples. If you limit the number of labels to two or three and captions to one sentence, performance could be very close to Gemma 3.

@lastzero commented on GitHub (Dec 4, 2025): I suggest also trying the `qwen3-vl:4b-instruct` model, as shown in our documentation: - https://docs.photoprism.app/user-guide/ai/ollama-models/#qwen3-vl-labels - https://docs.photoprism.app/user-guide/ai/ollama-models/#qwen3-vl-caption It's the same size as Gemma 3, but slightly more complicated to use - which is why we provide these examples. If you limit the number of labels to two or three and captions to one sentence, performance could be very close to Gemma 3.

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 4, 2025):

Thanks, I will try later. Right now Ollama/ Gemma 3 uses 4.7GB of VRAM:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0 Off |                  N/A |
| 38%   52C    P2            101W /  120W |    4717MiB /   6144MiB |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     31053      C   /usr/bin/ollama                              4712MiB |
+-----------------------------------------------------------------------------------------+

@alexislefebvre commented on GitHub (Dec 4, 2025): Thanks, I will try later. Right now Ollama/ Gemma 3 uses 4.7GB of VRAM: ``` $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1060 6GB Off | 00000000:01:00.0 Off | N/A | | 38% 52C P2 101W / 120W | 4717MiB / 6144MiB | 95% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 31053 C /usr/bin/ollama 4712MiB | +-----------------------------------------------------------------------------------------+ ```

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 6, 2025):

With Qwen3 (qwen3-vl:4b-instruct):

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0 Off |                  N/A |
| 48%   57C    P2             38W /  120W |    5627MiB /   6144MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1087364      C   /usr/bin/ollama                              5622MiB |
+-----------------------------------------------------------------------------------------+

This looks very close to the total VRAM (6GB), which may be an issue

I’m also testing a different setup with ollama and the same Qwen3 model running on another computer. This is similar to this trick that used another more powerful computer to index files: https://blog.alexislefebvre.com/post/2021/12/17/Run-PhotoPrism-on-another-computer

So I started ollama on a more powerful computer with a RTX 3070:

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:09:00.0  On |                  N/A |
|  0%   47C    P2             63W /  280W |    7331MiB /   8192MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3377      G   /usr/bin/gnome-shell                    291MiB |
|    0   N/A  N/A            3473      G   /usr/bin/Xwayland                        10MiB |
|    0   N/A  N/A            4355      G   /usr/bin/firefox                        256MiB |
|    0   N/A  N/A           74826      C   /usr/bin/ollama                        6702MiB |
+-----------------------------------------------------------------------------------------+

It uses more VRAM?!

This is slower than the model running on the same host than photoprism (like 11-20 seconds instead of 5-10 seconds), which is very surprising since this other GPU is more powerful, and if I understand correctly, the file transfer is fast since it sends a thumbnail, so there should be no bottleneck.

Update: this is now faster, it takes 3 to 7 seconds per image.

Update 2: it is even faster if I pause the BOINC jobs. Ollama use the GPU and CPU, it was slower with BOINC, even if the CPU wasn’t at a 100% utilization.

@alexislefebvre commented on GitHub (Dec 6, 2025): With Qwen3 (`qwen3-vl:4b-instruct`): ```shell $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1060 6GB Off | 00000000:01:00.0 Off | N/A | | 48% 57C P2 38W / 120W | 5627MiB / 6144MiB | 15% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1087364 C /usr/bin/ollama 5622MiB | +-----------------------------------------------------------------------------------------+ ``` This looks very close to the total VRAM (6GB), which may be an issue --- I’m also testing a different setup with ollama and the same Qwen3 model running on another computer. This is similar to this *trick* that used another more powerful computer to index files: https://blog.alexislefebvre.com/post/2021/12/17/Run-PhotoPrism-on-another-computer So I started ollama on a more powerful computer with a RTX 3070: ```shell nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Off | 00000000:09:00.0 On | N/A | | 0% 47C P2 63W / 280W | 7331MiB / 8192MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3377 G /usr/bin/gnome-shell 291MiB | | 0 N/A N/A 3473 G /usr/bin/Xwayland 10MiB | | 0 N/A N/A 4355 G /usr/bin/firefox 256MiB | | 0 N/A N/A 74826 C /usr/bin/ollama 6702MiB | +-----------------------------------------------------------------------------------------+ ``` It uses more VRAM?! ~~This is slower than the model running on the same host than photoprism (like 11-20 seconds instead of 5-10 seconds), which is very surprising since this other GPU is more powerful, and if I understand correctly, the file transfer is fast since it sends a thumbnail, so there should be no bottleneck.~~ Update: this is now faster, it takes 3 to 7 seconds per image. Update 2: it is even faster if I pause the BOINC jobs. Ollama use the GPU and CPU, it was slower with BOINC, even if the CPU wasn’t at a 100% utilization.

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 9, 2025):

I see this in the Docker logs, when running the Ollama container with qwen3-vl:4b-instruct on a RTX 3070, is this expected to enter low VRAM mode even if there are 7.3 GB available?

level=INFO source=types.go:42 msg="inference compute" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3070" libdirs=ollama,cuda_v13 driver=13.0 type=discrete total="8.0 GiB" available="7.3 GiB"
source=routes.go:1638 msg="entering low vram mode" "total vram"="8.0 GiB" threshold="20.0 GiB"

@alexislefebvre commented on GitHub (Dec 9, 2025): I see this in the Docker logs, when running the Ollama container with qwen3-vl:4b-instruct on a RTX 3070, is this expected to enter low VRAM mode even if there are 7.3 GB available? > level=INFO source=types.go:42 msg="inference compute" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3070" libdirs=ollama,cuda_v13 driver=13.0 type=discrete total="8.0 GiB" available="7.3 GiB" > source=routes.go:1638 msg="entering low vram mode" "total vram"="8.0 GiB" threshold="20.0 GiB"

kerem commented

Author

Owner

@lastzero commented on GitHub (Dec 9, 2025):

There should be more detailed logs available that state which parts of the model run on the CPU, if any? Also note that Ollama’s effective VRAM usage depends on model size, quantization (Q3/Q4/Q5/…), context window, and other GPU workloads.

@lastzero commented on GitHub (Dec 9, 2025): There should be more detailed logs available that state which parts of the model run on the CPU, if any? Also note that Ollama’s effective VRAM usage depends on model size, quantization (Q3/Q4/Q5/…), context window, and other GPU workloads.

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 9, 2025):

There should be more detailed logs available that state which parts of the model run on the CPU, if any?

docker compose logs --timestamps | grep "CPU\|GPU"
level=INFO source=ggml.go:482 msg="offloading 29 repeating layers to GPU"
level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
level=INFO source=ggml.go:494 msg="offloaded 29/37 layers to GPU"
level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB"
level=INFO source=device.go:256 msg="kv cache" device=CPU size="112.0 MiB"
level=INFO source=device.go:267 msg="compute graph" device=CPU size="126.6 MiB"

Does this answer your question? I know nothing about Ollama, offloading, etc.

Also note that Ollama’s effective VRAM usage depends on model size, quantization (Q3/Q4/Q5/…), context window, and other GPU workloads.

This would be nice to explain this in the documentation, or adding links with these explanations instead on this page https://docs.photoprism.app/user-guide/ai/ollama-models/ ?

Ideally, I think that it should explain basic stuff so that even people who don’t know much about Ollama can choose the model that will fit their hardware. I know that it’s a lot of work though.

@alexislefebvre commented on GitHub (Dec 9, 2025): > There should be more detailed logs available that state which parts of the model run on the CPU, if any? ```shell docker compose logs --timestamps | grep "CPU\|GPU" level=INFO source=ggml.go:482 msg="offloading 29 repeating layers to GPU" level=INFO source=ggml.go:486 msg="offloading output layer to CPU" level=INFO source=ggml.go:494 msg="offloaded 29/37 layers to GPU" level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB" level=INFO source=device.go:256 msg="kv cache" device=CPU size="112.0 MiB" level=INFO source=device.go:267 msg="compute graph" device=CPU size="126.6 MiB" ``` Does this answer your question? I know nothing about Ollama, offloading, etc. > Also note that Ollama’s effective VRAM usage depends on model size, quantization (Q3/Q4/Q5/…), context window, and other GPU workloads. This would be nice to explain this in the documentation, or adding links with these explanations instead on this page https://docs.photoprism.app/user-guide/ai/ollama-models/ ? Ideally, I think that it should explain basic stuff so that even people who don’t know much about Ollama can choose the model that will fit their hardware. I know that it’s a lot of work though.

kerem commented

Author

Owner

@lastzero commented on GitHub (Dec 9, 2025):

These logs are fine. They show that Ollama has successfully offloaded most transformer layers to the GPU and is keeping a small portion of the model and runtime data on the CPU, which is expected. Some small parts of the model almost always stay on CPU.

@lastzero commented on GitHub (Dec 9, 2025): These logs are fine. They show that Ollama has successfully offloaded most transformer layers to the GPU and is keeping a small portion of the model and runtime data on the CPU, which is expected. Some small parts of the model almost always stay on CPU.

kerem commented

Author

Owner

@alexislefebvre commented on GitHub (Dec 10, 2025):

I pulled all versions of the models. I updated the vision.yml file and restarted the PhotoPrism and Ollama containers before running docker compose exec photoprism photoprism vision run -m labels --count 1 --force.

Then I tried this:

qwen3-vl:2b-instruct uses 6010 MiB of VRAM and about 2 GB of RAM. I expected it to use less VRAM because the file size is about 2 GB, and less RAM, but it looks like it’s not that simple
qwen3-vl:4b-instruct uses 6700 MiB of VRAM and about 3 GB of RAM
qwen3-vl:8b-instruct uses the same amount of VRAM as the 4b variant (see the end of my previous test) and about 5 GB of VRAM

And now I have more questions than before my test. It looks like Ollama uses all the available VRAM, and load the rest of the data in the RAM. This makes it hard to add some rough estimates of required VRAM for the different models.

Maybe it should be only:

If your GPU has 6 GB of VRAM or less, use qwen3-vl:2b-instruct. Otherwise use any bigger model, Ollama will adapt to the available VRAM and offload data to the RAM, this will still be faster than relying only on the CPU.

@alexislefebvre commented on GitHub (Dec 10, 2025): I pulled all versions of the models. I updated the `vision.yml` file and restarted the PhotoPrism and Ollama containers before running `docker compose exec photoprism photoprism vision run -m labels --count 1 --force`. Then I tried this: - `qwen3-vl:2b-instruct` uses 6010 MiB of VRAM and about 2 GB of RAM. I expected it to use less VRAM because the file size is about 2 GB, and less RAM, but it looks like it’s not that simple - `qwen3-vl:4b-instruct` uses 6700 MiB of VRAM and about 3 GB of RAM - `qwen3-vl:8b-instruct` uses the same amount of VRAM as the 4b variant (see the end of my [previous test](https://github.com/photoprism/photoprism-docs/issues/211#issuecomment-3621065168)) and about 5 GB of VRAM And now I have more questions than before my test. It looks like Ollama uses all the available VRAM, and load the rest of the data in the RAM. This makes it hard to add some rough estimates of required VRAM for the different models. Maybe it should be only: > If your GPU has 6 GB of VRAM or less, use `qwen3-vl:2b-instruct`. Otherwise use any bigger model, Ollama will adapt to the available VRAM and offload data to the RAM, this will still be faster than relying only on the CPU.

kerem commented

Author

Owner

@lastzero commented on GitHub (Dec 10, 2025):

Thanks for testing the models and sharing the numbers!

Since all three models are quantized and have a similar architecture, the additional VRAM usage does not seem to scale linearly with the parameter count in your setup. I would have expected the differences in memory usage to be more significant. However, the runtime structures and caches increase memory usage, in addition to the raw file size. This could explain why you see ~6 GB of VRAM usage from a ~2 GB model on disk.

Due to these structures and caches, your Ollama service configuration (e.g., environment variables) has an impact too, and should be considered. If you share your configuration with me, I can review it and suggest changes that may help.

@lastzero commented on GitHub (Dec 10, 2025): Thanks for testing the models and sharing the numbers! Since all three models are quantized and have a similar architecture, the additional VRAM usage does not seem to scale linearly with the parameter count in your setup. I would have expected the differences in memory usage to be more significant. However, the runtime structures and caches increase memory usage, in addition to the raw file size. This could explain why you see ~6 GB of VRAM usage from a ~2 GB model on disk. Due to these structures and caches, your Ollama service configuration (e.g., environment variables) has an impact too, and should be considered. If you share your configuration with me, I can review it and suggest changes that may help.

kerem commented