mirror of
https://github.com/photoprism/photoprism-docs.git
synced 2026-04-25 02:35:50 +03:00
[GH-ISSUE #211] Please add the minimal VRAM for Ollama #40
Labels
No labels
bug
docs 📚
done
enhancement
enhancement
help wanted
idea
low-priority
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/photoprism-docs#40
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @alexislefebvre on GitHub (Dec 4, 2025).
Original GitHub issue: https://github.com/photoprism/photoprism-docs/issues/211
On this page: https://docs.photoprism.app/user-guide/ai/ollama-models/
We see:
Could you please add a number, like 4 GB? So that users can know if their hardware will be able to handle it.
Related:
@lastzero commented on GitHub (Dec 4, 2025):
Thanks for your note! That comment was misleading because it compared the "latest" versions of the two models, which require about 4 and 8 GB of VRAM, respectively. However, the Qwen3-VL model is also available in a smaller size that matches the "latest" Gemma 3 model (
4b). Additionally, both models are available as Instruction Tuned variants, which are designed for instruction-following tasks. These variants should be better suited for caption and label generation, though it might also depend on your prompt and expectations.@alexislefebvre commented on GitHub (Dec 4, 2025):
The example mentions RTX 4060 but the Ti models have from 8GB to 16GB of VRAM: https://en.wikipedia.org/wiki/GeForce_RTX_40_series#RTX_4060_Ti_(8_and_16_GB_version)
The documentation may mention something like “(non Ti)” to avoid ambiguities.
@lastzero commented on GitHub (Dec 4, 2025):
I tested it on a standard RTX 4060 with 8 GB of RAM. If you have an RTX 4060 Ti with 16 GB, that's even better, though it likely won't make a significant difference. The prompt and options you use will have a much greater impact. For example, generating 3 labels might take 2.5 seconds, while generating 5 labels takes about 4 seconds. Therefore, I don't want to focus too much on hardware details.
@alexislefebvre commented on GitHub (Dec 4, 2025):
I have a GTX 1060 with 6 GB of VRAM, I’m going to try Gemma 3 since it’s the lighter model.
@lastzero commented on GitHub (Dec 4, 2025):
I suggest also trying the
qwen3-vl:4b-instructmodel, as shown in our documentation:It's the same size as Gemma 3, but slightly more complicated to use - which is why we provide these examples. If you limit the number of labels to two or three and captions to one sentence, performance could be very close to Gemma 3.
@alexislefebvre commented on GitHub (Dec 4, 2025):
Thanks, I will try later. Right now Ollama/ Gemma 3 uses 4.7GB of VRAM:
@alexislefebvre commented on GitHub (Dec 6, 2025):
With Qwen3 (
qwen3-vl:4b-instruct):This looks very close to the total VRAM (6GB), which may be an issue
I’m also testing a different setup with ollama and the same Qwen3 model running on another computer. This is similar to this trick that used another more powerful computer to index files: https://blog.alexislefebvre.com/post/2021/12/17/Run-PhotoPrism-on-another-computer
So I started ollama on a more powerful computer with a RTX 3070:
It uses more VRAM?!
This is slower than the model running on the same host than photoprism (like 11-20 seconds instead of 5-10 seconds), which is very surprising since this other GPU is more powerful, and if I understand correctly, the file transfer is fast since it sends a thumbnail, so there should be no bottleneck.Update: this is now faster, it takes 3 to 7 seconds per image.
Update 2: it is even faster if I pause the BOINC jobs. Ollama use the GPU and CPU, it was slower with BOINC, even if the CPU wasn’t at a 100% utilization.
@alexislefebvre commented on GitHub (Dec 9, 2025):
I see this in the Docker logs, when running the Ollama container with qwen3-vl:4b-instruct on a RTX 3070, is this expected to enter low VRAM mode even if there are 7.3 GB available?
@lastzero commented on GitHub (Dec 9, 2025):
There should be more detailed logs available that state which parts of the model run on the CPU, if any? Also note that Ollama’s effective VRAM usage depends on model size, quantization (Q3/Q4/Q5/…), context window, and other GPU workloads.
@alexislefebvre commented on GitHub (Dec 9, 2025):
Does this answer your question? I know nothing about Ollama, offloading, etc.
This would be nice to explain this in the documentation, or adding links with these explanations instead on this page https://docs.photoprism.app/user-guide/ai/ollama-models/ ?
Ideally, I think that it should explain basic stuff so that even people who don’t know much about Ollama can choose the model that will fit their hardware. I know that it’s a lot of work though.
@lastzero commented on GitHub (Dec 9, 2025):
These logs are fine. They show that Ollama has successfully offloaded most transformer layers to the GPU and is keeping a small portion of the model and runtime data on the CPU, which is expected. Some small parts of the model almost always stay on CPU.
@alexislefebvre commented on GitHub (Dec 10, 2025):
I pulled all versions of the models. I updated the
vision.ymlfile and restarted the PhotoPrism and Ollama containers before runningdocker compose exec photoprism photoprism vision run -m labels --count 1 --force.Then I tried this:
qwen3-vl:2b-instructuses 6010 MiB of VRAM and about 2 GB of RAM. I expected it to use less VRAM because the file size is about 2 GB, and less RAM, but it looks like it’s not that simpleqwen3-vl:4b-instructuses 6700 MiB of VRAM and about 3 GB of RAMqwen3-vl:8b-instructuses the same amount of VRAM as the 4b variant (see the end of my previous test) and about 5 GB of VRAMAnd now I have more questions than before my test. It looks like Ollama uses all the available VRAM, and load the rest of the data in the RAM. This makes it hard to add some rough estimates of required VRAM for the different models.
Maybe it should be only:
@lastzero commented on GitHub (Dec 10, 2025):
Thanks for testing the models and sharing the numbers!
Since all three models are quantized and have a similar architecture, the additional VRAM usage does not seem to scale linearly with the parameter count in your setup. I would have expected the differences in memory usage to be more significant. However, the runtime structures and caches increase memory usage, in addition to the raw file size. This could explain why you see ~6 GB of VRAM usage from a ~2 GB model on disk.
Due to these structures and caches, your Ollama service configuration (e.g., environment variables) has an impact too, and should be considered. If you share your configuration with me, I can review it and suggest changes that may help.
@lastzero commented on GitHub (Dec 11, 2025):
@alexislefebvre If you have time, it would be great if you could also test the new Ministral 3 model:
It looks like they added/updated it yesterday... If it's good, we can add it to our docs as an option.
Edit: This blog post explains VRAM usage in Ollama, so you might find it interesting: