Homepage: www.llama.com/
Page: www.llama.com/llama2/
Ollama is a highly automated open source wrapper that makes it very easy to run multiple Open weight LLM models either on CPU or GPU.
Its README alone is of great value, serving as a fantastic list of the most popular Open weight LLM models in existence.
Install with:
curl https://ollama.ai/install.sh | sh
On P14s it runs on CPU and generates a few tokens per second, which is quite usable for a quick interactive play.
As mentioned at github.com/jmorganca/ollama/blob/0174665d0e7dcdd8c60390ab2dd07155ef84eb3f/docs/faq.md the downloads to under The file:gives a the exact model name and parameters.
/usr/share/ollama/.ollama/models/
and ncdu tells me:--- /usr/share/ollama ----------------------------------
3.6 GiB [###########################] /.ollama
4.0 KiB [ ] .bashrc
4.0 KiB [ ] .profile
4.0 KiB [ ] .bash_logout
/usr/share/ollama/.ollama/models/manifests/hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/Q2_K
We can also do it non-interactively with:which gave me:but note that there is a random seed that affects each run by default. ollama-expect is an attempt to make the output deterministic.
/bin/time ollama run llama2 'What is quantum field theory?'
0.13user 0.17system 2:06.32elapsed 0%CPU (0avgtext+0avgdata 17280maxresident)k
0inputs+0outputs (0major+2203minor)pagefaults 0swaps
Some other quick benchmarks from Amazon EC2 GPU on a g4nd.xlarge instance which had an Nvidia Tesla T4:and on Nvidia A10G in an g5.xlarge instance:
0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps
0.03user 0.05system 0:09.59elapsed 0%CPU (0avgtext+0avgdata 17312maxresident)k
8inputs+0outputs (1major+1934minor)pagefaults 0swaps
It tends to babble quite a lot by default, but eventually decides to stop.
askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.
A decent test command:but it failed to be deterministic despite
time ./llama-cli \
--no-display-prompt \
--single-turn \
--temp 0 \
-c 16384 \
-cnv \
-m Llama-3.1-Tulu-3-8B-Q8_0.gguf \
-n 1000 \
-ngl 100 \
-p 'What is quantum field theory?' \
-t 10 |
tee output.txt \
;
--temperature 0
. This ran 2x faster at 18 tokens/s for 1000 tokens on P14s on GPU via Vulkan than on CPU which is achievable by removing the -ngl 100
.Across hardware:
Impossible without expect? Fuck...
Attempt at: ollama-expect
Articles by others on the same topic
There are currently no matching articles.