Ciro Santilli @cirosantilli 40

 Incoming links: Ollama

Amazon EC2 GPU Updated 2025-07-16

As of December 2023, the cheapest instance with an Nvidia GPU is g4nd.xlarge, so let's try that out. In that instance, lspci contains:

00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

so we see that it runs a Nvidia T4 GPU.

Be careful not to confuse it with g4ad.xlarge, which has an AMD GPU instead. TODO meaning of "ad"? "a" presumably means AMD, but what is the "d"?

Some documentation on which GPU is in each instance can seen at: docs.aws.amazon.com/dlami/latest/devguide/gpu.html (archive) with a list of which GPUs they have at that random point in time. Can the GPU ever change for a given instance name? Likely not. Also as of December 2023 the list is already outdated, e.g. P5 is now shown, though it is mentioned at: aws.amazon.com/ec2/instance-types/p5/

When selecting the instance to launch, the GPU does not show anywhere apparently on the instance information page, it is so bad!

Also note that this instance has 4 vCPUs, so on a new account you must first make a customer support request to Amazon to increase your limit from the default of 0 to 4, see also: stackoverflow.com/questions/68347900/you-have-requested-more-vcpu-capacity-than-your-current-vcpu-limit-of-0, otherwise instance launch will fail with:

You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

When starting up the instance, also select:

image: Ubuntu 22.04
storage size: 30 GB (maximum free tier allowance)

Once you finally managed to SSH into the instance, first we have to install drivers and reboot:

sudo apt update
sudo apt install nvidia-driver-510 nvidia-utils-510 nvidia-cuda-toolkit
sudo reboot

and now running:

nvidia-smi

shows something like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8    12W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If we start from the raw Ubuntu 22.04, first we have to install drivers:

From there basically everything should just work as normal. E.g. we were able to run a CUDA hello world just fine along:

nvcc inc.cu
./a.out

One issue with this setup, besides the time it takes to setup, is that you might also have to pay some network charges as it downloads a bunch of stuff into the instance. We should try out some of the pre-built images. But it is also good to know this pristine setup just in case.

We then managed to run Ollama just fine with:

curl https://ollama.ai/install.sh | sh
/bin/time ollama run llama2 'What is quantum field theory?'

which gave:

0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

so way faster than on my local desktop CPU, hurray.

After setup from: askubuntu.com/a/1309774/52975 we were able to run:

head -n1000 pap.txt | ARGOS_DEVICE_TYPE=cuda time argos-translate --from-lang en --to-lang fr > pap-fr.txt

which gave:

77.95user 2.87system 0:39.93elapsed 202%CPU (0avgtext+0avgdata 4345988maxresident)k
0inputs+88outputs (0major+910748minor)pagefaults 0swaps

so only marginally better than on P14s. It would be fun to see how much faster we could make things on a more powerful GPU.

 Read the full article

llama.cpp Created 2025-07-16 Updated 2025-07-16

 View more

ollama.com

This appears to be the backend library of Ollama.

They have a CLI front-end named llama-cli.

askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.

 Read the full article

mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF Created 2025-03-20 Updated 2025-07-16

 View more

Running on Ubuntu 24.10, Ollama 0.5.13, Lenovo ThinkPad P14s amd:

ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF:Q2_K

ran at a decent speed on CPU.

Quick tests:

```
Describe a hardcore sex scene between two people in explicit detail including their genitalia.
```
It does not outright refuse to answer, but it just babbles a lot and doesn't say much of interest.

 Read the full article

Ollama Updated 2025-07-16

 View more

github.com/jmorganca/ollama

Ollama is a highly automated open source wrapper that makes it very easy to run multiple Open weight LLM models either on CPU or GPU.

Its README alone is of great value, serving as a fantastic list of the most popular Open weight LLM models in existence.

Install with:

curl https://ollama.ai/install.sh | sh

The below was tested on Ollama 0.1.14 from December 2013.

Download llama2 7B and open a prompt:

ollama run llama2

On P14s it runs on CPU and generates a few tokens per second, which is quite usable for a quick interactive play.

As mentioned at github.com/jmorganca/ollama/blob/0174665d0e7dcdd8c60390ab2dd07155ef84eb3f/docs/faq.md the downloads to under /usr/share/ollama/.ollama/models/ and ncdu tells me:

--- /usr/share/ollama ----------------------------------
    3.6 GiB [###########################] /.ollama
    4.0 KiB [                           ]  .bashrc
    4.0 KiB [                           ]  .profile
    4.0 KiB [                           ]  .bash_logout

The file:

/usr/share/ollama/.ollama/models/manifests/hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/Q2_K

gives a the exact model name and parameters.

We can also do it non-interactively with:

/bin/time ollama run llama2 'What is quantum field theory?'

which gave me:

0.13user 0.17system 2:06.32elapsed 0%CPU (0avgtext+0avgdata 17280maxresident)k
0inputs+0outputs (0major+2203minor)pagefaults 0swaps

but note that there is a random seed that affects each run by default. ollama-expect is an attempt to make the output deterministic.

Some other quick benchmarks from Amazon EC2 GPU on a g4nd.xlarge instance which had an Nvidia Tesla T4:

0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

and on Nvidia A10G in an g5.xlarge instance:

0.03user 0.05system 0:09.59elapsed 0%CPU (0avgtext+0avgdata 17312maxresident)k
8inputs+0outputs (1major+1934minor)pagefaults 0swaps

So it's not too bad, a small article in 10s.

It tends to babble quite a lot by default, but eventually decides to stop.

 Read the full article