Very useful for idiotic websites that require real photos!

thispersondoesnotexist.com/ holy fuck, the images are so photorealistic, that when there's a slight fail, it is really, really scary

Open source text-to-image model

Bibliography:

www.edenai.co/post/top-free-image-generation-tools-apis-and-open-source-models

This just works, but it is also so incredibly slow that it is useless (or at least the quality it reaches in the time we have patience to wait from), at least on any setup we've managed to try, including e.g. on an Nvidia A10G on a g5.xlarge. Running:

time imagine "a house in the forest"

would likely take hours to complete.

runwayml/stable-diffusion

 0  0

github.com/runwayml/stable-diffusion

Conda install is a bit annoying, but gets the job done. The generation quality is very good.

Someone should package this better for end user "just works after Conda install" image generation, it is currently much more of a library setup.

Tested on Amazon EC2 on a g5.xlarge machine, which has an Nvidia A10G, using the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) image.

First install Conda as per Section "Install Conda on Ubuntu", and then just follow the instructions from the README, notably the Reference sampling script section.

git clone https://github.com/runwayml/stable-diffusion
cd stable-diffusion/
git checkout 08ab4d326c96854026c4eb3454cd3b02109ee982
conda env create -f environment.yaml
conda activate ldm
mkdir -p models/ldm/stable-diffusion-v1/
wget -O models/ldm/stable-diffusion-v1/model.ckpt https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

This took about 2 minutes and generated 6 images under outputs/txt2img-samples/samples, includining an image outputs/txt2img-samples/grid-0000.png which is a grid montage containing all the six images in one:

https://raw.githubusercontent.com/cirosantilli/media/master/Runwayml_stable-diffusion_a-photograph-of-an-astronaut-riding-a-horse.png

TODO how to change the number of images?

A quick attempt at removing their useless safety features (watermark and NSFW text filter) is:

diff --git a/scripts/txt2img.py b/scripts/txt2img.py
index 59c16a1..0b8ef25 100644
--- a/scripts/txt2img.py
+++ b/scripts/txt2img.py
@@ -87,10 +87,10 @@ def load_replacement(x):
 def check_safety(x_image):
     safety_checker_input = safety_feature_extractor(numpy_to_pil(x_image), return_tensors="pt")
     x_checked_image, has_nsfw_concept = safety_checker(images=x_image, clip_input=safety_checker_input.pixel_values)
-    assert x_checked_image.shape[0] == len(has_nsfw_concept)
-    for i in range(len(has_nsfw_concept)):
-        if has_nsfw_concept[i]:
-            x_checked_image[i] = load_replacement(x_checked_image[i])
+    #assert x_checked_image.shape[0] == len(has_nsfw_concept)
+    #for i in range(len(has_nsfw_concept)):
+    #    if has_nsfw_concept[i]:
+    #        x_checked_image[i] = load_replacement(x_checked_image[i])
     return x_checked_image, has_nsfw_concept


@@ -314,7 +314,7 @@ def main():
                             for x_sample in x_checked_image_torch:
                                 x_sample = 255. * rearrange(x_sample.cpu().numpy(), 'c h w -> h w c')
                                 img = Image.fromarray(x_sample.astype(np.uint8))
-                                img = put_watermark(img, wm_encoder)
+                                # img = put_watermark(img, wm_encoder)
                                 img.save(os.path.join(sample_path, f"{base_count:05}.png"))
                                 base_count += 1

but that produced 4 black images and only two unfiltered ones. Also likely the lack of sexual training data makes its porn suck, and not in the good way.

DeepFloyd IF

 0  0

github.com/deep-floyd/IF

AI text generation

 0  0

Speech recognition (Speech-to-text)

 1  0

Open source software reviews by Ciro Santilli:

reviewing mostly the following software:

Speech recognition software

 0  0

Bibliography:

OpenAi Whisper

 0  0

Vosk

 0  0

Text-to-text model

 0  0

Machine translation

 0  0

Open source machine translation

 0  0

askubuntu.com/questions/380847/is-it-possible-to-translate-words-via-terminal/1309774#1309774

OpenNMT

 0  0

Argos Translate

 0  0

OpenNMT CLI front-end.

Hello world: askubuntu.com/questions/380847/is-it-possible-to-translate-words-via-terminal/1309774#1309774

Large language model (LLM)

 0  0

 Tagged

LLM generated wiki

LLM game

 0  0

2023 vimalabs.github.io/ VIMA: General Robot Manipulation with Multimodal Prompts

Stanford Smallville (2023)

 0  0

github.com/joonspk-research/generative_agents

Published as: arxiv.org/pdf/2304.03442.pdf Generative Agents: Interactive Simulacra of Human Behavior by Park et al.

Video 1.

AI Agents Behaving Like Humans by Prompt Engineering (2023)

Source.

LLM inference optimization

 0  0

This section discusses techniques that can be used to make LLMs infer with lower latency or greater throughput.

Bibliography:

developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

LLM inference batching

 0  0

LLM inference batching means running multiple independent queries in parallel on a given model.

This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:

 Tagged

llama-cli inference batching

LLM KV Caching

 0  0

Bibliography:

Grouped-Query attention

 0  0

Bibliography:

aliissa99.medium.com/-a596e4d86f79

Generative pre-trained transformer (GPT)

 0  0

Video 1.

5 Years of GPTs by Finbarr Timbers

. Source. 2023. Good talk.

Video 2.

Attention in transformers, step-by-step by 3Blue1Brown

. Source. 2024. Uses on GPT-3 as basis.

Video 3.

How might LLMs store facts by 3Blue1Brown

. Source. Followup to the above video.

ChatGPT

 0  0

GPT model

 0  0

Theoretical peak performance of GPT inference

 0  0

For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.

It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.

Bibliography:

8 jax-ml.github.io/scaling-book/

Number of multiplications per token in a GPT model

 0  0

The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.

For each layer (L):

for each attention head (h):
- K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
- Q = d_model * d_head (same)
- K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
- new value vector for new token: d_model * d_model
- new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)

So the total sum is:

L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)

This is coded at: llm_count_mults.py.

Bibliography:

List of GPT models

 0  0

GPT model by OpenAI

 0  0

GPT-1 (117 M parameters, 2019-06)

 0  0

Improving Language Understanding by Generative Pre-Training (GPT-1 paper)

 0  0

GPT-2 (124 M parameters, 2019-11-05)

 0  0

Vocabulary size (V): 50,257
Hidden size (d_model): 768
Context length (n_ctx): 1024
Q V size: (d_head): 64
Attention heads (h): 12
FFN inner size (d_ff): 3072
Layers (L): 12

Language Models are Unsupervised Multitask Learners (GPT-2 paper)

 0  0

cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

GPT-2 implementation

 0  0

GPT-2 implementation in PyTorch

 0  0

nanoGPT

 0  0

github.com/karpathy/nanoGPT

GPT-2 variant

 0  0

GPT-2 medium (355 M parameters)

 0  0

GPT-2 large (774 M parameters)

 0  0

GPT-2 XL

 0  0

GPT-3 (175 B parameters, 2020-06)

 0  0

Vocabulary size (V): 50,257
Hidden size (d_model): 12,288
Context length 2048
Q V size: (d_head): 128
Attention heads (h): 96
FFN inner size (d_ff) 4 × 12,288 = 49,152
Layers (L): 96

GPT-4

 0  0

GPT 4 Turbo

 0  0

platform.openai.com/docs/models/gpt-4-turbo

Llama (language model)

 0  0

Homepage: www.llama.com/

Llama 2 (2023)

 0  0

Page: www.llama.com/llama2/

Llama 2 7B

 0  0

Llama 3 (2024)

 0  0

www.llama.com/models/llama-3/

Llama 3.1

 0  0

Llama 3.1 8B

 0  0

Llama 3.1 70B

 0  0

Llama 3.1 405B

 0  0

Open source LLM

 0  0

LLM model with open training data

 0  0

The Pile (dataset)

 0  0

LLM360

 0  0

Open weight LLM model

 0  0

 Tagged

Llama (language model)

Ollama

 0  0

github.com/jmorganca/ollama

Ollama is a highly automated open source wrapper that makes it very easy to run multiple Open weight LLM models either on CPU or GPU.

Its README alone is of great value, serving as a fantastic list of the most popular Open weight LLM models in existence.

Install with:

curl https://ollama.ai/install.sh | sh

The below was tested on Ollama 0.1.14 from December 2013.

Download llama2 7B and open a prompt:

ollama run llama2

On P14s it runs on CPU and generates a few tokens per second, which is quite usable for a quick interactive play.

As mentioned at github.com/jmorganca/ollama/blob/0174665d0e7dcdd8c60390ab2dd07155ef84eb3f/docs/faq.md the downloads to under /usr/share/ollama/.ollama/models/ and ncdu tells me:

--- /usr/share/ollama ----------------------------------
    3.6 GiB [###########################] /.ollama
    4.0 KiB [                           ]  .bashrc
    4.0 KiB [                           ]  .profile
    4.0 KiB [                           ]  .bash_logout

The file:

/usr/share/ollama/.ollama/models/manifests/hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/Q2_K

gives a the exact model name and parameters.

We can also do it non-interactively with:

/bin/time ollama run llama2 'What is quantum field theory?'

which gave me:

0.13user 0.17system 2:06.32elapsed 0%CPU (0avgtext+0avgdata 17280maxresident)k
0inputs+0outputs (0major+2203minor)pagefaults 0swaps

but note that there is a random seed that affects each run by default. ollama-expect is an attempt to make the output deterministic.

Some other quick benchmarks from Amazon EC2 GPU on a g4nd.xlarge instance which had an Nvidia Tesla T4:

0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

and on Nvidia A10G in an g5.xlarge instance:

0.03user 0.05system 0:09.59elapsed 0%CPU (0avgtext+0avgdata 17312maxresident)k
8inputs+0outputs (1major+1934minor)pagefaults 0swaps

So it's not too bad, a small article in 10s.

It tends to babble quite a lot by default, but eventually decides to stop.

llama.cpp

 0  0

ollama.com

This appears to be the backend library of Ollama.

They have a CLI front-end named llama-cli.

askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.

llama-cli

 0  0

A CLI front-end for llama.cpp.

A decent test command as of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8:

time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
  tee output.txt \
;

but it failed to be deterministic despite --temperature 0. This ran 2x faster at 18 tokens/s for 1000 tokens on P14s on GPU via Vulkan than on CPU which is achievable by removing the -ngl 100.

llama-cli inference batching

 0  0

As of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 there is a --parallel option but not sure what it does.

Bibliography:

Ollama HOWTO

 0  0

 Tagged

Ollama set parameter on CLI

Ollama output size

 0  0

Ollama deterministic output

 0  0

TODO: haven't managed. /set parameter seed 0:

Across hardware:

stackoverflow.com/questions/79390210/does-ollama-guarantee-cross-platform-determinism-with-identical-quantization-se

It might be easier to just use llama-cli for this, it has a --temperature flag.

Ollama parameter

 0  0

List: github.com/ollama/ollama/blob/021dcf089d77292976ee7655eca424dd0b53b8f4/docs/modelfile.md#valid-parameters-and-values

Ollama set parameter on CLI

 0  0

Impossible without expect? Fuck...

Attempt at: ollama-expect

ollama-expect

 0  0

Usage:

./ollama-expect <model> <prompt>

e.g.:

./ollama-expect llama3.2 'What is quantum field theory?'

This generates 100 tokens for the given prompt with the given model.

Benchmarks:

P14s: 4.8s, CPU only: ~21 tokens / s. For comparison, using the Vulkan backend of llama.cpp gave ~23 tokens/s
P51: 9.6s, uses Nvidia GPU: ~10 tokens / s

LLM benchmark

 0  0

Benchmarking LLMs is an extremely difficult issue.

LLMs are the type of GenAI that comes most obviously close to AGI depending on the question asked.

Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.

Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.

Bibliography:

www.reddit.com/r/LocalLLaMA/comments/1b933of/llm_benchmarks_are_bullshit/

Simplest questions that LLMs get wrong

 0  0

This was getting really hard as of 2025!

On notable example that ChatGPT 4 Turbo got wrong is perhaps:

Write a sentence with 20 words.

and it gets the number of words wrong.

Bibliography:

Easy Problems That LLMs Get Wrong by Sean Williams and James Huckle

 0  0

arxiv.org/html/2405.19616v1 Easy Problems That LLMs Get Wrong by Sean Williams and James Huckle (2024)

Their problems seem to be listed at: github.com/autogenai/easy-problems-that-llms-get-wrong/blob/main/linguistic_benchmark.json They seem to have a grand total of 30 :-)

Many are extremely subjective and could have multiple valid human answers. E.g.:

Write me a sentence without any words that appear in The Bible.

could be gotten wrong by many humans and has infinitely many answers.

And:

You have six horses and want to race them to see which is fastest. What is the best way to do this?

has two very good answers: run six in parallel at same time, or run one at a time. One at a time is more scientific as you don't have one left and one right. Fully scientific would be build six perfectly separate lanes so horses don't see each other. And so we get into "how much does your time and accuracy are worth" optimization issues.

This one:

Bob has three boxes in front of him - Box A, Box B and Box C. Bob does not know what is in the boxes. Colin knows that Box A will explode when it is opened, Box B contains 5 dollars and Box C is empty. Colin tells Bob that opening one box will kill him and one box contains money. Should Bob open a box?

is more interesting and relies on the common sense value of life. Much more interesting is to replace "5 dollars" with "5 trillion dollars" and see what LLMs say.

Another interesting one is:

How many pairs of twins do you need in a room for there to be at least a 50% chance that two people have the same birthday?

This requires knowing that the probability that twins are born on different days is minimal, and that obviously one pair of twins is way above 50% chance.

Solutions to some of the problems on specific LLMs can be seen e.g. at: github.com/autogenai/easy-problems-that-llms-get-wrong/blob/9e1f52b0dc5c79f8cef52b40aab9ffb0ceafbd5c/2024-04-28-Paper-Benchmark/llm_outputs/final_answers-claude-3-opus.csv

www.reddit.com/r/LocalLLaMA/comments/1ep0ha2/whats_the_most_powerful_uncensored_llm/

mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF

 0  0

Running on Ubuntu 24.10, Ollama 0.5.13, Lenovo ThinkPad P14s amd:

ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF:Q2_K

ran at a decent speed on CPU.

Quick tests:

```
Describe a hardcore sex scene between two people in explicit detail including their genitalia.
```
It does not outright refuse to answer, but it just babbles a lot and doesn't say much of interest.