Source: cirosantilli/updates/generating-test-data-for-full-text-search-tests

= Generating test data for full text search tests

I've been thinking lightly about adding full text search to <OurBigBook>.

For example, at https://docs.ourbigbook.com/news/article-and-topic-id-prefix-search article search was added, but it only finds if you search something that appears right at the start of a title, e.g. for:
``
Fundamental theorem of calculus
``
you'd get a hit for:
``
fundamental
``
but not for
``
calculus
``

To do this efficiently, we need full text search, which <PostgreSQL> implements.

But finding a clean way to generate test data for testing out the speedup was not so easy and exploration into this led me to publishing a few new slightly improved methods where Googlers can now find them:

* https://unix.stackexchange.com/questions/97160/is-there-something-like-a-lorem-ipsum-generator/787733#787733 I propose a neat random "sentence" generator using common CLI tools like <grep> and <sed> and the pre-installed Ubuntu dictionary `/usr/share/dict/american-english`:
  ``
  grep -v "'" /usr/share/dict/american-english |
  shuf -r |
  paste -d ' ' $(printf "%4s" | sed 's/ /- /g') |
  sed -e 's/^\(.\)/\U\1/;s/$/./' |
  head -n10000000 \
  > lorem.txt
  ``

  * to achieve that, I also proposed two superior "join every N lines" method for the CLI: https://stackoverflow.com/questions/25973140/joining-every-group-of-n-lines-into-one-with-bash/79257780#79257780[], notably this <awk> poem:
    ``
    seq 10 | awk '{ printf("%s%s", NR  == 1 ? "" : NR % 3 == 1 ? "\n" : " ", $0 ) } END { printf("\n") }'
    ``
* https://stackoverflow.com/questions/3371503/sql-populate-table-with-random-data/79255281#79255281 I propose:
  * a clean <PostgreSQL> random string stored procedure that picks random characters from an allowed character list
    ``
    CREATE OR REPLACE FUNCTION random_string(int) RETURNS TEXT as $$
    select
    string_agg(substr(characters, (random() * length(characters) + 1)::integer, 1), '') as random_word
    from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-    ')) as symbols(characters)
    join generate_series(1, $1) on 1 = 1
    $$ language sql;  
    ``
  * first generating <PostgreSQL> data as <CSV>, and then importing the CSV into PostgreSQL as a more flexible method. This can also be done in a streaming fashion from stdin which is neat.
    ``
    python generate_data.py 10 | psql mydb -c '\copy "mytable" FROM STDIN'
    ``

Finally I did a writeup summarizing PostgreSQL full text search: <PostgreSQL full-text search>{full} and also dumped it at: https://www.reddit.com/r/PostgreSQL/comments/12yld1o/is_it_worth_using_postgres_builtin_fulltext/[] for good measure.