Generating test data for full text search tests by
Ciro Santilli 35 Updated 2025-03-28 +Created 2024-12-23
I've been thinking lightly about adding full text search to OurBigBook.
For example, at docs.ourbigbook.com/news/article-and-topic-id-prefix-search article search was added, but it only finds if you search something that appears right at the start of a title, e.g. for:you'd get a hit for:but not for
Fundamental theorem of calculus
fundamental
calculus
To do this efficiently, we need full text search, which PostgreSQL implements.
But finding a clean way to generate test data for testing out the speedup was not so easy and exploration into this led me to publishing a few new slightly improved methods where Googlers can now find them:
- unix.stackexchange.com/questions/97160/is-there-something-like-a-lorem-ipsum-generator/787733#787733 I propose a neat random "sentence" generator using common CLI tools like
grep
andsed
and the pre-installed Ubuntu dictionary/usr/share/dict/american-english
:grep -v "'" /usr/share/dict/american-english | shuf -r | paste -d ' ' $(printf "%4s" | sed 's/ /- /g') | sed -e 's/^\(.\)/\U\1/;s/$/./' | head -n10000000 \ > lorem.txt
- to achieve that, I also proposed two superior "join every N lines" method for the CLI: stackoverflow.com/questions/25973140/joining-every-group-of-n-lines-into-one-with-bash/79257780#79257780, notably this awk poem:
seq 10 | awk '{ printf("%s%s", NR == 1 ? "" : NR % 3 == 1 ? "\n" : " ", $0 ) } END { printf("\n") }'
- to achieve that, I also proposed two superior "join every N lines" method for the CLI: stackoverflow.com/questions/25973140/joining-every-group-of-n-lines-into-one-with-bash/79257780#79257780, notably this awk poem:
- stackoverflow.com/questions/3371503/sql-populate-table-with-random-data/79255281#79255281 I propose:
- a clean PostgreSQL random string stored procedure that picks random characters from an allowed character list
CREATE OR REPLACE FUNCTION random_string(int) RETURNS TEXT as $$ select string_agg(substr(characters, (random() * length(characters) + 1)::integer, 1), '') as random_word from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789- ')) as symbols(characters) join generate_series(1, $1) on 1 = 1 $$ language sql;
- first generating PostgreSQL data as CSV, and then importing the CSV into PostgreSQL as a more flexible method. This can also be done in a streaming fashion from stdin which is neat.
python generate_data.py 10 | psql mydb -c '\copy "mytable" FROM STDIN'
- a clean PostgreSQL random string stored procedure that picks random characters from an allowed character list
- stackoverflow.com/questions/16020164/psqlexception-error-syntax-error-in-tsquery/79437030#79437030 regarding the safe generation of prefix search
tsquery
from user inputs without query errors, I've learned aboutwebsearch_to_tsquery
and further highlighted a possibletsquery -> text -> tsquery
approach that might be correct for prefix searches - stackoverflow.com/questions/67438575/fulltext-search-using-sequelize-postgres/79439253#79439253 I put everything together into a minimal Sequelize example, read for usage in OurBigBook
Finally I did a writeup summarizing PostgreSQL full text search: Section "PostgreSQL full-text search" and also dumped it at: www.reddit.com/r/PostgreSQL/comments/12yld1o/is_it_worth_using_postgres_builtin_fulltext/ for good measure.
Meet Willow, our state-of-the-art quantum chip by Google Quantum AI
. Source. 2024 public presentation of their then new chip.
Related blog post: blog.google/technology/research/google-willow-quantum-chip/
Timeline:He went pretty much in a straight line into the quantum computing boom! Well done.
- 2015: joined Google as a Google Quantum AI employee
- 2010: UCSB Physics PhD. His thesis was "Fault-tolerant superconducting qubits" and the PDF can be downloaded from: alexandria.ucsb.edu/lib/ark:/48907/f3b56gwb.
- 2006: UCSB Physics undergrad. In 2008 he joined John Martinis' lab during his undergrad itself.
Built 2021. TODO address. Located in Santa Barbara, which has long been the epycenter of Google's AI efforts. Apparently contains fabrication facilities.
Take a tour of Google's Quantum AI Lab by Google Quantum AI
. Source. 2023This section is about POSIX environment variable that have special effects.
They are documented by POSIX at: pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08
Signup required for any search, bastards. But it's free. Once you have a URL however it is visible without login, so you could just Google it too.
Made possible by the Kibble balance.
Google's 2019 quantum supremacy claim by
Ciro Santilli 35 Updated 2025-03-28 +Created 2024-12-13
There are unlisted articles, also show them or only show them.