cirosantilli.com content uploaded to ourbigbook.com/cirosantilli Updated 2025-03-28 +Created 1970-01-01
Managed to upload the content from the static website cirosantilli.com (OurBigBook Markup source at github.com/cirosantilli/cirosantilli.github.io) to ourbigbook.com/cirosantilli.
Although most of the key requirements were already in place since the last update, as usual doing things with the complex reference content stresses the system further and leads to the exposition of several new bugs.
The upload of OurBigBook Markup files to ourbigbook.com was done with the newly added OurBigBook CLI
ourbigbook --web
option. Although fully exposed to end users, the setup is not super efficient: a trully decent implementation should only upload changed files, and would basically mean reimplementing/using Git, since version diffing is what Git shines at. But I've decided not to put much emphasis on CLI upload for now, since it is expected that initially the majority of users will use the Web UI only. The functionality was added primarily to upload the reference content.This is a major milestone, as the new content can start attracting new users, and makes the purpose of the website much clearer. Just having this more realistic content also immediately highlighted what the next development steps need to be.
Once v1.0 is reached, I will actually make all internal links of cirosantilli.com to point to ourbigbook.com/cirosantilli to try and drive some more traffic.
The new content blows up by far the limit of the free Heroku PostgreSQL database of 10k lines. This meant that I needed to upgrade the Heroku Postgres plugin from the free Hobby Dev to the 9 USD/month Hobby Basic: elements.heroku.com/addons/heroku-postgresql, so now hosting costs will increase from 7 USD/month for the dyno to 7 + 9 = 16 UDS/month. After this upgrade and uploading all of cirosantilli.com to ourbigbook.com, Heroku dashboard reads reads:so clearly if we are ever forced to upgrade plans again, it means that a bunch of people are using the website and that things are going very very well! Happy how this storage cost turned out so far.
- 30,918 rows out of 10,000,000
- 61.0 MB (out of 10 GB)
One key limitation found was that Heroku RAM memory is quite limited at 512MB, and JavaScript is not exactly the most memory economical language out there. Started investigation at: github.com/ourbigbook/ourbigbook/issues/230 Initially working around that by simply splitting the largest files. We were just on the verge of what could be ran however luckily, so a few dozen splits was enough, it managed to handle 70 kB OurBigBook Markup inputs. So hopefully if we manage to optimize a bit more we will be able to set a maximum size of 100 kB and still have a good safety margin.
These examples are written in the Sequelize library using raw queries.
Sequelize is used minimally, just to feed raw queries in transparently to any underlying database, and get minimally parsed results out for us, which we then assert with standard JavaScript. The queries themselves are all written by hand.
By default the examples run on SQLite. Just like the examples from sequelize example, you can set the database at runtime as:
./index.js
or./index.js l
: SQLite./index.js p
: PostgreSQL. You must manually create a database calledtmp
and ensure that peer authentication works for it
Here we list only examples which we believe are standard SQL, and should therefore work across different SQL implementations:
- nodejs/sequelize/raw/index.js: basic hello world to demonstrate the setup and very simple functionality
- nodejs/sequelize/raw/many_to_many.js: illustrates many-to-many relations with JOIN. Contains:
- SQL transaction examples:
- nodejs/sequelize/raw/commit_error.js: stackoverflow.com/questions/27245101/why-should-we-use-rollback-in-sql-explicitly/27245234#27245234 and stackoverflow.com/questions/48277519/how-to-use-commit-and-rollback-in-a-postgresql-function/48277708#48277708 suggest that on PostgreSQL, once something fails inside a transaction, all queries in the current transaction are ignored, and
COMMIT
simply does aROLLBACK
. Let's check. Yup, true for Postgres, but false for SQLite, SQLite just happily runs anything it can, you really needROLLBACK
for it. - SQL isolation level example
- nodejs/sequelize/raw/commit_error.js: stackoverflow.com/questions/27245101/why-should-we-use-rollback-in-sql-explicitly/27245234#27245234 and stackoverflow.com/questions/48277519/how-to-use-commit-and-rollback-in-a-postgresql-function/48277708#48277708 suggest that on PostgreSQL, once something fails inside a transaction, all queries in the current transaction are ignored, and
- GROUP BY and SQL aggregate functions:
- nodejs/sequelize/raw/group_by_extra_column.js: let's see if it blows up or not on different DB systems,
sqlite3
Node.js package allows it:- github.com/sequelize/sequelize/issues/5481#issuecomment-964387232
- dba.stackexchange.com/questions/141594/how-select-column-does-not-list-in-group-by-clause/141600 says that it was allowed in SQL:1999 when there are no ambiguities due to constraints, e.g. when grouping by unique columns
- github.com/postgres/postgres/blob/REL_13_5/src/test/regress/sql/functional_deps.sql#L27 shows that PostgreSQL wants it to work for
UNIQUE NOT NULL
, but they just haven't implemented it as of 13.5, where it only works if you group byPRIMARY KEY
- dba.stackexchange.com/questions/158015/why-can-i-select-all-fields-when-grouping-by-primary-key-but-not-when-grouping-b also says that
UNIQUE NOT NULL
doesn't work. Dan Lenski then points to a rationale mailing list thread.
- nodejs/sequelize/raw/group_by_max_full_row.js: here we try to get the full row of each group at which a given column reaches the max of the group
- Postgres: has
SELECT DISCINTCT ON
which works perfectly if you only want one row in case of multiple rows attaining the max.ON
is an extension to the standard unfortunately: www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT Docs specify that it always respectsORDER BY
when selecting the row.- stackoverflow.com/questions/586781/postgresql-fetch-the-row-which-has-the-max-value-for-a-column asks it without the multiple matches use case
- stackoverflow.com/questions/586781/postgresql-fetch-the-rows-which-have-the-max-value-for-a-column-in-each-group/587209#587209 also present in simpler form at stackoverflow.com/questions/121387/fetch-the-rows-which-have-the-max-value-for-a-column-for-each-distinct-value-of/123481#123481 gives a very nice OUTER JOIN only solution! Incredible, very elegant.
- dba.stackexchange.com/questions/171938/get-only-rows-with-max-group-value asks specifically the case of multiple matches to the max
- stackoverflow.com/questions/586781/postgresql-fetch-the-row-which-has-the-max-value-for-a-column asks it without the multiple matches use case
- SQLite:
- stackoverflow.com/questions/48326957/row-with-max-value-per-group-sqlite
- stackoverflow.com/questions/48326957/row-with-max-value-per-group-sqlite/48328243#48328243 teaches us that in SQLite min and max are magic and guarantee that the matching row is returned
- stackoverflow.com/questions/48326957/row-with-max-value-per-group-sqlite/72996649#72996649 Ciro Santilli uses the magic of
ROW_NUMBER
- stackoverflow.com/questions/17277152/sqlite-select-distinct-of-one-column-and-get-the-others/71924314#71924314 get any full row without specifying which, we teach how to specify
- code.djangoproject.com/ticket/22696 WONTFIXed
DISTINCT ON
- stackoverflow.com/questions/50846722/what-is-the-difference-between-postgres-distinct-vs-distinct-on/72997494#72997494
DISTINCT
vsDISTINCT ON
, somewhat related question
- stackoverflow.com/questions/50846722/what-is-the-difference-between-postgres-distinct-vs-distinct-on/72997494#72997494
- stackoverflow.com/questions/48326957/row-with-max-value-per-group-sqlite
- stackoverflow.com/questions/5803032/group-by-to-return-entire-row asks how to take the top N with distinct after order limit. I don't know how to do it in Postgres
- Postgres: has
- nodejs/sequelize/raw/most_frequent.js: illustrates a few variants of findind the mode, including across GROUP
- nodejs/sequelize/raw/group_by_max_n.js: get the top N in each group
- nodejs/sequelize/raw/group_by_extra_column.js: let's see if it blows up or not on different DB systems,
- order results in the same order as
IN
: - LIMIT by a running total: TODO links
OK, there's a billion questions:
- SQL Server
- stackoverflow.com/questions/485409/generating-a-histogram-from-column-values-in-a-database OP did not know the difference between count and histogram :-) But it's the number one Google result.
- stackoverflow.com/questions/19103991/create-range-bins-from-sql-server-table-for-histograms has a minor extra group by twist, but otherwise fine
- stackoverflow.com/questions/16268441/generate-histogram-in-sql-server
- SQLite
- stackoverflow.com/questions/67514208/how-to-optimise-creating-histogram-bins-in-sqlite perf only, benchmarking would be needed. SQLite.
- stackoverflow.com/questions/32155449/create-a-histogram-with-a-dynamic-number-of-partitions-in-sqlite variable bin size, same number of entries per bin
- stackoverflow.com/questions/60348109/histogram-for-time-periods-using-sqlite-regular-buckets-1h-wide time
- MySQL: stackoverflow.com/questions/1764881/getting-data-for-histogram-plot MySQL appears to extend
ROUND
to also round by integers:ROUND(numeric_value, -2)
, but this is not widely portable which is a shame - stackoverflow.com/questions/72367652/populating-empty-bins-in-a-histogram-generated-using-sql specifically asks about empty bins, which is amazing. Amazon Redshift dialect unfortunately, but answer provided works widely, and Redshift was forked from PostgreSQL, so there's hope. Those newb open source server focused projects that don't use AGPL!
Let's try it on SQLite 3.40.1, Ubuntu 23.04. Data setup:
sqlite3 tmp.sqlite 'create table t(x integer)'
sqlite3 tmp.sqlite <<EOF
insert into t values (
0,
2,
2,
3,
5,
6,
6,
8,
9,
17,
)
EOF
sqlite3 tmp.sqlite 'create index tx on t(x)'
For a bin size of 5 ignoring empty ranges we can:which produces the desired:
sqlite3 tmp.sqlite <<EOF
select floor(x/5)*5 as x,
count(*) as cnt
from t
group by 1
order by 1
EOF
0|4
5|5
15|1
And to consider empty ranges we can use SQL which outputs the desired:
genenerate_series
+ as per stackoverflow.com/questions/72367652/populating-empty-bins-in-a-histogram-generated-using-sql:sqlite3 tmp.sqlite <<EOF
select x, sum(cnt) from (
select floor(x/5)*5 as x,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x
EOF
0|4
5|5
10|0
15|1
Example under: nodejs/sequelize/raw/tree.js
Minimal example: nodejs/sequelize/raw/recursive.js
More advanced SQL tree traversal examples: nodejs/sequelize/raw/tree.js
Vs SQL SERIALIZABLE isolation level on PostgreSQL: dba.stackexchange.com/questions/284744/postgres-repeatable-read-vs-serializable
nodejs/sequelize/raw/parallel_create_delete_empty_tag.js is an example which experimentally seems to be solved by
REAPEATABLE READ
, although we are not sure that this is truly the case and why. What is clear is that that example is not solved by the SQL READ COMMITTED isolation level.In PostgreSQL, this is the first isolation level which can lead to postgreSQL serialization failures, this does not happen to SQL READ COMMITTED isolation level in that DBMS. You then have to retry the transaction.
A quick look at PostgreSQL's compliance notes: www.postgresql.org/docs/13/features.html shows the complete utter mess that this standard is. Multiple compliance levels that no one fully implements and optional features everywhere.
Each transaction isolation level specifies what can or cannot happen when two queries are being run in parallel, i.e.: the memory semantics of the system.
Remember that queries can affects thousands of rows, and database systems like PostgreSQL can run multiple such queries at the same time.
Good summary on the PostgreSQL page: www.postgresql.org/docs/14/transaction-iso.html
Implementation specifics:
SQL's implementation of database triggers.
This feature is really cool, as it allows you to keep caches up to date!
In particular, everything that happens in a trigger happens as if it were in a transaction. This way, you can do less explicit transactions when you use triggers. It is a bit like the advantages of SQL CASCADE.
DBMS:
I've been thinking lightly about adding full text search to OurBigBook.
For example, at docs.ourbigbook.com/news/article-and-topic-id-prefix-search article search was added, but it only finds if you search something that appears right at the start of a title, e.g. for:you'd get a hit for:but not for
Fundamental theorem of calculus
fundamental
calculus
To do this efficiently, we need full text search, which PostgreSQL implements.
But finding a clean way to generate test data for testing out the speedup was not so easy and exploration into this led me to publishing a few new slightly improved methods where Googlers can now find them:
- unix.stackexchange.com/questions/97160/is-there-something-like-a-lorem-ipsum-generator/787733#787733 I propose a neat random "sentence" generator using common CLI tools like
grep
andsed
and the pre-installed Ubuntu dictionary/usr/share/dict/american-english
:grep -v "'" /usr/share/dict/american-english | shuf -r | paste -d ' ' $(printf "%4s" | sed 's/ /- /g') | sed -e 's/^\(.\)/\U\1/;s/$/./' | head -n10000000 \ > lorem.txt
- to achieve that, I also proposed two superior "join every N lines" method for the CLI: stackoverflow.com/questions/25973140/joining-every-group-of-n-lines-into-one-with-bash/79257780#79257780, notably this awk poem:
seq 10 | awk '{ printf("%s%s", NR == 1 ? "" : NR % 3 == 1 ? "\n" : " ", $0 ) } END { printf("\n") }'
- to achieve that, I also proposed two superior "join every N lines" method for the CLI: stackoverflow.com/questions/25973140/joining-every-group-of-n-lines-into-one-with-bash/79257780#79257780, notably this awk poem:
- stackoverflow.com/questions/3371503/sql-populate-table-with-random-data/79255281#79255281 I propose:
- a clean PostgreSQL random string stored procedure that picks random characters from an allowed character list
CREATE OR REPLACE FUNCTION random_string(int) RETURNS TEXT as $$ select string_agg(substr(characters, (random() * length(characters) + 1)::integer, 1), '') as random_word from (values('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789- ')) as symbols(characters) join generate_series(1, $1) on 1 = 1 $$ language sql;
- first generating PostgreSQL data as CSV, and then importing the CSV into PostgreSQL as a more flexible method. This can also be done in a streaming fashion from stdin which is neat.
python generate_data.py 10 | psql mydb -c '\copy "mytable" FROM STDIN'
- a clean PostgreSQL random string stored procedure that picks random characters from an allowed character list
- stackoverflow.com/questions/16020164/psqlexception-error-syntax-error-in-tsquery/79437030#79437030 regarding the safe generation of prefix search
tsquery
from user inputs without query errors, I've learned aboutwebsearch_to_tsquery
and further highlighted a possibletsquery -> text -> tsquery
approach that might be correct for prefix searches - stackoverflow.com/questions/67438575/fulltext-search-using-sequelize-postgres/79439253#79439253 I put everything together into a minimal Sequelize example, read for usage in OurBigBook
Finally I did a writeup summarizing PostgreSQL full text search: Section "PostgreSQL full-text search" and also dumped it at: www.reddit.com/r/PostgreSQL/comments/12yld1o/is_it_worth_using_postgres_builtin_fulltext/ for good measure.
Dumping examples under nodejs/sequelize/raw/many_to_many.js.
Not possible without subqueries in the standard syntax, a huge shame: stackoverflow.com/questions/1293330/how-can-i-do-an-update-statement-with-join-in-sql-server