A software that implements some database system, e.g. PostgreSQL or MySQL are two (widely extended) SQL implementations.
Our WIP script: wikipedia/import-categories.sh.
Related:
- opendata.stackexchange.com/questions/1533/download-wikipedia-articles-from-a-specific-category
- webapps.stackexchange.com/questions/16359/is-there-a-way-to-download-a-list-of-all-wikipedia-categories/172480#172480
- stackoverflow.com/questions/40119322/how-to-download-all-pages-inside-a-category-in-wikipedia
- category tree on Stack Overflow
- stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps/77313490#77313490 Canon but no good answers.
- stackoverflow.com/questions/12227134/how-to-fetch-category-tree-of-wiki
- stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t. Actually explains it: stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t/21798259#21798259
- stackoverflow.com/questions/27279649/how-to-build-wikipedia-category-hierarchy
- mdkzaman.com/knowledge-graph-from-wikipedia-category-hierarchy/
Consider:
Jewish_physicists
Let's observe them in MySQL:outputs:
mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Computer_storage_devices', 'Computer_data_storage')"
+----------+----------------+--------------------------+------------------+
| page_id | page_namespace | page_title | page_is_redirect |
+----------+----------------+--------------------------+------------------+
| 5300 | 0 | Computer_data_storage | 0 |
| 42371130 | 0 | Computer_storage_devices | 1 |
| 711721 | 14 | Computer_data_storage | 0 |
| 895945 | 14 | Computer_storage_devices | 0 |
+----------+----------------+--------------------------+------------------+
mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (5300, 711721, 895945, 42371130)"
+----------+-----------------------------------------------------------------------+
| cl_from | cl_to |
+----------+-----------------------------------------------------------------------+
| 5300 | All_articles_containing_potentially_dated_statements |
| 5300 | Articles_containing_potentially_dated_statements_from_2009 |
| 5300 | Articles_containing_potentially_dated_statements_from_2011 |
| 5300 | Articles_with_GND_identifiers |
| 5300 | Articles_with_NKC_identifiers |
| 5300 | Articles_with_short_description |
| 5300 | Computer_architecture |
| 5300 | Computer_data_storage |
| 5300 | Short_description_matches_Wikidata |
| 5300 | Use_dmy_dates_from_June_2020 |
| 5300 | Wikipedia_articles_incorporating_text_from_the_Federal_Standard_1037C |
| 711721 | Computer_architecture |
| 711721 | Computer_data |
| 711721 | Computer_hardware_by_type |
| 711721 | Data_storage |
| 895945 | Computer_data_storage |
| 895945 | Computer_peripherals |
| 895945 | Recording_devices |
| 42371130 | Redirects_from_alternative_names |
+----------+-----------------------------------------------------------------------+
So we see that
cl_from
encodes the parent categories:- parent categories of categories:
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
711721
, has parent categories: "Computer hardware by type", "Computer data", "Data storage", "Computer architecture". This matches exactly on the database. These are all encoded on the source code of the page:{{DEFAULTSORT:Storage}} [[Category:Computer hardware by type]] [[Category:Computer data|Storage]] [[Category:Data storage|Computer]] [[Category:Computer architecture]]
- en.wikipedia.org/wiki/Category:Computer_storage_devices has parent categories: "Computer data storage", "Recording devices", "Computer peripherals". This matches exactly on the database.
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
- parent categories of pages:
- en.wikipedia.org/wiki/Computer_storage_devices whish is a redirect gets the magic category "Redirects_from_alternative_names", a humongous placeholder with many thousands of pages: en.wikipedia.org/wiki/Category:Redirects_from_alternative_names
- en.wikipedia.org/wiki/Computer_data_storage shows only two categories onthe web UI: "Computer data storage" and "Computer architecture". Both of these are present on the database and at the end of the source code:The others appear to be more magic. Two of them we can guess from the templates:
{{DEFAULTSORT:Computer Data Storage}} [[Category:Computer data storage| ]] [[Category:Computer architecture]]
are likely{{short description|Storage of digital data readable by computers}} {{Use dmy dates|date=June 2020}}
Use_dmy_dates_from_June_2020
andArticles_with_short_description
but the rest is more magic and not necessarily present in-source.
So to find all articls and categories under a given category title, say en.wikipedia.org/wiki/Category:Mathematics we can run:
mariadb enwiki -e "select cl_from, cl_to, page_namespace, page_title from categorylinks inner join page on page_namespace in (0, 14) and cl_from = page_id and cl_to = 'Mathematics'"
PostgreSQL feels good.
Its feature set is insanely large! Just look at stuff like: stackoverflow.com/questions/1986491/sql-split-string-by-space-into-table-in-postgresql/1993058#1993058
Had a look at the source tree, and also felt good.
If Oracle is the Microsoft of database, Postgres is the Linux, and MySQL (or more precisely MariaDB) is the FreeBSD (i.e. the one that got delayed by legal issues). Except that their software licenses were accidentally swapped.
The only problem with Postgres is its name. PostgreSQL is so unpronounceable and so untypeable that you should just call it "Postgres" like everyone else.
OK, there's a billion questions:
- SQL Server
- stackoverflow.com/questions/485409/generating-a-histogram-from-column-values-in-a-database OP did not know the difference between count and histogram :-) But it's the number one Google result.
- stackoverflow.com/questions/19103991/create-range-bins-from-sql-server-table-for-histograms has a minor extra group by twist, but otherwise fine
- stackoverflow.com/questions/16268441/generate-histogram-in-sql-server
- SQLite
- stackoverflow.com/questions/67514208/how-to-optimise-creating-histogram-bins-in-sqlite perf only, benchmarking would be needed. SQLite.
- stackoverflow.com/questions/32155449/create-a-histogram-with-a-dynamic-number-of-partitions-in-sqlite variable bin size, same number of entries per bin
- stackoverflow.com/questions/60348109/histogram-for-time-periods-using-sqlite-regular-buckets-1h-wide time
- MySQL: stackoverflow.com/questions/1764881/getting-data-for-histogram-plot MySQL appears to extend
ROUND
to also round by integers:ROUND(numeric_value, -2)
, but this is not widely portable which is a shame - stackoverflow.com/questions/72367652/populating-empty-bins-in-a-histogram-generated-using-sql specifically asks about empty bins, which is amazing. Amazon Redshift dialect unfortunately, but answer provided works widely, and Redshift was forked from PostgreSQL, so there's hope. Those newb open source server focused projects that don't use AGPL!
Let's try it on SQLite 3.40.1, Ubuntu 23.04. Data setup:
sqlite3 tmp.sqlite 'create table t(x integer)'
sqlite3 tmp.sqlite <<EOF
insert into t values (
0,
2,
2,
3,
5,
6,
6,
8,
9,
17,
)
EOF
sqlite3 tmp.sqlite 'create index tx on t(x)'
For a bin size of 5 ignoring empty ranges we can:which produces the desired:
sqlite3 tmp.sqlite <<EOF
select floor(x/5)*5 as x,
count(*) as cnt
from t
group by 1
order by 1
EOF
0|4
5|5
15|1
And to consider empty ranges we can use SQL which outputs the desired:
genenerate_series
+ as per stackoverflow.com/questions/72367652/populating-empty-bins-in-a-histogram-generated-using-sql:sqlite3 tmp.sqlite <<EOF
select x, sum(cnt) from (
select floor(x/5)*5 as x,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x
EOF
0|4
5|5
10|0
15|1
Example: nodejs/sequelize/raw/tree.js
- Implementation agnostic
- stackoverflow.com/questions/192220/what-is-the-most-efficient-elegant-way-to-parse-a-flat-table-into-a-tree
- stackoverflow.com/questions/5508985/recursive-query-for-adjacency-list-to-preorder-tree-traversal-in-sql DBMS agnostic specifically asking not to modify adjacenty list data structure
- Postgres
- stackoverflow.com/questions/67848017/simple-recursive-sql-query
- stackoverflow.com/questions/28688264/how-to-traverse-a-hierarchical-tree-structure-structure-backwards-using-recursiv
- stackoverflow.com/questions/51822070/how-can-postgres-represent-a-tree-of-row-ids
- depth first
- uspecified depth first variant
- preorder DFS
- breadth-first stackoverflow.com/questions/3709292/select-rows-from-table-using-tree-order
- MySQL
- stackoverflow.com/questions/8252323/mysql-closure-table-hierarchical-database-how-to-pull-information-out-in-the-c asks how to use a specific order (preorder DFS) with closure table
- Microsoft SQL Server