dumps.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz contains a list of categories. It only contains the categories and some counts, but it doesn't contain the subcategories and pages under each category, so it is a bit pointless.
The schema is listed at: www.mediawiki.org/wiki/Manual:Category_table
The SQL first defines the table:followed by a few humongous inserts:which we can see at: en.wikipedia.org/wiki/Category:Computer_storage_devices
CREATE TABLE `category` (
`cat_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cat_title` varbinary(255) NOT NULL DEFAULT '',
`cat_pages` int(11) NOT NULL DEFAULT 0,
`cat_subcats` int(11) NOT NULL DEFAULT 0,
`cat_files` int(11) NOT NULL DEFAULT 0,
PRIMARY KEY (`cat_id`),
UNIQUE KEY `cat_title` (`cat_title`),
KEY `cat_pages` (`cat_pages`)
) ENGINE=InnoDB AUTO_INCREMENT=249228235 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
INSERT INTO `category` VALUES (2,'Unprintworthy_redirects',1597224,20,0),(3,'Computer_storage_devices',88,11,0)
Se see that en.wikipedia.org/wiki/Category:Computer_storage_devices_by_companyso it contains only categories.
- en.wikipedia.org/wiki/Category:Computer_storage_devices is a subcategory of that category and it appears in that file.
- en.wikipedia.org/wiki/Acronis_Secure_Zone is a page of the category, and it does not appear
We can check this with:and it shows:There doesn't seem to be any interlink between the categories, only page and subcategory counts therefore.
sed -s 's/),/\n/g' enwiki-latest-category.sql | grep Computer_storage_devices
(3,'Computer_storage_devices',88,11,0
(521773,'Computer_storage_devices_by_company',6,6,0
The schema is listed at: www.mediawiki.org/wiki/Manual:Categorylinks_table
On the SQL:
CREATE TABLE `categorylinks` (
`cl_from` int(8) unsigned NOT NULL DEFAULT 0,
`cl_to` varbinary(255) NOT NULL DEFAULT '',
`cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
`cl_timestamp` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
`cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
`cl_collation` varbinary(32) NOT NULL DEFAULT '',
`cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
PRIMARY KEY (`cl_from`,`cl_to`),
KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`),
KEY `cl_collation_ext` (`cl_collation`,`cl_to`,`cl_type`,`cl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
TODO what is
cl_from
? We've tried:page_id
: nope, there is notpage_id
of 3
cl_to
appears to always be a category string name.The format appears to be described at: www.mediawiki.org/wiki/Manual:Categorylinks_table
A sample INSERT entry is:
(3,'Computer_storage_devices',88,11,0)
Download titles of all Wikipedia articles by
Ciro Santilli 35 Updated 2025-03-28 +Created 1970-01-01
dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz Characterization:
- contains redirects, e.g. en.wikipedia.org/wiki/"Ampere_North" redirects to en.wikipedia.org/wiki/Ampere_North,_New_Jersey and both are present. Noted in this comment: stackoverflow.com/questions/24474288/how-to-obtain-a-list-of-titles-of-all-wikipedia-articles#comment136016773_24474476
Download titles of all Wikipedia articles without redirects by
Ciro Santilli 35 Updated 2025-03-28 +Created 1970-01-01
Our WIP script: wikipedia/import-categories.sh.
Related:
- opendata.stackexchange.com/questions/1533/download-wikipedia-articles-from-a-specific-category
- webapps.stackexchange.com/questions/16359/is-there-a-way-to-download-a-list-of-all-wikipedia-categories/172480#172480
- stackoverflow.com/questions/40119322/how-to-download-all-pages-inside-a-category-in-wikipedia
- category tree on Stack Overflow
- stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps/77313490#77313490 Canon but no good answers.
- stackoverflow.com/questions/12227134/how-to-fetch-category-tree-of-wiki
- stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t. Actually explains it: stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t/21798259#21798259
- stackoverflow.com/questions/27279649/how-to-build-wikipedia-category-hierarchy
- mdkzaman.com/knowledge-graph-from-wikipedia-category-hierarchy/
Consider:
Jewish_physicists
Let's observe them in MySQL:outputs:
mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Computer_storage_devices', 'Computer_data_storage')"
+----------+----------------+--------------------------+------------------+
| page_id | page_namespace | page_title | page_is_redirect |
+----------+----------------+--------------------------+------------------+
| 5300 | 0 | Computer_data_storage | 0 |
| 42371130 | 0 | Computer_storage_devices | 1 |
| 711721 | 14 | Computer_data_storage | 0 |
| 895945 | 14 | Computer_storage_devices | 0 |
+----------+----------------+--------------------------+------------------+
mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (5300, 711721, 895945, 42371130)"
+----------+-----------------------------------------------------------------------+
| cl_from | cl_to |
+----------+-----------------------------------------------------------------------+
| 5300 | All_articles_containing_potentially_dated_statements |
| 5300 | Articles_containing_potentially_dated_statements_from_2009 |
| 5300 | Articles_containing_potentially_dated_statements_from_2011 |
| 5300 | Articles_with_GND_identifiers |
| 5300 | Articles_with_NKC_identifiers |
| 5300 | Articles_with_short_description |
| 5300 | Computer_architecture |
| 5300 | Computer_data_storage |
| 5300 | Short_description_matches_Wikidata |
| 5300 | Use_dmy_dates_from_June_2020 |
| 5300 | Wikipedia_articles_incorporating_text_from_the_Federal_Standard_1037C |
| 711721 | Computer_architecture |
| 711721 | Computer_data |
| 711721 | Computer_hardware_by_type |
| 711721 | Data_storage |
| 895945 | Computer_data_storage |
| 895945 | Computer_peripherals |
| 895945 | Recording_devices |
| 42371130 | Redirects_from_alternative_names |
+----------+-----------------------------------------------------------------------+
So we see that
cl_from
encodes the parent categories:- parent categories of categories:
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
711721
, has parent categories: "Computer hardware by type", "Computer data", "Data storage", "Computer architecture". This matches exactly on the database. These are all encoded on the source code of the page:{{DEFAULTSORT:Storage}} [[Category:Computer hardware by type]] [[Category:Computer data|Storage]] [[Category:Data storage|Computer]] [[Category:Computer architecture]]
- en.wikipedia.org/wiki/Category:Computer_storage_devices has parent categories: "Computer data storage", "Recording devices", "Computer peripherals". This matches exactly on the database.
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
- parent categories of pages:
- en.wikipedia.org/wiki/Computer_storage_devices whish is a redirect gets the magic category "Redirects_from_alternative_names", a humongous placeholder with many thousands of pages: en.wikipedia.org/wiki/Category:Redirects_from_alternative_names
- en.wikipedia.org/wiki/Computer_data_storage shows only two categories onthe web UI: "Computer data storage" and "Computer architecture". Both of these are present on the database and at the end of the source code:The others appear to be more magic. Two of them we can guess from the templates:
{{DEFAULTSORT:Computer Data Storage}} [[Category:Computer data storage| ]] [[Category:Computer architecture]]
are likely{{short description|Storage of digital data readable by computers}} {{Use dmy dates|date=June 2020}}
Use_dmy_dates_from_June_2020
andArticles_with_short_description
but the rest is more magic and not necessarily present in-source.
So to find all articls and categories under a given category title, say en.wikipedia.org/wiki/Category:Mathematics we can run:
mariadb enwiki -e "select cl_from, cl_to, page_namespace, page_title from categorylinks inner join page on page_namespace in (0, 14) and cl_from = page_id and cl_to = 'Mathematics'"
Lots of features, but slow because written in Python. A faster version may be csvtools. Also some annoyances like obtuse header handing and missing features like grep + cut in one go: csvgrep and select column in csvkit.
A compiled executable under
/usr/bin/csvtool
, has an Ubuntu 23.04 package: manpages.ubuntu.com/manpages/lunar/en/man1/csvtool.1.htmlThere seems to be no sane filtering mechanism however: stackoverflow.com/questions/46540752/using-csvtool-call-to-filter-csv-in-bash
Build failed with
undefined reference to pcre_config
on Ubuntu 23.04: github.com/DavyLandman/csvtools/issues/18Unfortunately it is lacking some basic options, like optional header + selecting column by index on
csvgrep
(though csvcut
has it). The project seems kind of dead.Also unclear if it allows to filter + print only selected columns.
Is this the one?
Convert bytes to hex from Linux CLI by
Ciro Santilli 35 Updated 2025-03-28 +Created 1970-01-01
- no formatting;
- stackoverflow.com/questions/2614764/how-to-create-a-hex-dump-of-file-containing-only-the-hex-characters-without-spac
- unix.stackexchange.com/questions/10826/shell-how-to-read-the-bytes-of-a-binary-file-and-print-as-hexadecimal/758531#758531
- stackoverflow.com/questions/2003803/show-hexadecimal-numbers-of-a-file/77262369#77262369
- stackoverflow.com/questions/9515007/linux-script-to-convert-byte-data-into-a-hex-string/77262375#77262375
Extract certain pages of a PDF:
pdftk input.pdf cat 2-4 output out1.pdf
Simple example:output:
printf '00,11,22\n33,44,55\n' | csvgrep -H -c2 -r '^11$' | tail -n+2
00,11,22
csvgrep and select column in csvkit by
Ciro Santilli 35 Updated 2025-03-28 +Created 1970-01-01
There seems to be no way without a pipe, you seem to need to reparse the columns, e.g. the tutorial at: csvkit.readthedocs.io/en/latest/tutorial/2_examining_the_data.html#csvgrep-find-the-data-you-need does:
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER
This is a weak point of grep, it can't handle large lines that don't fit fully into memory:
- superuser.com/questions/1703029/is-there-a-limit-for-a-line-length-for-grep-command-to-process-correctly what is the grep line limit?
- unix.stackexchange.com/questions/223078/best-way-to-grep-big-binary-file/758528#758528 Ciro's
bgrep
canon - large not required but mentioning bgrep anyways:
- superuser.com/questions/1368263/use-grep-for-a-long-line-to-get-the-part-of-the-line/1811969#1811969
- unix.stackexchange.com/questions/217936/equivalent-command-to-grep-binary-files/758544#758544
- stackoverflow.com/questions/2034799/how-to-truncate-long-matching-lines-returned-by-grep-or-ack/77263826#77263826
- stackoverflow.com/questions/9988379/how-to-grep-a-text-file-which-contains-some-binary-data leaving this one alone for now
- stackoverflow.com/questions/65674717/how-to-check-if-a-binary-file-is-contained-inside-another-binary-from-the-linux search pattern from file
From Machine Learning to Autonomous Intelligence by Yann LeCun (2023)
Source. After a bunch of B.S., LeCun goes on to describe his AGI architecture. Nothing ground breaking, but not bad either.Minimal example: nodejs/sequelize/raw/recursive.js
More advanced SQL tree traversal examples: nodejs/sequelize/raw/tree.js
Unlisted articles are being shown, click here to show only listed articles.