Our WIP script: wikipedia/import-categories.sh.
Related:
- opendata.stackexchange.com/questions/1533/download-wikipedia-articles-from-a-specific-category
- webapps.stackexchange.com/questions/16359/is-there-a-way-to-download-a-list-of-all-wikipedia-categories/172480#172480
- stackoverflow.com/questions/40119322/how-to-download-all-pages-inside-a-category-in-wikipedia
- category tree on Stack Overflow
- stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps/77313490#77313490 Canon but no good answers.
- stackoverflow.com/questions/12227134/how-to-fetch-category-tree-of-wiki
- stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t. Actually explains it: stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t/21798259#21798259
- stackoverflow.com/questions/27279649/how-to-build-wikipedia-category-hierarchy
- mdkzaman.com/knowledge-graph-from-wikipedia-category-hierarchy/
Consider:
Jewish_physicists
Let's observe them in MySQL:outputs:
mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Computer_storage_devices', 'Computer_data_storage')"
+----------+----------------+--------------------------+------------------+
| page_id | page_namespace | page_title | page_is_redirect |
+----------+----------------+--------------------------+------------------+
| 5300 | 0 | Computer_data_storage | 0 |
| 42371130 | 0 | Computer_storage_devices | 1 |
| 711721 | 14 | Computer_data_storage | 0 |
| 895945 | 14 | Computer_storage_devices | 0 |
+----------+----------------+--------------------------+------------------+
mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (5300, 711721, 895945, 42371130)"
+----------+-----------------------------------------------------------------------+
| cl_from | cl_to |
+----------+-----------------------------------------------------------------------+
| 5300 | All_articles_containing_potentially_dated_statements |
| 5300 | Articles_containing_potentially_dated_statements_from_2009 |
| 5300 | Articles_containing_potentially_dated_statements_from_2011 |
| 5300 | Articles_with_GND_identifiers |
| 5300 | Articles_with_NKC_identifiers |
| 5300 | Articles_with_short_description |
| 5300 | Computer_architecture |
| 5300 | Computer_data_storage |
| 5300 | Short_description_matches_Wikidata |
| 5300 | Use_dmy_dates_from_June_2020 |
| 5300 | Wikipedia_articles_incorporating_text_from_the_Federal_Standard_1037C |
| 711721 | Computer_architecture |
| 711721 | Computer_data |
| 711721 | Computer_hardware_by_type |
| 711721 | Data_storage |
| 895945 | Computer_data_storage |
| 895945 | Computer_peripherals |
| 895945 | Recording_devices |
| 42371130 | Redirects_from_alternative_names |
+----------+-----------------------------------------------------------------------+
So we see that
cl_from
encodes the parent categories:- parent categories of categories:
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
711721
, has parent categories: "Computer hardware by type", "Computer data", "Data storage", "Computer architecture". This matches exactly on the database. These are all encoded on the source code of the page:{{DEFAULTSORT:Storage}} [[Category:Computer hardware by type]] [[Category:Computer data|Storage]] [[Category:Data storage|Computer]] [[Category:Computer architecture]]
- en.wikipedia.org/wiki/Category:Computer_storage_devices has parent categories: "Computer data storage", "Recording devices", "Computer peripherals". This matches exactly on the database.
- en.wikipedia.org/wiki/Category:Computer_data_storage, which has ID
- parent categories of pages:
- en.wikipedia.org/wiki/Computer_storage_devices whish is a redirect gets the magic category "Redirects_from_alternative_names", a humongous placeholder with many thousands of pages: en.wikipedia.org/wiki/Category:Redirects_from_alternative_names
- en.wikipedia.org/wiki/Computer_data_storage shows only two categories onthe web UI: "Computer data storage" and "Computer architecture". Both of these are present on the database and at the end of the source code:The others appear to be more magic. Two of them we can guess from the templates:
{{DEFAULTSORT:Computer Data Storage}} [[Category:Computer data storage| ]] [[Category:Computer architecture]]
are likely{{short description|Storage of digital data readable by computers}} {{Use dmy dates|date=June 2020}}
Use_dmy_dates_from_June_2020
andArticles_with_short_description
but the rest is more magic and not necessarily present in-source.
So to find all articls and categories under a given category title, say en.wikipedia.org/wiki/Category:Mathematics we can run:
mariadb enwiki -e "select cl_from, cl_to, page_namespace, page_title from categorylinks inner join page on page_namespace in (0, 14) and cl_from = page_id and cl_to = 'Mathematics'"
Lots of features, but slow because written in Python. A faster version may be csvtools. Also some annoyances like obtuse header handing and missing features like grep + cut in one go: csvgrep and select column in csvkit.
A compiled executable under
/usr/bin/csvtool
, has an Ubuntu 23.04 package: manpages.ubuntu.com/manpages/lunar/en/man1/csvtool.1.htmlThere seems to be no sane filtering mechanism however: stackoverflow.com/questions/46540752/using-csvtool-call-to-filter-csv-in-bash
Build failed with
undefined reference to pcre_config
on Ubuntu 23.04: github.com/DavyLandman/csvtools/issues/18Unfortunately it is lacking some basic options, like optional header + selecting column by index on
csvgrep
(though csvcut
has it). The project seems kind of dead.Also unclear if it allows to filter + print only selected columns.
Is this the one?
Convert bytes to hex from Linux CLI by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
- no formatting;
- stackoverflow.com/questions/2614764/how-to-create-a-hex-dump-of-file-containing-only-the-hex-characters-without-spac
- unix.stackexchange.com/questions/10826/shell-how-to-read-the-bytes-of-a-binary-file-and-print-as-hexadecimal/758531#758531
- stackoverflow.com/questions/2003803/show-hexadecimal-numbers-of-a-file/77262369#77262369
- stackoverflow.com/questions/9515007/linux-script-to-convert-byte-data-into-a-hex-string/77262375#77262375
Extract certain pages of a PDF:
pdftk input.pdf cat 2-4 output out1.pdf
Simple example:output:
printf '00,11,22\n33,44,55\n' | csvgrep -H -c2 -r '^11$' | tail -n+2
00,11,22
csvgrep and select column in csvkit by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
There seems to be no way without a pipe, you seem to need to reparse the columns, e.g. the tutorial at: csvkit.readthedocs.io/en/latest/tutorial/2_examining_the_data.html#csvgrep-find-the-data-you-need does:
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER
This is a weak point of grep, it can't handle large lines that don't fit fully into memory:
- superuser.com/questions/1703029/is-there-a-limit-for-a-line-length-for-grep-command-to-process-correctly what is the grep line limit?
- unix.stackexchange.com/questions/223078/best-way-to-grep-big-binary-file/758528#758528 Ciro's
bgrep
canon - large not required but mentioning bgrep anyways:
- superuser.com/questions/1368263/use-grep-for-a-long-line-to-get-the-part-of-the-line/1811969#1811969
- unix.stackexchange.com/questions/217936/equivalent-command-to-grep-binary-files/758544#758544
- stackoverflow.com/questions/2034799/how-to-truncate-long-matching-lines-returned-by-grep-or-ack/77263826#77263826
- stackoverflow.com/questions/9988379/how-to-grep-a-text-file-which-contains-some-binary-data leaving this one alone for now
- stackoverflow.com/questions/65674717/how-to-check-if-a-binary-file-is-contained-inside-another-binary-from-the-linux search pattern from file
Minimal example: nodejs/sequelize/raw/recursive.js
More advanced SQL tree traversal examples: nodejs/sequelize/raw/tree.js
Often known simply as SQL Server, a terrible thing that makes it impossible to find portable SQL answers on Google! You just have to Google by specific SQL implementation unfortunately to find anything about the open source ones.
SQL RECURSIVE prevent infinite recursion by Ciro Santilli 35 Updated 2025-01-10 +Created 1970-01-01
Example under: nodejs/sequelize/raw/tree.js
Unlisted articles are being shown, click here to show only listed articles.