Linux CLI HOWTO by Ciro Santilli 37 Updated +Created
PDF tool by Ciro Santilli 37 Updated +Created
Project Gutenberg by Ciro Santilli 37 Updated +Created
enwiki-latest-category.sql by Ciro Santilli 37 Updated +Created
dumps.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz contains a list of categories. It only contains the categories and some counts, but it doesn't contain the subcategories and pages under each category, so it is a bit pointless.
The SQL first defines the table:
CREATE TABLE `category` (
  `cat_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `cat_title` varbinary(255) NOT NULL DEFAULT '',
  `cat_pages` int(11) NOT NULL DEFAULT 0,
  `cat_subcats` int(11) NOT NULL DEFAULT 0,
  `cat_files` int(11) NOT NULL DEFAULT 0,
  PRIMARY KEY (`cat_id`),
  UNIQUE KEY `cat_title` (`cat_title`),
  KEY `cat_pages` (`cat_pages`)
) ENGINE=InnoDB AUTO_INCREMENT=249228235 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
followed by a few humongous inserts:
INSERT INTO `category` VALUES (2,'Unprintworthy_redirects',1597224,20,0),(3,'Computer_storage_devices',88,11,0)
which we can see at: en.wikipedia.org/wiki/Category:Computer_storage_devices
Se see that en.wikipedia.org/wiki/Category:Computer_storage_devices_by_company
so it contains only categories.
We can check this with:
sed -s 's/),/\n/g' enwiki-latest-category.sql | grep Computer_storage_devices
and it shows:
(3,'Computer_storage_devices',88,11,0
(521773,'Computer_storage_devices_by_company',6,6,0
There doesn't seem to be any interlink between the categories, only page and subcategory counts therefore.
enwiki-latest-categorylinks.sql by Ciro Santilli 37 Updated +Created
Download all Wikipedia categories by Ciro Santilli 37 Updated +Created
Let's observe them in MySQL:
mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Computer_storage_devices', 'Computer_data_storage')"
outputs:
+----------+----------------+--------------------------+------------------+
| page_id  | page_namespace | page_title               | page_is_redirect |
+----------+----------------+--------------------------+------------------+
|     5300 |              0 | Computer_data_storage    |                0 |
| 42371130 |              0 | Computer_storage_devices |                1 |
|   711721 |             14 | Computer_data_storage    |                0 |
|   895945 |             14 | Computer_storage_devices |                0 |
+----------+----------------+--------------------------+------------------+
mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (5300, 711721, 895945, 42371130)"
gives:
+----------+-----------------------------------------------------------------------+
| cl_from  | cl_to                                                                 |
+----------+-----------------------------------------------------------------------+
|     5300 | All_articles_containing_potentially_dated_statements                  |
|     5300 | Articles_containing_potentially_dated_statements_from_2009            |
|     5300 | Articles_containing_potentially_dated_statements_from_2011            |
|     5300 | Articles_with_GND_identifiers                                         |
|     5300 | Articles_with_NKC_identifiers                                         |
|     5300 | Articles_with_short_description                                       |
|     5300 | Computer_architecture                                                 |
|     5300 | Computer_data_storage                                                 |
|     5300 | Short_description_matches_Wikidata                                    |
|     5300 | Use_dmy_dates_from_June_2020                                          |
|     5300 | Wikipedia_articles_incorporating_text_from_the_Federal_Standard_1037C |
|   711721 | Computer_architecture                                                 |
|   711721 | Computer_data                                                         |
|   711721 | Computer_hardware_by_type                                             |
|   711721 | Data_storage                                                          |
|   895945 | Computer_data_storage                                                 |
|   895945 | Computer_peripherals                                                  |
|   895945 | Recording_devices                                                     |
| 42371130 | Redirects_from_alternative_names                                      |
+----------+-----------------------------------------------------------------------+
So we see that cl_from encodes the parent categories:
So to find all articls and categories under a given category title, say en.wikipedia.org/wiki/Category:Mathematics we can run:
mariadb enwiki -e "select cl_from, cl_to, page_namespace, page_title from categorylinks inner join page on page_namespace in (0, 14) and cl_from = page_id and cl_to = 'Mathematics'"
csvkit by Ciro Santilli 37 Updated +Created
Lots of features, but slow because written in Python. A faster version may be csvtools. Also some annoyances like obtuse header handing and missing features like grep + cut in one go: csvgrep and select column in csvkit.
csvtool by Ciro Santilli 37 Updated +Created
A compiled executable under /usr/bin/csvtool, has an Ubuntu 23.04 package: manpages.ubuntu.com/manpages/lunar/en/man1/csvtool.1.html
There seems to be no sane filtering mechanism however: stackoverflow.com/questions/46540752/using-csvtool-call-to-filter-csv-in-bash
csvtools by Ciro Santilli 37 Updated +Created
A fast version of a somewhat subset of csvkit, written in C.
Build failed with undefined reference to pcre_config on Ubuntu 23.04: github.com/DavyLandman/csvtools/issues/18
Unfortunately it is lacking some basic options, like optional header + selecting column by index on csvgrep (though csvcut has it). The project seems kind of dead.
Also unclear if it allows to filter + print only selected columns.
xsv by Ciro Santilli 37 Updated +Created
pdftk by Ciro Santilli 37 Updated +Created
Extract certain pages of a PDF:
pdftk input.pdf cat 2-4 output out1.pdf
Rust library by Ciro Santilli 37 Updated +Created
csvgrep and select column in csvkit by Ciro Santilli 37 Updated +Created
There seems to be no way without a pipe, you seem to need to reparse the columns, e.g. the tutorial at: csvkit.readthedocs.io/en/latest/tutorial/2_examining_the_data.html#csvgrep-find-the-data-you-need does:
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER

There are unlisted articles, also show them or only show them.