Per-table dumps created with mysqldump and listed at: dumps.wikimedia.org/. Most notably, for the English Wikipedia: dumps.wikimedia.org/enwiki/latest/
A few of the files are not actual tables but derived data, notably dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz from Download titles of all Wikipedia articles
The tables are "documented" under: www.mediawiki.org/wiki/Manual:Database_layout, e.g. the central "page" table: www.mediawiki.org/wiki/Manual:Page_table. But in many cases it is impossible to deduce what fields are from those docs.
dumps.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz contains a list of categories. It only contains the categories and some counts, but it doesn't contain the subcategories and pages under each category, so it is a bit pointless.
The schema is listed at: www.mediawiki.org/wiki/Manual:Category_table
The SQL first defines the table:
followed by a few humongous inserts:
which we can see at: en.wikipedia.org/wiki/Category:Computer_storage_devices
CREATE TABLE `category` (
`cat_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cat_title` varbinary(255) NOT NULL DEFAULT '',
`cat_pages` int(11) NOT NULL DEFAULT 0,
`cat_subcats` int(11) NOT NULL DEFAULT 0,
`cat_files` int(11) NOT NULL DEFAULT 0,
PRIMARY KEY (`cat_id`),
UNIQUE KEY `cat_title` (`cat_title`),
KEY `cat_pages` (`cat_pages`)
) ENGINE=InnoDB AUTO_INCREMENT=249228235 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
INSERT INTO `category` VALUES (2,'Unprintworthy_redirects',1597224,20,0),(3,'Computer_storage_devices',88,11,0)
Se see that en.wikipedia.org/wiki/Category:Computer_storage_devices_by_companyso it contains only categories.
- en.wikipedia.org/wiki/Category:Computer_storage_devices is a subcategory of that category and it appears in that file.
- en.wikipedia.org/wiki/Acronis_Secure_Zone is a page of the category, and it does not appear
We can check this with:
and it shows:
There doesn't seem to be any interlink between the categories, only page and subcategory counts therefore.
sed -s 's/),/\n/g' enwiki-latest-category.sql | grep Computer_storage_devices
(3,'Computer_storage_devices',88,11,0
(521773,'Computer_storage_devices_by_company',6,6,0
The schema is listed at: www.mediawiki.org/wiki/Manual:Categorylinks_table
On the SQL:
CREATE TABLE `categorylinks` (
`cl_from` int(8) unsigned NOT NULL DEFAULT 0,
`cl_to` varbinary(255) NOT NULL DEFAULT '',
`cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
`cl_timestamp` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
`cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
`cl_collation` varbinary(32) NOT NULL DEFAULT '',
`cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
PRIMARY KEY (`cl_from`,`cl_to`),
KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`),
KEY `cl_collation_ext` (`cl_collation`,`cl_to`,`cl_type`,`cl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
TODO what is
cl_from
? We've tried:page_id
: nope, there is notpage_id
of 3
cl_to
appears to always be a category string name.The format appears to be described at: www.mediawiki.org/wiki/Manual:Categorylinks_table
A sample INSERT entry is:
(3,'Computer_storage_devices',88,11,0)
Articles by others on the same topic
There are currently no matching articles.