Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.
App-only as of 2023, i.e. for children.
Humans make the table of contents, and then AI fills it. Ciro was thinking about doint the exact same thing at some point, maybe starting from Wikipedia categories.