Has the following structure:
- Training Set: 1000 tasks: a variety of difficulty levels, easy to hard, designed to contain all "primitives" needed for eval
- Public Eval Set: 120 tasks
- Semi-Private Eval: 120 hard tasks, may have been exposed to limited third-parties eg. via API
- Private Eval Set: 120 tasks, never exposed to third parties
New to topics? Read the docs here!