From cocodataset.org/:
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image. A caption is a short textual description of the image.
So they have relatively few object labels, but their focus seems to be putting a bunch of objects on the same image. E.g. they have 13 cat plus pizza photos. Searching for such weird combinations is kind of fun.
Their official dataset explorer is actually good: cocodataset.org/#explore
And the objects don't just have bounding boxes, but detailed polygons.
Also, images have captions describing the relation between objects:
a black and white cat standing on a table next to a pizza.
Epic.
This dataset is kind of cool.
This is the one used on MLperf v2.1 ResNet, likely one of the most popular choices out there.
2017 challenge subset:
  • train: 118k images, 18GB
  • validation: 5k images, 1GB
  • test: 41k images, 6GB