Data Collections

Accessing data is a fundamental component of Machine Learning. In recognition of this need, HPC sites are beginning to host core datasets for the Machine Learning community.

If you would like to see more data collections, let us know via the Contact Us button.

UCI Machine Learning Dataset Repository

The UCI Machine Learning Repository maintains 588 data sets as a service to the machine learning community.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. This includes datasets such as:

Learn more on the AWS Open Data Registry Webpage.

Instructions for Downloading Datasets on MASSIVE

You can find documentation for how data has been downloaded on MASSIVE in this Github repository: https://github.com/ML4AU/Data_Collections

ImageNet 2012 (ILSVRC2012) - MASSIVE M3

ImageNet is an image database organised according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. MASSIVE M3 hosts the most recent update of this data collection, from 2012.

Find out more here: https://docs.massive.org.au/M3/data-collection/data-collection.html#

International Skin Imaging Collaboration 2019 (ISIC 2019) - MASSIVE M3

An International Skin Imaging Collaboration (ISIC) developed repository of dermoscopic images, for both the purposes of clinical training, and for supporting technical research toward automated algorithmic analysis. 25,331 images are available for training across 8 different categories.

Find out more here: https://docs.massive.org.au/M3/data-collection/data-collection.html#

NIH Chest X-ray Dataset (NIH CXR-14) - MASSIVE M3

The NIH Chest X-ray Dataset includes 112,120 frontal-view X-ray images from 30,805 unique patients, with text-mined image labels gathered from radiological reports using natural language processing. There are 14 labels, and images may contain multiple labels.

Find out more here: https://docs.massive.org.au/M3/data-collection/data-collection.html#

Stanford Natural Language Inference Corpus (SNLI) - MASSIVE M3

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Find out more here: https://docs.massive.org.au/M3/data-collection/data-collection.html#stanford-natural-language-inference-snli-corpus