Dataset - Tumor Proliferation Assessment Challenge

The participants of this challenge are provided with a training dataset of whole slide images with known tumor proliferation scores. The goal of the challenge is to assess algorithms that predict the tumor proliferation scores from the whole slide images. Chek out the Tasks and Evaluation page for more details on the challenge tasks.

In addition, two auxiliary datasets are provided: 1) a dataset with annotated mitotic figures that can be used to train a mitosis detection method, and 2) a dataset with annotations of regions of interest that can be used to train a region of interest detection method.

Note that the use of the auxiliary datasets is optional. The whole slide image → region of interest → mitosis detection → tumor proliferation score pipeline is only one approach to designing a tumor proliferation assessment method. We encourage the participants to explore other approaches. For example, tumor proliferation is associated with morphological features other than mitotic figures such as the general apperance of the cell nuclei and tissue, which can be used to improve the estimation.

Training dataset¶

The training dataset consists of 500 breast cancer cases from The Cancer Genome Atlas. Each case is represented with one whole-slide image and is annotated with a proliferation score based on mitosis counting by pathologists, and a molecular proliferation score.


Image data	Available from Google Drive (490 GB)
Ground truth data	training_ground_truth.csv

Format: The image data is provided in the form of whole-slide images. Whole-slide images are stored in the Aperio .svs file format as multi-resolution pyramid structures (the size of the highest resolution image can easily exceed 50,000 by 50,000 pixels). The files contain multiple downsampled versions of the original image. Each image in the pyramid is stored as a series of tiles, to facilitate rapid retrieval of subregions of the image. Libraries and software that can open this file format are listed in the Software page.

The CSV file containing the ground truth has 500 rows (one for each patient) and two columns. The first column corresponds to the tumor proliferation score based on mitosis counting. The second column contains the molecular proliferation score.

Expanding the training dataset with external data or use use of other auxiliarry datasets is allowed provided that they are publically available. This is with the exception of other data from The Cancer Genome Atlas .

Auxiliary dataset: mitoses¶

The first auxiliary dataset consists of images from 73 breast cancer cases from three pathology centers. The first 23 cases are the dataset that was previously released as part of the AMIDA13 challenge. These cases were collected from the Department of Pathology at the University Medical Center in Utrecht, The Netherlands.

The remaining 50 cases are from two different pathology centers in The Netherlands (cases 24-48 are from one center and cases 48-73 are from another center). Each case is represented with one image region with area of 2 mm2. The whole-slide images from which the image regions were extracted were produced with the Leica SCN400 whole-slide image scanner (×40 magnification and spatial resolution of 0.25 μm/pixel). The annotated mitotic figures are the consensus of at least two pathologists, similar to the AMIDA13 challenge.


Image data	Available from Google Drive (10 GB)
Ground truth data	mitoses_ground_truth.zip

Format: *Each cases is represented by a number of image regions stored as TIFF images (cases 23-73 have only one large image region). Regions that contain mitotic figures have an accompanying .csv file that contains the locations of the mitotic figures with the format (row, *column) (for example, mitoses_ground_truth/01/01.csv corresponds to mitoses_image_data/01/01.tif). Absence of a .csv file indicates that the region has no mitotic figures present. A side-by-side view of all mitotic figures in the dataset is available here.

Auxiliary dataset: regions of interest¶

For 148 cases from the training dataset we provide annotations of regions of interest where pathologists would perform mitosis counting.


Ground truth data	ROIs.zip

Format: The regions of interest are provided in as .csv files that have names that corresponds to the whole-slide image files (for example TUPAC-TR-032-ROI.csv corresponds to TUPAC-TR-032.svs). Each row in the .csv file represents one rectangular region of interest in the format (x, y, width, height) where x and y are the coordinates of the top-left corner of the rectangle.

Testing dataset¶

The testing dataset consists of 321 breast cancer cases, each case is represented with one whole-slide image. The ground truth for this dataset is not publicly available. The evaluation will be done by the challenge organizers after submission of results.


Image data Available from Google Drive (345 GB)

Mitosis detection testing dataset¶

This dataset consists of images from 34 breast cancer cases from two pathology labs (the same pathology labs as for cases 24-73 from the auxiliary mitosis dataset). Each case is represented with one image region with area of 2 mm2. The whole-slide images from which the image regions were extracted were produced with the Leica SCN400 whole-slide image scanner (×40 magnification and spatial resolution of 0.25 μm/pixel).

The ground truth for this dataset is not publicly available. The evaluation will be done by the challenge organizers after submission of results.


Image data	Available from Google Drive (3 GB)

Format: This dataset has the same format as the auxiliary mitosis dataset.