JUHE API Marketplace
DATASET
Open Source Community

minhanhto09/NuCLS_dataset

The NuCLS dataset comprises over 220,000 annotated nuclei from breast cancer images, primarily for developing and validating nucleus detection, classification, and segmentation algorithms. Annotations were performed by pathologists, pathology residents, and medical students, covering both single‑observer and multi‑observer evaluations. The dataset consists of 1,744 entries, each containing high‑resolution RGB images, mask images, visualization images, and nucleus annotation coordinates, split into six folds with separate training and test subsets to assess cross‑institution generalization. It is suitable for image classification, detection, and segmentation tasks.

Updated 7/13/2024
hugging_face

Description

NuCLS Dataset

Overview

NuCLS contains over 220,000 labeled nuclei from breast cancer images sourced from TCGA, making it one of the largest datasets for nucleus detection, classification, and segmentation. Annotations were created by pathologists, pathology residents, and medical students using digital slide archives. The dataset supports development and validation of nucleus detection, classification, and segmentation algorithms, as well as multi‑observer analysis studies. The current version includes approximately 59,500 nuclei from a corrected single‑observer subset.

Data Access

The dataset can be loaded via the Python datasets library, with options to load the full dataset or a smaller subset.

from datasets import load_dataset
dataset = load_dataset("minhanhto09/NuCLS_dataset", name="default")
from datasets import load_dataset
dataset = load_dataset("minhanhto09/NuCLS_dataset", name="debug")

Data Structure

Data Schema

The corrected single‑observer subset contains 1,744 entries, each with a field‑of‑view image, mask image, visualization image, and a list of nucleus annotation coordinates, totaling 59,485 nucleus annotations. Image resolution is 0.2 µm/pixel; annotation coordinates are provided in pixel units.

Each entry includes:

  • file_name: Unique file name encoding the most relevant information for the sample.
  • rgb_image: High‑resolution RGB image of breast cancer tissue.
  • mask_image: Mask image for each labeled nucleus. Class labels are encoded in the first channel; the second and third channels generate a unique identifier for each nucleus. Gray‑colored regions mark the field of view.
  • visualization_image: Overlay of RGB and mask images for visual inspection.
  • annotation_coordinates: List of nucleus annotations per instance, each containing:
    • raw_classification: Basic nucleus class (13 possible classes, e.g., tumor or lymphocyte).
    • main_classification: Higher‑level class (7 classes, e.g., tumor_mitotic, nonTILnonMQ_stromal).
    • super_classification: Broadest class label (4 options, e.g., sTIL, nonTIL_stromal).
    • type: Annotation format used, rectangle or polyline.
    • xmin, ymin, xmax, ymax: Bounding box coordinates.
    • coords_x, coords_y: Specific boundary coordinates.

Data Splits

The dataset is divided into six folds; each fold has its own training and test sets, split by source hospital to capture variability across imaging practices and ensure models generalize across institutions.

SplitTraining SamplesTest Samples
train_fold_11,481263
train_fold_21,239505
train_fold_31,339405
train_fold_41,450294
train_fold_51,467277
train_fold_999 (debug)217

The debug configuration uses the tiny train_fold_999/test_fold_999 due to the limited number of samples.

Usage Examples

The dataset is applicable to various computer‑vision tasks, including image classification, detection, and segmentation. Example notebooks provide exploratory data analysis (EDA) techniques and image detection task pipelines.

License

The dataset is released under the CC0 1.0 license.

Limitations

Currently, the dataset only includes the corrected single‑observer subset; future releases will extend to uncorrected single‑observer and multi‑observer subsets.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Breast Cancer
Computer Vision

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.