Back to datasets
Dataset assetOpen Source CommunityMedical Image AnalysisMachine Learning Competition

PatchCamelyon (PCam) benchmark dataset

The dataset contains small‑size pathology images with corresponding labels indicating the presence of tumor tissue. Images are 96 × 96 pixels and were provided as part of a Kaggle competition.

Source
github
Created
Jul 30, 2024
Updated
Jul 30, 2024
Signals
433 views
Availability
Linked source ready
Overview

Dataset description and usage context

Histopathologic Cancer Detection Dataset Overview

Dataset Description

The dataset contains small‑size pathology images with corresponding labels indicating the presence of tumor tissue. Images are 96 × 96 pixels and were provided as part of a Kaggle competition.

  • Number of Training Images: 220,000
  • Number of Validation Images: 57,000
  • Image Size: 96 × 96 pixels

Project Structure

  • train.py: Script for training a CNN model.
  • infer.py: Script for inference using a trained model.
  • HCDNetwork.py: Definition of the CNN architecture.
  • utils.py: Utility functions for data processing and visualization.
  • data/: Directory containing the dataset.
  • model/: Directory for saved model weights and results.

Model Architecture

The CNN model HCDNetwork can be configured with varying numbers of convolutional layers and dropout rates. Architecture includes:

  • Convolutional layers followed by ReLU activation and max‑pooling
  • Fully‑connected layers with dropout for regularization
  • Softmax output layer for classification

Example Model Configuration

params_model = {
    "shape_in": (3, 96, 96),
    "initial_filters": 8,
    "num_fc1": 100,
    "num_classes": 2,
    "dropout_rate": 0.75,  # Dropout rate
    "num_conv_layers": 4   # Number of convolutional layers
}

Training and Evaluation

Training involves hyper‑parameter tuning, exploring different architectures, and applying various techniques to boost performance. Model performance is evaluated using the Area Under the ROC Curve (AUC).

Training Results

ModelDropout RateConv LayersTrain LossTrain AccuracyTrain AUCVal LossVal AccuracyVal AUC
A0.1040.20420.93000.97590.45120.80870.8842
B0.5040.24470.90970.96380.47840.80000.8736
C0.9040.43140.80340.88330.44830.81250.8780
D0.7530.35150.84780.92380.38880.84000.9003
E0.7540.38620.83560.90770.37940.84500.9064
F0.7550.08810.97940.99580.61200.81130.8746

Inference

The infer.py script allows inference on new images using a trained model. The script loads the trained model, preprocesses the input image, and outputs predicted labels and class probabilities.

Example Usage

from infer import infer

model_path = model/trained_hcd_model.pth
image_path = test/sample_image.tif
pred_label, pred_probs = infer(model, image_path, device=cuda)

print(f"Predicted Label: {pred_label}")
print(f"Class Probabilities: {pred_probs}")
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio