JUHE API Marketplace
DATASET
Open Source Community

TCGA-PAAD

The TCGA‑PAAD clinical dataset contains clinical data associated with pancreatic adenocarcinoma patients. It is part of the TCGA project, which aims to provide comprehensive genomic and clinical data for various cancers. Clinical data include patient demographics, treatment history, survival information, and other attributes relevant to pancreatic cancer research. The dataset is primarily intended for research and analysis, especially for building machine‑learning models that predict outcomes such as survival, recurrence, or treatment response based on clinical attributes.

Updated 10/19/2024
huggingface

Description

TCGA‑PAAD Clinical Data Dataset Overview

Dataset Introduction

The TCGA‑PAAD (The Cancer Genome Atlas – Pancreatic Adenocarcinoma) clinical dataset contains clinical data related to pancreatic adenocarcinoma patients. It is part of the TCGA project, which provides comprehensive genomic and clinical data for various cancers. Clinical data include patient demographics, treatment history, survival data, and other attributes pertinent to pancreatic cancer research. The dataset is primarily used for research and analysis, especially for building machine‑learning models that predict survival, recurrence, or treatment response based on clinical attributes.

Objective

Develop a machine‑learning model that uses clinical data to predict survival of PAAD patients. Model performance will be evaluated with the C‑Index (Concordance Index), a metric especially suited for assessing survival model accuracy.

Background

Pancreatic cancer has one of the highest mortality rates. Accurate survival prediction based on clinical data can aid treatment planning and patient care. The dataset provides a variety of clinical variables (demographics, treatment history, survival time) that will serve as predictive factors.

Task Details

  • Model Type: Regression
  • Target Variable: Patient survival time (usually measured in days or months)
  • Evaluation Metric: C‑Index (Concordance Index), which evaluates how well the model distinguishes between patient pairs in terms of actual survival.

Why Choose C‑Index?

The C‑Index is commonly used in survival analysis because it reflects the model's ability to correctly rank patients by survival time. It handles censored data better than typical regression metrics such as MSE.

Dataset Considerations

  • Data Split: Randomly divided into training (70%), validation (15%), and test (15%). All clinical variables are present in the training and validation sets, while the test set has the clinical variables removed to simulate a real‑world test scenario.
  • Privacy: The dataset follows privacy standards as it is anonymized.
  • Data Imbalance: Survival outcomes are often skewed; appropriate techniques should be employed during training to mitigate bias toward the majority class (shorter survival).

Usage Example

import pandas as pd

splits = {
    "train": "train_data.parquet",
    "validation": "val_data.parquet",
    "test": "test_data.parquet",
}

train_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["train"])
validation_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["validation"])
test_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["test"])

Citation

Data used in this research originates from The Cancer Genome Atlas (TCGA) Research Network:

The Cancer Genome Atlas Research Network. (2017). Comprehensive and Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. *Cancer Cell*, 32(2), 185‑203.e13. https://doi.org/10.1016/j.ccell.2017.07.007

Dataset Administrators

The dataset was curated by the TCGA consortium and prepared for machine‑learning use by the Moffitt Cancer Center.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Cancer Research
Clinical Data Analysis

Source

Organization: huggingface

Created: 10/15/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.