TCGA-PAAD
The TCGA‑PAAD clinical dataset contains clinical data associated with pancreatic adenocarcinoma patients. It is part of the TCGA project, which aims to provide comprehensive genomic and clinical data for various cancers. Clinical data include patient demographics, treatment history, survival information, and other attributes relevant to pancreatic cancer research. The dataset is primarily intended for research and analysis, especially for building machine‑learning models that predict outcomes such as survival, recurrence, or treatment response based on clinical attributes.
Description
TCGA‑PAAD Clinical Data Dataset Overview
Dataset Introduction
The TCGA‑PAAD (The Cancer Genome Atlas – Pancreatic Adenocarcinoma) clinical dataset contains clinical data related to pancreatic adenocarcinoma patients. It is part of the TCGA project, which provides comprehensive genomic and clinical data for various cancers. Clinical data include patient demographics, treatment history, survival data, and other attributes pertinent to pancreatic cancer research. The dataset is primarily used for research and analysis, especially for building machine‑learning models that predict survival, recurrence, or treatment response based on clinical attributes.
Objective
Develop a machine‑learning model that uses clinical data to predict survival of PAAD patients. Model performance will be evaluated with the C‑Index (Concordance Index), a metric especially suited for assessing survival model accuracy.
Background
Pancreatic cancer has one of the highest mortality rates. Accurate survival prediction based on clinical data can aid treatment planning and patient care. The dataset provides a variety of clinical variables (demographics, treatment history, survival time) that will serve as predictive factors.
Task Details
- Model Type: Regression
- Target Variable: Patient survival time (usually measured in days or months)
- Evaluation Metric: C‑Index (Concordance Index), which evaluates how well the model distinguishes between patient pairs in terms of actual survival.
Why Choose C‑Index?
The C‑Index is commonly used in survival analysis because it reflects the model's ability to correctly rank patients by survival time. It handles censored data better than typical regression metrics such as MSE.
Dataset Considerations
- Data Split: Randomly divided into training (70%), validation (15%), and test (15%). All clinical variables are present in the training and validation sets, while the test set has the clinical variables removed to simulate a real‑world test scenario.
- Privacy: The dataset follows privacy standards as it is anonymized.
- Data Imbalance: Survival outcomes are often skewed; appropriate techniques should be employed during training to mitigate bias toward the majority class (shorter survival).
Usage Example
import pandas as pd
splits = {
"train": "train_data.parquet",
"validation": "val_data.parquet",
"test": "test_data.parquet",
}
train_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["train"])
validation_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["validation"])
test_df = pd.read_parquet("hf://datasets/HLMCC/TCGA-PAAD/" + splits["test"])
Citation
Data used in this research originates from The Cancer Genome Atlas (TCGA) Research Network:
The Cancer Genome Atlas Research Network. (2017). Comprehensive and Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. *Cancer Cell*, 32(2), 185‑203.e13. https://doi.org/10.1016/j.ccell.2017.07.007
Dataset Administrators
The dataset was curated by the TCGA consortium and prepared for machine‑learning use by the Moffitt Cancer Center.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 10/15/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.