Back to datasets
Dataset assetOpen Source CommunityBioinformaticsDrug Discovery

CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

CHEN11: 251 proteins with 476 ligands for LBS prediction benchmarks. ASTEX: Astex diverse dataset. metapocket2: includes U/B48 (48 proteins in bound and unbound states), DT198 (198 drug‑target complexes), B210 (210 bound‑state proteins). FPTRAIN: dataset for training Fpocket pocket‑scoring function. HOLO4K: large protein‑ligand complex set comprising large multi‑chain structures directly downloaded from PDB.

Source
github
Created
May 18, 2018
Updated
Apr 11, 2024
Signals
145 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Main Protein Datasets

  1. CHEN11: Contains 251 proteins with a total of 476 ligands for LBS prediction benchmarking.
  2. ASTEX: Astex diverse collection.
  3. metapocket2 dataset series:
    • U/B48: 48 proteins in both bound and unbound states.
    • DT198: 198 drug‑target complexes.
    • B210: Benchmark dataset of 210 bound‑state proteins.
  4. FPTRAIN: Dataset used for training the Fpocket pocket‑scoring function.
  5. HOLO4K: Large protein‑ligand complex dataset containing multi‑chain structures, non‑overlapping with CHEN11 and JOINED.

Dataset Variants

  • "standard": Contains a single column of ligand‑bound proteins.
  • *(mlig)* dataset: Explicitly specifies associated ligands; ligand codes are sourced from the MOAD 2013 database.
  • Prediction‑included datasets: Contain predictions from other ligand‑binding site prediction methods.
  • *-XXsubset-* datasets: Subsets of the original datasets where a specific method succeeded and produced predictions.

Dataset Caveats

  • *.ds files may contain only a subset of the PDB files. For example, the holo4k/ directory holds 4,543 PDB files, but holo4k.ds lists 4,009 lines, which is the correct protein count used in the P2Rank/PrankWeb paper for the HOLO4K dataset.
  • 1xgf.pdb has been removed from the holo4k dataset (contains only UNK groups and no ligand).
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio