CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

CHEN11: 251 proteins with 476 ligands for LBS prediction benchmarks. ASTEX: Astex diverse dataset. metapocket2: includes U/B48 (48 proteins in bound and unbound states), DT198 (198 drug‑target complexes), B210 (210 bound‑state proteins). FPTRAIN: dataset for training Fpocket pocket‑scoring function. HOLO4K: large protein‑ligand complex set comprising large multi‑chain structures directly downloaded from PDB.

Updated 4/11/2024

github

Dataset Overview

Main Protein Datasets

CHEN11: Contains 251 proteins with a total of 476 ligands for LBS prediction benchmarking.
ASTEX: Astex diverse collection.
metapocket2 dataset series:
- U/B48: 48 proteins in both bound and unbound states.
- DT198: 198 drug‑target complexes.
- B210: Benchmark dataset of 210 bound‑state proteins.
FPTRAIN: Dataset used for training the Fpocket pocket‑scoring function.
HOLO4K: Large protein‑ligand complex dataset containing multi‑chain structures, non‑overlapping with CHEN11 and JOINED.

Dataset Variants

"standard": Contains a single column of ligand‑bound proteins.
*(mlig)* dataset: Explicitly specifies associated ligands; ligand codes are sourced from the MOAD 2013 database.
Prediction‑included datasets: Contain predictions from other ligand‑binding site prediction methods.
*-XXsubset-* datasets: Subsets of the original datasets where a specific method succeeded and produced predictions.

Dataset Caveats

*.ds files may contain only a subset of the PDB files. For example, the holo4k/ directory holds 4,543 PDB files, but holo4k.ds lists 4,009 lines, which is the correct protein count used in the P2Rank/PrankWeb paper for the HOLO4K dataset.
1xgf.pdb has been removed from the holo4k dataset (contains only UNK groups and no ligand).

CHEN11, ASTEX, metapocket2 datasets, FPTRAIN, HOLO4K

Description

Dataset Overview

Main Protein Datasets

Dataset Variants

Dataset Caveats

AI studio

Access Dataset

Topics

Source