Back to datasets
Dataset assetOpen Source CommunityProtein StructureStructure Prediction
wwPDB dataset
The wwPDB dataset is a dataset for protein‑structure prediction that contains a large amount of protein structural information.
Source
github
Created
Nov 8, 2024
Updated
Nov 27, 2024
Signals
234 views
Availability
Linked source ready
Overview
Dataset description and usage context
Protenix: Protein + X Dataset Overview
Dataset Content
- Data Source: wwPDB dataset
- Data Type: Protein structure data
- Data Size: Requires at least 1 T of disk space
- Data Structure:
├── components.v20240608.cif [408M] # ccd source file ├── components.v20240608.cif.rdkit_mol.pkl [121M] # ccd source file generated by rdkit Mol object ├── indices [33M] # chain or interface entries ├── mmcif [283G] # original mmcif data ├── mmcif_bioassembly [36G] # pre‑processed wwPDB structural data ├── mmcif_msa [450G] # msa files ├── posebusters_bioassembly [42M] # pre‑processed posebusters structural data ├── posebusters_mmcif [361M] # original mmcif data ├── recentPDB_bioassembly [1.5G] # pre‑processed recentPDB structural data └── seq_to_pdb_index.json [45M] # sequence‑to‑pdb‑id mapping file
Data Download
-
Pre‑processed data download:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/ rm /af3-dev/release_data/release_data.tar.gz -
Inference‑only data download:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl
Model Checkpoints
- Pre‑trained model download:
wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt
Dataset Uses
- Training: From‑scratch model training
- Inference: Model inference and prediction
Dataset Processing
- Data processing scripts: Being organized and prepared; distilled data will be released in the future.
Related Documentation
License
- Non‑commercial use: Creative Commons Attribution‑NonCommercial 4.0 International License
- Commercial use: Contact ai4s‑bio@bytedance.com for a commercial license
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.