DATASET
Open Source Community
wwPDB dataset
The wwPDB dataset is a dataset for protein‑structure prediction that contains a large amount of protein structural information.
Updated 11/27/2024
github
Description
Protenix: Protein + X Dataset Overview
Dataset Content
- Data Source: wwPDB dataset
- Data Type: Protein structure data
- Data Size: Requires at least 1 T of disk space
- Data Structure:
├── components.v20240608.cif [408M] # ccd source file ├── components.v20240608.cif.rdkit_mol.pkl [121M] # ccd source file generated by rdkit Mol object ├── indices [33M] # chain or interface entries ├── mmcif [283G] # original mmcif data ├── mmcif_bioassembly [36G] # pre‑processed wwPDB structural data ├── mmcif_msa [450G] # msa files ├── posebusters_bioassembly [42M] # pre‑processed posebusters structural data ├── posebusters_mmcif [361M] # original mmcif data ├── recentPDB_bioassembly [1.5G] # pre‑processed recentPDB structural data └── seq_to_pdb_index.json [45M] # sequence‑to‑pdb‑id mapping file
Data Download
-
Pre‑processed data download:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/ rm /af3-dev/release_data/release_data.tar.gz -
Inference‑only data download:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl
Model Checkpoints
- Pre‑trained model download:
wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt
Dataset Uses
- Training: From‑scratch model training
- Inference: Model inference and prediction
Dataset Processing
- Data processing scripts: Being organized and prepared; distilled data will be released in the future.
Related Documentation
License
- Non‑commercial use: Creative Commons Attribution‑NonCommercial 4.0 International License
- Commercial use: Contact ai4s‑bio@bytedance.com for a commercial license
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Protein Structure
Structure Prediction
Source
Organization: github
Created: 11/8/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.