JUHE API Marketplace
DATASET
Open Source Community

wwPDB dataset

The wwPDB dataset is a dataset for protein‑structure prediction that contains a large amount of protein structural information.

Updated 11/27/2024
github

Description

Protenix: Protein + X Dataset Overview

Dataset Content

  • Data Source: wwPDB dataset
  • Data Type: Protein structure data
  • Data Size: Requires at least 1 T of disk space
  • Data Structure:
    ├── components.v20240608.cif [408M] # ccd source file
    ├── components.v20240608.cif.rdkit_mol.pkl [121M] # ccd source file generated by rdkit Mol object
    ├── indices [33M] # chain or interface entries
    ├── mmcif [283G]  # original mmcif data
    ├── mmcif_bioassembly [36G] # pre‑processed wwPDB structural data
    ├── mmcif_msa [450G] # msa files
    ├── posebusters_bioassembly [42M] # pre‑processed posebusters structural data
    ├── posebusters_mmcif [361M] # original mmcif data
    ├── recentPDB_bioassembly [1.5G] # pre‑processed recentPDB structural data
    └── seq_to_pdb_index.json [45M] # sequence‑to‑pdb‑id mapping file
    

Data Download

  • Pre‑processed data download:

    wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz
    tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/
    rm /af3-dev/release_data/release_data.tar.gz
    
  • Inference‑only data download:

    wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif
    wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl
    

Model Checkpoints

  • Pre‑trained model download:
    wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt
    

Dataset Uses

  • Training: From‑scratch model training
  • Inference: Model inference and prediction

Dataset Processing

  • Data processing scripts: Being organized and prepared; distilled data will be released in the future.

Related Documentation

  • Input JSON file format: Details
  • Training and fine‑tuning settings: Details

License

  • Non‑commercial use: Creative Commons Attribution‑NonCommercial 4.0 International License
  • Commercial use: Contact ai4s‑bio@bytedance.com for a commercial license

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Protein Structure
Structure Prediction

Source

Organization: github

Created: 11/8/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.