Explore high-quality datasets for your AI and machine learning projects.
The wwPDB dataset is a dataset for protein‑structure prediction that contains a large amount of protein structural information.
--- dataset_info: features: - name: input_id_x sequence: int64 - name: input_id_y sequence: int64 splits: - name: test num_bytes: 1087504 num_examples: 474 - name: valid num_bytes: 1124160 num_examples: 474 - name: train num_bytes: 65391887792 num_examples: 17070828 download_size: 810671738 dataset_size: 65394099456 license: mit task_categories: - text-generation tags: - biology size_categories: - 10M<n<100M --- # Dataset Card for "ProstT5Dataset" * **Contributors:** Michael Heinzinger and Konstantin Weissenow, Joaquin Gomez Sanchez and Adrian Henkel, Martin Steinegger and Burkhard Rost * **Licence:** MIT ## Table of Contents - [Overview](#overview) - [Dataset Description](#dataset-description) - [Data Collection and Annotation](#data-collection-and-annotation) - [Data Splits](#data-splits) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Instances](#data-instances) - [Data Considerations](#data-considerations) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Overview The ProstT5Dataset is a curated collection of *tokenized* protein sequences and their corresponding structure sequences (3Di). It is derived from the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) and includes various steps of clustering and quality filtering. To capture 3D information of the sequence, the [3Di structure string representation](https://www.nature.com/articles/s41587-023-01773-0#Sec2) is leveraged. This format captures the spatial relationship of each residue to its neighbors in 3D space, effectively translating the 3D information of the sequence. The sequence tokens are generated using the [ProstT5 Tokenizer](https://huggingface.co/Rostlab/ProstT5). ## Data Fields - **input_id_x** (3Di Tokens): Corresponding tokenized 3Di structure representation sequences derived from the proteins. - **input_id_y** (Amino Acid Tokens): Tokenized amino acid sequences of proteins. ## Dataset Description  We compare basic protein properties (sequence length, amino acid composition, 3Di-distribution) between our dataset (training, validation, test sets) and proteins obtained from the [Protein Data Bank (PDB)](https://www.rcsb.org/). Key findings include similar amino acid distributions across datasets, an overrepresentation of certain 3Di-tokens (d, v, p) and helical structures in AlphaFold2 predictions compared to PDB, and a tendency for shorter protein lengths in this dataset (average 206-238) relative to PDB proteins (average 255). The analysis also highlights the relationship between 3Di states and secondary structures, with a notable distinction in strand-related tokens between datasets. ## Data Collection and Annotation The dataset began with the AlphaFold Protein Structure Database , undergoing a two-step clustering process and one step of quality filtering: 1. *First Clustering:* 214M UniprotKB protein sequences were clustered using MMseqs2, resulting in 52M clusters based on pairwise sequence identity. 2. *Second Clustering:* Foldseek further clustered these proteins into 18.8M clusters, expanded to 18.6M proteins by adding diverse members. 3. *Quality Filtering:* Removed proteins with low pLDDT scores, short lengths, and highly repetitive 3Di-strings. The final training split contains 17M proteins. ## Data Splits Data splits into train, test, and, validation were created by moving whole clusters (after quality filtering - see above), to either of the sets. For validation and test, we only kept representatives to avoid bias towards large families. This resulted in 474 proteins for test, 474 proteins for validation and around 17M proteins for training. ## Citation ``` @article{heinzinger2023prostt5, title={ProstT5: Bilingual language model for protein sequence and structure}, author={Heinzinger, Michael and Weissenow, Konstantin and Sanchez, Joaquin Gomez and Henkel, Adrian and Steinegger, Martin and Rost, Burkhard}, journal={bioRxiv}, pages={2023--07}, year={2023}, publisher={Cold Spring Harbor Laboratory} } ``` ## Tokens to Character Mapping | Amino Acid Representation | 3DI | Special Tokens | |---------------------------|-----------|--------------------| | 3: A | 128: a | 0: \<pad\> | | 4: L | 129: l | 1: \</s\> | | 5: G | 130: g | 2: \<unk\> | | 6: V | 131: v | 148: \<fold2AA\> | | 7: S | 132: s | 149: \<AA2fold\> | | 8: R | 133: r | | | 9: E | 134: e | | | 10: D | 135: d | | | 11: T | 136: t | | | 12: I | 137: i | | | 13: P | 138: p | | | 14: K | 139: k | | | 15: F | 140: f | | | 16: Q | 141: q | | | 17: N | 142: n | | | 18: Y | 143: y | | | 19: M | 144: m | | | 20: H | 145: h | | | 21: W | 146: w | | | 22: C | 147: c | | | 23: X | | | | 24: B | | | | 25: O | | | | 26: U | | | | 27: Z | | |