Rostlab/ProstT5Dataset
--- dataset_info: features: - name: input_id_x sequence: int64 - name: input_id_y sequence: int64 splits: - name: test num_bytes: 1087504 num_examples: 474 - name: valid num_bytes: 1124160 num_examples: 474 - name: train num_bytes: 65391887792 num_examples: 17070828 download_size: 810671738 dataset_size: 65394099456 license: mit task_categories: - text-generation tags: - biology size_categories: - 10M<n<100M --- # Dataset Card for "ProstT5Dataset" * **Contributors:** Michael Heinzinger and Konstantin Weissenow, Joaquin Gomez Sanchez and Adrian Henkel, Martin Steinegger and Burkhard Rost * **Licence:** MIT ## Table of Contents - [Overview](#overview) - [Dataset Description](#dataset-description) - [Data Collection and Annotation](#data-collection-and-annotation) - [Data Splits](#data-splits) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Instances](#data-instances) - [Data Considerations](#data-considerations) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Overview The ProstT5Dataset is a curated collection of *tokenized* protein sequences and their corresponding structure sequences (3Di). It is derived from the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) and includes various steps of clustering and quality filtering. To capture 3D information of the sequence, the [3Di structure string representation](https://www.nature.com/articles/s41587-023-01773-0#Sec2) is leveraged. This format captures the spatial relationship of each residue to its neighbors in 3D space, effectively translating the 3D information of the sequence. The sequence tokens are generated using the [ProstT5 Tokenizer](https://huggingface.co/Rostlab/ProstT5). ## Data Fields - **input_id_x** (3Di Tokens): Corresponding tokenized 3Di structure representation sequences derived from the proteins. - **input_id_y** (Amino Acid Tokens): Tokenized amino acid sequences of proteins. ## Dataset Description  We compare basic protein properties (sequence length, amino acid composition, 3Di-distribution) between our dataset (training, validation, test sets) and proteins obtained from the [Protein Data Bank (PDB)](https://www.rcsb.org/). Key findings include similar amino acid distributions across datasets, an overrepresentation of certain 3Di-tokens (d, v, p) and helical structures in AlphaFold2 predictions compared to PDB, and a tendency for shorter protein lengths in this dataset (average 206-238) relative to PDB proteins (average 255). The analysis also highlights the relationship between 3Di states and secondary structures, with a notable distinction in strand-related tokens between datasets. ## Data Collection and Annotation The dataset began with the AlphaFold Protein Structure Database , undergoing a two-step clustering process and one step of quality filtering: 1. *First Clustering:* 214M UniprotKB protein sequences were clustered using MMseqs2, resulting in 52M clusters based on pairwise sequence identity. 2. *Second Clustering:* Foldseek further clustered these proteins into 18.8M clusters, expanded to 18.6M proteins by adding diverse members. 3. *Quality Filtering:* Removed proteins with low pLDDT scores, short lengths, and highly repetitive 3Di-strings. The final training split contains 17M proteins. ## Data Splits Data splits into train, test, and, validation were created by moving whole clusters (after quality filtering - see above), to either of the sets. For validation and test, we only kept representatives to avoid bias towards large families. This resulted in 474 proteins for test, 474 proteins for validation and around 17M proteins for training. ## Citation ``` @article{heinzinger2023prostt5, title={ProstT5: Bilingual language model for protein sequence and structure}, author={Heinzinger, Michael and Weissenow, Konstantin and Sanchez, Joaquin Gomez and Henkel, Adrian and Steinegger, Martin and Rost, Burkhard}, journal={bioRxiv}, pages={2023--07}, year={2023}, publisher={Cold Spring Harbor Laboratory} } ``` ## Tokens to Character Mapping | Amino Acid Representation | 3DI | Special Tokens | |---------------------------|-----------|--------------------| | 3: A | 128: a | 0: \<pad\> | | 4: L | 129: l | 1: \</s\> | | 5: G | 130: g | 2: \<unk\> | | 6: V | 131: v | 148: \<fold2AA\> | | 7: S | 132: s | 149: \<AA2fold\> | | 8: R | 133: r | | | 9: E | 134: e | | | 10: D | 135: d | | | 11: T | 136: t | | | 12: I | 137: i | | | 13: P | 138: p | | | 14: K | 139: k | | | 15: F | 140: f | | | 16: Q | 141: q | | | 17: N | 142: n | | | 18: Y | 143: y | | | 19: M | 144: m | | | 20: H | 145: h | | | 21: W | 146: w | | | 22: C | 147: c | | | 23: X | | | | 24: B | | | | 25: O | | | | 26: U | | | | 27: Z | | |
Description
Dataset Overview
Dataset Description
ProstT5Dataset is a carefully curated collection of protein sequences and their corresponding structural sequences (3Di) that have been tokenized. The dataset originates from the AlphaFold Protein Structure Database and incorporates multiple clustering and quality‑filtering steps. To capture 3‑D information of the sequences, the 3Di structural string representation is used, effectively translating the 3‑D information into a linear format. Sequence tokenization is performed with the ProstT5 Tokenizer.
Data Fields
- input_id_x (3Di Tokens): Tokenized 3Di structural representation derived from the protein.
- input_id_y (Amino‑acid Tokens): Tokenized amino‑acid sequence of the protein.
Data Collection and Annotation
The dataset starts from the AlphaFold Protein Structure Database and undergoes two clustering steps plus one quality‑filtering step:
- First clustering: MMseqs2 clusters 214 M UniProtKB protein sequences, yielding 52 M clusters based on pairwise similarity.
- Second clustering: Foldseek further groups these proteins into 18.8 M clusters, which are expanded to 18.6 M proteins by adding diverse members.
- Quality filtering: Proteins with low pLDDT scores, short length, or highly repetitive 3Di strings are removed. The final training set contains 17 M proteins.
Data Splits
The data are split into training, test, and validation sets by moving entire (quality‑filtered) clusters into one of the subsets. To avoid bias toward large families, the validation and test sets retain only representative proteins, resulting in 474 proteins for testing, 474 for validation, and ~17 M for training.
Dataset Information
- Features:
input_id_x: int64 sequenceinput_id_y: int64 sequence
- Splits:
test: 1 087 504 bytes, 474 samplesvalid: 1 124 160 bytes, 474 samplestrain: 65 391 887 792 bytes, 17 070 828 samples
- Download size: 810 671 738 bytes
- Total size: 65 394 099 456 bytes
- License: MIT
- Task category:
- Text generation
- Tag:
- Biology
- Size category:
- 10M < size < 100M
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.