Dataset assetOpen Source CommunityProtein StructureSequence Analysis

Rostlab/ProstT5Dataset

--- dataset_info: features: - name: input_id_x sequence: int64 - name: input_id_y sequence: int64 splits: - name: test num_bytes: 1087504 num_examples: 474 - name: valid num_bytes: 1124160 num_examples: 474 - name: train num_bytes: 65391887792 num_examples: 17070828 download_size: 810671738 dataset_size: 65394099456 license: mit task_categories: - text-generation tags: - biology size_categories: - 10M<n<100M --- # Dataset Card for "ProstT5Dataset" * **Contributors:** Michael Heinzinger and Konstantin Weissenow, Joaquin Gomez Sanchez and Adrian Henkel, Martin Steinegger and Burkhard Rost * **Licence:** MIT ## Table of Contents - [Overview](#overview) - [Dataset Description](#dataset-description) - [Data Collection and Annotation](#data-collection-and-annotation) - [Data Splits](#data-splits) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Instances](#data-instances) - [Data Considerations](#data-considerations) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Overview The ProstT5Dataset is a curated collection of *tokenized* protein sequences and their corresponding structure sequences (3Di). It is derived from the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) and includes various steps of clustering and quality filtering. To capture 3D information of the sequence, the [3Di structure string representation](https://www.nature.com/articles/s41587-023-01773-0#Sec2) is leveraged. This format captures the spatial relationship of each residue to its neighbors in 3D space, effectively translating the 3D information of the sequence. The sequence tokens are generated using the [ProstT5 Tokenizer](https://huggingface.co/Rostlab/ProstT5). ## Data Fields - **input_id_x** (3Di Tokens): Corresponding tokenized 3Di structure representation sequences derived from the proteins. - **input_id_y** (Amino Acid Tokens): Tokenized amino acid sequences of proteins. ## Dataset Description ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62c412251f45e8bdb2b05855/BgiKOoFUGjlHDPjbxJWOX.png) We compare basic protein properties (sequence length, amino acid composition, 3Di-distribution) between our dataset (training, validation, test sets) and proteins obtained from the [Protein Data Bank (PDB)](https://www.rcsb.org/). Key findings include similar amino acid distributions across datasets, an overrepresentation of certain 3Di-tokens (d, v, p) and helical structures in AlphaFold2 predictions compared to PDB, and a tendency for shorter protein lengths in this dataset (average 206-238) relative to PDB proteins (average 255). The analysis also highlights the relationship between 3Di states and secondary structures, with a notable distinction in strand-related tokens between datasets. ## Data Collection and Annotation The dataset began with the AlphaFold Protein Structure Database , undergoing a two-step clustering process and one step of quality filtering: 1. *First Clustering:* 214M UniprotKB protein sequences were clustered using MMseqs2, resulting in 52M clusters based on pairwise sequence identity. 2. *Second Clustering:* Foldseek further clustered these proteins into 18.8M clusters, expanded to 18.6M proteins by adding diverse members. 3. *Quality Filtering:* Removed proteins with low pLDDT scores, short lengths, and highly repetitive 3Di-strings. The final training split contains 17M proteins. ## Data Splits Data splits into train, test, and, validation were created by moving whole clusters (after quality filtering - see above), to either of the sets. For validation and test, we only kept representatives to avoid bias towards large families. This resulted in 474 proteins for test, 474 proteins for validation and around 17M proteins for training. ## Citation ``` @article{heinzinger2023prostt5, title={ProstT5: Bilingual language model for protein sequence and structure}, author={Heinzinger, Michael and Weissenow, Konstantin and Sanchez, Joaquin Gomez and Henkel, Adrian and Steinegger, Martin and Rost, Burkhard}, journal={bioRxiv}, pages={2023--07}, year={2023}, publisher={Cold Spring Harbor Laboratory} } ``` ## Tokens to Character Mapping | Amino Acid Representation | 3DI | Special Tokens | |---------------------------|-----------|--------------------| | 3: A | 128: a | 0: \<pad\> | | 4: L | 129: l | 1: \</s\> | | 5: G | 130: g | 2: \<unk\> | | 6: V | 131: v | 148: \<fold2AA\> | | 7: S | 132: s | 149: \<AA2fold\> | | 8: R | 133: r | | | 9: E | 134: e | | | 10: D | 135: d | | | 11: T | 136: t | | | 12: I | 137: i | | | 13: P | 138: p | | | 14: K | 139: k | | | 15: F | 140: f | | | 16: Q | 141: q | | | 17: N | 142: n | | | 18: Y | 143: y | | | 19: M | 144: m | | | 20: H | 145: h | | | 21: W | 146: w | | | 22: C | 147: c | | | 23: X | | | | 24: B | | | | 25: O | | | | 26: U | | | | 27: Z | | |

Source

hugging_face

Created

Nov 28, 2025

Updated

Dec 4, 2023

Signals

247 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Description

ProstT5Dataset is a carefully curated collection of protein sequences and their corresponding structural sequences (3Di) that have been tokenized. The dataset originates from the AlphaFold Protein Structure Database and incorporates multiple clustering and quality‑filtering steps. To capture 3‑D information of the sequences, the 3Di structural string representation is used, effectively translating the 3‑D information into a linear format. Sequence tokenization is performed with the ProstT5 Tokenizer.

Data Fields

input_id_x (3Di Tokens): Tokenized 3Di structural representation derived from the protein.
input_id_y (Amino‑acid Tokens): Tokenized amino‑acid sequence of the protein.

Data Collection and Annotation

The dataset starts from the AlphaFold Protein Structure Database and undergoes two clustering steps plus one quality‑filtering step:

First clustering: MMseqs2 clusters 214 M UniProtKB protein sequences, yielding 52 M clusters based on pairwise similarity.
Second clustering: Foldseek further groups these proteins into 18.8 M clusters, which are expanded to 18.6 M proteins by adding diverse members.
Quality filtering: Proteins with low pLDDT scores, short length, or highly repetitive 3Di strings are removed. The final training set contains 17 M proteins.

Data Splits

The data are split into training, test, and validation sets by moving entire (quality‑filtered) clusters into one of the subsets. To avoid bias toward large families, the validation and test sets retain only representative proteins, resulting in 474 proteins for testing, 474 for validation, and ~17 M for training.

Dataset Information

Features:
- input_id_x: int64 sequence
- input_id_y: int64 sequence
Splits:
- test: 1 087 504 bytes, 474 samples
- valid: 1 124 160 bytes, 474 samples
- train: 65 391 887 792 bytes, 17 070 828 samples
Download size: 810 671 738 bytes
Total size: 65 394 099 456 bytes
License: MIT
Task category:
- Text generation
Tag:
- Biology
Size category:
- 10M < size < 100M

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio