Atlas.Y Dataset
The Atlas.Y dataset comprises two main components: a signal peptide dataset and a linker dataset. The signal peptide dataset is intended to facilitate research on protein subcellular localization and transport, while the linker dataset is used to study linkers between signal peptides and target proteins, aiding the design and optimization of fusion proteins.
Dataset description and usage context
Atlas.Y Dataset
Dataset Overview
Atlas.Y Dataset is a collection for studying protein subcellular localization and transport, consisting of a signal peptide dataset and a linker dataset. This dataset is released under the Attribution‑NonCommercial 4.0 International (CC BY‑NC 4.0) License, permitting non‑commercial sharing and adaptation with appropriate attribution. For commercial use, please contact tongji_china2019@163.com to request permission.
Signal Peptide Dataset
- Design Purpose: Facilitates research on protein subcellular localization and transport.
- Source: Derived from the dataset used to train the DeepLoc 2.1 deep‑learning model by Marius Thrane Ødum et al.
- Selection Criteria: Includes only eukaryotic proteins, extracts signal peptides, classifies them, and assigns unique identifiers for efficient querying.
- Applicable Domains: Bioinformatics research, protein design, cell‑biology experiments, especially subcellular location prediction.
- File: Signal_Peptide.csv
Linker Dataset
- Design Purpose: Supports investigation of linkers between signal peptides and target proteins, assisting the design and optimization of fusion proteins, particularly in subcellular localization and transport studies.
- Data Classification: Divided into a classical linker table and a natural linker table.
Classical Linker Table
- Content: Contains linkers extensively reviewed and classified in the literature, categorized by rigidity and flexibility.
- Applicable Domains: Protein design, molecular biology, synthetic biology engineering projects.
- File: Classical_Linker.csv
Natural Linker Table
- Content: Short peptides extracted from natural protein sequences without artificial optimization.
- Generation Method: Produced by removing signal peptides and conserved regions following the method of the 2021 Sun Yat‑sen University iGEM team.
- Source: Utilizes protein sequences from the DeepLoc 2.1 dataset, with conserved domains identified using NCBI's Conserved Domain Database (CDD) and the batch CD‑Search tool.
- File: Natural_Linker.csv
Application Areas
The dataset is widely applicable to protein engineering, molecular design, signal peptide functional studies, and bioinformatics analyses. Both tables provide foundational resources for scientists to efficiently query and exploit linker sequences.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.