Atlas.Y Dataset
The Atlas.Y dataset comprises two main components: a signal peptide dataset and a linker dataset. The signal peptide dataset is intended to facilitate research on protein subcellular localization and transport, while the linker dataset is used to study linkers between signal peptides and target proteins, aiding the design and optimization of fusion proteins.
Description
Atlas.Y Dataset
Dataset Overview
Atlas.Y Dataset is a collection for studying protein subcellular localization and transport, consisting of a signal peptide dataset and a linker dataset. This dataset is released under the Attribution‑NonCommercial 4.0 International (CC BY‑NC 4.0) License, permitting non‑commercial sharing and adaptation with appropriate attribution. For commercial use, please contact tongji_china2019@163.com to request permission.
Signal Peptide Dataset
- Design Purpose: Facilitates research on protein subcellular localization and transport.
- Source: Derived from the dataset used to train the DeepLoc 2.1 deep‑learning model by Marius Thrane Ødum et al.
- Selection Criteria: Includes only eukaryotic proteins, extracts signal peptides, classifies them, and assigns unique identifiers for efficient querying.
- Applicable Domains: Bioinformatics research, protein design, cell‑biology experiments, especially subcellular location prediction.
- File: Signal_Peptide.csv
Linker Dataset
- Design Purpose: Supports investigation of linkers between signal peptides and target proteins, assisting the design and optimization of fusion proteins, particularly in subcellular localization and transport studies.
- Data Classification: Divided into a classical linker table and a natural linker table.
Classical Linker Table
- Content: Contains linkers extensively reviewed and classified in the literature, categorized by rigidity and flexibility.
- Applicable Domains: Protein design, molecular biology, synthetic biology engineering projects.
- File: Classical_Linker.csv
Natural Linker Table
- Content: Short peptides extracted from natural protein sequences without artificial optimization.
- Generation Method: Produced by removing signal peptides and conserved regions following the method of the 2021 Sun Yat‑sen University iGEM team.
- Source: Utilizes protein sequences from the DeepLoc 2.1 dataset, with conserved domains identified using NCBI's Conserved Domain Database (CDD) and the batch CD‑Search tool.
- File: Natural_Linker.csv
Application Areas
The dataset is widely applicable to protein engineering, molecular design, signal peptide functional studies, and bioinformatics analyses. Both tables provide foundational resources for scientists to efficiently query and exploit linker sequences.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/26/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.