DATASET-CAPE-RhlA-seqlabel
The CAPE dataset contains mutation sequences of the RhlA enzyme and their functional evaluation metrics. The training set comprises 1,593 sequences, each paired with an enzyme activity metric for model training. The test set includes 925 sequences for model evaluation, where participants must predict the activity of these sequences. The goal is to optimize rhamnolipids production and application potential by engineering RhlA mutations.
Dataset description and usage context
CAPE Dataset: RhlA Enzyme Mutations
Dataset Introduction and Use Cases
RhlA (Uniprot ID: Q51559, PDB ID: 8IK2) is a key enzyme involved in synthesizing the hydrophobic component of rhamnolipids. It determines fatty‑acid chain length and unsaturation, influencing the physicochemical properties and bioactivity of rhamnolipids.
Why Engineer RhlA?
Modifying RhlA enables precise control over fatty‑acid chain structure, thereby increasing rhamnolipid yield and enhancing its industrial and pharmaceutical applicability.
Dataset Description
Training Set: Saprot_CAPE_dataset_train.csv
- File format: CSV
- Number of sequences: 1,593
- Columns:
- protein: Represents the mutation combination at six critical residues (positions 74, 101, 143, 148, 173, 176).
- label: Enzyme activity metric indicating overall productivity.
Test Set: Saprot_CAPE_dataset_test.csv
- Number of sequences: 925
- Description: Contains only the sequence information. Participants must predict the activity of these sequences for model evaluation. Predictions are submitted to Kaggle for performance feedback.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.