damlab/HIV_FLT
The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.
Dataset description and usage context
Dataset Description
Overview
This dataset is sourced from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 version. Gene sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's Seq.translate function.
Supported tasks and leaderboards: None
Language: English
Structure
Data Instances
Each column represents a protein amino‑acid sequence from the HIV genome. The ID field provides the future cross‑reference GenBank accession. The dataset comprises 1,609 full‑length HIV genomes.
Fields: ID, gag, pol, env, nef, tat, rev, proteome
Splits: None
Creation
Rationale
The dataset was assembled to train an HIV‑BERT model aimed at predicting various HIV‑related features.
Collection and Normalization
The dataset was downloaded and curated on 2021‑12‑21.
Considerations
Societal Impact
The dataset can facilitate research on HIV sequence‑dependent characteristics; HIV has claimed millions of lives worldwide over recent decades.
Discussion of Bias
The dataset originates from the LANL full‑genome database and includes representative samples from each subtype and geographic region.
Additional Information
- Curator: Will Dampier
- Citation: To be determined
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.