damlab/HIV_FLT
The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.
Description
Dataset Description
Overview
This dataset is sourced from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 version. Gene sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's Seq.translate function.
Supported tasks and leaderboards: None
Language: English
Structure
Data Instances
Each column represents a protein amino‑acid sequence from the HIV genome. The ID field provides the future cross‑reference GenBank accession. The dataset comprises 1,609 full‑length HIV genomes.
Fields: ID, gag, pol, env, nef, tat, rev, proteome
Splits: None
Creation
Rationale
The dataset was assembled to train an HIV‑BERT model aimed at predicting various HIV‑related features.
Collection and Normalization
The dataset was downloaded and curated on 2021‑12‑21.
Considerations
Societal Impact
The dataset can facilitate research on HIV sequence‑dependent characteristics; HIV has claimed millions of lives worldwide over recent decades.
Discussion of Bias
The dataset originates from the LANL full‑genome database and includes representative samples from each subtype and geographic region.
Additional Information
- Curator: Will Dampier
- Citation: To be determined
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.