Back to datasets
Dataset assetOpen Source CommunityBioinformaticsViral Genomics

damlab/HIV_FLT

The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.

Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 8, 2022
Signals
175 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Description

Overview

This dataset is sourced from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 version. Gene sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's Seq.translate function.

Supported tasks and leaderboards: None

Language: English

Structure

Data Instances

Each column represents a protein amino‑acid sequence from the HIV genome. The ID field provides the future cross‑reference GenBank accession. The dataset comprises 1,609 full‑length HIV genomes.

Fields: ID, gag, pol, env, nef, tat, rev, proteome

Splits: None

Creation

Rationale

The dataset was assembled to train an HIV‑BERT model aimed at predicting various HIV‑related features.

Collection and Normalization

The dataset was downloaded and curated on 2021‑12‑21.

Considerations

Societal Impact

The dataset can facilitate research on HIV sequence‑dependent characteristics; HIV has claimed millions of lives worldwide over recent decades.

Discussion of Bias

The dataset originates from the LANL full‑genome database and includes representative samples from each subtype and geographic region.

Additional Information

  • Curator: Will Dampier
  • Citation: To be determined
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio