damlab/HIV_FLT

The dataset originates from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 release. Sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's `Seq.translate`. The dataset is intended to train an HIV‑BERT model for predicting various HIV‑related features. It includes fields such as ID, gag, pol, env, nef, tat, rev, and proteome, each representing the protein amino‑acid sequence of the corresponding HIV gene. The dataset can be used for research on HIV sequence characteristics, a virus that has caused millions of deaths globally over past decades.

Updated 2/8/2022

hugging_face

Description

Dataset Description

Overview

This dataset is sourced from the Los Alamos National Laboratory (LANL) HIV sequence database and contains 1,609 high‑quality full‑genome HIV sequences from the 2016 version. Gene sequences were processed with the GeneCutter tool and translated to amino‑acid sequences using BioPython's Seq.translate function.

Supported tasks and leaderboards: None

Language: English

Structure

Data Instances

Each column represents a protein amino‑acid sequence from the HIV genome. The ID field provides the future cross‑reference GenBank accession. The dataset comprises 1,609 full‑length HIV genomes.

Fields: ID, gag, pol, env, nef, tat, rev, proteome

Splits: None

Creation

Rationale

The dataset was assembled to train an HIV‑BERT model aimed at predicting various HIV‑related features.

Collection and Normalization

The dataset was downloaded and curated on 2021‑12‑21.

Considerations

Societal Impact

The dataset can facilitate research on HIV sequence‑dependent characteristics; HIV has claimed millions of lives worldwide over recent decades.

Discussion of Bias

The dataset originates from the LANL full‑genome database and includes representative samples from each subtype and geographic region.

Additional Information

Curator: Will Dampier
Citation: To be determined

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Bioinformatics

Viral Genomics

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →