Back to datasets
Dataset assetOpen Source CommunityBiodiversityInsect Genetic Classification

Gharaee/BIOSCAN_1M_Insect_Dataset

The BIOSCAN_1M insect dataset provides information about insects. Each record includes four primary attributes: DNA barcode sequence, barcode index number (BIN), taxonomic rank annotation, and RGB image. The DNA barcode sequence shows the nucleotide arrangement, BIN serves as an alternative to Linnaean names, providing gene‑centered taxonomy, taxonomic rank annotation classifies organisms hierarchically based on evolutionary relationships, and the RGB image displays raw images from the 16 most densely sampled insect orders. The dataset also illustrates class distribution and class imbalance, which are inherent characteristics of insect communities.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jun 20, 2024
Signals
108 views
Availability
Linked source ready
Overview

Dataset description and usage context

BIOSCAN_1M Insect Dataset

Dataset Overview

BIOSCAN‑1M Insect Dataset provides information about insects, with each record containing the following four main attributes:

  1. DNA Barcode Sequence
  2. Barcode Index Number (BIN)
  3. Taxonomic Rank Annotation
  4. RGB Image

I. DNA Barcode Sequence

The provided DNA barcode sequence displays the nucleotide arrangement:

  • Adenine (A): Red
  • Thymine (T): Blue
  • Cytosine (C): Green
  • Guanine (G): Yellow

Example sequence:

TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTA …

II. Barcode Index Number (BIN)

BIN serves as an alternative to Linnaean names, offering a gene‑centered classification.

Example BIN:

BOLD:AER5166

III. Taxonomic Rank Annotation

Annotations are organized hierarchically based on evolutionary relationships, grouping species that share common features and genetic similarity.

IV. RGB Image

Images are sourced from the 16 most densely sampled orders in the BIOSCAN‑1M Insect Dataset. Below each image, a number indicates the count of images in that class, clearly showing the class imbalance within the dataset.

Diptera: 896,234Hymenoptera: 89,311Coleoptera: 47,328Hemiptera: 46,970
Lepidoptera: 32,538Psocodea: 9,635Thysanoptera: 2,088Trichoptera: 1,296
Orthoptera: 1,057Blattodea: 824Neuroptera: 676Ephemeroptera: 96
Dermaptera: 66Archaeognatha: 63Plecoptera: 30Embioptera: 6

Class Distribution

The dataset visualizes class distribution and imbalance, reflecting an inherent characteristic of insect communities.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio