Back to datasets
Dataset assetOpen Source CommunityBee ResearchNatural History Dataset

MikeTrizna/bees

The USNM Bumblebee dataset is a natural‑history collection containing single‑view images and occurrence data for 73,497 bumblebee specimens (family Apidae). The data conform to the Darwin Core standard, including taxonomy, collection date, location, and other metadata; most specimen locations are georeferenced. The dataset is global in scope but limited to specimens held by the Smithsonian Institution’s USNM collection. Image metadata follow the Audiovisual Core standard. Collection and digitization involved specimen gathering, imaging, data transcription, and quality control. The dataset can be used for evolutionary biology, ecology, climate change studies, and related research fields.

Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 22, 2023
Signals
105 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Card – Bee Dataset

Dataset Overview

The United States National Museum of Natural History (USNM) bumblebee dataset is a natural‑history collection comprising single‑side or dorsal images of 73,497 bumblebee specimens belonging to the Apidae family, along with a tab‑separated values file containing occurrence data. Occurrence data include taxonomic classification, collection date, location information, and other metadata compliant with the Darwin Core standard (https://dwc.tdwg.org). 11,421 specimens are not identified to species and are listed as Bombus sp. or Xylocopa sp. Most specimens (55,301) have georeferenced locations. The dataset is global but limited to specimens housed in the Smithsonian USNM collection.

Language

English

Data Example

A typical data point includes specimen metadata and image information.

An example from the dataset:

{
  "occurrenceID": "http://n2t.net/ark:/65665/30042e2d8-669d-4520-b456-e3c64203eff8",
  "catalogNumber": "USNMENT01732649",
  "recordedBy": "R. Craig",
  "year": "1949",
  "month": "4",
  "day": "13",
  "country": "United States",
  "stateProvince": "California",
  "county": "Fresno",
  "locality": "Auberry",
  "decimalLatitude": "37.0808",
  "decimalLongitude": "-119.485",
  "identifiedBy": "OBrien, L. R.",
  "scientificName": "Xylocopa (Notoxylocopa) tabaniformis orpifex",
  "genus": "Xylocopa",
  "subgenus": "Notoxylocopa",
  "specificEpithet": "tabaniformis",
  "infraspecificEpithet": "orpifex",
  "scientificNameAuthorship": "Smith",
  "accessURI": "https://ids.si.edu/ids/deliveryService?id=NMNH-USNMENT01732649",
  "PixelXDimension": 2000,
  "PixelYDimension": 1212
}

Data Fields

Specimen metadata fields follow the Darwin Core standard; see https://dwc.tdwg.org for details. Image metadata fields follow the Audiovisual Core standard; see https://ac.tdwg.org/.

Dataset Size

  • Training set: 73,387 samples, 3,672,202,733.82 bytes
  • Download size: 3,659,907,058 bytes
  • Total dataset size: 3,672,202,733.82 bytes

Configuration

  • Configuration name: default
  • Data files:
    • Split: training
    • Path: data/train-*

Dataset Curators

Smithsonian National Museum of Natural History, Department of Entomology. Jessica Bird (Entomology Data Manager) is the primary contact.

License

Public domain, Creative Commons CC0.

Citation

Orrell T, Informatics Office (2023). NMNH Extant Specimen Records (USNM, US). Version 1.72. National Museum of Natural History, Smithsonian Institution. Occurrence dataset. https://collections.nmnh.si.edu/ipt/resource?r=nmnh_extant_dwc-a&v=1.72

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio