Dataset assetOpen Source CommunityImage RecognitionEntomology

Wild Bee Dataset

The Wild Bee Dataset was created by Berlin University of Applied Sciences and contains approximately 30 000 images of wild bees sourced from the iNaturalist database. It is primarily intended to support insect monitoring and species classification research. The dataset covers 25 common German wild bee species; four visually similar species were merged into a single class. During creation, the dataset underwent rigorous labeling, including segmentation masks for body parts. The goal is to assist biologists in annotating rare species using deep‑learning techniques, thereby improving understanding and protection of biodiversity.

Source

arXiv

Created

Jun 15, 2022

Updated

Jun 15, 2022

Signals

206 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Introduction

The dataset is intended to support the development of automatic insect‑monitoring systems capable of identifying insect species without capturing or killing the insects. Because of the great diversity and rarity of insect species, building a high‑quality insect‑image dataset is challenging. The construction involved downloading insect images from iNaturalist via the script webscraper_inat.py and manually annotating them.

Data Acquisition

Images were downloaded using the script webscraper_inat.py from iNaturalist. Users must specify the target folder, maximum number of images, and the species URL index. For example, the index for Anthidium manicatum can be obtained by searching its name and copying the number at the end of the URL.

Data Annotation

From the downloaded images, about 30 samples per species (the mini dataset) were selected and further annotated using Label Studio. The final mini dataset contains 726 images covering 25 bee species. Annotations include segmentation of major body parts such as head, thorax, and abdomen.

Data Pre‑processing

Scripts create_metafiles_mini.py and create_metafiles_all.py were used to generate CUB200‑style metadata files from the JSON exports of Label Studio. These files map class names, image files, class labels, body parts, and their locations.

Training and Validation

A pretrained ResNet50 model was trained and cross‑validated on the full dataset, using the mini dataset as a test set. Reported test accuracies were 0.78 (top‑1) and 0.95 (top‑3), competitive with state‑of‑the‑art fine‑grained models.

Preliminary XAI Experiments

In initial experiments without human involvement, several XAI methods (e.g., saliency maps) were used to assess model interpretability. Experiments employed segmentation masks as a reference for explanations and evaluated fidelity via pixel‑flipping and Monte Carlo dropout.

Concept‑Based Prototype Nearest Neighbor (CoProNN)

A new concept‑based posterior XAI method was developed, leveraging text‑to‑image models (e.g., Stable Diffusion) to generate high‑level concept images, which were then used with k‑NN to explain model predictions. User studies confirmed that the method helped users classify bees more accurately and more easily discover erroneous model predictions.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio