Back to datasets
Dataset assetOpen Source CommunityImage DatasetMineral Recognition

MinDat-Mineral-Image-Dataset

A dataset containing over 500,000 mineral images, each labeled, sourced from mindat.org. The dataset includes two CSV files that store image URLs and cleaned label information.

Source
github
Created
Jun 25, 2017
Updated
Sep 22, 2023
Signals
823 views
Availability
Linked source ready
Overview

Dataset description and usage context

MinDat-Mineral-Image-Dataset Overview

Basic Dataset Information

  • Dataset Name: MinDat-Mineral-Image-Dataset
  • Volume: Over 500,000 mineral images
  • Format: Contains two CSV files
    • img_url_list.csv: Contains image URLs and their original labels
    • img_url_list_converted.csv: Contains cleaned labels and image URLs of images whose unlabeled images have been removed
  • Source: Scraped from [mindat.org]
  • Processing Time:
    • CSV file generation takes ~10 hours
    • Image download takes ~24 hours (assuming network speed >10 Mbps)

Dataset Generation Process

  1. Run make_url_list.py to fetch all image URLs and save them to the img_urls directory.
  2. Run the concat_url_files script to merge URL files into img_url_list.csv.
  3. Run convert_img_url_list.py to clean labels and generate img_url_list_converted.csv.
  4. Run download_images.py to download all images to the specified directory.

Dataset Characteristics

  • Some images have extremely high resolution, with total data size around 400 GB.
  • During label cleaning, variant labels such as “Capped Quartz, Chalcedony Quartz” were simplified to “Quartz”.

Example Images

  • Example Image 1
  • Example Image 2
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio