Back to datasets
Dataset assetOpen Source CommunityImage DatasetMineral Recognition
MinDat-Mineral-Image-Dataset
A dataset containing over 500,000 mineral images, each labeled, sourced from mindat.org. The dataset includes two CSV files that store image URLs and cleaned label information.
Source
github
Created
Jun 25, 2017
Updated
Sep 22, 2023
Signals
823 views
Availability
Linked source ready
Overview
Dataset description and usage context
MinDat-Mineral-Image-Dataset Overview
Basic Dataset Information
- Dataset Name: MinDat-Mineral-Image-Dataset
- Volume: Over 500,000 mineral images
- Format: Contains two CSV files
img_url_list.csv: Contains image URLs and their original labelsimg_url_list_converted.csv: Contains cleaned labels and image URLs of images whose unlabeled images have been removed
- Source: Scraped from [mindat.org]
- Processing Time:
- CSV file generation takes ~10 hours
- Image download takes ~24 hours (assuming network speed >10 Mbps)
Dataset Generation Process
- Run
make_url_list.pyto fetch all image URLs and save them to theimg_urlsdirectory. - Run the
concat_url_filesscript to merge URL files intoimg_url_list.csv. - Run
convert_img_url_list.pyto clean labels and generateimg_url_list_converted.csv. - Run
download_images.pyto download all images to the specified directory.
Dataset Characteristics
- Some images have extremely high resolution, with total data size around 400 GB.
- During label cleaning, variant labels such as “Capped Quartz, Chalcedony Quartz” were simplified to “Quartz”.
Example Images
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

