Back to datasets
Dataset assetOpen Source CommunityWater Quality MonitoringPlastic Pollution
Microplastics in Drinking Water
This dataset records microplastic presence in drinking water. Each row represents a water‑sample record, containing microplastic material and type, color, water source type (tap or bottled), and the sampling location's latitude and longitude. The dataset focuses on polyethylene (PE) material for predicting PE levels across different geographic locations.
Source
github
Created
Feb 22, 2024
Updated
Feb 24, 2024
Signals
230 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
- The dataset is named “Microplastics in Drinking Water,” with the specific file called “Microplastics Sample Data (wide).”
Dataset Source
- Released by the California State Water Resources Control Board, accessible via: Microplastics in Drinking Water.
Dataset Content
- Each row represents a water‑sample record with associated information.
- Key columns include microplastic material and type (content per sample), color, tap vs. bottled water, sampling location, and approximate coordinates.
- The project focuses on PE (polyethylene); other “material” columns are removed.
Data Processing
- The original dataset had over 100 columns; columns with fewer than 40 values were removed.
- Additional cleaning removed unnecessary columns such as
Sample_IDand handled allNANorPresentvalues. - Samples from Chinese reservoirs with extreme values were excluded.
Dataset Usage
- Random Forest, k‑NN regression, and Decision Tree regression models were employed for prediction.
- Model evaluation indicated Decision Tree regression performed best, though its predictive power is limited by sample size and data quality.
Dataset Limitations
- The dataset suffers from many missing values and mismatched data types; after cleaning, only about 60 samples remain usable.
- Updated continuously since 21 July 2022, but current data reliability and standardization are insufficient for robust predictive modeling.
Conclusion
- Despite testing multiple models, the dataset’s quality prevents reliable predictions of drinking‑water safety based on microplastic content. Further data collection and standardization are required.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.