Back to datasets
Dataset assetOpen Source CommunityWater Quality MonitoringPlastic Pollution

Microplastics in Drinking Water

This dataset records microplastic presence in drinking water. Each row represents a water‑sample record, containing microplastic material and type, color, water source type (tap or bottled), and the sampling location's latitude and longitude. The dataset focuses on polyethylene (PE) material for predicting PE levels across different geographic locations.

Source
github
Created
Feb 22, 2024
Updated
Feb 24, 2024
Signals
230 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • The dataset is named “Microplastics in Drinking Water,” with the specific file called “Microplastics Sample Data (wide).”

Dataset Source

Dataset Content

  • Each row represents a water‑sample record with associated information.
  • Key columns include microplastic material and type (content per sample), color, tap vs. bottled water, sampling location, and approximate coordinates.
  • The project focuses on PE (polyethylene); other “material” columns are removed.

Data Processing

  • The original dataset had over 100 columns; columns with fewer than 40 values were removed.
  • Additional cleaning removed unnecessary columns such as Sample_ID and handled all NAN or Present values.
  • Samples from Chinese reservoirs with extreme values were excluded.

Dataset Usage

  • Random Forest, k‑NN regression, and Decision Tree regression models were employed for prediction.
  • Model evaluation indicated Decision Tree regression performed best, though its predictive power is limited by sample size and data quality.

Dataset Limitations

  • The dataset suffers from many missing values and mismatched data types; after cleaning, only about 60 samples remain usable.
  • Updated continuously since 21 July 2022, but current data reliability and standardization are insufficient for robust predictive modeling.

Conclusion

  • Despite testing multiple models, the dataset’s quality prevents reliable predictions of drinking‑water safety based on microplastic content. Further data collection and standardization are required.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.