Synthetic Water Quality and Potability Data
Synthetic Data Generation
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Water quality metrics and safety attributes constitute this synthetic dataset designed for educational and classification purposes. It focuses on the chemical and physical properties of water samples to determine their suitability for human consumption. Containing approximately 100,000 records, the data facilitates the development of machine learning models for binary classification, specifically predicting whether water is potable (safe to drink) or not based on factors such as pH, hardness, and chemical composition.
Columns
- ph: Indicates the acid-base balance of the water (Range: 0–14).
- hardness: Measures the capacity of water to precipitate soap, typically caused by calcium and magnesium (Range: 0–1100).
- tds: Represents Total Dissolved Solids, measuring the combined content of all inorganic and organic substances (Range: 0–1100).
- chlorine: The amount of chlorine present, often used as a disinfectant (Range: 0–9).
- sulfate: Concentration of dissolved sulfates (Range: 0–800).
- conductivity: Electrical conductivity of the water, related to dissolved ion concentration (Range: 0–40,000).
- organic_carbon: Measurement of the amount of organic carbon in the water (Range: 0–18).
- trihalomethanes: Chemicals found in water treated with chlorine (Range: 0–230).
- turbidity: A measure of the light-emitting property of water and the presence of suspended particles (Range: 0–14).
- potability: The target variable indicating if water is safe for human consumption (1 = Potable, 0 = Not Potable).
Distribution
- Format: CSV
- Size: Approximately 100,000 records (rows).
- Structure: Tabular data with 10 columns.
- Class Balance: The dataset is imbalanced, with a significantly higher proportion of non-potable samples (approx. 92.4%) compared to potable samples (approx. 7.6%).
- Data Quality: There are missing values in specific columns, including pH (3%), Sulfate (1%), and Conductivity (2%).
Usage
- Machine Learning Classification: Developing binary classification models to predict water safety.
- Data Pre-processing Practice: Handling imbalanced datasets and imputing missing values (e.g., in pH and Sulfate columns).
- Exploratory Data Analysis (EDA): Analysing correlations between chemical properties like conductivity and dissolved solids.
- Educational Demonstrations: Teaching data science concepts using synthetic environmental data.
Coverage
- Scope: The data is synthetic and generated specifically for educational purposes; it does not represent a specific geographic location or real-time monitoring system.
- Demographic/Time: Not applicable as the data is simulated.
License
CC BY-NC-SA 4.0
Who Can Use It
- Data Science Students: For practising classification algorithms and data cleaning techniques.
- Machine Learning Educators: As a resource for assignments regarding imbalanced class handling.
- Environmental Researchers: For simulating water quality assessment workflows.
Dataset Name Suggestions
- Synthetic Water Quality and Potability Data
- Water Safety Classification Set
- Educational Water Drinkability Indicators
- Large-Scale Water Chemical Properties Dataset
Attributes
Original Data Source: Synthetic Water Quality and Potability Data
Loading...
