Water Potability Classification and Quality Clustering

Data analysis experimentation using the dataset from Kadiwal, A. (2020)

Methodology

Data Source Collection
Data Cleaning and Pre-processing
- Train Test Split of 80/20
- Creating 3 sets, Origial, MinMax, and Standard scaled
Modelling
- Nearest Neighbors
- Decision Tree
- K-means Clustering
Result analysis
- Best classification model
  - Metrics: Accuracy, Precision, Recall, F1 Score
- Highest Scoring cluster
  - Silhouette Score

Tech stack: Jupyter for the modelling and charting, Canva for the images

Best performing model is the Nearest Neighbors (k=10) with the Standard Scaling. This achieved an accuracy of 69.65% and precision of 71.43%.

Across the multiple features, only 2 features (solids vs turbidity) are used in the visualisation but all features were used in the model.

K-means clustering with K=2 using the Original dataset performed best with a silhouette score of 0.571.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
[CSCI_111]_Water_Quality_Project.ipynb		[CSCI_111]_Water_Quality_Project.ipynb
water_potability.csv		water_potability.csv