Data analysis experimentation using the dataset from Kadiwal, A. (2020)
- Data Source Collection
- Data Cleaning and Pre-processing
- Train Test Split of 80/20
- Creating 3 sets, Origial, MinMax, and Standard scaled
- Modelling
- Nearest Neighbors
- Decision Tree
- K-means Clustering
- Result analysis
- Best classification model
- Metrics: Accuracy, Precision, Recall, F1 Score
- Highest Scoring cluster
- Silhouette Score
- Best classification model
Tech stack: Jupyter for the modelling and charting, Canva for the images
Best performing model is the Nearest Neighbors (k=10) with the Standard Scaling. This achieved an accuracy of 69.65% and precision of 71.43%.
Across the multiple features, only 2 features (solids vs turbidity) are used in the visualisation but all features were used in the model.
K-means clustering with K=2 using the Original dataset performed best with a silhouette score of 0.571.