This project involves building a classification model to predict the genre of a song based on its features.
- Removed rows with missing
instance_id
,artist_name
, andtrack_name
. - Replaced missing values in
duration_ms
andtempo
with medians. - One-hot encoded the
key
column; label encoded themode
andmusic_genre
columns. - Standardized numerical features.
- Split dataset: 500 songs per genre for the test set, remaining for training.
- Used t-SNE and UMAP for visualization.
- Applied HDBSCAN for clustering analysis.
- Added cluster labels as features.
- Used XGBoost for classification.
- Evaluated with ROC AUC and accuracy metrics.
- Mean ROC AUC: 0.9345 (test set).
- Accuracy: 64.59% (training), 58.82% (test).
- Observed overfitting; suggested cross-validation, regularization, and further feature engineering.
- Feature engineering and data preprocessing were crucial.
- Visualized audio features highlighted the complexity of genre classification.
- Further improvements needed for practical application.
For a detailed description, refer to report.pdf.