Implementation of machine learning to predict bankruptcy using financial data from Taiwanese companies. The project compares Logistic Regression and Balanced Bootstrap Aggregating methods to create reliable bankruptcy prediction models.
The dataset used in this project is sourced from the Taiwanese Bankruptcy Prediction Dataset on the UCI Machine Learning Repository. It includes financial ratios from 6819 companies collected between 1999 and 2009.
Accurate bankruptcy prediction models are crucial for governments, financial institutions, and small-scale lenders to assess economic activity and minimize credit risks. This project aims to develop robust models using machine learning techniques to improve upon traditional statistical methods.
data/
: Contains the dataset and any preprocessed data files.README.md
: Project overview and instructions.Report.pdf
: Detailed analyses of the project.
To run this project, you will need Python 3.8 and the following libraries:
pip install numpy pandas scikit-learn matplotlib seaborn imbalanced-learn
-
Data Preprocessing:
- Load the dataset and preprocess it by handling missing values, scaling features, and performing feature selection.
-
Model Training:
- Train a Logistic Regression model with Stratified Cross Validation (SCV).
- Train a Balanced Bootstrap Aggregating (BB) model using Logistic Regression as the base learner.
-
Model Evaluation:
- Evaluate the models using F1-score, ROC-AUC, precision, and recall metrics.
- Compare the performance of the SCV Logistic Regression and Balanced BAgging models.
The models were evaluated on their ability to predict bankruptcy. Here are some key findings:
-
Logistic Regression (SCV):
- F1 Score: 0.27
- ROC-AUC: 0.91
- Precision: 0.16
- Recall: 0.82
-
Balanced Bootstrap Aggregating (BB):
- F1 Score: 0.27
- ROC-AUC: 0.92
- Precision: 0.16
- Recall: 0.82
Both models show good discrimination between classes, with high recall indicating effective capture of positive bankruptcy cases. However, precision is low, suggesting positive predictions are less reliable.
This project demonstrates the challenges of predicting bankruptcy in an imbalanced dataset where successful businesses significantly outnumber unsuccessful ones. The Logistic Regression model with SCV performed marginally better than the Balanced BAgging model. Future improvements could include incorporating more diverse data and creating sector-specific models.
- Altman, E.I. (1968). Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy. The Journal of Finance.
- Gnip, P., and Drotar, P. (2019). Ensemble methods for strongly imbalanced data: bankruptcy prediction. IEEE Conference Publication.
- Hackeling, G. (2017). Mastering Machine Learning with scikit-learn - Second Edition. Packt Publishing.