ABH DS internship

During the three months of the internship the goal was to solve different tasks ranging from classical data science methods, to newer ones, including sparse SVD, etc. To deal with larger, as well as smaller datasets, handle preprocessing of data and cross-validating, training and testing different models.

Month 1 Task 2

The task is to perform binary classification on the Titanic dataset, given passenger data that includes:

Passenger's surname, title, name (full name of spouse) - Name,
Passenger's sex - Sex,
Passenger's class - PClass,
Passenger's age - Age.

And using this data to predict whether or not the passenger has surivived the sinking of the Titanic.

Solution

Create new datasets

Original dataset was parsed, bad and missing data was fixed and estimated.

We extracted passenger titles from their full names, missing titles were estimated based on their sex, age and passenger class to one of the four categories 'Mr', 'Mrs', 'Master', 'Ms'. Different names for the same title were categorized, e.g. 'Ms' = ('Ms', 'Miss', 'Mlle')

Age was estimated based on passenger class and title, different methods of age estimation were done, using mean and median age for different categories, as well as adding a small deviation from the mean/median age to preserve the age-class histograms.

We also added a new feature family_size which represents the number of passengers with the same surname and the same passenger class onboard the Titanic. With the idea being that families that travel always travel together. (This may not always hold, especially for the 3rd passenger class, but the models have performed better with this naive implementation)

Visualise dataset

This module performs basic dataset visualisations, listed below:

plot_age_histograms: create age histograms for different p_class and title categories.
plot_age_title: create 2D plots of those who have survived vs those less fortunate with x axis being their age and y axis passenger title.
plot_age_title_class: create a 3D scatter graph of survivors, with an additional z axis of p_class and the size of the point being determined by their familiy size.
group_survivors_by_params: this creates multiindex pandas dataframes that group survivors by different list of attributes passed to the function
create_survivor_groups: a parent function that uses the group_survivors_by_params function to create groupings by different combinations of parameters.
create_visualisations: a single function that performs the following:
- create age histograms,
- create survivor groups dataframes,
- 2D plot of survivors by age and title,
- 3D plot of survivors by age, title, class and family_size.

Files generated by this function will appear in the /Data/Images, /Data/Pickles, /Data/Dataframes folders.

After using this module, we get the following results (red represents survivors):

Titanic classification

This module performs cross validation on four different classifiers, finding the following best hyper parameters:

Dataset parameters:
- mean: bool, represents whether we estimate age based on median or the mean age,
- std_dev_scale: float, represents the value of the standard deviation scaler used for age estimation. range: (0, 1)
- better_title: bool, represents whether we estimate age naively, based only on passenger sex or taking into consideration passenger title as well.
Data format parameters:
- scale_data: bool, do we need to scale the data or not,
- delete_fam_size: do we take into consideration the induced family size parameter
- group_by_age: do we convert the continous age parameter to a categorical variable.
Model parameters, these vary for different models.

The classifiers that we use are listed below:

Decision tree,
Gaussian Naive Bayes,
Support Vector Machine,
Logistic Regression.

And finally we use a Voting Classifier, that ensembles the previously trained classifiers, using their achieved validation accuracies as weights.

Results

Model parameters

Model name	Model params	Data format params			Dataset params
Model name	Model params	Scale data	Delete fam_size	Group by age	Mean	std_dev scaler	Better title est.
Decision tree	max_depth: 4	False	False	False	True	0.6	False
Gaussian Naive Bayes	None	False	False	True	True	0.8	False
Support Vector Machine	C: 1, gamma:0.03	True	False	False	False	0.2	True
Logistic Regression	C: 1	False	False	False	False	0.4	False
Voting Classifier	None	True	False	True	True	0.2	False

Model results

Decision Tree
Label	Estimated label	Probability
0	0	0.90740741
0	0	0.90740741
1	1	0.91570881
0	0	0.90740741
0	0	0.89775561
0	0	0.89775561
1	0	0.67647059
1	1	0.91570881
1	1	0.91570881
1	0	0.67647059

Gaussian Naive Bayes
Label	Estimated label	Probability
0	0	0.999999390
0	0	0.999999390
1	0	0.569434327
0	0	0.841778072
0	0	0.835817807
0	0	0.835817807
1	1	0.612528817
1	1	0.883369126
1	1	0.831691072
1	0	0.521778528

Support Vector Machine
Label	Estimated label	Probability
0	0	0.92314078
0	0	0.92694052
1	1	0.65021272
0	0	0.81390982
0	0	0.86003766
0	0	0.8605255
1	0	0.83425303
1	1	0.97584139
1	1	0.82403036
1	0	0.80041372

Logistic Regression
Label	Estimated label	Probability
0	0	0.79770699
0	0	0.78271488
1	0	0.56420469
0	0	0.56946768
0	0	0.86886883
0	0	0.8673849
1	0	0.5175392
1	1	0.66775676
1	1	0.55726023
1	0	0.52991957

Voting Classifier
Label	Estimated label	Probability
0	0	0.91029576
0	0	0.91029576
1	1	0.59435627
0	0	0.69313934
0	0	0.86604043
0	0	0.86604043
1	0	0.59705601
1	1	0.89258578
1	1	0.81703188
1	0	0.7196419

Prerequisites

numpy
pandas
scikit-learn
matplotlib

Installing

To install the package, download it from GitHub, and navigate to the Task2 folder and install using setup.py command:

$ python setup.py install

To use the installed package, you can either import it into your module with:

from ds_internship_task2 import create_new_datasets
from ds_internship_task2 import visualise_dataset
from ds_internship_task2 import titanic_classification

Or use the command line script as follows:

$ run-all /full/path/to/the/data/folder -cdata -cvis -tmodels -print -splt

This will create different datasets (-cdata), create visualisations (-cvis), train the models (-tmodels), print the results and current statuses (-print), as well as show plots that are created by the -cvis argument.

Of course, any of these arguments may be omitted but be careful not to exclude the -tmodels arg if there are no pretrained models available in the /Data/Pickles folder.

Month 1 Task 3

Authors

Amar Civgin

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Hat tip to anyone whose code was used
Inspiration
etc

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Task2		Task2
Task3		Task3
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABH DS internship

Table of contents

Month 1 Task 2

Solution

Create new datasets

Visualise dataset

Titanic classification

Results

Model parameters

Model results

Prerequisites

Installing

Month 1 Task 3

Authors

License

Acknowledgments

About

Releases

Packages

Languages

License

acivgin1/M1-DS-internship

Folders and files

Latest commit

History

Repository files navigation

ABH DS internship

Table of contents

Month 1 Task 2

Solution

Create new datasets

Visualise dataset

Titanic classification

Results

Model parameters

Model results

Prerequisites

Installing

Month 1 Task 3

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages