Skip to content

The task is to classify whether a patient has diabetes(class 1) or not (class 0), based on the diagnostic measurements provided in the dataset, using logistic regression and neural network as the classifier. The dataset in use is the Pima Indians Diabetes Database(diabetes.csv). The code is written in Python.

Notifications You must be signed in to change notification settings

akshu15/Classification-using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Classification-using-Machine-Learning

The task is to classify whether a patient has diabetes(class 1) or not (class 0), based on the diagnostic measurements provided in the dataset, using logistic regression and neural network as the classifier. The dataset in use is the Pima Indians Diabetes Database(diabetes.csv). The code is written in Python.

1.1 Data processing

Extract features values from the data: Process the original CSV data files into a Numpy matrix or Pandas Dataframe. For this we will first import the libraries. We will then use pandas library to load the CSV data to a pandas data frame.

ml1

Data Partitioning:

For this we will first separate the features and target, and then normalize our features. Using sklearn library’s train-test-split, we will partition our data into training, validation and testing data. Here we have randomly chosen 60% of the data for training, 20% for validation and the rest for testing.

1.2 Implementing Logistic Regression

Train using Logistic Regression:

We will then define a sigmoid function.

ml2

A sigmoid 20 function is an activation function with output always lying between a range of 0 to 1. Now we will define a function for training our model. In this function we have defined a cost/loss variable where we have used our sigmoid function for calculating the loss and also Gradient Descent for logistic regression to train the model. Finally we call the model function by passing training set, learning rate and iterations parameters. Now we will test the performance of our model using the validation set and the testing set. This shows the effectiveness of the model’s generalization power gained by learning.

ml3

1.3 Implementing Neural Networks

Train using Neural networks:

For training the Neural Network model we have used 3 hidden layers with different regularization methods(l2, l1). As model complexity increases, it is likely that we overfit. One way to control overfitting is adding a regularization term to the error function. Regularization is used to improve the model’s generalization power gained by learning. It helps in avoiding overfitting by appending penalties to the loss function.

L1 Regularization uses the absolute value of the magnitude of coefficient as penalty term to the loss.

ml4

where ml_lambda is the regularization coefficient that controls relative importance of data-dependent error ml_ed (w) and regularization term. After training our model, when we evaluate it we get an accuracy of about 88%.

ml5

We will then plot the accuracy and loss.

l1_plot

L2 Regularization uses the squared magnitude of coefficient as the penalty term to the loss.

ml6

Here after training our model, when we evaluate it we get an accuracy of about 98%. This is better than the L1 regularization, who shrinks the unimportant feature’s coefficient to zero. L1 is better in case when we have huge amount of features with us.

l2

We will then plot the accuracy and the loss for training and valid data.

l2_plot2

In Dropout regularization technique, the neurons are randomly dropped-out. Here I have applied drop out between two hidden layers. After training our model, when we evaluate it we get an accuracy of about 93%.

dropout

We will then plot the accuracy and the loss for training and valid data.

dropout_plot

For a small number of hidden neurons, we observe that the accuracy of L2 is better than the dropout.

About

The task is to classify whether a patient has diabetes(class 1) or not (class 0), based on the diagnostic measurements provided in the dataset, using logistic regression and neural network as the classifier. The dataset in use is the Pima Indians Diabetes Database(diabetes.csv). The code is written in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published