Classification on Diabetes Data set

Shonit Gangoly
16 min readMay 1, 2021

--

Using Classification models on Pima Indians Diabetes Data set

The goal of my project is to predict whether a patient has diabetes or not using supervised classification. Diabetes is a chronic disease that occurs when the pancreas are unable to produce Insulin thus resulting in high level sugars to not be broken down by the body thus resulting in damage to organs and tissues.

The data set used is the Pima Indians Diabetes dataset. It has 9 features that will be used to tune the model:

  1. Glucose: Concentration of glucose in the body
  2. Blood Pressure: The diastolic blood pressure (mm Hg)
  3. Skin Thickness: Skin fold thickness (mm)
  4. Insulin: Insulin levels in the body (mu U/ml)
  5. BMI: Body Mass Index(Weight(Kg)/Height(m)²)
  6. Diabetes Pedigree Function: Genetic history of diabetes in the family tree
  7. Age: Age(Years)
  8. Final Class: Diabetes(1: Diabetes, 0:No diabetes)

Link to data set: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Exploratory Data Analysis

We begin with importing required libraries and storing the data csv file onto a data frame

import pandas as pd
import numpy as np
import sklearn
import imblearn
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
data = pd.read_csv("./data/diabetes.csv")
col = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Diabetes"]
data.columns = col
data.info()data.head()
Head of Data set

We can observe that all values are numerical except Diabetes which is 1 for positive and 0 for negative. Now using histograms to visualize the different classes and the values they contain

data.hist(bins=50, figsize = (8.0, 6.0))
plt.tight_layout(True)
plt.show()
Features in data set

Since there are multiple values at 0 except columns like PedigreeFunction and Age. The only acceptable one can be pregnancy. So I will try to change those values to NaN to give us a better visual idea.

col_missing = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]for c in col_missing:
data[c] = data[c].replace(to_replace=0, value=np.NaN)
data.hist(bins=50, figsize = (8.0, 6.0))
plt.tight_layout(True)
plt.show()

Plots show that Diabetes is skewed towards 0. So there are multiple entries of people with no diabetes

num_diabetes = data[ "Diabetes" ].sum()
num_no_diabetes = data.shape[ 0 ] - num_diabetes
perc_diabetes = num_diabetes / data.shape[ 0 ] * 100
perc_no_diabetes = num_no_diabetes / data.shape[ 0 ] * 100
print( "There are %d (%.2f%%) people who have diabetes and the remaining %d (%.2f%%) who have not been diagnosed with the desease." % ( num_diabetes, perc_diabetes, num_no_diabetes, perc_no_diabetes ) )def plot_diabetes( normalize ):
plt.grid( False )
data.Diabetes.value_counts( normalize=normalize ).plot( kind="bar", grid=False, color=[ sns.color_palette()[ 0 ], sns.colors.xkcd_rgb.get( 'orange' ) ] )
plt.xticks( [ 0, 1 ], [ 'No', 'Yes' ], rotation=0 )
plt.xlabel( "Diabetes" )

if ( normalize == False ):
plt.ylabel( "Count" )
else:
plt.ylabel( "Percentage" )

return

plt.subplot( 1, 2, 1 )
plot_diabetes( False )
plt.subplot( 1, 2, 2 )
plot_diabetes( True )
plt.tight_layout( True )
plt.show()
Chart to visualize how many have diabetes

Statistical Analysis

We will be doing statistical analysis to give us an idea of the mean, standard deviation, outliers in the data set. This will also help us in creating correlation matrices to determine which features heavily factor in prediction of diabetes.

We will be using:

  • Count: No of observations
  • Mean: Mean of the values
  • Std: Standard Deviation of values
  • Min: Minimum Value
  • Max: Maximum Value
  • Q1: 25% of values Lower Quartile
  • Median: Center Value
  • Q3: 75% of values Upper Quartile
data.describe().round( 2 )

So we can see that columns like Skin Thickness and Insulin have values that are much further from the mean which indicates the presence of outliers. So we will be using a series of Box Plots to visualize them.

plt.figure( figsize=( 7.0, 5.0 ) )for i in range( 8 ):
plt.subplot( 2, 4, i + 1 )
plt.grid( True )
sns.boxplot( x='Diabetes', y=data.columns[ i ], data=data )
plt.xticks( [ 0, 1 ], [ 'No', 'Yes' ], rotation=0 )
plt.tight_layout( True )
plt.show()
Box Plot for features

Correlation Analysis

Correlation Analysis is a statistical method for investigating the relationship between two numerical variables. We can do that by computing the Pearson’s correlation coefficient, which measures the strength of the linear relationship between two variables.

It is defined as following:

ρX,Y=cov(X,Y)/σXσY

The coefficient can only take values between -1 and 1:

  • the closer the value to 1, the higher the positive linear relationship.
  • the closer the value to 0, the lower the linear relationship.
  • the closer the value to -1, the higher the negative linear relationship.

In order to visually investigate the correlation among all the features of our data set, I will display the heat map of the correlation matrix

plt.figure(figsize = (6.0, 5.0))
plt.grid(False)
plt.xticks(range(data.shape[1]), data.columns[0:], rotation=0)
plt.yticks(range(data.shape[1]), data.columns[0:], rotation=0)
sns.heatmap(data.corr(), cbar=True, annot=True, square=False, fmt='.2f', cmap=plt.cm.Reds, robust=False, vmin=0)
plt.show()
Heat Map for correlation

As we observe not much correlation between features except for some values that are greater than 0.5 The features are:

  • Age-Pregnancies : Pregnancies can increase with age and stop after a certain age
  • Glucose-Diabetes : Higher glucose count has higher probability of being diagnose with diabetes
  • Glucose-Insulin : Higher level Glucose means more Insulin
  • BMI-SkinThickness : Higher the BMI, fatter the person is

Observing these correlations using scatterplot

sns.pairplot(data.dropna(), vars = ['Glucose', 'Insulin','BMI','SkinThickness'], size= 2.0, diag_kind='kde', hue='Diabetes')
plt.tight_layout(False)
plt.show()
Scatter plot for correlation

As we can see, there is a positive linear relationship between Insulin, Glucose and BMI, SkinThickness. So these factors will play an important role in determining the Diabetes in a person

Data Pre-Processing

We will start by splitting our data into 80% Training set and 20% Testing set. For this we will use sklearn train_test_split. This is a function that does the splitting for us. We just have to specify the test size which is 0.2 or 20% and random state which means that every time the data set is split, it uses different values in each split.

from sklearn.model_selection import train_test_splitX = data.drop(["Diabetes"], axis=1)
Y = data.Diabetes
X_Train, X_Test, y_train, y_test = train_test_split(X, Y, test_size = 0.2,
random_state=42, shuffle =True,
stratify= Y)

As we have seen during the EDA , our data set contains many Null or missing values and so to deal with them,

The two frequently used solutions are:

  • Removing them from the data set
  • Estimating those values and adding a median value

Choosing to replace these values with median since the training set is small

from sklearn.impute import SimpleImputerimpute = SimpleImputer(missing_values = np.nan, strategy='median')X_train_Impute = impute.fit_transform(X_Train)
X_test_Impute = impute.fit_transform(X_Test)

Normalizing the data

Standardizing data consists of centering the data and scaling each feature to unit variance:

xij=(xijμj)/σj

This way, all the features are brought on the same scale

from sklearn.preprocessing import StandardScalers = StandardScaler()X_train_normal = s.fit_transform(X_train_Impute)
X_test_normal = s.fit_transform(X_test_Impute)
data_X_train_normal = pd.DataFrame( X_train_normal, columns=col[ 0:8 ], index=y_train.index )
data_y_train_normal = pd.DataFrame( y_train, columns = [ col[ 8 ] ] )
data_train_normalized = data_X_train_normal.join( data_y_train_normal )

Since our data set is small, we will be using two techniques to artificially sample our data set in order to avoid Over fitting. It occurs when either the data set is too small or the model is too complex. I will be comparing the difference in accuracy’s using base data set and sampled data sets. Since our data set contains only 768 values, we will be using these techniques to sample them. These techniques also help in balancing the class imbalance since we saw from our previous graphs that there are many more people who got diabetes not detected as those who did. While this may be helpful in real world, but not when it comes to model training. Imbalance may lead to skewness in training and cause over fitting.

  • Principal Component Analysis
  • SMOTE (Synthetic Minority Over-sampling)

Principal Component Analysis

Principal Component Analysis is an unsupervised machine learning algorithm for reducing the dimensionality of the dataset.

It works by projecting the original dataset into a lower-dimensional space. The basis of this space is composed by a set of Principal Components that represent the directions where there is the most variance in the data. All the Principal Components are also orthogonal each other, so the correlation among the reduced features is minimized, and so, as a consequence, the redundancy of the information.

The algorithm consists of the following steps:

  • standardizing the dataset (as we have done previously), which is represented by the matrix X of size n×d, where n is the number of samples and d is the original number of features.
  • computing the sample covariance matrix Σ of the data of size d×d.
  • performing the eigen decomposition of Σ in order to obtain the corresponding set of eigenvalues and eigenvectors.
  • building the projection matrix W of size d×l containing on its columns the l eigenvectors of Σ corresponding to the eigenvalues of largest magnitude, where l is the desired number of features after the PCA transformation.
  • performing the PCA transformation in order to obtain the reduced dataset Z=X×W of size n×l.

In order to optimal number of reduced features, let’s first check the amount of variance explained by each of the Principal Components

from sklearn.decomposition import PCApca = PCA( whiten=True )
pca.fit( X_train_normal )
pca_evr = pca.explained_variance_ratio_
pca_evr_cum = np.cumsum( pca_evr )
x = np.arange( 1, len( pca_evr ) + 1 )
y = np.linspace( 0.1, 1, 10 )
plt.bar( x, pca_evr, alpha=1, align='center', label='Individual' )
plt.step( x, pca_evr_cum, where='mid', label='Cumulative', color=sns.colors.xkcd_rgb.get( 'dusty orange' ) )
plt.ylabel( 'Explained Variance Ratio' )
plt.xlabel( 'Principal Components' )
plt.legend()
plt.xticks( x )
plt.yticks( y )
plt.show()
PCA Step graph

We can see that the higher the number of Principal Components retained, the higher is the variance preserved, but it is also higher the number of features after applying PCA.

pca = PCA(n_components=6)X_train_pca = pca.fit_transform(X_train_normal)
X_test_pca = pca.fit_transform(X_test_normal)
print("Train Set columns: ", X_train_pca.shape[1])
print("Test Set Columns: ", X_test_pca.shape[1])

So we reduced our data set dimensionality from 8 columns to 6 columns

SMOTE: Synthetic Minority Oversampling Technique

The Synthetic Minority Oversampling TEchnique is an oversampling approach in which the minority class is oversampled by creating synthetic data samples. In particular, it works by taking each minority class sample and introducing synthetic data samples along the line segments joining any/all of the k minority class nearest neighbors.

Synthetic samples are generated in the following way:

  • Take the difference between the minority class sample under consideration and one if its k nearest neighbors.
  • Multiply this difference by a random number between 0 and 1, and add it to the minority class sample under consideration.
  • This causes the selection of a random point along the line segment between two data points where it synthesized a new minority class sample.
from imblearn.over_sampling import SMOTEsmote = SMOTE( random_state=42 )
X_train_smote, y_train_smote = smote.fit_resample( X_train_normal, y_train )
data_X_train_smote = pd.DataFrame( X_train_smote, columns=col[ 0:8 ] )
data_y_train_smote = pd.DataFrame( y_train_smote, columns = [ col[ 8 ] ] )
data_train_smote = data_X_train_smote.join( data_y_train_smote )
num_diabetes_smote = data_train_smote[ "Diabetes" ].sum()
num_no_diabetes_smote = data_train_smote.shape[ 0 ] - num_diabetes_smote
perc_diabetes_smote = num_diabetes_smote / data_train_smote.shape[ 0 ] * 100
perc_no_diabetes_smote = num_no_diabetes_smote / data_train_smote.shape[ 0 ] * 100
print( "There are %d (%.2f%%) people with diabetes and %d (%.2f%%) people without diabetes." % ( num_diabetes_smote, perc_diabetes_smote, num_no_diabetes_smote, perc_no_diabetes_smote ) )def plot_diabetes_value_counts( normalize ):
plt.grid( False )
data_train_smote[ 'Diabetes' ].value_counts( normalize=normalize ).plot( kind="bar", grid=False, color=[ sns.color_palette()[ 0 ], sns.colors.xkcd_rgb.get( 'orange' ) ] )
plt.xticks( [ 0, 1 ], [ 'No', 'Yes' ], rotation=0 )
plt.xlabel( "Diabetes" )

if ( normalize == False ):
plt.ylabel( "Count" )
else:
plt.ylabel( "Percentage" )

return

plt.subplot( 1, 2, 1 )
plot_diabetes_value_counts( False )
plt.subplot( 1, 2, 2 )
plot_diabetes_value_counts( True )
plt.tight_layout( True )
plt.show()
Balanced Classes

Classification

I will experiment with two classifiers and decide which one would be best to use.

I used the following classifiers:

  • k-Nearest Neighbors
  • Decision Tree

For each classifier I will fit three different models: one using the original dataset, one using the dataset balanced by SMOTE and one using the dataset reduced with PCA.

In order to properly select the hyperparameters for each classifier I will use a GridSearch 5-fold Cross-Validation approach, which allows to estimate which are the hyperparameters that give the best generalized results in a more reliable way than just trying several hyperparameters’ configurations using only the training set and a validation set.

The 5-fold Cross-Validation consists of the following steps:

  • splitting the dataset into 5 folds of equal size.
  • building 5 different models using the 5 different possible combinations of 4 of the 5 folds as training data.
  • evaluating each model using the remaining fold that was not used to train that model as validation data and obtaining its score value.
  • averaging all the score values obtained in order to provide a good estimation of the generalization performance of the model.

The GridSearch 5-fold Cross-Validation consists of applying the 5-fold Cross-Validation algorithm explained before for each of the possible configurations of hyperparameters that we want to test. As a result, the mean validation score is obtained for each parameter setting and the one that gave the best results is chosen.

Given the class imbalance, I decided to use different validation scores depending on the type of dataset used:

  • for the original dataset and the dataset reduced by PCA I selected the F1 score, which takes into account the class imbalance and the Precision-Recall trade-off.
  • for the dataset oversampled by SMOTE I selected the Accuracy because the samples of both classes are balanced by the algorithm.

Finally, after having fitted the models using the hypermaters’ configurations suggested by the GridSearch 5-fold Cross-Validation, I will evaluate them by having them make predictions on the data samples of the test set.

In order to properly evaluate each model I will use several metrics, which are defined through the following values:

  • TP: the number of true positive samples that have been correctly classified as positives. In our case the positive samples are those with a positive diagnosis of diabetes.
  • TN: the number of true negatives samples that have been correctly classified as negatives. In our case the negative samples are those who have not been diagnosed with diabetes.
  • FP: the number of true negatives samples that have been wrongly classified as positives.
  • FN: the number of true positives samples that have been wrongly classified as negatives.

The metrics that I will use are the following:

  • Accuracy: (TP+TN)/(TP+TN+FP+FN) (How accurate the model is)
  • Precision: TP/(TP+FP) (Quantifies the number of positive class predictions that actually belong to the positive class)
  • Recall: TP/(TP+FN) (Quantifies the number of positive class predictions made out of all positive examples in the dataset)
  • F1 Score: 2TP/(2TP+FP+FN) (F1 score provides a single score that balances both the concerns of precision and recall in one number)
  • Confusion matrix: It is a square matrix that visually reports the counts of the true positive, true negative, false positive, and false negative predictions of a classifier.

Each of those metrics provides different information about the performance of a classifier: Accuracy is usually a good metric when dealing with balanced datasets while the other ones are more suitable when there is a class skew or when there are differential miss classification costs, like in our case.

1. K-Nearest Neighbors

k-Nearest Neighbors is a simple Machine Learning algorithm for classification.

The learning phase consists only of storing the training set.

Then, in order to predict the class for a new data point, the algorithm finds the closest data points to the data point that we want to classify in the training set, its nearest neighbors, and assigns to it the class to which belong the majority of them.

The number of nearest neighbors that the algorithm considers in order to predict a class is chosen by the user through the hyper-parameter k.

Choosing the right value of k is crucial to find a good balance between over fitting and under fitting:

  • a too low value of k will result in very good performance when classifying samples of the training set, but it will perform worse when classifying samples that it has never seen during the training procedure. This happens because the classifier is not able to generalize well and it is over fitting the training data.
  • a too high value of k will result in a too simple model that is not able to discriminate properly the sample among the different classes. This causes under fitting.

The hyper-parameters that I will test using the Grid-Search Cross-Validation are the following:

  • n_neighbors: it is the value of k. It represents the number of neighbors that the algorithm considers in order to classify data points.
  • weights: whether to consider all the neighbors equally or to give more weight to the nearest ones.
from sklearn.neighbors import KNeighborsClassifierknn_param_grid = {
'n_neighbors' : [ 5, 9, 15, 21 ],
'weights' : [ 'uniform', 'distance' ]
}
knn_estimator_names = get_estimator_names( "kNN" )
knn_best_estimators = grid_search_cross_validation( KNeighborsClassifier(), knn_param_grid, knn_estimator_names )
kNN with hyper-parameters

Evaluating the model

knn_test_predictions = test_predictions( knn_best_estimators )print_compared_cofusion_matrices( knn_test_predictions, knn_estimator_names )
Confusion Matrix kNN
df_knn_overall_results = evaluate_test_results( knn_test_predictions, knn_estimator_names )
df_knn_overall_results
Performance of kNN
plot_learning_curve( knn_best_estimator, X_train_normal, knn_best_estimator_name, 'lower right' )

kNN trained with the original dataset and the one reduced with PCA provided similar results: they both scored an acceptable overall Accuracy and Precision scores. Their downside is that, with a very low Recall score, they miss over 40% of the people with positive diagnosis, which is not acceptable in our case.

Instead, kNN using SMOTE provides a lower overall Accuracy and a lower Precision score but, with the highest Recall score, it is the one that misses the lowest number of people with diabetes, which makes this classifier the most suitable for our purposes among the three.

The learning curve shows that both the training score and the validation score increase with the number of samples of the training set but it seems that using more than 350 samples does not improve the performances of the classifier as both the scores remain almost steady.

Learning Curve for kNN

Decision Tree

The Decision Tree is a particular kind of classifier which is represented by a tree of finite depth. Every node of the tree specifies a test involving an attribute and every branch descending from that node matches one of the possible outcomes of the test.

Classifying an istance means performing a sequence of tests, starting with the root node and terminating with a leaf node, which represents a class.

Decision Trees are induced from a training set using a top-down approach, from the root to the leafs, by recursively binary spliting the predictor space. The attribute selected to perform each split is the one that partitions the training set into subsets as pure as possible.

The most popular measures of purity are:

  • Entropy
  • Gini

The hyper-parameters that I will test with the Grid-Search Cross-Validation are the following:

  • criterion: the split criterion to be used.
  • max_depth: the maximum depth of the Tree.
  • class_weight: whether or not to take in account the class imbalance.
from sklearn.tree import DecisionTreeClassifierdecision_tree_param_grid = {
'max_depth': [ 10, 15, 20, None ],
'criterion' : [ 'gini', 'entropy' ],
'class_weight': [ None, 'balanced' ]
}
decision_tree_estimator_names = get_estimator_names( "Decision Tree" )
decision_tree_best_estimators = grid_search_cross_validation( DecisionTreeClassifier( random_state=42 ), decision_tree_param_grid, decision_tree_estimator_names )
Decision tree with hyper parameters
decision_tree_test_predictions = test_predictions( decision_tree_best_estimators )print_compared_cofusion_matrices( decision_tree_test_predictions, decision_tree_estimator_names )
Confusion Matrix Decision Tree
df_decision_tree_overall_results = evaluate_test_results( decision_tree_test_predictions, decision_tree_estimator_names )
df_decision_tree_overall_results
plot_learning_curve( decision_tree_best_estimator, X_train_normal, decision_tree_best_estimator_name, 'center left' )

As we can see, the only classifier that scored acceptable results is the one trained with the original dataset, which is the best among the three.

Learning Curve for DT

Final Comparison between the two models

df_results = evaluate_best_estimators_results( best_estimators )
df_results

From the final results we can see that reducing the dataset using PCA did not improved the performances of any of the classifiers. On the other hand, balancing the dataset using SMOTE provided better results with kNN.

We can conclude that using kNN with SMOTE yields a better accuracy and a recall score. So it is more efficient in predicting diabetes in patients

Experiments

  • Used a concept called Principal Component Analysis (PCA). This method reduces the dimensionality of the data set. Since I was dealing with over 8 dimensions, I tried reducing them to 6 so that our model calculations could become easier thus reducing the runtime. However this was not effective as nearly all of the columns have great effect on prediction of diabetes. Thus we got reduced accuracy
  • Included hyper parameter tuning functions such as Grid Search 5-fold cross validation. This allows us to estimate the hyper parameters for the models that suits the data best. In case of k-Nearest Neighbors it was the k value which decides how many neighbors to consider and weight which is the weight value assigned to feature set. For decision trees it was Entropy and GINI index values. This decides the purity of nodes.

Challenges Faced

  • The original data set that I had chosen was too imbalanced, so had to use a concept called SMOTE which is Synthetic Minority Oversampling Technique. This is a method that is used to balance out classes. It does this by duplicating minority classes and creating artificial or synthetic results such that our final class becomes balanced.
  • Another issue with the data set was that the different features had differently scaled values. This causes hindrance when using kNN as it solely relies on vector distances between the values. This was overcome by using sklearn standardization method. This function brings all features to the same scale.
  • Next challenge faced was large amounts of Null data. As Null data holds no value when calculating distances, I used sklearn imputer method to impute the data according to the median value of the column.

--

--

Shonit Gangoly
Shonit Gangoly

Written by Shonit Gangoly

Software Engineer in the making

No responses yet