First Project – Predicting Diabetes in Pima People

Predicting Diabetes in the Pima People

See the live demo — here!

In [1]:
#import pandas for data manipulation
import pandas as pd

#import data
data = pd.read_csv('diabetes.csv')

#change float64 values to integers for comparisons to work correctly
data.loc[:,['BMI','DiabetesPedigreeFunction']]=data.loc[:,['BMI','DiabetesPedigreeFunction']].astype(int)

After reviewing the dataset, there were no missing datapoints as None or NaN. However, many of the features are indeed physical attributes and there are many 0 values. With the columns for glucose (concentration in blood), blood pressure, skin thickness, insulin (concentration in blood), BMI (a height and weight measurement), and age I decided if 3 or more of these are 0 in any row then they should be left out.

In [2]:
#for columns glucose through age, remove rows with 3 or more 0's across those columns
data = data.loc[(data.loc[:,'Glucose':'Age']==0).sum(axis=1)<3,:]

#split independent and dependent
x = data.iloc[:,0:8].values
y = data.iloc[:,8].values

To prevent a warning when scaling, the x variable will by transformed to float.

In [3]:
#change all x values to float for imputing and scaling
x = x.astype(float)

Now that the rows with 3 or more 0 values in the aforementioned column are gone, I assumed it to be safe to replace the remaining few 0 values with the sample mean.

In [4]:
#replace remaining glucose and skin thickness 0's with the respective mean's
from sklearn.preprocessing import Imputer
xputer = Imputer(missing_values = 0, strategy = 'mean', axis = 0)
xputer = xputer.fit(x[:,[1,3]])
x[:,[1,3]] = xputer.transform(x[:,[1,3]])

The data is now ready to be split into training and test sets, then features scaled to build the model.

In [5]:
#train test split
from sklearn.cross_validation import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2, random_state=1)


#feature scaling x
from sklearn.preprocessing import StandardScaler
xscaler = StandardScaler()
xtrain = xscaler.fit_transform(xtrain)
xtest = xscaler.transform(xtest)

#fit model
from sklearn.svm import SVC
classifier = SVC(kernel='linear')
classifier.fit(xtrain,ytrain)
Out[5]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

With the classifier trained, we are ready to make predictions on the test set!

In [6]:
#predict
ypred = classifier.predict(xtest)

Using the confusion matrix we can see how will it performed.

In [7]:
#confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest,ypred)
print(cm)
[[69  7]
 [14 19]]

Let’s see this as a score.

In [8]:
#single score
score = classifier.score(xtest,ytest)
print('single score: {0:.2f}%'.format(score*100))
single score: 80.73%

In this instance the classifier does a very decent job of predicting the outcomes correctly. Now we will see a more realistic score with 5 fold cross validation.

In [ ]:
#cross-validation score
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(classifier, x, y, cv=5)
print('cv score: {0:.2f}% +/- {1:.2f}%'.format(scores.mean()*100,scores.std()*200))
cv score: 77.48% +/- 6.04%

This result is lower than the single score but give’s more insight to how well the classifer performs which is still very decent. The model may be optimized more with better hyperparameters but the purpose here was to demontrate the process of building the model. Also, I did run logistic regression, k-nearest neigbors, and random forest classifiers but they did not perform as well as the linear support vector machine classifier.

Dataset – https://www.kaggle.com/uciml/pima-indians-diabetes-database

Leave a Reply

Your email address will not be published. Required fields are marked *