Analyzing Pima-Indians-Diabetes-Data using Python

Analyzing Pima-Indians-Diabetes-Data using Python
  1. GlucosePlasma — glucose concentration 2 hours in an oral glucose tolerance test
  2. Blood Pressure — Diastolic blood pressure (mm Hg)
  3. SkinThickness — Triceps skin-fold thickness (mm)
  4. Insulin — Two hours of serum insulin (mu U/ml)
  5. BMI — Body mass index (weight in kg/(height in m)²)
  6. Diabetes Pedigree Function — Diabetes pedigree function
  7. Age — Age in years
  8. Outcome — Class variable (0 or 1)
import numpy as np 
import pandas as pd
from pandas import read_csv
# Specify the file name 
filename = ‘diabetes.csv’
# Read the data 
data = read_csv(filename)
# Determine the type of ‘data’
type(data)
pandas.core.frame.DataFrame
# Determine the shape of the DataFrame
data.shape
(768, 9)
data.head()
Output of data.head()
Table 1: Output of data.head()
# Get the column names 
col_idx = data.columns
print(col_idx)
Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’],
dtype=’object’)
# Get row indices 
row_idx = data.index
print(row_idx)
RangeIndex(start=0, stop=768, step=1)
# Find data type for each attribute 
print(data.dtypes)
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
# Generate statistical summary 
description = data.describe()
print(description)
Output of data.describe()
Table 2: Output of data.describe()
class_counts = data.groupby(‘Outcome’).size() 
print(“Class breakdown of the data:\n”)
print(class_counts)
Class breakdown of the data:Outcome
0 500
1 268
dtype: int64
# Compute correlation matrix 
correlations = data.corr(method = ‘pearson’)
print(correlations)
Correlation matrix
Table 3: Correlation matrix
skew = data.skew() 
print(“Skew of attribute distributions in the data:\n”)
print(skew)
Skew of attribute distributions in the data:Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64
  • Density Plots
  • Box and Whisker Plots
# Import required package 
from matplotlib import pyplot
# set the figure size
pyplot.rcParams[‘figure.figsize’] = [20, 10];
# Draw histograms for all attributes
data.hist()
pyplot.show()
Histograms for all attributes
Fig. 1: Histograms for all attributes
# Density plots for all attributes
data.plot(kind=’density’, subplots=True, layout=(3,3), sharex=False)
pyplot.show()
Density plots for all attributes
Fig. 2: Density plots for all attributes
# Draw box and whisker plots for all attributes 
data.plot(kind= ‘box’, subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()
Box and whisker plots for all attributes
Fig. 3: Box and whisker plots for all attributes
  • Scatter Plot Matrix
  • Correlation Matrix Plot
# Compute the correlation matrix 
correlations = data.corr(method = ‘pearson’)
# Correlations between all pairs of attributes
# import required package 
import numpy as np
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names = data.columns
# Rotate x-tick labels by 90 degrees
ax.set_xticklabels(names,rotation=90)
ax.set_yticklabels(names)
pyplot.show()
Plot of the correlation matrix
Fig. 4: Plot of the correlation matrix
# Import required package 
from pandas.plotting import scatter_matrix
pyplot.rcParams[‘figure.figsize’] = [20, 20]
# Plotting Scatterplot Matrix
scatter_matrix(data)
pyplot.show()
Scatterplot Matrix
Fig. 5: Scatterplot Matrix

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store