# Analyzing Pima-Indians-Diabetes-Data using Python

In this tutorial, we will learn how to analyze Pima-Indians-Diabetes-Data (in .csv format) using Python’s Pandas. This dataset consists of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

The columns of this datasets are as follows:

- Pregnancies — Number of times pregnant
- GlucosePlasma — glucose concentration 2 hours in an oral glucose tolerance test
- Blood Pressure — Diastolic blood pressure (mm Hg)
- SkinThickness — Triceps skin-fold thickness (mm)
- Insulin — Two hours of serum insulin (mu U/ml)
- BMI — Body mass index (weight in kg/(height in m)²)
- Diabetes Pedigree Function — Diabetes pedigree function
- Age — Age in years
- Outcome — Class variable (0 or 1)

The first eight columns represent the independent variables, and the last column denotes the binary dependent variable. There are a total of 768 entries in the dataset. The outcome variable is set to 1 for 268 entries, and the rest are set to 0.

The dataset used in this tutorial can be downloaded from *here*.

**Analyzing Pima-Indians-Diabetes-Data**

At first, we import the required packages.

**import numpy as np **

import pandas as pd

from pandas import read_csv

Now, we load the input CSV file. We need to specify the file name.

`# Specify the file name `

filename = ‘diabetes.csv’

We use the Pandas *read_csv()* function to read the input file.

`# Read the data `

data = read_csv(filename)

We can check the type of the variable ‘data’ using the *type()* function available in Python.

`# Determine the type of ‘data’`

**type(data)**

The output is as follows:

`pandas.core.frame.DataFrame`

Therefore, it is a Pandas DataFrame.

We can determine the shape of the input CSV file using the shape attribute of DataFrame.

`# Determine the shape of the DataFrame`

**data.shape**

The output is shown below:

`(768, 9)`

Using the Pandas *head()* method, we can print the initial five rows of a DataFrame.

**data.head()**

The output is shown below:

We can determine the column names using the *columns* attribute of DataFrame. The column names are also known as the attribute values.

`# Get the column names `

**col_idx = data.columns**

print(col_idx)

The output of the above print statement is shown below:

`Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’],`

dtype=’object’)

We can also determine the row indices using the *index* attribute of DataFrame.

`# Get row indices `

**row_idx = data.index**

print(row_idx)

The output of the above print statement is shown below:

`RangeIndex(start=0, stop=768, step=1)`

Here, the indices start at 0 and end at 768. Therefore, the input CSV file consists of 769 rows.

The *dtypes* attribute of DataFrame can be used to find the data type of each attribute.

`# Find data type for each attribute `

print(data.dtypes)

The output of the above *print* statement is as follows:

`Pregnancies int64`

Glucose int64

BloodPressure int64

SkinThickness int64

Insulin int64

BMI float64

DiabetesPedigreeFunction float64

Age int64

Outcome int64

dtype: object

**Descriptive Statistics using Pandas**

The *describe()* function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are Count, Mean, Standard Deviation, Minimum Value, 25th Percentile, 50th Percentile (Median), 75th Percentile, Maximum Value.

`# Generate statistical summary `

**description = data.describe()**

print(description)

**Distribution of the ‘Outcome’ attribute**

The data considered here is an example of classification data. We can get an idea of the distribution of the ‘Outcome’ attribute in Pandas.

**class_counts = data.groupby(‘Outcome’).size() **

print(“Class breakdown of the data:\n”)

print(class_counts)

The output of the code segment is mentioned below:

Class breakdown of the data:Outcome

0 500

1 268

dtype: int64

Therefore, there are a total of 768 entries in the dataset. The outcome variable is set to 1 for 268 entries, and the rest are set to 0.

**Correlation between all pairs of attributes**

We can use the *corr()* function on the Pandas DataFrame to calculate a correlation matrix. For calculating correlation, Pearson’s Correlation Coefficient is used here. Pearson’s Correlation Coefficient assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. On the other hand, a value of 0 shows no correlation at all.

`# Compute correlation matrix `

**correlations = data.corr(method = ‘pearson’) **

print(correlations)

The output of the above code segment is shown below:

**Skew of attribute distributions**

The skew of each attribute can be calculated using the *skew()* function on the Pandas DataFrame.

**skew = data.skew() **

print(“Skew of attribute distributions in the data:\n”)

print(skew)

The outputs of the above code segment are shown below:

Skew of attribute distributions in the data:Pregnancies 0.901674

Glucose 0.173754

BloodPressure -1.843608

SkinThickness 0.109372

Insulin 2.272251

BMI -0.428982

DiabetesPedigreeFunction 1.919911

Age 1.129597

Outcome 0.635017

dtype: float64

A positive value represents a right-skewed distribution, and a negative value denotes a left-skewed distribution. Values closer to zero correspond to a less skewed distribution.

**Visualizing data using Pandas**

Now, we visualize data using Python’s Pandas library. We discuss both univariate plots and multivariate plots using Pandas.

**Univariate plots:**

- Histograms
- Density Plots
- Box and Whisker Plots

**Histograms**

The distribution of each attribute can easily be visualized by plotting histograms.

# Import required packagefrom matplotlib import pyplot# set the figure sizepyplot.rcParams[‘figure.figsize’] = [20, 10];# Draw histograms for all attributesdata.hist()

pyplot.show()

**Density Plots**

Another way to visualize the distribution of each attribute is density plots.

`# Density plots for all attributes`

**data.plot(kind=’density’, subplots=True, layout=(3,3), sharex=False)**

pyplot.show()

**Box and Whisker Plots**

Box and Whisker Plots (or simply, boxplots) is used to visualize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data, and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of the spread of the middle 50% of the data).

`# Draw box and whisker plots for all attributes `

**data.plot(kind= ‘box’, subplots=True, layout=(3,3), sharex=False, sharey=False)**

pyplot.show()

**Multivariate Plots**

- Correlation Matrix Plot
- Scatter Plot Matrix
- Correlation Matrix Plot

We can use the *corr()* function on the Pandas DataFrame to calculate a correlation matrix. For calculating correlation, Pearson’s Correlation Coefficient is used here. Pearson’s Correlation Coefficient assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. On the other hand, a value of 0 shows no correlation at all.

`# Compute the correlation matrix `

**correlations = data.corr(method = ‘pearson’)**

# Correlations between all pairs of attributes

The correlation matrix is shown in Table 3 above. Now, we will learn how to plot the correlation matrix.

# import required packageimport numpy as np# plot correlation matrixfig = pyplot.figure()# Rotate x-tick labels by 90 degrees

ax = fig.add_subplot(111)

cax = ax.matshow(correlations, vmin=-1, vmax=1)

fig.colorbar(cax)

ticks = np.arange(0,9,1)

ax.set_xticks(ticks)

ax.set_yticks(ticks)

names = data.columnsax.set_xticklabels(names,rotation=90)

ax.set_yticklabels(names)

pyplot.show()

**Scatter Plot Matrix**

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. Drawing all these scatter plots together is called a scatter plot matrix. Scatter plots are useful for spotting structured relationships between variables. Attributes with structured relationships may also be correlated, and good candidates for removal from your dataset.

# Import required packagefrom pandas.plotting import scatter_matrix# Plotting Scatterplot Matrix

pyplot.rcParams[‘figure.figsize’] = [20, 20]scatter_matrix(data)

pyplot.show()

This tutorial initially published *here*.