Exploratory Data Analysis using Python Pandas: A Tutorial

Exploratory Data Analysis using Python Pandas: A Tutorial

Load the input data

import pandas as pd
df = pd.read_csv(“pima-indians-diabetes-data.csv”, header=None)
type(df)
pandas.core.frame.DataFrame

Determine the shape of the input data

df.shape
(768, 9)
df.head()
df.head()
df.tail()
df.tail()

Assigning names to attributes

attributes = ['Pregnancies','GlucosePlasma','BloodPressure','SkinThickness','Insulin','BMI','DPF','Age','Group']
df.columns = attributes
df.head()

Retrieve basic information of the input data

  1. Count,
  2. Mean,
  3. Standard Deviation,
  4. Minimum Value,
  5. 25th Percentile,
  6. 50th Percentile (Median),
  7. 75th Percentile,
  8. Maximum Value.
df.describe()
df.describe()
from pandas import set_option 
set_option( ‘display.width’, 100)
set_option( ‘precision’, 3)
df.describe()
df.describe()
df.dtypes
Pregnancies int64
GlucosePlasma int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DPF float64
Age int64
Group int64
dtype: object

Basic query in the DataFrame

df.query(‘Age > 65’)
df.query(‘Age > 65’)
df.query(‘Age > 65 & Group == 1’)
df.query(‘Age > 65 & Group == 1’)

Find the distribution of the ‘Group’ attribute

df.groupby(‘Group’).size()
Group
0 500
1 268
dtype: int64

Correlation between all pairs of attributes

df.corr(method = ‘pearson’)
df.corr(method = ‘pearson’)

Skew of attribute distribution

df.skew()
Pregnancies 0.902
GlucosePlasma 0.174
BloodPressure -1.844
SkinThickness 0.109
Insulin 2.272
BMI -0.429
DPF 1.920
Age 1.130
Group 0.635
dtype: float64

Univariate analysis

Box-and-Whisker plots

df.boxplot(column = attributes, figsize=(15, 8))
df.boxplot(column = attributes, figsize=(15, 8))
df.boxplot(column = [‘Age’], figsize=(10, 5))
df.boxplot(column = [‘Age’], figsize=(10, 5))

Density plots

df.plot(kind=’density’, subplots=True, layout=(3,3), figsize=(15, 10));

Histogram

df.hist(figsize=(15, 10))

Multivariate analysis

Scatter plots

df.plot.scatter(x = ‘Age’, y = ‘Pregnancies’, figsize=(15, 10));
df.plot.scatter(x = ‘Age’, y = ‘Pregnancies’, figsize=(15, 10));
df.plot.scatter(x = ‘Age’, y = ‘BMI’, figsize=(15, 10));
df.plot.scatter(x = ‘Age’, y = ‘BMI’, figsize=(15, 10));

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dr. Soumen Atta, Ph.D.

Dr. Soumen Atta, Ph.D.

94 Followers

Postdoctoral Researcher at Laboratoire des Sciences du Numérique de Nantes (LS2N), Université de Nantes, IMT Atlantique, Nantes, France.