Exploratory Data Analysis using Python Pandas: A Tutorial

Exploratory Data Analysis using Python Pandas: A Tutorial

In this tutorial, we will learn about exploratory data analysis using Python Pandas. In exploratory data analysis, we analyze the input dataset to summarize its main characteristics. Sometimes, we examine the main features of the input dataset visually using different standard plots.

This is a beginner-friendly tutorial. Here, we assume that the readers are familiar with the basic Python programming language. If you are new to Python programming and have never used Pandas before, you can read the following beginner-friendly tutorial on Python Pandas.

In this tutorial, we will use a well-known dataset, known as Pima Indian Diabetes data. This dataset can be downloaded from this link. This dataset consists of several medical predictor (independent) variables and one target (dependent) variable. Here, the input data is available as a CSV (comma-separated value) file. In this tutorial, we will only be using the Python Pandas package.

Load the input data

The input CSV file needs to be loaded at first. This can be done using Python Pandas. We import Python Pandas with an alias ‘pd’.

import pandas as pd

Now, we load the input CSV file as follows:

df = pd.read_csv(“pima-indians-diabetes-data.csv”, header=None)

Note that, the input file does not have any header. The read_csv() function reads the input CSV file into a DataFrame ‘df’. We can confirm the type of the variable ‘df’ using the type() function, as shown below:

type(df)

The output of the above command is:

pandas.core.frame.DataFrame

Determine the shape of the input data

We can use the shape attribute of DataFrame to determine the number of rows and number of columns present in DataFrame ‘df’.

df.shape

The output of the above statement is:

(768, 9)

Therefore, the input CSV file contains 768 rows and 9 columns. These columns denote the attributes of the input data.

We can easily inspect the initial few rows of the DataFrame ‘df’ using the head() function, as follows:

df.head()

The output is shown below:

df.head()
df.head()

We can observe that the default column names are starting from 0 and ends at 8. Similarly, the row indices are automatically set.

We can also retrieve the last few rows of the DataFrame ‘df’ using the tail() function, as shown below:

df.tail()

The output of the tail() function is shown below:

df.tail()
df.tail()

Assigning names to attributes

We can change the default column (i.e., attribute) names. This will help us to easily understand the data.

attributes = ['Pregnancies','GlucosePlasma','BloodPressure','SkinThickness','Insulin','BMI','DPF','Age','Group']

Here, the attribute names are set based on the original dataset. The first eight attributes are the independent variables and the last one is the dependent variable in the original dataset.

Now, we can update the column names of the DataFrame ‘df’ using the command mentioned below:

df.columns = attributes

The output is shown below:

df.head()
df.head()

As we can see now, the column names are modified. This helps us to easily identify each column.

Retrieve basic information of the input data

We can even compute some basic statistics of the DataFrame ‘df’ using the describe() function. The describe() function of the Pandas DataFrame lists eight statistical properties of each attribute. They are:

  1. Count,
  2. Mean,
  3. Standard Deviation,
  4. Minimum Value,
  5. 25th Percentile,
  6. 50th Percentile (Median),
  7. 75th Percentile,
  8. Maximum Value.

We can apply the describe() function as shown below:

df.describe()

The output of the above code is shown below:

df.describe()
df.describe()

We can also use some options such as precision while using the describe() function.

from pandas import set_option 
set_option( ‘display.width’, 100)
set_option( ‘precision’, 3)
df.describe()

The output is shown below:

df.describe()
df.describe()

We can easily determine the datatype of each attribute in ‘df’, as follows:

df.dtypes

The output is shown below:

Pregnancies int64
GlucosePlasma int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DPF float64
Age int64
Group int64
dtype: object

Basic query in the DataFrame

We can use the query() method of DataFrame for filtering purposes.

Example 1: Suppose we want to find the rows of the DataFrame ‘df’ where the attribute ‘Age’ has values greater than 65.

This can be done using the following code:

df.query(‘Age > 65’)

The output of the above code is shown below:

df.query(‘Age > 65’)
df.query(‘Age > 65’)

Example 2: Suppose we want to find the rows of the DataFrame ‘df’ where the attribute ‘Age’ has values greater than 65 and belongs to group 1, i.e., the values of attribute ‘Group’ are all 1.

This is a multi-criteria query. This can be achieved using the following command:

df.query(‘Age > 65 & Group == 1’)

The output of the above code is shown below:

df.query(‘Age > 65 & Group == 1’)
df.query(‘Age > 65 & Group == 1’)

Find the distribution of the ‘Group’ attribute

We can determine the distribution of the attribute ‘Group’, as follows:

df.groupby(‘Group’).size()

The output is shown below:

Group
0 500
1 268
dtype: int64

This means that there are two groups in the input data. One group is having 500 rows and the other is having 268 rows.

Correlation between all pairs of attributes

Pearson’s Correlation Coefficient can be computed between all pairs of attributes of the DataFrame ‘df’ using Pandas corr() function. Here, it is assumed that each attribute follows a normal distribution. A full negative and full positive correlations are denoted using the values -1 and 1, respectively. On the other hand, a value of 0 indicates no correlation at all.

df.corr(method = ‘pearson’)

The output of the above code is shown below:

df.corr(method = ‘pearson’)
df.corr(method = ‘pearson’)

Skew of attribute distribution

We can use the skew() function to compute the skew of each attribute in a Python DataFrame.

df.skew()

The output is shown below:

Pregnancies 0.902
GlucosePlasma 0.174
BloodPressure -1.844
SkinThickness 0.109
Insulin 2.272
BMI -0.429
DPF 1.920
Age 1.130
Group 0.635
dtype: float64

A right-skewed and left-skewed distribution are represented by a positive and negative value, respectively. Values closer to zero denote a less skewed distribution.

Univariate analysis

Here, we discuss three univariate analyses based on the Box-and-Whisker plots, Density plots, and Histograms.

Box-and-Whisker plots

Box-and-Whisker plots can be drawn from a DataFrame.

df.boxplot(column = attributes, figsize=(15, 8))

The above command will create a box-and-whisker plot for all the attributes present in the DataFrame ‘df’ (shown below). The parameter ‘figsize’ is optional.

df.boxplot(column = attributes, figsize=(15, 8))
df.boxplot(column = attributes, figsize=(15, 8))

We can also select attributes for which we want to draw box-and-whisker plots.

df.boxplot(column = [‘Age’], figsize=(10, 5))

The above command will create a box-and-whisker plot for the attribute ‘Age’ in the DataFrame ‘df’.

df.boxplot(column = [‘Age’], figsize=(10, 5))
df.boxplot(column = [‘Age’], figsize=(10, 5))

Density plots

Density plots for all attributes of a DataFrame can be drawn.

df.plot(kind=’density’, subplots=True, layout=(3,3), figsize=(15, 10));

The output of the above code is shown below:

The density plot helps us to visualize the distribution of an attribute.

Histogram

We can also draw histograms of all the attributes in a DataFrame.

df.hist(figsize=(15, 10))

The histograms of the attributes in the DataFrame ‘df’ are shown below:

The histogram also helps us to understand the distribution of the attribute.

Multivariate analysis

Here, we discuss scatter plots for multivariate analyses.

Scatter plots

We can draw a scatter plot between the attributes ‘Age’ and ‘Pregnancies’, as follows:

df.plot.scatter(x = ‘Age’, y = ‘Pregnancies’, figsize=(15, 10));

Here, the parameter ‘figsize’ is also optional. The output of the above command is shown below:

df.plot.scatter(x = ‘Age’, y = ‘Pregnancies’, figsize=(15, 10));
df.plot.scatter(x = ‘Age’, y = ‘Pregnancies’, figsize=(15, 10));

Similarly, a scatter plot between the attributes ‘Age’ and ‘BMI’ can be drawn using the following code:

df.plot.scatter(x = ‘Age’, y = ‘BMI’, figsize=(15, 10));

The output is shown below:

df.plot.scatter(x = ‘Age’, y = ‘BMI’, figsize=(15, 10));
df.plot.scatter(x = ‘Age’, y = ‘BMI’, figsize=(15, 10));

This is the end of this tutorial. In this tutorial, we have learned how to use the Python Pandas package for exploratory data analyses.

Interested readers can read the following introductory tutorial on Descriptive Statistics using Python Pandas.

Assistant Professor at Indian Institute of Information Technology (IIIT) Vadodara, India, and Postdoctoral researcher at Masaryk University, Czech Republic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store