Categorical Data Encoding Techniques in Python: A Complete Guide

Dr. Soumen Atta, Ph.D.
9 min readApr 27, 2023

Categorical data encoding is an important step in preparing data for machine learning algorithms. Categorical data refers to data that represents non-numerical values, such as colors, types of fruits, or gender. In order to use categorical data in machine learning models, it needs to be encoded as numerical values. In this tutorial, we will explore various techniques for categorical data encoding in Python.

We will be using the scikit-learn library for our examples. Scikit-learn is a popular library for machine learning in Python.

Let’s start by importing the necessary libraries:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

We will be using the following dataset for our examples. This dataset contains information about different types of fruits and their characteristics.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
'Weight': [100, 120, 80, 110, 130, 90]}

df = pd.DataFrame(data)

print(df)

--

--

Dr. Soumen Atta, Ph.D.

I am a Postdoctoral Researcher at the Faculty of IT, University of Jyväskylä, Finland. You can find more about me on my homepage: https://www.soumenatta.com/