Categorical Data Encoding Techniques in Python: A Complete Guide
Categorical data encoding is an important step in preparing data for machine learning algorithms. Categorical data refers to data that represents non-numerical values, such as colors, types of fruits, or gender. In order to use categorical data in machine learning models, it needs to be encoded as numerical values. In this tutorial, we will explore various techniques for categorical data encoding in Python.
We will be using the scikit-learn library for our examples. Scikit-learn is a popular library for machine learning in Python.
Let’s start by importing the necessary libraries:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
We will be using the following dataset for our examples. This dataset contains information about different types of fruits and their characteristics.
data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
'Weight': [100, 120, 80, 110, 130, 90]}
df = pd.DataFrame(data)
print(df)