Member-only story

DBSCAN Clustering with HDBSCAN: A Python Tutorial with Iris Dataset

4 min readMar 29, 2023

In this tutorial, we will cover how to perform DBSCAN clustering with HDBSCAN in Python. DBSCAN is a popular clustering algorithm that groups together similar data points based on their density. HDBSCAN is a hierarchical extension of DBSCAN that automatically determines the optimal number of clusters and can handle clusters of varying densities.

Before we start, make sure you have the necessary Python libraries installed. You will need numpy, pandas, matplotlib, and scikit-learn. You will also need to install HDBSCAN, which can be done by running the following command in the terminal: pip install hdbscan.

Now let’s get started!

Step 1: Load the Data

For this tutorial, we will use the iris dataset, which is a popular dataset for machine learning. You can download the dataset from scikit-learn using the following code:

from sklearn.datasets import load_iris
iris = load_iris()

Get an email whenever Dr. Soumen Atta, Ph.D. publishes.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes. By signing up, you will create a Medium account if you don't…

soumenatta.medium.com

Step 2: Preprocess the Data

Before we can apply DBSCAN clustering, we need to preprocess the data. In this case, we will use the StandardScaler from scikit-learn to scale the data to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

Step 3: Apply DBSCAN Clustering

Now that we have preprocessed the data, we can apply DBSCAN clustering. We will use the DBSCAN implementation from scikit-learn:

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_scaled)

In this example, we set eps to 0.5 and min_samples to 5. These are hyperparameters that you can adjust to control the sensitivity of the clustering algorithm. eps determines the maximum distance between two points for them to be considered part of the same cluster, while…

DBSCAN Clustering with HDBSCAN: A Python Tutorial with Iris Dataset

Step 1: Load the Data

Get an email whenever Dr. Soumen Atta, Ph.D. publishes.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes. By signing up, you will create a Medium account if you don't…

Step 2: Preprocess the Data

Step 3: Apply DBSCAN Clustering

Create an account to read the full story.

Written by Dr. Soumen Atta, Ph.D.

Responses (2)

More from Dr. Soumen Atta, Ph.D.

How to install LaTeX on Ubuntu 24.04

Installing LaTeX on Ubuntu 24.04 is a straightforward process using the apt package manager. Below are step-by-step instructions to help…

K-Median Clustering Algorithm in Machine Learning and its Python Implementation

The k-median algorithm is a clustering algorithm that is used to partition a dataset into k clusters where each cluster is represented by…

🎤 Creating Presentations using Reveal.js in VS Code

Presentations are an essential way to share knowledge, ideas, and research findings. Reveal.js is a powerful, flexible, and open-source HTML

Grok 3: The AI Revolution You Can Actually Talk To (For Free!)

Imagine an AI that’s not just a tool but a conversational partner — one that’s witty, curious, and built to unravel the mysteries of the…

Recommended from Medium

Dimensionality Reduction: Feature Selection and Feature Elimination Explained

with PCA and RFE

Clustering Text Data with K-Means and Visualizing with t-SNE

In NLP, analyzing and grouping text data into meaningful clusters is a vital task. Clustering helps us discover hidden patterns or…

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while retaining most of the variance…

Isolation forest algorithm

Many of you may be familiar with machine learning algorithms such as decision tree and random forest, but it is likely that you have never…

A Mixture Model Approach for Clustering Time Series Data

Time Series Clustering Using Auto-Regressive Models, Moving Averages, and Nonlinear Trend Functions

Exploring Word Association Norms and Clustering Using Information-Theoretic Measures

A Case Study