DBSCAN Clustering with HDBSCAN: A Python Tutorial with Iris Dataset

Dr. Soumen Atta, Ph.D.
4 min readMar 29, 2023
Photo by Kelly Sikkema on Unsplash

In this tutorial, we will cover how to perform DBSCAN clustering with HDBSCAN in Python. DBSCAN is a popular clustering algorithm that groups together similar data points based on their density. HDBSCAN is a hierarchical extension of DBSCAN that automatically determines the optimal number of clusters and can handle clusters of varying densities.

Before we start, make sure you have the necessary Python libraries installed. You will need numpy, pandas, matplotlib, and scikit-learn. You will also need to install HDBSCAN, which can be done by running the following command in the terminal: pip install hdbscan.

Now let’s get started!

Step 1: Load the Data

For this tutorial, we will use the iris dataset, which is a popular dataset for machine learning. You can download the dataset from scikit-learn using the following code:

from sklearn.datasets import load_iris
iris = load_iris()

Step 2: Preprocess the Data

--

--

Dr. Soumen Atta, Ph.D.

I am a Postdoctoral Researcher at the Faculty of IT, University of Jyväskylä, Finland. You can find more about me on my homepage: https://www.soumenatta.com/