Member-only story

DBSCAN Clustering with HDBSCAN: A Python Tutorial with Iris Dataset

Dr. Soumen Atta, Ph.D.
4 min readMar 29, 2023

Photo by Kelly Sikkema on Unsplash

In this tutorial, we will cover how to perform DBSCAN clustering with HDBSCAN in Python. DBSCAN is a popular clustering algorithm that groups together similar data points based on their density. HDBSCAN is a hierarchical extension of DBSCAN that automatically determines the optimal number of clusters and can handle clusters of varying densities.

Before we start, make sure you have the necessary Python libraries installed. You will need numpy, pandas, matplotlib, and scikit-learn. You will also need to install HDBSCAN, which can be done by running the following command in the terminal: pip install hdbscan.

Now let’s get started!

Step 1: Load the Data

For this tutorial, we will use the iris dataset, which is a popular dataset for machine learning. You can download the dataset from scikit-learn using the following code:

from sklearn.datasets import load_iris
iris = load_iris()

Step 2: Preprocess the Data

Before we can apply DBSCAN clustering, we need to preprocess the data. In this case, we will use the StandardScaler from scikit-learn to scale the data to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

Step 3: Apply DBSCAN Clustering

Now that we have preprocessed the data, we can apply DBSCAN clustering. We will use the DBSCAN implementation from scikit-learn:

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_scaled)

In this example, we set eps to 0.5 and min_samples to 5. These are hyperparameters that you can adjust to control the sensitivity of the clustering algorithm. eps determines the maximum distance between two points for them to be considered part of the same cluster, while…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Dr. Soumen Atta, Ph.D.
Dr. Soumen Atta, Ph.D.

Written by Dr. Soumen Atta, Ph.D.

I am a Postdoctoral Researcher at the Faculty of IT, University of Jyväskylä, Finland. You can find more about me on my homepage: https://www.soumenatta.com/

Responses (2)

Write a response

Great article,
In the sub plots for dbscan and hdbscan, both the plots are the same.

Nicely done, thank you for sharing!