*Density-based spatial clustering of applications with noise* (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 (Ester et al., 1996). It uses notions of *connectivity* and *density* of points in the data set to compute the connected components of dense enough regions of the data set. But let’s cut the intro and dive right into it.

# Definitions

Let $(X,d)$ be a finite discrete metric space, i.e. let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Assume we fixed a pair of parameters $\varepsilon > 0$ and $m \in \mathbb{N}$. We say that two points $x,x’ \in X$ are **$\varepsilon$-connected** if there is a sequence of points $x=x_0,\ldots,x_n=x’$ such that
\(d(x_i,x_{i+1}) < \varepsilon, \text{ for $i=0,\ldots,n-1$}.\)
Note that **$\varepsilon$-connectivity** defines an equivalence relation on $X$ which we denote by $\sim_\varepsilon$.

For a subset $A \subset X$ nnn we define its **$\varepsilon$-boundary $\Delta_\varepsilon A$** to be the set of points whose distance to $A$ is at most $\varepsilon$, i.e. we set

Here $d(A,x)$ denotes the minimum $\text{min}_{a \in A} { d(a,x) } $ over all points in $A$. This notion of *boundary* is borrowed from graph theory and slightly adapted to our situation. Don’t worry if you don’t like that notion it could also be omitted – I will elaborate on that later on. The choice of the parameters above allows us to define a discrete notion of **density $\rho(x)$ of a point $x$** by

where $\vert.\vert$ counts the number of elements of a set and $N_\varepsilon(x)$ denotes the $\varepsilon$-neighborhood of $p$ (i.e. the set of points whose distance to $x$ is at most $\varepsilon$). A point is called **$m$-dense** if its $\varepsilon$-density is greater or equal to $m$, i.e. its $\varepsilon$-neighborhood $N_\varepsilon(p)$ contains at least $m$ points. In the literature these points usually are referred to as **core points**.

# Sketch of the DBSCAN-algorithm

The idea behind DBSCAN can be formulated as follows. Choose pair of parameters $(\varepsilon, m) \in \mathbb{R}_{\geq 0} \times \mathbb{N}$. The first parameter, $\varepsilon > 0$, gives rise to the notions of connectivity, density, and the boundary of subset as described above. Let

\[X_{m \leq \rho} := \rho^{-1}\big( [m,\infty) \big) \subset X\]denote the points of density at least $m$ and let $C_1,\ldots,C_n$ denote its connected components with respect to the notion of $\varepsilon$-connectivity defined above, i.e.

\[\pi_0^\varepsilon(X_{m \leq \rho}) := \big( X_{m \leq \rho} \big)/_{x \sim_\varepsilon x'} = \big\{ C_1,\ldots,C_n \big\}.\]The components $C_1,\ldots,C_n$ can be understood as **dense cores** of the final clusters. Note that these sets are mutually disjoint. The final clusters $C^\prime_1,\ldots,C^\prime_n$ are obtained from $C_1,\ldots,C_n$ by adding their respetive boundaries, i.e. for $i=1,\ldots,n$ we define $C^\prime_i := C_i \cup \Delta C_i.$

Actually $C^\prime_i$ is just the $\varepsilon$-neighborhood $N_\varepsilon(C_i)$ of $C_i$, but I like the formulation in terms of $\Delta_\varepsilon$. Note that that $C^\prime_1,\ldots,C^\prime_n$ are not necessarily pairwise disjoint. To achieve that we need to make a choice (usually following the order in which the points are visited) and in that sense an implementation of DBSCAN is usually not deterministic.

# References

- Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
*Proceedings of the Second International Conference on Knowledge Discovery and Data Mining*, 226–231.