Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 (Ester et al., 1996). It uses notions of connectivity and density of points in the data set to compute the connected components of dense enough regions of the data set. But let’s cut the intro and dive right into it.
Definitions
Let $(X,d)$ be a finite discrete metric space, i.e. let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Assume we fixed a pair of parameters $\varepsilon > 0$ and $m \in \mathbb{N}$. We say that two points $x,x’ \in X$ are $\varepsilon$-connected if there is a sequence of points $x=x_0,\ldots,x_n=x’$ such that \(d(x_i,x_{i+1}) < \varepsilon, \text{ for $i=0,\ldots,n-1$}.\) Note that $\varepsilon$-connectivity defines an equivalence relation on $X$ which we denote by $\sim_\varepsilon$.
For a subset $A \subset X$ we define its $\varepsilon$-boundary $\Delta_\varepsilon A$ to be those points in the complement of $A$ whose distance to $A$ is at most $\varepsilon$, i.e. we set
\[\Delta A := \Delta_\varepsilon A := \{ x \in X \setminus A \ | \ d(A,x) \leq \varepsilon \}.\]Here $d(A,x)$ denotes the minimum $\text{min}_{a \in A} { d(a,x) } $ over all points in $A$. This notion of boundary is borrowed from graph theory and slightly adapted to our situation. Don’t worry if you don’t like that notion it could also be omitted – I will elaborate on that later on. The choice of the parameters above allows us to define a discrete notion of density $\rho(x)$ of a point $x$ by
\[\rho(x) = \rho_{\varepsilon}(x) := \vert N_\varepsilon(x) \vert,\]where $\vert.\vert$ counts the number of elements of a set and $N_\varepsilon(x)$ denotes the $\varepsilon$-neighborhood of $p$ (i.e. the set of points whose distance to $x$ is at most $\varepsilon$). A point is called $m$-dense if its $\varepsilon$-density is greater or equal to $m$, i.e. its $\varepsilon$-neighborhood $N_\varepsilon(p)$ contains at least $m$ points. In the literature these points usually are referred to as core points.
Sketch of the DBSCAN-algorithm
The idea behind DBSCAN can be formulated as follows. Choose a pair of parameters $(\varepsilon, m) \in \mathbb{R}_{> 0} \times \mathbb{N}$. The first parameter, $\varepsilon > 0$, gives rise to the notions of connectivity, density, and the boundary of subset as described above. Let
\[X_{m \leq \rho} := \rho^{-1}\big( [m,\infty) \big) \subset X\]denote the points of density at least $m$ and let $C_1,\ldots,C_n$ denote its connected components with respect to the notion of $\varepsilon$-connectivity defined above, i.e.
\[\pi_0^\varepsilon(X_{m \leq \rho}) := \big( X_{m \leq \rho} \big)/_{x \sim_\varepsilon x'} = \big\{ C_1,\ldots,C_n \big\}.\]The components $C_1,\ldots,C_n$ can be understood as dense cores of the final clusters. Note that these sets are mutually disjoint. The final clusters $C^\prime_1,\ldots,C^\prime_n$ are obtained from $C_1,\ldots,C_n$ by adding their respetive boundaries, i.e. for $i=1,\ldots,n$ we define $C^\prime_i := C_i \cup \Delta C_i.$
Actually $C^\prime_i$ is just the $\varepsilon$-neighborhood $N_\varepsilon(C_i)$ of $C_i$, but I like the formulation in terms of $\Delta_\varepsilon$. Note that that $C^\prime_1,\ldots,C^\prime_n$ are not necessarily pairwise disjoint. To achieve that we need to make a choice (usually following the order in which the points are visited).
References
- Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–231.