The $k$-Means algorithm computes a Voronoi partition of the data set such that each landmark is given by the centroid of the corresponding cell. Let me quickly quote Wikipedia on the history of the algorithm before I explain what it is about:

“The term “$k$-means” was first used by James MacQueen in 1967 (Macqueen, 1967), though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957 (Lloyd, 1982).” – www.wikipedia.org/k-means

The $k$-Means Problem

Let $X \subset \mathbb{R}^m$ be a finite collection of data points. Fix a positive integer $k$. Then our aim is to find a partition $\boldsymbol S = \{S_1, \ldots, S_k\}$ of $X$ into $k$ subsets such that it minimizes the following function

\[J(\boldsymbol S) = \sum_{i=1}^k \sum_{x \in S_i} \| x - \mu(S_i) \|^2,\]

where $\mu(S)$ denotes the mean of the points in $S$, i.e.

\[\mu(S) = \frac{1}{|S|}\sum_{x \in S} x.\]

We denote by $\mu_* \big( \boldsymbol S \big)$ the collection of means of the sets in $\boldsymbol S$.

As a rule of thumb: in most of my posts, the $*$-functor applied to some construction (or function) $f$ can in functional-programming terms be translated to $f_*(Z) := \verb+map+ \ f \ Z.$

Voronoi cells

Let $(Y,d)$ be a metric space. Let $\Lambda \subset Y$ be a finite subset called the landmarks. Given a landmark $\lambda \in \Lambda$ we define its associated Voronoi cell $V_\lambda$ by

\[V_\lambda := \{ y \in Y \ | \ d(y,\lambda) \leq d(y, \Lambda) \}.\]

That means $V_\lambda$ consists of all points that are closer to $\lambda$ than to any of the other landmarks in $\Lambda$. Suppose we are given a subset $X \subset Y$ then we introduce the following shorthand notation for a realtive version of a Voronoi cell

\[V_{X, \lambda} := V_\lambda \cap X.\]

When it is clear whether we are dealing with the relative or ordinary version we may omit the extra index. We write $V_*(\Lambda)$ resp. $(V_X)_*(\Lambda)$ for the whole collection of Voronoi cells associated to the landmarks $\Lambda$, i.e. for the relative version we have

\[(V_X)_*(\Lambda) := \{ V_{X,\lambda} \ | \ \lambda \in \Lambda \}.\]

Partitions and Voronoi cells

Suppose we have a discrete set $X$ embedded in $m$-dimensional Euclidean space $\mathbb{R}^m$ endowed with the Euclidean metric $d$. Suppose further we have chosen a family $\Lambda = \{\lambda_1,\ldots,\lambda_k\}$ of landmarks. We would like to produce a partition of $X$, i.e. a decomposition of $X$ into mutually disjoint sets, based on the Voronoi cells associated to $\Lambda$. However we are facing an ambiguity for points $x \in X$ with

\[d(x, \lambda_i) = d(x, \lambda_j), \text{ for some $i \neq j$}.\]

We have to make a choice to which set we are assigning $x$ (and from which cell we are removing $x$). For the remaining part of this post we will:

Assign $x$ to the cell with the lower index, and remove it from the other.

There is no particular reason to go with this rule other than it is the easiest I could come up with. After reassigning all problematic points we end up with an honest partition of $X$. We will continue to denote these sets by $V_\lambda$ resp. $V_{X,\lambda}$, and continue to refer to them as Voronoi cells.

Lloyd’s algorithm

Computing the minimum of the function $J$ described above is usually too expensive. Instead one uses a heuristic algorithm to compute a local minimum. The most common algorithm is Lloyd’s algorithm which I will sketch in the following: Suppose $X$ is a finite discrete set in $m$-dimensional Euclidean space $\mathbb{R}^m$ endowed with the Euclidean metric $d$. Suppose further we fixed a positive integer $k$. Then choose an arbitrary partition $\boldsymbol S$, i.e. a decomposition of $X$ into a family mutually disjoint sets $S_1,\ldots,S_k \subset X$. Then define a sequence $(C_n)_{n \in \mathbb{N}}$ of partitions as follows

\[C_n := L^n(\boldsymbol S) \ \ \ \text{, where } L:= (V_X \circ \mu)_*.\]

It is not hard to show that this sequence converges (see the section below). Hence one can define the result of Lloyd’s algorithm applied to the initial partition $\boldsymbol S$ as follows

\[\mathscr{L}_{V,\mu}\big(\boldsymbol S \big) := \lim_{n \to \infty} C_{n} .\]

Convergence of $C_n$

Observe that for any partition $\boldsymbol S$ of $X$ we have

\[J(\boldsymbol S) \geq J \big( L(\boldsymbol S) \big).\]

Applying $L$ to $S$ results in two changes the objective $J = \sum_i\sum_{x \in S_i} \| x - \mu_i \|^2$: First, replace $S_i$ by $V_{\mu_i}$ which makes $J$ smaller by definition of a Voronoi cell. Second, replace $\mu_i$ by $\mu(V_{\mu_i})$. Note that $\mu(V)$ minimizes $\sum_{x\in V} \| x - \mu \|^2$ and thus the second change makes the sum smaller as well.

Furthermore equality holds if and only if $\boldsymbol S = L(\boldsymbol S)$. Therefore $J(C_n)$ defines a descending sequence in $\mathbb{R}$ which is bounded below by zero, hence it converges. Since the set of partitions of $X$ is finite $J$ only takes a finite number of values. This implies that $J(C_n)$ takes a constant value for $n$ sufficiently large. By the observation above this implies that $C_n$ is constant for sufficiently large $n$ as well.

Drawbacks

A solution to the $k$-Means problem will always be a partition of $X$ into $k$ subsets. A solution to the problem is not always suited to be interpreted as partition into “clusters”. Imagine a point cloud that is distributed according to a probability distribution centered along a straight line. Intuitively one would suggest a single “connected” cluster. However $k$-means by definition would suggest otherwise. Without further analysis we couldn’t tell the difference of that particular data set and another one scattered around $k$ different centers. So one should really see the solution for what it is. A partition, or “cover”, of $X$. Luckily there are further directions to go from here and to build on top of $k$-Means. We briefly sketch one of those possible extensions in the section below.

Where to go from here - Witness complexes

Let $\boldsymbol S = \{ S_1,\ldots,S_k \} $ be a partition of $X$ obtained by the Lloyd’s algorithm say. We would like to associate a simplicial complex to $\boldsymbol S$. In a previous post on the Mapper construction I explained how to construct the nerve of a covering of $X$. However, since $\boldsymbol S$ is a partition, i.e. the sets are mutually disjoint, this construction will only yield a trivial zero-dimensional complex. All we have to do is to slightly enlarge the sets in the partition $\boldsymbol S$. For $\varepsilon > 0$ we define

\[\boldsymbol S_\varepsilon := \big\{ N_\varepsilon(S_1), \ldots, N_\varepsilon(S_k) \big\},\]

where $N_\varepsilon(S)$ denotes the epsilon neighbourhood of $S$ in $X$, i.e. the set of points in $x$ whose distance to $S$ is at most $\varepsilon$. We can compute the nerve $\check{N}(\boldsymbol S_\varepsilon)$ of the enlarged cover $\boldsymbol S_\varepsilon$. For the “right” choice of $\varepsilon$ we are now able to distinguish the two data sets given in the previous section.

A construction that is closely related (almost similar) to the above is the following. Let $\Lambda$ be a family of landmarks. The strong Witness complex $\mathcal{W}^s(X,\Lambda ,\varepsilon)$ (cf. (Carlsson, 2009)) is defined to be the complex whose vertex set is $\Lambda$, and where a collection $(\lambda_{1},\ldots, \lambda_{i})$ defines an $l$-simplex if and only if there is a witness $x \in X$ for $(\lambda_{1},\ldots, \lambda_{i})$, i.e.

\[d(x,\lambda_{j}) \leq d(x,\Lambda) + \varepsilon, \text{ for $j=1,...,i$}.\]

References

Macqueen, J. (1967). Some methods for classification and analysis of multivariate observations. In 5-Th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
Carlsson, G. (2009). Topology and Data. Bull. Amer. Math. Soc., 46, 255–308.

K-Means clustering and Lloyd's algorithm