Clustering for Edge-Cost Minimization, L. J. Schulman. Proceedings of the 32'nd STOC, 2000.

We address the problem of partitioning a set of $n$ points into clusters, so as to minimize the sum, over all intracluster pairs of points, of the cost associated with each pair. We obtain a randomized approximation algorithm for this problem, for the cost functions $\ell_2^2,\ell_1$ and $\ell_2$, as well as any cost function isometrically embeddable in $\ell_2^2$.

The maximization problem (maximize the costs of intercluster edges) is approximated with high probability to within multiplicative factor $(1-\ep)$; while the minimization problem either receives a multiplicative $1+\ep$ approximation, or else the optimal clustering is correctly identified except for a mislabelling of an $\ep$ fraction of the point set. Given a fixed approximation parameter $\ep$, the runtime is linear in $n$ for $\ell_2^2$ problems of dimension $o(\log n / \log \log n)$; and $n^{O(\log \log n)}$ in the general case.

The case $\ell_2^2$ is addressed by combining three elements: (a) Variable-probability sampling of the given points, to reduce the size of the data set. (b) Near-isometric dimension reduction. (c) A deterministic exact algorithm which runs in time exponential in the dimension (rather than the number of points). The remaining cases are addressed by reduction to $\ell_2^2$.