Given a set of points
\(P \subset\mathbb{R}^{d}\) , the
k-means clustering problem is to find a set of
k centers \(C = \{ c_{1},\ldots,c_{k}\}, c_{i} \in\mathbb{R}^{d}\) , such that the objective function ∑
x∈P e(
x,
C)
2, where
e(
x,
C) denotes the Euclidean distance between
x and the closest center in
C, is minimized. This is one of the most prominent objective functions that has been studied with respect to clustering.
D 2-sampling (Arthur and Vassilvitskii, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia,
2007) is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points
\(P \subset\mathbb{R}^{d}\) , the first point is chosen uniformly at random from
P. Subsequently, a point from
P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled point.
D 2-sampling has been shown to have nice properties with respect to the
k-means clustering problem. Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia,
2007) show that
k points chosen as centers from
P using
D 2-sampling give an
O(log
k) approximation in expectation. Ailon et al. (NIPS, pp. 10–18,
2009) and Aggarwal et al. (Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15–28, Springer, Berlin,
2009) extended results of Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia,
2007) to show that
O(
k) points chosen as centers using
D 2-sampling give an
O(1) approximation to the
k-means objective function with high probability. In this paper, we further demonstrate the power of
D 2-sampling by giving a simple randomized (1+
?)-approximation algorithm that uses the
D 2-sampling in its core.
相似文献