Clustering Affine Subspaces: Hardness and Algorithms

February 06, 2013

We study a generalization of the famous $k$-center problem
where each object is an affine subspace of dimension
$\Delta$, and give either the first or significantly
improved algorithms and hardness results for many
combinations of parameters. This generalization from points
($\Delta=0$) is motivated
by the analysis of incomplete data, a pervasive challenge in
statistics: incomplete data
objects in $\mathbb{R}^d$ can be modeled as affine
subspaces. We give three algorithmic results for different
values of $k$, under the assumption that all subspaces are
axis-parallel, the main case of interest because of the
correspondence to missing entries in data tables.

1) $k=1$: Two polynomial time approximation schemes which runs in $poly(\Delta, 1/\epsilon) nd$.

2) $k=2$: $O(\Delta^{1/4})$-approximation algorithm which runs in $poly(n, d, \Delta)$

3) General $k$: Polynomial time approximation scheme which runs in $2^{O(\Delta k \log k (1 + 1/\epsilon^2))}nd$

We also prove nearly matching hardness results; in both the general (not necessarily axis-parallel) case (for $k \geq 2$) and in the axis-parallel case (for $k \geq 3$), the running time of an approximation algorithm with any approximation ratio cannot be polynomial in even one of $k$ and $\Delta$, unless P = NP. Furthermore, assuming that the 3-SAT problem cannot be solved subexponentially, the dependence on both $k$ and $\Delta$ must be exponential in the general case (in the axis-parallel case, only the dependence on $k$ drops to $2^{\Omega(\sqrt{k})}$). The simplicity of the first and the third algorithm suggests that they might be actually used in statistical applications. The second algorithm, which demonstrates a theoretical gap between the axis-parallel and general case for $k=2$, displays a strong connection between geometric clustering and classical coloring problems on graphs and hypergraphs, via a new Helly-type theorem.

1) $k=1$: Two polynomial time approximation schemes which runs in $poly(\Delta, 1/\epsilon) nd$.

2) $k=2$: $O(\Delta^{1/4})$-approximation algorithm which runs in $poly(n, d, \Delta)$

3) General $k$: Polynomial time approximation scheme which runs in $2^{O(\Delta k \log k (1 + 1/\epsilon^2))}nd$

We also prove nearly matching hardness results; in both the general (not necessarily axis-parallel) case (for $k \geq 2$) and in the axis-parallel case (for $k \geq 3$), the running time of an approximation algorithm with any approximation ratio cannot be polynomial in even one of $k$ and $\Delta$, unless P = NP. Furthermore, assuming that the 3-SAT problem cannot be solved subexponentially, the dependence on both $k$ and $\Delta$ must be exponential in the general case (in the axis-parallel case, only the dependence on $k$ drops to $2^{\Omega(\sqrt{k})}$). The simplicity of the first and the third algorithm suggests that they might be actually used in statistical applications. The second algorithm, which demonstrates a theoretical gap between the axis-parallel and general case for $k=2$, displays a strong connection between geometric clustering and classical coloring problems on graphs and hypergraphs, via a new Helly-type theorem.

*Based on joint work with Leonard Schulman.*