Iterative Approaches to Row Sampling

October 24, 2012

We show faster algorithms for solving regression problems based on estimating statistical leverage scores. A growing number of applications involve large, sparse, n * d matrices where n >> d. For many of these applications the more expensive operations involve the smaller, d * d product A^TA. Recent works by Clarkson, Drineas, Magdon-Ismail, Mahoney and Woodruff led to algorithms that approximate this product in O(ndlogn + poly(d)) and O(nnz(A)+poly(d)) time respectively. When the size of the matrix is much larger than the cost of more expensive operations involving the smaller d*d matrix, these algorithms give significant savings in running time.

We use techniques originally developed for spectral sparsification of graphs to design algorithms for approximating A^TA. Our algorithm finds a matrix B consisting of a small number of scaled rows of A such that B^TB is close to A^TA. It runs in time close to O(nnz(A) + d^{w} \epsilon^{-2}), which is optimal for this type of algorithms. The key to our approach is using statistical leverage scores in an iterative fashion, which allows the conversion of crude approximations to higher quality ones.

We use techniques originally developed for spectral sparsification of graphs to design algorithms for approximating A^TA. Our algorithm finds a matrix B consisting of a small number of scaled rows of A such that B^TB is close to A^TA. It runs in time close to O(nnz(A) + d^{w} \epsilon^{-2}), which is optimal for this type of algorithms. The key to our approach is using statistical leverage scores in an iterative fashion, which allows the conversion of crude approximations to higher quality ones.