Sparse regression (Lasso, sparse logistic regression, etc.) is a workhorse of modern machine learning. I am interested in parallel optimization methods which can take advantage of properties of sparse regression to allow parallelization, reduce computation, or reduce communication. For example, correlation between features can make optimization difficult when partitioning datasets by features, but algorithms can adapt to this correlation to allow greater parallelization. Another key property is sparsity in the optimal solution, which can permit more efficient iterations and reduced communication.
Our Shotgun algorithm was one of the first parallel coordinate descent methods for sparse regression. It comes with both theoretical convergence guarantees and strong empirical performance. Since we proposed it, many variants have followed.
The actual algorithm is a very simple generalization of coordinate descent, allowing efficient implementation and easy modifications. Intuitively, it allows the amount of parallelization to be chosen based on correlation between features. Uncorrelated problems allow lots of parallel coordinate updates, while very correlated features permit only a few parallel coordinate updates.
Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin.
Parallel Coordinate Descent for L1-Regularized Loss Minimization.
International Conference on Machine Learning (ICML), 2011.
Walsh-Hadamard Transforms (WHTs) are special cases of multi-dimensional Discrete Fourier Transforms (DFTs). A DFT takes a length-N vector indexed by 1,...,N and transforms it to another length-N vector. A WHT does the same thing, but it has special structure. A WHT operates on an n-dimensional binary finite field, so the vector is of length N=2^n.
Comparing with "machine learning" regression: For WHTs, features are columns of the WHT matrix, and the sparsity in the solution refers to the WHT signal having a small number of non-zero frequency coefficients. Since the WHT has special structure, we can locate and estimate these non-zero coefficients in sub-linear time. Suppose the length-N signal has K non-zero coefficients. Lasso must look at the full N-by-N WHT matrix, taking at least N^2 time. Our methods can subsample cleverly, taking O(K*log^2(N)) samples and O(K*log^3(N)) time.
Basic idea: We subsample the signal, take the WHT of this shorter sequence, and use aliasing properties of the WHT to show that non-zero coefficients are separated out into (mostly) separate bins. We examine each bin, decide how many non-zero coefficients it contains, and decode those coefficients.
Xiao Li, Joseph K. Bradley, Sameer Pawar, and Kannan Ramchandran.
Robustifying the Sparse Walsh-Hadamard Transform without Increasing the Sample Complexity of O(K log N).
IEEE International Symposium on Information Theory (ISIT), 2014.