# Project Presentations: Advanced Optimization and Randomized Algorithms

- Gates&Hillman Centers
- 4307

## Session Two

Please join us for the 801 project presentations tomorrow and Wednesday. The students have selected challenging projects which, we hope, will spark some really interesting discussions.

**3:00-3:20 - Veeranjaneyulu Sadhanala, Willie Neiswanger, Dai Wei, Yuxiang Wang***"Asynchronous Distributed Minibatch Frank-Wolfe Optimization"*

Frank-Wolfe algorithms, also known as conditional gradient algorithms, are first-order, iterative methods designed for constrained convex optimization. Frank-Wolfe algorithms and related extensions have received much renewed interest very recently due to their relevance to machine learning applications, the structural implications induced by the algorithm itself (such as low-rank sparsity), and most importantly, the necessity for simple methods in the Big Data Era. In this project, we aspire to make Frank-Wolfe algorithms more suitable for web-scale machine learning. In particular, we develop an asynchronous distributed minibatch Frank-Wolfe procedure that satisfies the following two criteria: (1) data points are distributed over multiple machines, which each perform stochastic gradient updates and (2) each machine sends updates to a central server in an asynchronous parallel manner (i.e. without waiting for other machines to complete an iteration before continuing). These two goals are essential for the successful deployment of Frank-Wolfe algorithms in real-world machine learning systems such as GraphLab and parameter servers. We first define an asynchronous parallel Frank-Wolfe algorithm, analyze the convergence of our algorithm, and then demonstrate its performance empirically.

**3:20-3:40 - Hsiao-Yu Tung, Manzil Zaheer, Chao-Yuan Wu**

*"A Spectral Algorithm on Hierarchical Dirichlet Processes"*

In this talk, we present our proposed spectral algorithm for estimating the parameters in Hierarchical Dirichlet process (HDP) [2]. HDP mixture model solves problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups . In previous work, inference on the HDP is carried out using Markov chain Monte Carlo techniques such as Gibbs sampling or by collapsed sampler as Chinese restaurant franchise (CRF). However, the computational complexity is usually high. In this project, we solve the parameter estimation of HDP by using tensor method. Tensor method has been applied to different models, such as spherical Gaussian mixtures, hidden Markov model, independent component analysis, and LDA successfully [1]. The basic idea is essentially the method of moments: (i) compute the moments or other statistics of data, and (ii) find the parameters that generate the same quantity. Specifically, we estimate the parameters in HDP by decomposing a symmetric tensor derived from the moments. The decomposition can be seen as a generalization of singular value decomposition for matrices. By using the defined distribution of priors in [2], we derive special tensor structure from lower-order moments such that the hidden latent features are obtained by power tensor method [1]. We test our algorithm on HDP mixture model, and compare the performance with CRF. The documents that we used for these experiments consist of 1740 articles from the proceedings of the Neural Information Processing Systems (NIPS). We show that our method produce very similar results as those produced by CRF, while being over 100 times faster.

References:

[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. CoRR, (abs/1210.7559), 2012.

[2] Y. W. The, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, (101):1566–1581, 2005.

** 3:40-4:00 Zichao Yan, Jun Woo Park****" Kernel learning based on Fastfood representation"**

Learning kernels directly from the data is important in many practical applications. Previous work on kernel learning mainly focused on improving accuracy by combining several kernel functions. However, such methods can not scale to large dataset. In this project, we aim to learn the kernel while maintain scalability. We make use of Fastfood, which has shown how to do fast basis function expansion based on frequency sampling of kernels using random projection methods. We first represent the kernel using the basis function of the spectrum samples and then optimize the log-likelihood objection function over the hyper parameters of the spectrum to learn the kernel. This method can beat RBF kernel in some dataset of specific patterns. However, it suffers from overfitting problem easily. Instead, we then learn the kernel function whose spectrum is a mixture of gaussians using Fastfood representation. We demonstrate the effective of this methods. Further, we use the bayesian regression on the basis function to maintain scalability. Extensive experiments are done to show the effectiveness of our methods.