October 12, 2016

Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowadays. However, unlike supervised learning, our current theoretical understanding of those tasks, and in particular of clustering, is very rudimentary. Although hundreds of clustering papers are being published every year, there is hardly any work reasoning about clustering independently of any particular algorithm, objective function, or generative data model. My talk will focus on such clustering research. I will discuss two aspects in which theory could play a significant role in guiding the use of clustering tools. The first is model selection - how should a user pick an appropriate clustering tool for a given clustering problem, and how should the parameters of such an algorithmic tool be tuned? In contrast with other common computational tasks, in clustering, different algorithms often yield drastically different outcomes. Therefore, the choice of a clustering algorithm may play a crucial role in the usefulness of an output clustering solution. However, there currently exist no methodical guidance for clustering tool selection for a given clustering task. I will describe some recent proposals aiming to address this crucial lacuna. The second aspect of clustering that I will address is the complexity of computing a cost minimizing clustering (given some clustering objective function). Once a clustering model (or objective) has been picked, the task becomes an optimization problem. While most of the clustering objective optimization problems are computationally infeasible, they are being carried out routinely in practice. This theory-practice gap has attracted significant research attention recently. I will describe some of the theoretical attempts to address this gap and discuss how close they bring us to a satisfactory understanding of the computational resources needed for achieving good clustering solutions.