· Large-scale Hierarchical Classification: Using classification to provide organizational views of large data becomes increasingly important in the Big-Data Era. For instance, Wikipedia articles are indexed using over 600,000 categories in a dependency graph. Jointly optimizing all the classifiers (one per node) in such a large graph or hierarchy presents significant challenges for structured learning. We have developed new statistical learning frameworks and scalable algorithms which successfully solved joint optimization problems with over one trillion (4 TB) model parameters in 37 hours, and produced the best results in the international PASCAL benchmark evaluations for large scale classification (Gopal; Gopal & Yang, KDD 2013; Gopal et al., NIPS 2012) .
· Mining the Web for Customized Curriculum Planning (On-going NSF project): With massive quantities of educational materials freely available on the web, the vision of universal education appears within our grasp. General-purpose search engines are insufficient as they do not focus on educational materials, objectives, pre-requisite relations, etc., nor do they stitch together multiple sources to create customized curricula for students’ goals and current knowledge. The project focuses on: 1) extracting educational units from diverse web sites and representing them in a large directed graph, whose nodes are content descriptors and whose edges encode pre-requisite and other relations, 2) conducting multi-field topic inference via a new family of graphical models to infer relations among educational units, enriching the graph, and 3) automated curricular planning, focusing on providing sequences of lessons, courses, exercises and other education units for a student to achieve his or her educational goals, conditioned on current skills. The curriculum planner enriches a graph traversal path, with alternate paths, reinforcement options, and conditional branches.
· Hierarchical, Dynamic and Multi-filed Topic Modeling (On-going NSF project): Modeling information dynamics in at different levels of granularity is an open challenge. We are developing new Bayesian VonMieses-Fischer topical clustering techniques, including hierarchical and dynamic models that outperform existing methods and scale to large data. Our approach consists of multi-field graphical models for correlated latent topics, semi-supervised topology learning, metric learning, transfer learning and temporal trend modeling. We evaluate on large datasets of scientific literature (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, etc.), as well as news-story collections.
· Multi-Task Active Learning: Active learning selects the most informative instances to label in the process of iterative retraining of classification or regression models. MTAL extends this idea by leveraging inter-task dependencies in estimating the impact of newly selected instances, instead of selecting instances for each task in isolation. We have developed a family of MTAL methods called Benevolent Active Learning, to explicitly estimate the impact of supervision across tasks and to leverage various dependence structures (hierarchies, networks, latent-factor correlations). We have also pursued Personalized Active Learning, i.e., we want to optimize the learning curve of the system not only by selecting informative instances to label, but also by selecting the most knowledgeable labelers for the selected instances. (A Harpale, PhD Thesis; J Zhang, PhD Thesis)
We have also proposed new metrics for evaluating the expected utility of multi-session ranked lists based on both the relevance and the novelty. (A Lad, PhD Thesis; Yang & Lad, ICTIR 2009; Lad & Yang, CIKM 2010Novelty-based Information Retrieval: Jointly optimizing the utility of Information Retrieval (IR) systems based both the relevance and the novelty of retrieved documents to the user is an open challenge. We have developed a unified theoretical framework and scalable algorithms for multi-session retrieval, adaptive filtering and online recommendation over document streams. Semi-supervised learning is used to identify informative “nuggets” in documents, and to optimize ranked lists in the way that maximizes the coverage of informative nuggets and minimizes the redundancy. Users’ tendency to abandon search at varying points in ranked lists and user’s tolerance to redundancy are also modeled in a stochastic process. By learning user-specific parameters in the model, the system can be personalized for each user or user group.
· Large-scale Optimization for Online Advertising: Sponsored search is an important means of Internet monetization, and is the driving force of major search engines today. How to place advertisements to maximize the revenue for search engines, as well as to satisfy the needs of both users and advertising industries is a tough problem. Collaborating with Microsoft Research in Asia, we have developed a new (and the first) probabilistic optimization framework based on joint modeling of per-click auctions and campaign-level guaranteed delivery of advertisements. We also developed a hierarchical divide-&-conquer strategy for solving the very large optimization problem with millions of users/queries (demands) and massive campaigns (supplies) in the ever-evolving Internet. (K Salomatin, PhD Thesis)
· Large-scale Optimization for Wind Farm Planning: We are developing a new probabilistic framework to model the process of wind-energy power generation, and to maximize the power output by optimizing the placement of turbines based on a large number of variables (observed or latent), including turbine locations, wind directions and speeds at all locations, and non-linear interactions among wind and turbines (e.g. wake interference). Conventional solutions (e.g., Integer Linear Programming) do not scale to problems with a very large number of turbines and complex interactions. We have developed a new probabilistic optimization framework with a hierarchical divide-&-conquer strategy for solving very large problems in wind farm planning. (K Salomatin, PhD Thesis)
Using a classification framework to model the mapping from email messages to the appropriate personal priority levels, the system leverages both standard features of email messages and induced social features of senders and receivers in an enriched vector space. (Personalized Email Prioritization based on Content and Social Network Analysis: Statistical learning in personalized email prioritization has been relatively sparse due to privacy issues since people are reluctant to share personal messages and importance judgments with the research community. We have developed PEP methods under the assumption that the system can only access personal email of each user during the training and testing of the model for that user. Specifically, our focus is on the analysis of personal email networks for discovering user groups and inducing social importance features for email senders and receivers from the viewpoint of each particular user.