Amr Ahmed
Amr Ahmed is a Senior Staff Research Scientist at Google. He received his M.Sc and PhD degrees from the School of Computer Science, Carnegie Mellon University in 2009 and 2011, respectively. He received the best paper award at KDD 2014 , the best Paper Award at WSDM 2014, the 2012 ACM SIGKDD Doctoral Dissertation Award, and a best paper award (runner-up) at WSDM 2012. He co-chaired the WWW'18 track on Web Content Analysis and served as an Area Chair for IJCAI 2019, SIGIR 2019, SIGIR 2018, ICML 2018, ICML 2017, KDD 2016, WSDM 2015, ICML 2014, and ICDM 2014. His research interests include large-scale machine learning, data/web mining, user modeling, personalization, social networks and content analysis.
Authored Publications
Google Publications
Other Publications
Sort By
Non-Stationary Off-policy Optimization
Joey Hong
Branislav Kveton
International Conference on Artificial Intelligence and Statistics (AISTATS) (2021)
Preview abstract
Off-policy learning is a framework for estimating the value of and optimizing policies offline from logged data without deploying them. Real-world environments are nonstationary, and the optimized policies should be able to adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary environments. Our key idea is to use a change-point detector to partition the logged data into categorical latent states, then find a near-optimal policy conditioned on latent state. We derive high-probability bounds on our off-policy estimates and optimization. Furthermore, we also propose a practical approach to deploy our policy online and evaluate our approach comprehensively on a real-world clickstream dataset.
View details
Exact and Approximate Hierarchical Clustering Using A*
Craig Greenberg
Sebastian Macaluso
Nicholas Monath
Avinava Dubey
Patrick Flaherty
Kyle Cranmer
Andrew McCallum
Uncertainty in Artificial Intelligence (2021)
Preview abstract
Hierarchical clustering is known to be broadly applicable in myriad domains. Despite its extensive use, existing approximate inference methods are insufficient for applications that require either exact or high-quality approximate solutions.For example, in high energy physics, we are interested in discovering high-quality jet structures, which are hierarchical clusterings of particles.In this paper we view inference as a search problem and focus on inferring high-quality hierarchies for a given cost function, rather than using ad hoc (e.g., greedy or beam) methods. This leads naturally to the use of A*, which has seldom been used for clustering (with the notable exception of \citep{daume2007fast}). Unlike ad hoc search methods, A* carries with it optimality guarantees. However, applying A* search naively leads to a large space and time complexity. To address this challenge, we develop a novel augmented trellis data structure and dynamic programming algorithm for A* that result in substantially improved time and space complexity bounds while still computing the globally optimal hierarchical clustering. We demonstrate that our proposed method increases the number of points for which an exact solution can be found by 25$\%$ compared with previous work \cite{greenberg2020compact}. Furthermore, our approach yields a natural approximation that scales to larger datasets and achieves substantially higher quality results than ad hoc search baselines, motivating its use in applications demanding exact or high-quality approximations.
View details
DAG-structured Clustering by Nearest-Neighbors
Nicholas Monath
Avinava Dubey
Andrew McCallum
International Conference on Artificial Intelligence and Statistics (2021)
Preview abstract
Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple alternative clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structure is not only more flexible than a tree but also allows for points to be members of multiple clusters. We describe a large-scale, map-reduce-based algorithm to infer these DAGs. Our algorithm works by simply merging nearest neighbor substructures to form a DAG structure. Our algorithm is supported with theoretical guarantees showing its representational capacity over tree-based algorithms. Further, we provide comprehensive empirical experiments on large-scale clustering benchmarks and entity resolution datasets. Our results show that our method is as scalable as and more accurate than state-of-the-art tree-based techniques.
View details
Scalable Hierarchical Agglomerative Clustering
Nick Monath
Avinava Dubey
Guru Prashanth Guruganesh
Andrew McCallum
Gokhan Mergen
Mert Terzihan
Bryon Tjanaka
Yuchen Wu
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2021), 1245–1255
Preview abstract
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method’s scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
View details
Big Bird: Transformers for Longer Sequences
Guru Prashanth Guruganesh
Avinava Dubey
Joshua Ainslie
Anirudh Ravula
Qifan Wang
NeurIPS (2020)
Preview abstract
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (in terms of memory mainly) on the sequence length due to their full attention mechanism. To remedy this, we propose, \emph{BigBird}, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that \emph{BigBird} is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis demonstrates the need for having an O(1) global tokens, such as CLS, that attend to the entire sequence as part of the sparse attentions. We show that the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, \emph{BigBird} drastically improves performance on various NLP tasks such as question answering.
View details
Latent Bandits Revisited
Joey Hong
Branislav Kveton
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 13423-13433
Preview abstract
A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning, where complex models can be learned offline and the agent identifies the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper confidence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach.
View details
Preview abstract
Learning continuous representations of discrete objects such as text, sentences, users, and movies lies at the heart of many applications including involving text and user modeling. Unfortunately, traditional methods that embed all objects do not scale to large vocabulary sizes and embedding dimensions. In this paper, we propose a general method, Anchor & Transform (ANT) that learns sparse representations of discrete objects by jointly learning a small set of \textit{anchor embeddings} and a \textit{sparse transformation} from anchor objects to all objects. ANT is scalable, flexible, end-to-end trainable, and allows the user to easily incorporate domain knowledge about object relationships. ANT also recovers several task-specific baselines under certain structural assumptions on the anchor embeddings and transformation matrices. On several benchmarks involving text and user modeling, ANT demonstrates strong performance with respect to accuracy and sparsity.
View details
Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space
Nick Monath
Daniel Silva
Andrew McCallum
The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19) (2019)
Preview abstract
Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimization of multiple objectives or components of a model (e.g.end-to-end training). In this paper, we present an approach for hierarchical clustering that searches over continuous representations of trees in hyperbolic space by running gradient descent. We compactly represent uncertainty over tree structures with vectors in the Poincaré ball. We show how the vectors can be optimized using an objective related to recently proposed cost functions for hierarchical clustering. Using our method with a mini-batch stochastic gradient descent inference procedure, we are able to outperform prior work on clustering millions of ImageNet images by 15 points of dendrogram purity. Further, our continuous tree representation can be jointly optimized in multi-task learning applications offering a 9 point improvement over baseline methods
View details
Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models
Daniel Silva
Yuchen Wu
Shibani Sanan
Surojit Chatterjee
Proceedings of the 12 ACM International Conference on Web Search and Data Mining (2019), pp. 186-194
Preview abstract
Long Short-Term Memory (LSTM) is one of the most powerful sequence models for user browsing history [17, 22] or natural language text [19]. Despite the strong performance, it has not gained popularity for user-facing applications, mainly owing to a large number of parameters and lack of interpretability. Recently Zaheer et al. [25] introduced latent LSTM Allocation (LLA) to address these problems by incorporating topic models with LSTM, where the topic model maps observed words in each sequence to topics that evolve using an LSTM model. In our experiments, we found the resulting model, although powerful and interpretable, to show shortcomings when applied to sequence data that exhibit multi-modes of behaviors with abrupt dynamic changes. To address this problem we introduce thLLA: a threading LLA model. thLLA has the ability to break each sequence into a set of segments and then model the dynamic in each segment using an LSTM mixture. In that way, thLLA can model abrupt changes in sequence dynamics and provides a better fit for sequence data while still being interpretable and requiring fewer parameters. In addition, thLLA uncovers hidden themes in the data via its dynamic mixture components. However, such generalization and interpretability come at a cost of complex dependence structure, for which inference would be extremely non-trivial. To remedy this, we present an efficient sampler based on particle MCMC method for inference that can draw from the joint posterior directly. Experimental results confirm the superiority of thLLA and the stability of the new inference algorithm on a variety of domains.
View details
Recurrent Recommender Networks
Chao-Yuan Wu
Alex Beutel
Alexander J. Smola
How Jing
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (2017), pp. 495-503
Preview abstract
Recommender systems traditionally assume that user profiles and movie attributes are static. Temporal dynamics are purely reactive, that is, they are inferred after they are observed, e.g. after a user's taste has changed or based on hand-engineered temporal bias corrections for movies. We propose Recurrent Recommender Networks (RRN) that are able to predict future behavioral trajectories. This is achieved by endowing both users and movies with a Long Short-Term Memory (LSTM) autoregressive model that captures dynamics, in addition to a more traditional low-rank factorization. On multiple real-world datasets, our model offers excellent prediction accuracy and it is very compact, since we need not learn latent state but rather just the state transition function.
View details
Preview abstract
Recurrent neural network, such as Long-short term memory
(LSTM), are powerful tools for modeling sequential data,
however, they lack interpretability and requires large num-
ber of parameters. On the other hand, topic models, such
as Latent Dirichlet Allocation (LDA), are powerful tools for
uncovering the hidden structure in a document collection,
however, they lack the same strong predictive power as deep
models. In this paper we bridge the gap between such mod-
els and propose Latent LSTM Allocation (LLA). In LLA
each document is modeled as a sequence of words, and the
model jointly groups words into topics and learns the tempo-
ral dynamics over the sequence. Our model is interpretable,
concise and can capture intricate dynamics. We give an ef-
ficient MCMC-EM inference algorithm for our model that
scales to millions of documents. Our experimental evalu-
ations shows that the proposed model compares favorably
with several state-of-the-art baselines.
View details
State Space LSTM Models with Particle MCMC Inference
Xun Zheng
Alex J. Smola
Eric Xing
The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) workshop on Bayesian Deep Learning. (2017)
Preview abstract
Long Short-Term Memory (LSTM) is one of the most powerful sequence models.
Despite the strong performance, however, it lacks the nice interpretability as in
state space models. In this paper, we present a way to combine the best of both
worlds by introducing State Space LSTM (SSL) models that generalizes the earlier
work Zaheer et al. (2017) of combining topic models with LSTM. However,
unlike Zaheer et al. (2017), we do not make any factorization assumptions in our
inference algorithm. We present an efficient sampler based on sequential Monte
Carlo (SMC) method that draws from the joint posterior directly. Experimental
results confirms the superiority and stability of this SMC inference algorithm on
a variety of domains.
View details
A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications
Preview abstract
Topic models are often applied in industrial settings to discover
user profiles from activity logs where documents correspond
to users and words to complex objects such as
web sites and installed apps. Standard topic models ignore
the content-based similarity structure between these objects
largely because of the inability of the Dirichlet prior to capture
such side information of word-word correlation. Several
approaches were proposed to replace the Dirichlet prior
with more expressive alternatives. However, this added expressivity
comes with a heavy premium: inference becomes
intractable and sparsity is lost which renders these alternatives
not suitable for industrial scale applications. In this
paper we take a radically different approach to incorporating
word-word correlation in topic models by applying this
side information at the posterior level rather than at the
prior level. We show that this choice preserves sparsity and
results in a graph-based sampler for LDA whose computational
complexity is asymptotically on bar with the state of
the art Alias base sampler for LDA. We illustrate the
efficacy of our approach over real industrial datasets that
span up to billion of users, tens of millions of words and
thousands of topics. To the best of our knowledge, our approach
provides the first practical and scalable solution to
this important problem
View details
Preview abstract
Hierarchical Bayesian models often capture distributions over a very large number of distinct atoms.
The need for this arises when organizing huge amount of unsupervised data, for instance, features extracted using deep convnets can be exploited to organize abundant unlabeled images.
Inference for hierarchical Bayesian models in such cases can be rather nontrivial, leading to approximate approaches.
In this work, we propose a sampler based on Cover Trees that is exact and that has guaranteed runtime logarithmic in the number of atoms and is polynomial in the inherent dimensionality of the underlying parameter space.
In other words, the algorithm is as fast as search over a hierarchical data structure and we demonstrate the effectiveness on both synthetic and real datasets, consisting of over 100 million images.
View details
Preview abstract
In online shopping, users usually express their intent through
search queries. However, these queries are usually in a rather
ambiguous form instead of being accurate. For example, it is
more likely (and easier) for users to write a query like “high end
bike” than “21 speed carbon frames jamis road bike”. It
is challenging to interpret these ambiguous queries and thus
search result accuracy suffers. A user oftentimes needs to go
through the frustrating process of refining search queries or
self-teaching from possibly unstructured information. However,
shopping is indeed a structured domain, that is composed
of category hierarchy, brands, product lines, features,
etc. It would have been much better if a shopping site could
understand users’ intent through this structure, present organized/structured
information, or even find items with the
right categories, brands or features for them.
In this paper we study the problem of inferring the latent
intent from unstructured queries and mapping them to
structured attributes. We present a novel framework that
jointly learns this knowledge from user consumption behaviors
and product metadata. We present a hybrid Long Short term
Memory (LSTM) joint model that is accurate and robust,
even though user queries are noisy and product catalog
is rapidly growing. Our study is conducted on a large-scale
dataset from Google Shopping, that is composed of millions
of items and user queries along with their click responses.
Extensive qualitative and quantitative evaluation shows that
the proposed model is more accurate, concise, and robust
than multiple possible alternatives.
View details
Preview abstract
Business-to-consumer (B2C) emails are usually generated by
filling structured user data (e.g. purchase, event) into tem-
plates. Extracting structured data from B2C emails allows
users to track important information on various devices.
However, it also poses several challenges, due to the re-
quirement of short response time for massive data volume,
the diversity and complexity of templates, and the privacy
and legal constraints. Most notably, email data is legally
protected content, which means no one except the receiver
can review the messages or derived information.
In this paper we first introduce a system which can extract
structured information automatically without requiring hu-
man review of any personal content. Then we focus on how
to annotate product names from the extracted texts, which
is one of the most difficult problems in the system. Nei-
ther general learning methods, such as binary classifiers, nor
more specific structure learning methods, such as Condition-
al Random Field (CRF), can solve this problem well.
To accomplish this task, we propose a hybrid approach,
which basically trains a CRF model using the labels pre-
dicted by binary classifiers (weak learners). However, the
performance of weak learners can be low, therefore we use
Expectation Maximization (EM) algorithm on CRF to re-
move the noise and improve the accuracy, without the need
to label and inspect specific emails. In our experiments, the
EM-CRF model can significantly improve the product name
annotations over the weak learners and plain CRFs.
View details
Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams
Nan Du
Mehrdad Farajtabar
Alexander J. Smola
Le Song
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), pp. 219-228
Preview abstract
Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal
dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.
View details
Taxonomy Discovery for Personalized Recommendation
Yuchen Zhang
Vanja Josifovski
Alexander J Smola
ACM International Conference on Web Search And Data Mining (WSDM) (2014)
Preview abstract
Personalized recommender systems based on latent factor models are widely used to increase sales in e-commerce. Such systems use the past behavior of users to recommend new items that are likely to be of interest to them. However, latent factor model suffer from sparse user-item interaction in online shopping data: for a large portion of items that do not have sufficient purchase records, their latent factors cannot be estimated accurately.
In this paper, we propose a novel approach that automatically discovers the taxonomies from online shopping data and jointly learns a taxonomy-based recommendation system. Out model is non-parametric and can learn the taxonomy structure automatically from the data. Since the taxonomy allows purchase data to be shared between item- s, it effectively improves the accuracy of recommending tail items by sharing strength with the more frequent items. Ex- periments on a large-scale online shopping dataset confirm that our proposed model improves significantly over state-of- the-art latent factor models. Moreover, our model generates high-quality and human readable taxonomies. Finally, us- ing the algorithm-generated taxonomy, our model even out- performs latent factor models based on the human-induced taxonomy, thus alleviating the need for costly manual taxonomy generation.
View details
Reducing the Sampling Complexity of Topic Models
Aaron Li
Sujith Ravi
Alexander J Smola
ACM Conference on Knowledge Discovery and Data Mining (KDD) (2014)
Preview abstract
Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately, the generative model loses sparsity with the increase in data, requiring O(k) operations per word for k latent states, such as topics. In this paper we propose an algorithm which requires only O(kd) operations per word, where kd is the number of actually instantiated topics in the document. For large document collections and structured hierarchical models kd ≪ k, thus yielding an order of magnitude speedup. Our method is general and it applies to a wide variety of statistical models.
At its core is the idea that dense, rapidly changing distributions can be approximated efficiently by the combination of a Metropolis-Hastings step, judicious use of sparsity, and amortized preprocessing via the alias method.
View details
Scaling Distributed Machine Learning with the Parameter Server
Mu Li
David G. Anderson
Jun Woo Park
Alexander J. Smola
Vanja Josifovski
James Long
Eugene J. Shekita
Bor-Yiing Su
Operating Systems Design and Implementation (OSDI), USENIX (2014), pp. 583-598
Preview abstract
We propose a parameter server framework for distributed
machine learning problems. Both data and workloads
are distributed over worker nodes, while the server nodes
maintain globally shared parameters, represented as dense
or sparse vectors and matrices. The framework manages
asynchronous data communication between nodes, and
supports flexible consistency models, elastic scalability,
and continuous fault tolerance.
To demonstrate the scalability of the proposed framework,
we show experimental results on petabytes of real
data with billions of examples and parameters on problems
ranging from Sparse Logistic Regression to Latent
Dirichlet Allocation and Distributed Sketching.
View details
Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising
Abhimanyu Das
Alexander J. Smola
ACM International Conference on Web Search And Data Mining (WSDM) (2014)
Preview abstract
Many estimation tasks come in groups and hierarchies of related problems. In this paper we propose a hierarchical model and a scalable algorithm to perform inference for multitask learning. It infers task correlation and subtask structure in a joint sparse setting. Implementation is achieved by a distributed subgradient oracle and the successive application of prox-operators pertaining to groups and sub-groups of variables. We apply this algorithm to conversion optimization in display advertising. Experimental results on over 1TB data for up to 1 billion observations and 1 million attributes show that the algorithm provides significantly better prediction accuracy while simultaneously being efficiently scalable by distributed parameter synchronization.
View details
Measurement and modeling of eye-mouse behavior
Preview
LaDawn Jentzsch
Rory Sayres
Sujith Ravi
Alex J. Smola
Proceedings of the 22nd International World Wide Web Conference (2013)
Scalable Dynamic Nonparametric Bayesian Models of Content and Users
Preview
Eric P. Xing
International Joint Conference on Artificial Intelligence (IJCAI -- Best paper track) (2013)
Latent Factor Models with Additive Hierarchically-smoothed User Preferences
Sandeep Pandey
Vanja Josifovski
Lluis Garcia-Pueyo
Proceedings of The 6th ACM International Conference on Web Search and Data Mining (WSDM) (2013)
Preview abstract
Items in recommender systems are usually associated with annotated attributes such as brand and price for products; agency for news articles, etc. These attributes are highly informative and must be exploited for accurate recommendation. While learning a user preference model over these attributes can result in an interpretable recommender system and can hands the cold start problem, it suffers from two major drawbacks: data sparsity and the inability to model random effects. On the other hand, latent-factor collaborative filtering models have shown great promise in recommender systems; however, its performance on rare items is poor. In this paper we propose a novel model LFUM, which provides the advantages of both of the above models. We learn user preferences (over the attributes) using a personalized Bayesian hierarchical model that uses a combination (additive model) of a globally learned preference model along with user-specific preferences. To combat we smooth these preferences over the item-taxonomy an efficient forward-filtering and backward-smoothing algorithm. Our inference algorithms can handle both attributes (e.g., item brands) and continuous attributes (e.g., prices). We combine the user preferences with the latent- models and train the resulting collaborative filtering system end- using the successful BPR ranking algorithm. In our experimental analysis, we show that our proposed model several commonly used baselines and we carry out an ablation study showing the benefits of each component of our model.
View details
Distributed Large-scale Natural Graph Factorization
Preview
Nino Shervashidze
Shravan Narayanamurthy,
Vanja Josifovski
Alexander J Smola
Proceedings of the 22nd International World Wide Web Conference (WWW 2013)
The Nested Chinese Restaurant Franchise Process: User Tracking and Document Modeling
Preview
Liangjie Hong
Alexander J Smola
International Conference on Machine Learning (ICML) (2013)
Hierarchical Geographical Modeling of User locations from Social Media Posts
Preview
Liangjie Hong
Alexander J Smola
Proceedings of the 22nd International World Wide Web Conference (WWW 2013)
Focused Marix Factorization for Audience Selection in Display Advertising
Sandeep Pandey
Vanja Josifovski
Lluis Garcia-Pueyo
Jeff Yuan
Proceedings of the 29th International Conference on Data Engineering (ICDE) (2013)
Preview abstract
Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users’ past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users’ preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users’ interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.
View details
Hokusai | Sketching Streams in Real Time
Sergiy Matusevych
Alex Smola
Proceedings of the 28th International Conference on Conference on Uncertainty in Artificial Intelligence (UAI) (2012)
Preview abstract
We describe 北斎 Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the Count-Min sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations.
View details
FastEx: Hash Clustering with Exponential Families
Preview
Sujith Ravi
Shravan Narayanamurthy
Alex Smola
Proceedings of the 26th Conference on Neural Information Processing Systems. (NIPS) (2012)
Preview abstract
A supervised topic model can utilize side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical
representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular
and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we
propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model,
which integrates the mechanism behind the max-margin prediction models (e.g., SVMs)
with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uni-
fied constrained optimization framework, and yields latent topical representations that are
more discriminative and more suitable for prediction tasks such as document classification
or regression. The principle underlying the MedLDA formalism is quite general and can be
applied for jointly max-margin and maximum likelihood learning of directed or undirected
topic models when supervising side information is available. Efficient variational methods
for posterior inference and parameter estimation are derived and extensive empirical studies
on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative
topical representations; 2) achieve state of the art prediction performance;
View details
Web-Scale Multi-Task Feature Selection for Behavioral Targeting
Mohamed Aly
Abhimanyu Das
Alex Smola
Tasos Anastasakos
Proceedings of The 21st ACM International Conference on Information and Knowledge Management (CIKM), ACM (2012)
Preview abstract
A typical behavioral targeting system optimizing purchase
activities, called conversions, faces two main challenges: the web-scale amounts of user histories to process on a daily basis, and the relative sparsity of conversions. In this paper, we try to address these challenges through feature selection. We formulate a multi-task (or group) feature-selection problem among a set of related tasks (sharing a common set of features), namely advertising campaigns. We apply a group-sparse penalty consisting of a combination of an `1 and `2 penalty and an associated fast optimization algorithm for distributed parameter estimation. Our algorithm relies on a variant of the well known Fast Iterative Thresholding Algorithm (FISTA), a closed-form solution for mixed norm programming and a distributed subgradient oracle. To eciently handle web-scale user histories, we present a distributed inference algorithm for the problem that scales to billions of instances and millions of attributes. We show the superiority of our algorithm in terms of both sparsity and ROC performance over baseline feature selection methods (both single-task L1-regularization and multi-task mutual-information gain).
View details
Tutorial: New Templates for Scalable Data Analysis
Hokusai - Sketching Streams in Real Time
Supercharging Recommender Systems using Taxonomies for Learning User Purchase Behavior
Sandeep Pandey
Vanja Josifovski
Jeff Yuan
Lluis Garcia Pueyo
Proceedings of the 38th International Conference on Very Large Databases (VLDB) (2012)
Discovering Geographical Topics In The Twitter Stream
Liangjie Hong
Siva Gurumurthy
Kostas Tsioutsiouliklis
Alex Smola
Proceedings of The 21st International World Wide Web conference (WWW) (2012)
Scalable Inference in Latent Variable Models
Mohamed Aly
Joseph Gonzalez
Shravan Narayanamurthy
Alex Smola
Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM) (2012)
Fair and Balanced: Learning to Present News Stories
Choon-hui Teo
S.V.N Vishwanathan
Alex Smola
Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM) (2012)
Unified analysis of streaming news
Qirong Ho
Jacob Eisenstein
Eric P. Xing
Alexander J. Smola
Choon Hui Teo
WWW (2011), pp. 267-276
Tutorial: Graphical Models for the Internet
Online Inference for the Infinite Cluster-topic Model: Storylines from Streaming Text
Qirong Ho
Choon-hui Teo
Jacobe Eisenstein
Alex Smola
Eric Xing
Proceedings of the 14th Conference on Artificial Intelligence and Statistics (AISTATS) (2011)
Sparse Additive Generative Models of Text
Jacob Eisenstein
Eric Xing
Proceedings of the 28th International Conference on Machine Learning (ICML) (2011)
Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text
Qirong Ho
Choon Hui Teo
Jacob Eisenstein
Alexander J. Smola
Eric P. Xing
Journal of Machine Learning Research - Proceedings Track, vol. 15 (2011), pp. 101-109
Modeling Users and Content: Structured Probabilistic Representation, and Scalable Online Inference Algorithms
Ph.D. Thesis (2011)
Scalable Distributed Inference of Dynamic User Interests for Behavioral Targeting
Yucheng Low
Mohamed Aly
Vanja Josifovski
Alex Smola
Proceedings of The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011., ACM
Unified Analysis of Streaming News
Qirong Ho
Jacob Eisenstein
Eric P. Xing
Alex Smola
Choon-hui Teo
Proceedings of the 20th International World Wide Web conference (WWW) (2011)
WWW 2011 invited tutorial overview: latent variable models on the internet
Tutorial: Latent Variable Models for the Internet
Recovering Time-Varying Networks of Dependencies in Social and Biological Studies
Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream
Eric P. Xing
Proceedings of the 26th International Conference on Conference on Uncertainty in Artificial Intelligence (UAI) (2010)
Structured Literature Image Finder: Parsing Text and Figures in Biomedical Literature
Andrew Arnold
Luis Pedro Coelho
Joshua Kangas
Saboor Sheikh
Eric P. Xing
Robert Murphy
Journal of Web Semantics (2010)
Staying Informed: Supervised and Semi-Supervised Multi-view Topical Analysis of Ideological Perspective
Eric P. Xing
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2010)
Estimating Time-Varying Networks
Structured Literature Image Finder: Extracting Information from Text and Images in Biomedical Literature.
Luis Pedro Coelho
Andrew Arnold
Joshua Kangas
Saboor Sheikh
Eric P. Xing
Robert Murphey
Lecture Notes in Bioinformatics, Springer (2010)
MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification
SLIF: Structure Literature Image Finer
Andrew Arnold
Luis Pedro Coelho
Joshua Kangas
Saboor Sheikh
Eric P Xing
Robert Murphy
The Annual Meeting of The ISMB BioLINK Special Interest Group : Text Mining, Image Analysis and the future of Scientific Publishing In Association with ISMB/ECCB (2009)
Collapsed Variational Inference for Time-Varying Dirichlet Process Mixture Models
Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
Eric P. Xing
William W. Cohen
Robert Murphy
Proceedings of The Fifteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2009)
Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant Process : with Applications to Evolutionary Clustering
Eric P Xing
Proceedings of The Eighth SIAM International Conference on Data Mining (SDM) (2008)
Time-Varying Networks: Recovering Temporally Rewiring Genetic Networks During the Life Cycle of Drosophila melanogaster
Joint Latent Topic Models for Text and Citations
Ramesh Nallapati
Eric P. Xing
William W. Cohen
Proceedings of The Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2008)
Training Hierarchical Feed-forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks
Kai Yu
Wei Xu
Yihong Gong
Eric P. Xing
Proceeding of the 10th European Conference of Computer Vision (ECCV) (2008)
Sparse Word Graphs: A scalable algorithm for capturing word correlations in topic models
Ramesh Nallapati
William W Cohen
Eric P. Xing
In International Conference of Data Mining, Workshop on High performance data mining (ICDM WS) (2007)
On Tight Approximate Inference of the Logistic Normal Topic Admixture Model
Eric P Xing
The Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS) (2007)
Modeling Topic Evolution in Text Stream
Task Assignment for a Physical Agent Team via a Dynamic Forward/Reverse Auction Mechanism.
Abhilash Patel
Tom Brown
MyungJoo Ham
Myeong-Wuk Jang
Gul Agha
The International Conference of Integration of Knowledge Intensive Multi-Agent Systems KIMAS (2005)
Efficient Communication in Multi-Agent Systems
Myeong-Wuk Jang
Gul Agha
Software Engineering for Multi-Agent Systems III: Research Issues and Practical Applications, LNCS 3390, Springer-Verlag (2005)