# Brian Ziebart

Postdoctoral Fellow
Human-Computer Interaction Institute
Carnegie Mellon University

8208 Gates Hillman Center

Email: bzie...@cs.cmu.edu

I have moved to the Univ. of Illinois-Chicago. Please visit my new homepage.

## Overview

I am a postdoctoral fellow at Carnegie Mellon University. I was awarded a PhD from CMU's Machine Learning Department in December 2010.

I am interested in machine learning techniques for structured data and artificial intelligence applications. My recent research focuses on learning and forecasting decisions and strategies in sequential decision and game settings for assistive technology and robotics applications. My dissertation introduced the principle of maximum causal entropy, a general framework for process-conditioned probabilistic learning and prediction (with robust log-loss minimization guarantees).

I investigate learning and prediction of single-agent control with Anind Dey and prediction of multi-agent behavior with Geoff Gordon and Katia Sycara. My graduate studies were advised by Drew Bagnell and Anind Dey. My undergraduate research was supervised by Roy Campbell and Dan Roth at the University of Illinois at Urbana-Champaign.

## Research

### Theory:

• Maximum Causal Entropy
Conditional random fields can be viewed as maximizing the conditional entropy of predicted variables given provided variables. They provide state-of-the-art predictive accuracy in a number of tasks. What distribution is appropriate when provided variables are not availale up front, but sequentially revealed according to a known process instead? Our maximum causal entropy paper introduces an answer to this question. It was the best student paper runner-up at ICML 2010.
Perspectives: statistical estimation, decision theory
Statistical Estimation Perspective of Maximum Causal Entropy

Consider a joint distribution, P(Y,X). It can factor sequentially over time as a product of conditional probabilities,

P(Y,X) = ∏t P(Yt,Xt|Y1:t-1,X1:t-1) = ∏t P(Yt|Y1:t-1,X1:t) ∏t P(Xt|Y1:t-1,X1:t-1).

These two products are each causally conditioned probabilities, denoted:

P(YT||XT) = ∏t P(Yt|Y1:t-1,X1:t)
P(XT||YT-1) = ∏t P(Xt|Y1:t-1,X1:t-1).

We are interested in estimating the first process, P̂(YT||XT), given the second process, P(XT||YT-1), and samples from the joint distribution, P̃(Y,X). The log-loss minimizer, minP̂(y||x) -∑y,x P(y,x) log P̂(y||x), employs a natural measure to try to minimize. Unfortunately, P(Y,X) is unknown and employing P̃(Y,X) as a substitute will often overfit to the available data. Statistics of the form EP(Y,X)[F(y,x)] characterizing P(Y,X) may exist that are easier to estimate from P̃(Y,X). We will assume that (unknown) P(Y,X) is adversarially chosen but must: (1) factor according to the known P(XT||YT-1); and (2) match a set of statistics with distribution P̃(Y,X). The objective of this optimization in this adversarial log-loss minimization setting (subject to constraints: ∀k, EP(Y,X)[F(y,x)] = EP̃(Y,X)[F(y,x)]) is:

minP̂(YT||XT) maxP(YT||XT) -∑ P(y,x) log P̂(y||x)
= maxP(YT||XT) minP̂(YT||XT) -∑ P(y,x) log P̂(y||x)
= maxP(YT||XT) -∑ P(y,x) log P(y||x)

This last measure is the causal entropy, H(Y||X) = -∑ P(y,x) log P(y||x). Thus, robustly minimizing the log-loss of a predictor is equivalent to obtaining a probability distribution that maximizes the causal entropy.
Decision Theoretic Perspective of Maximum Causal Entropy

In a sequential decision making task, the optimal decision problem is to find the optimal action, a, to employ in each state, s. The optimal expected future values for states, V(s), and actions, Q(s,a), are specified by the Bellman equations:

V(s) = maxa Q(s,a)
Q(s,a) = Reward(s,a) + ∑s' P(s'|s,a) V(s'),

with optimal policy: π(s) = maxa Q(s,a).

When maximizing the causal entropy, H(A||S), subject to constraints of the form EP(A,S)[∑t f(at,st)] = EP(A,S)[∑t f(at,st)], the resulting distribution is very similar to the Bellman-optimal solution. It has the following form:

V(s) = softmaxa Q(s,a)
Q(s,a) = Reward(s,a) + ∑s' P(s'|s,a) V(s'),

with reward function: Reward(s,a) = θT fs,a. The softmax is a smooth interpolation of the max function, softmaxx f(x) = log ∑x ef(x). The corresponding policy is then stochastic and distributed as: π(a|s) = eQ(s,a)-V(s).

• Behavior Imitation Learning
Behaving like an observed actor given only a small amount of observed behavior is an important artificial intelligence task. My approach is to learn good predictive distributions of actor behavior. Maximum causal entropy provides the right log-loss guarantees when conditioned on a Markov decision process. It learns a feature-based reward function so that a probabilistic relaxation of optimal control matches statistics of observed behavior. When state transition dynamics are deterministic, this reduces to the distribution we employed in our paper, Maximum Entropy Inverse Reinforcement Learning, at AAAI 2008,

• Strategy Learning
In multi-agent games, rationality is defined in terms of regret rather than maximal utility. How can predictive imitation learning techniques be extended to this setting? Our computational rationalization paper at ICML 2011, which won the best paper award, introduces such an extension that learns feature-based payoffs that rationalize observed strategies. Our maximum causal entropy correlated equilibria paper at AAMAS 2011 applies the principle of maximum causal entropy to Markov games with known payoffs to obtain robust predictions of rational behavior.

• Bayesian Network Structure Learning
 In structure learning, the edges of a Bayesian network relating variables are learned from data. we present polynomial inference algorithms (MAP structure estimation and Bayesian model averaging) for selectively conditioned forests (c) -- the combination of tree structures (a) and ordered variable sets (b), enabling generalized naïve Bayes classifier learning in our UAI 2007 paper on selectively conditioned forests.

### Applications:

• Pedestrian Trajectory Forecasting
We improve robots' movements around people by endowing them with capabilities like you and I have -- mental models of where people intend to go and how they intend to get there.
Using these purposeful predictive models, robots can find better balances between efficiently accomplishing their tasks and hindering people moving in the same environment in our paper, Planning-based Pedestrian Prediction, at IROS 2009.

Existing personal navigation devices (PNDs) are largely oblivious - they "ride along" with drivers, but ignore everything about how the drivers actually drive: where they go and how they prefer to get there.
We develop navigation technology that reasons about observed driving to learn the driver's preferences and predict their future paths (to warn about unanticipated hazards). We applied these learning techniques on GPS data gathered from Pittsburgh taxi drivers in our Ubicomp 2008 paper, Navigate Like a Cabbie. We conducted additional experiments on this dataset in our paper at AAAI 2008 and our paper at AISTATS 2009. We also investigated techniques for reusing previous plans from planning tasks with different cost function weights to improve planning speed on new planning tasks with previously unseen cost function weights in our paper, Fast Planning for Dynamic Preferences, at ICAPS 2008.

• Pervasive Computing Environments
As an undergraduate, I developed distributed applications for a middleware-based meta-operating system. Our papers discuss composing applications, rapid development, and evaluation metrics.
I became interested in how machine learning could be used to automatically configure these systems. My paper, Learning Automation Policies for Pervasive Computing Environments, from ICAC 2005 discusses the latter topic.

## Publications

Miscellany
I helped organize the New Developments in Imitation Learning Workshop at ICML 2011.

I am a co-founder of NavPrescience, a CMU spin-off company developing personalization and prediction technologies for next generation GPS devices.

Gates-Hillman Predictive Market: I finished in 1st place (of 210) with 15.09% of the tickets by maximizing the (relative) entropy (i.e., buying low and selling high).

Bid-Tac-Toe: Exponentiated stochastic gradient ascent + logistic regression over pools of (approximate) equilibria strategies = \$2500. However, winning a hedge fund's recruitment programming contest + market meltdown ≠ lucrative quant career (yet..?).