Yet, two key shortcomings limit the scalability of these BDIbased theories and implementations. First, there are no techniques for the quantitative evaluation of the degree of optimality of their coordination behavior. While optimal teamwork may be impractical in realworld domains, such analysis would aid us in comparison of different theories/models and in identifying feasible improvements. One key reason for the difficulty in quantitative evaluation of most existing teamwork theories is that they ignore the various uncertainties and costs in realworld environments. For instance, joint intentions theory [Cohen & Levesque, 1991b] prescribes that team members attain mutual beliefs in key circumstances, but it ignores the cost of attaining mutual belief (e.g., via communication). Implementations that blindly follow such prescriptions could engage in highly suboptimal coordination. On the other hand, practical systems have addressed costs and uncertainties of realworld environments. For instance, STEAM [Tambe, 1997,Tambe & Zhang, 1998] extends joint intentions with decisiontheoretic communication selectivity. Unfortunately, the very pragmatism of such approaches often necessarily leads to a lack of theoretical rigor, so it remains unanswered whether STEAM's selectivity is the best an agent can do, or whether it is even necessary at all. The second key shortcoming of existing teamwork research is the lack of a characterization of the computational complexity of various aspects of teamwork decisions. Understanding the computational advantages of a practical coordination prescription could potentially justify the use of that prescription as an approximation to optimality in particular domains.
To address these shortcomings, we propose a new complementary framework, the COMmunicative Multiagent Team Decision Problem (COMMTDP), inspired by work in economic team theory [Marschak & Radner, 1971,Yoshikawa, 1978,Ho, 1980]. While our COMMTDP model borrows from a theory developed in another field, we make several contributions in applying and extending the original theory, most notably adding explicit models of communication and system dynamics. With these extensions, the COMMTDP generalizes other recently developed multiagent decision frameworks, such as decentralized POMDPs [Bernstein, Zilberstein, & Immerman, 2000].
Our definition of a team (like that in economic team theory) assumes only that team members have a common goal and that they work selflessly towards that goal (i.e., they have no other private goals of their own). In terms of our decisiontheoretic framework, we assume that all of the team members share the same joint utility functionthat is, each team member's individual preferences are exactly the preferences of the other members and, thus, of the team as a whole. Our definition may appear to be a ``barebones'' definition of a team, since it does not include common concepts and assumptions from the literature on what constitutes a team (e.g., the teammates form a joint commitment [Cohen & Levesque, 1991b], attain mutual belief upon termination of a joint goal, intend that teammates succeed in their tasks [Grosz & Kraus, 1996], etc.). From our COMMTDP perspective, we view these concepts as more intermediate concepts, as the means by which agents improve their team's overall performance, rather than ends in themselves. Our hypothesis in this investigation is that our COMMTDPbased analysis can provide concrete justifications for these concepts. For example, while mutual belief has no inherent value, our COMMTDP model can quantify the improved performance that we would expect from a team that attains mutual belief about important aspects of its execution.
More generally, this paper demonstrates three new types of teamwork analyses made possible by the COMMTDP model. First, we analyze the computational complexity of teamwork within subclasses of problem domains. For instance, some researchers have advocated teamwork without communication [Goldberg & Mataric, 1997]. We use the COMMTDP model to show that, in general, the problem of constructing optimal teams without communication is NEXPcomplete, but allowing free communication reduces the problem to be PSPACEcomplete. This paper presents a breakdown of the complexity of optimal teamwork over problem domains classified along the dimensions of observability and communication cost.
Second, the COMMTDP model provides a powerful tool for comparing the optimality of different coordination prescriptions across classes of domains. Indeed, we illustrate that we can encode existing team coordination strategies within a COMMTDP for evaluation. For our analysis, we selected two joint intentionsbased approaches from the literature: one using the approach realized within GRATE* and the joint responsibility model [Jennings, 1995], and another based on STEAM [Tambe, 1997]. Through this encoding, we derive the conditions under which these team coordination strategies generate optimal team behavior, and the complexity of the decision problems addressed by them. Furthermore, we also derive a novel team coordination algorithm that outperforms these existing strategies in optimality, though not in efficiency. The end result is a wellgrounded characterization of the complexityoptimality tradeoff among various means of team coordination.
Third, we can use the COMMTDP model to empirically analyze a specific domain of interest. We have implemented reusable, domainindependent algorithms that allow one to evaluate the optimality of the behavior generated by different prescriptive policies within a problem domain represented as a COMMTDP. We apply these algorithms in an example domain to empirically evaluate the aforementioned team coordination strategies, characterizing the optimality of each strategy as a function of the properties of the underlying domain. For instance, Jennings reports experimental results [Jennings, 1995] indicating that his joint responsibility teamwork model leads to lower waste of community effort than competing methods of accomplishing teamwork. With our COMMTDP model, we were able to demonstrate the benefits of Jennings' approach under many configurations of our example domain. However, in precisely characterizing the types of domains that showed such benefits, we also identified domains where these competing methods may actually perform better. In addition, we can use our COMMTDP model to recreate and explain previous work that noted an instance of suboptimality in a STEAMbased, realworld implementation [Tambe, 1997]. While this previous work treated that suboptimality as anomalous, our COMMTDP reevaluation of the domain demonstrated that the observed suboptimality was a symptom of STEAM's general propensity towards extraneous communication in a significant range of domain types. Both the algorithms and the example domain model are available for public use in an Online Appendix 1.
Section 2 presents the COMMTDP model's representation and places it in the context of related multiagent models from the literature. Section 3 uses the COMMTDP model to define and characterize the complexity of designing optimal agent teams. Section 4 analyzes the optimality of existing team coordination algorithms and derives a novel coordination algorithm. Section 5 presents empirical results from applying our COMMTDP algorithms to an example domain. Section 6 summarizes our results, and Section 7 identifies some promising future directions.
* Extension to Dynamic Problem: P The original team decision problem focused on a oneshot, static problem. We extend the original concept so that each component is a time series of random variables. The effects of domainlevel actions (e.g., a flying action changes a helicopter's position) obey a probabilistic distribution, given by a function P:S×A_{a}×S®[0,1]. In other words, for each initial state s at time t, combined action a taken at time t, and final state s¢ at time t+1, Pr(S^{t+1}=s¢S^{t}=s,A_{a}^{t}=a)=P(s,a,s¢). The given definition of P assumes that the world dynamics obey the Markov assumption.
* Extension of Allowable Information Structures: O_{a} We extend the information structure representation to allow for uncertain observations. We use a general stochastic model, borrowed from the partially observable Markov decision process model [Smallwood & Sondik, 1973], with a joint observation function: O_{a}(s,a,w) = Pr(W_{a}^{t}=wS^{t}=s,A_{a}^{t1}=a). This function models the sensors, representing any errors, noise, etc. In some cases, we can separate this joint distribution into individual observation functions: O_{a} º Õ_{i Î a}O_{i}, where O_{i}(s,a,w) = Pr(W_{i}^{t}=wS^{t}=s,A_{a}^{t1}=a). Thus, the probability distribution specified by O_{a} forms the richer information structure used in our model. We can make useful distinctions between different classes of information structures:
* Extension to Richer Belief State Space: B_{a} We generalize the set of possible strategies to capture the more complex mental states of the agents. Each agent, i Î a, forms a belief state, b_{i}^{t} Î B_{i}, based on its observations seen through time t, where B_{i} circumscribes the set of possible belief states for the agent. Thus, we define the set of possible domainlevel policies as mappings from belief states to actions, p_{iA}:B_{i}® A_{i}. We define the set of possible combined belief states over all agents to be B_{a} º Õ_{i Î a}B_{i}. The corresponding random variable, b_{a}^{t}, represents the agents' combined belief state at time t. We elaborate on different types of belief states and the mapping of observations to belief states (i.e., the state estimator function) in Section 2.2.1.
With communication, we divide each decision epoch into two phases: the
precommunication and postcommunication phases, denoted by
the subscripts ·S and S·, respectively. In
particular, the agents update their belief states at two distinct points
within each decision epoch: once upon receiving observation W_{i}^{t}
(producing the precommunication belief state b_{i·S}^{t}), and
again upon receiving the other agents' messages (producing the
postcommunication belief state b_{iS·}^{t}). The distinction
allows us to differentiate between the belief state used by the agents in
selecting their communication actions and the more ``uptodate'' belief state
used in selecting their domainlevel actions. We also distinguish between the
separate stateestimator functions used in each update phase:

In this paper, as a first step, we assume that the agents have
perfect recall. In other words, the agents recall all of their
observations, as well as all communication of the other agents. Thus,
their belief states can represent their entire histories as sequences of
observations and received messages:
B_{i}=W_{i}^{*}×S_{a}^{*}, where X^{*} denotes the
set of all possible sequences (of any length) of elements of X. The
agents realize perfect recall through the following state estimator
functions:

We extend our definition of a policy of behavior to include a communication policy, p_{iS}:B_{i}®S_{i}, analogous to Section 2.1.4's domainlevel policy. We define the joint policies, p_{aS} and p_{aA}, as the combined policies across all agents in a.
As with the observability function, we parameterize the communication costs associated with message transmissions:
The freecommunication case appears in the literature, when researchers wish to focus on issues other than communication cost. Although, realworld domains rarely exhibit such ideal conditions, we may be able to model some domains as having approximately free communication to a sufficient degree. In addition, analyzing this extreme case gives us some understanding of the benefit of communication, even if the results do not apply across all domains. We also identify the nocommunication case because such decision problems have been of interest to researchers as well [Goldberg & Mataric, 1997]. Of course, even if S_{a}=Æ, it is possible that there are domainlevel actions in A_{a} that have implicit communicative value by acting as signals that convey information to the other agents. However, we still label such agent teams as having no communication for the purposes of the work here, since many of our results exploit an explicit separation between domain and communicationlevel actions.
Next, each agent i Î a selects a message according to its communication policy, S_{i}^{0}=p_{iS}(b_{i·S}^{0}), defining a combined communication, S_{a}^{0}. Each agent interprets the communications of all of the others by updating its belief state, b_{iS·}^{0}=SE_{iS·}(b_{i·S}^{0},S_{a}^{0}). Each then selects an action according to its domainlevel policy, A_{i}^{0}=p_{iA}(b_{iS·}^{0}), defining a combined action A_{a}^{0}. By our central assumption of teamwork, each agent receives the same joint reward, R^{0}=R(S^{0},A_{a}^{0},S_{a}^{0}). The world then moves into a new state, S^{1}, according to the distribution, P(S^{0},A_{a}^{0}). Again, each agent i receives an observation W_{i}^{1} drawn from W_{i} according to the distribution O_{a}(S^{1},A_{a}^{0},W_{a}^{1}), and it updates its belief state, b_{i·S}^{1}=SE_{i·S}(b_{iS·}^{0},W_{i}^{1}).
The process continues, with agents choosing communication and domainlevel actions, observing the effects, and updating their beliefs. Thus, in addition to the time series of world states, S^{0},S^{1},¼,S^{t}, the agents themselves determine a time series of communicationlevel and domainlevel actions, S_{a}^{0},S_{a}^{1},¼,S_{a}^{t} and A_{a}^{1},A_{a}^{1},¼,A_{a}^{t}, respectively. We also have a time series of observations for each agent i, W_{i}^{0},W_{i}^{1},¼,W_{i}^{t}. Likewise, we can treat the combined observations, W_{a}^{0},W_{a}^{1},¼,W_{a}^{t}, as a similar time series of random variables.
Finally, the agents receive a series of rewards, R^{0},R^{1},¼,R^{t}. We
can define the value, V, of the policies,
p_{aA} and p_{aS}, as the
expected reward received when executing those policies. Over a finite
horizon, T, this value is equivalent to the following:
 (7) 
Model  S_{a}  O_{a} 
DECPOMDP  no communication  collective partial observability 
POIPSG  no communication  collective partial observability 
MMDP  no communication  individual observability 
XuanLesser  general communication  collective observability 
Theorem 1 The decision problem of whether there exist policies, p_{aS} and p_{aA}, for a given COMMTDP, under general communication and collective partial observability, that yield a total reward at least K over some finite horizon T is NEXPcomplete if a ³ 2 (i.e., more than one agent).
Proof: To prove that the COMMTDP decision problem is NEXPhard, we reduce a DECPOMDP [Bernstein, Zilberstein, & Immerman, 2000] to a COMMTDP with no communication by copying all of the other model features from the given DECPOMDP. In other words, if we are given a DECPOMDP, < S,{A^{i}}_{i=1}^{m},P, {W^{i}}_{i=1}^{m},O,R > , we can construct a COMMTDP, < S¢,{A_{i}¢}_{i=1}^{m},S_{a}¢,P¢,{W_{i}¢}_{i=1}^{m},O_{a}¢,B_{a}¢,R¢ > , as follows:
 (8) 
To show that the COMMTDP is in NEXP, our proof proceeds similarly to that of the DECPOMDP. In other words, we guess the joint policy, p_{a}, and write it down in exponential time (we assume that T £ S). We can take the COMMTDP plus the policy and generate (in exponential time) a corresponding MDP where the state space is the space of all possible combined belief states of the agents. We can then use dynamic programming to determine (in exponential time) whether p_{a} generates an expected reward of at least K. ^{[¯]}
In the remainder of this section, we examine the effect of communication on the complexity of constructing team policies that generate optimal behavior. We start by examining the case under the condition of free communication, where we would expect the benefit of communication to be the greatest. To begin with, suppose that each agent is capable of communicating its entire observation (i.e., S_{i} Ê W_{i}). Before we analyze the complexity of the team decision problem, we first prove that the agents should exploit this capability and communicate their true observation, as long as they incur no cost in doing so:
Theorem 2 Under free communication, consider a team of agents using a communication policy: p_{iS}(b_{i·S}^{t}) º W_{i}^{t}. If the domainlevel policy p_{aA} maximizes V^{T}(p_{aA},p_{aS}), then this combined policy is dominant over any other policies. In other words, for all policies, p_{aA}¢ and p_{aS}¢, V^{T}(p_{aA}, p_{aS}) ³ V^{T}(p_{aA}¢,p_{aS}¢).
Proof: Suppose we have some other communication policy, p_{aS}¢, that specifies something other than complete communication (e.g., keeping quiet, lying). Suppose that there is some domainlevel policy, p_{aA}¢, that allows the team to attain some expected reward, K, when used in combination with p_{aS}¢. Then, we can construct a domainlevel policy, p_{aA}, such that the team attains the same expected reward, K, when used in conjunction with the completecommunication policy, p_{aS}, as defined in the statement of Theorem 2.
The communication policy, p_{aS}¢, produces a
different set of belief states (denoted b¢_{i·S}^{t} and
b¢_{iS·}^{t}) than those for p_{aS}
(denoted b_{i·S}^{t} and b_{iS·}^{t}). In
particular, we use state estimator functions, SE_{i·S}¢ and
SE_{iS·}¢ as defined in Equations 5
and 6 to generate b¢_{i·S}^{t} and
b¢_{iS·}^{t}. Each belief state is a complete history of
observation and communication pairs for each agent. On the other hand,
under the complete communication of p_{aS}, the
state estimator functions of Equations 5 and
6 reduce to:


Given this dominance of the completecommunication policy, we can prove that the problem of constructing teams that coordinate optimally is simpler when communication is free.
Theorem 3 The decision problem of determining whether there exist policies, p_{aS} and p_{aA}, for a given COMMTDP with free communication under collective partial observability, that yield a total reward at least K over some finite horizon T is PSPACEcomplete.
Proof: To prove that the problem is PSPACEhard, we reduce the singleagent POMDP to a COMMTDP. In particular, if we are given a POMDP, < S,A,P,W,O,R > , we can construct a COMMTDP, < S¢,A_{1}¢,S_{1}¢,P¢,W_{1}¢,O_{1}¢,B_{1}¢,R¢ > , for a singleagent team (i.e., a = {1}):
To show that the problem is in PSPACE, we take a COMMTDP under free communication and reduce it to a singleagent POMDP. In particular, if we are given a COMMTDP, < S,A_{a},S_{a},P, W_{a},O_{a},B_{a},R > , we can construct a singleagent POMDP, < S¢,A¢,P¢,W¢,O¢, R¢ > , as follows:
 (12) 
Theorem 4 The decision problem of determining whether there exist policies, p_{aS} and p_{aA}, for a given COMMTDP with free communication and collective observability, that yield a total reward at least K over some finite horizon T is Pcomplete.
Proof: The proof follows that of Theorem 3, but with a reduction to and from the MDP decision problem, rather than the POMDP. The MDP decision problem is Pcomplete [Papadimitriou & Tsitsiklis, 1987]. ^{[¯]}
Theorem 5 The decision problem of determining whether there exist policies, p_{aS} and p_{aA}, for a given COMMTDP with individual observability, that yield a total reward at least K over some finite horizon T (given integers K and T) is Pcomplete.
Proof: The proof follows that of Theorem 4, except that we can reduce the problem to and from an MDP regardless of what communication policy the team uses. ^{[¯]}
Theorem 6 The decision problem of determining whether there exist policies, p_{aS} and p_{aA}, for a given COMMTDP with nonobservability, that yield a total reward at least K over some finite horizon T (given integers K and T) is NPcomplete.
Proof: The proof follows that of Theorem 4, except that we can reduce the problem to and from an singleagent nonobservable MDP (NOMDP) regardless of what communication policy the team uses. In particular, because the agents are all equally ignorant of the state, communication has no effect. The NOMDP decision problem is NPcomplete [Papadimitriou & Tsitsiklis, 1987]. ^{[¯]}
Thus, we have used the COMMTDP framework to characterize the difficulty of problem domains in agent teamwork along the dimensions of communication cost and observability. Table 2 summarizes our results, which we can use in deciding where to concentrate our energies in attacking teamwork problems. We can use these results to draw some conclusions about the challenges to designers of multiagent teams:
Individually  Collectively  Collectively  Non  
Observable  Observable  Partially Observable  Observable  
No Comm.  Pcomplete  NEXPcomplete  NEXPcomplete  NPComplete 
General Comm.  Pcomplete  NEXPcomplete  NEXPcomplete  NPComplete 
Free Comm.  Pcomplete  Pcomplete  PSPACEcomplete  NPComplete 
We demonstrate this methodology by using our COMMTDP framework to analyze joint intentions theory [Cohen & Levesque, 1991b,Cohen & Levesque, 1991a,Levesque, Cohen, & Nunes, 1990], which provides a common basis for many existing approaches to team coordination. Section 4.1 models two key instantiations of joint intentions taken from the literature [Jennings, 1995,Tambe, 1997] as COMMTDP communication policies. Section 4.2 analyzes the conditions under which these policies generate optimal behavior and provides a third candidate policy that makes communication decisions that are locally optimal within the context of joint intentions. In addition to providing the results for the particular team coordination strategies investigated, this section also illustrates a general methodology by which one can use our COMMTDP framework to encode and evaluate coordination strategies proposed by existing multiagent research.
Joint intentions theory requires that team members jointly commit to a joint persistent goal, G. It also requires that when any team member privately believes that G is achieved (or unachievable or irrelevant), it must then attain mutual belief throughout the team about this achievement (or unachievability or irrelevance). To encode this prescription of joint intentions theory within our COMMTDP model, we first specify the joint goal, G, as a subset of states, G Í S, where the desired goal is achieved (or unachievable or irrelevant).
Presumably, such a prescription indicates that joint intentions are not specifically intended for individually observable environments. Upon achieving the goal in an individually observable environment, each agent would simultaneously observe that S^{t} Î G. Because of our assumption that the COMMTDP model components (including O_{a}) are common knowledge to the team, each agent would also simultaneously come to believe that its teammates have observed that S^{t} Î G, and that its teammates believe that it believes that all of the team members have observed that S^{t} Î G, and so on. Thus, the team immediately attains mutual belief in the achievement of the goal under individual observability without any additional communication necessary by the team.
Instead, the joint intention framework aims at domains with some degree of unobservability. In such domains, the agents must signal the other agents, either through communication or some informative domainlevel action, to attain mutual belief. However, we can also assume that joint intention theory does not focus on domains with free communication, where Theorem 2 shows that we can simply have the agents communicate everything, all the time, without the need for more complex prescriptions.
The joint intention framework does not specify a precise communication policy for the attainment of mutual belief. In this paper, we focus on communication only in the case of goal achievement, but our methodology extends to handle unachievability and irrelevance as well. One wellknown approach [Jennings, 1995] applied joint intentions theory by having the agents communicate the achievement of the joint goal, G, as soon as they believe G to be true. To instantiate the behavior of Jennings' agents within a COMMTDP, we construct a communication policy, p_{aS}^{J}, that specifies that an agent sends the special message, s_{G}, when it first believes that G holds. Following joint intentions' assumption of sincerity [Smith & Cohen, 1996], we require that the agents never select the special s_{G} message in a belief state unless they believe G to be true with certainty. With this requirement and with our assumption of the team's common knowledge of the communication model, we can assume that all of the other agents immediately accept the special message, s_{G}, as true, and that the agents know that all their team members accept the message as true, and so on. Thus, the team attains mutual belief that G is true immediately upon receiving the message, s_{G}. We can construct the communication policy, p_{aS}^{J}, in constant time.
The STEAM algorithm is another instantiation of joint intentions that has had success in several realworld domains [Tambe, 1997,Pynadath, Tambe, Chauvat, & Cavedon, 1999, Tambe, Pynadath, Chauvat, Das, KaminkaTambe et al.2000,Pynadath & Tambe, 2002]. Unlike Jennings' instantiation, the STEAM teamwork model includes decisiontheoretic communication selectivity. A domain specification includes two parameters for each joint commitment, G: t, the probability of miscoordinated termination of G; and C_{mt}, the cost of miscoordinated termination of G. In this context, ``miscoordinated termination'' means that some agents immediately observe that the team has achieved G while the rest do not. STEAM's domain specification also includes a third parameter, C_{c}, to represent the cost of communication of a fact (e.g., the achievement of G). Using these parameters, the STEAM algorithm evaluates whether the expected cost of miscoordination outweighs the cost of communication. STEAM expresses this criterion as the following inequality: t·C_{mt} > C_{c}. We can define a communication policy, p_{aS}^{S} based on this criterion: if the inequality holds, then an agent that has observed the achievement of G will send the message, s_{G}; otherwise, it will not. We can construct p_{aS}^{S} in constant time.
Both policies, p_{aS}^{J}, and p_{aS}^{S} consider sending s_{G} only when an agent first believes that G has been achieved. Once an agent has the relevant belief, they make different choices, and we consider here what the optimal decision is at this point. The domain is not individually observable, so certain agents may be unaware of the achievement of G. When not sending the s_{G} message, these unaware agents may unnecessarily continue performing actions in the pursuit of achieving G. The performance of these extraneous actions could potentially incur costs and lead to a lower utility than one would expect when sending the s_{G} message.
The decision to send s_{G} or not matters only if the team achieves
G and one agent comes to know this fact.
We define the random variable, T_{G}, to be the earliest time at which an
agent knows this fact. We denote agent K_{G} as the agent who
knows of the achievement at time T_{G}. If K_{G}=i, for some agent, i, and
T_{G}=t_{0}, then agent i has some precommunication belief state,
b_{i·S}^{t0}=b, that indicates that G has been
achieved. To more precisely quantify the difference between agent i
sending the s_{G} message at time T_{G} vs. never sending it, we
define the following value:
 (1) 
Unfortunately, while Equation 13 identifies an exact
criterion for locally optimal communication, this criterion is not yet
operational. In other words, we can not directly implement it as a
communication policy for the agents. Furthermore,
Equation 13 hides the underlying complexity of the
computation involved, which is one of the key goals of our analysis.
Therefore, we use the COMMTDP model to derive an operational expression of
D^{T} ³ 0. For simplicity, we define notational shorthand for various
sequences and combinations of values. We
define a partial sequence of random variables, X^{ < t}, to be the sequence of
random variables for all times before t: X^{0}, X^{1}, ..., X^{t1}.
We make similar definitions for the other relational operators (i.e.,
X^{ > t}, X^{ ³ t}, etc.). The
expression, (S)^{T}, denotes the cross product over states of the world,
Õ_{t=0}^{T}S, as distinguished from the timeindexed random variable,
S^{T}, which denotes the value of the state at time T. The notation,
s^{ ³ t0}[t], specifies the element in slot t within the vector
s^{ ³ t0}. We define the function, U, as shorthand within
our probability expressions. It allows us to compactly represent a particular
subsequence of world and agent belief states occurring, conditioned on the
current situation, as follows:
 (14) 
 (15) 
Theorem 7 If we assume that, upon achievement of G, no communication other than s_{G} is possible, then the condition D^{T}(t_{0},i,b) ³ 0 holds if and only if:
å
s^{ £ t0} Î (S)^{t0}
å
b_{·S}^{ £ t0} Î B_{a}^{t0}
Pr
(U( < 0,t_{0} > ,s^{ £ t0},b_{·S}^{ £ t0}))·
æ
è
å
s^{ ³ t0} Î (S)^{Tt0+1}
å
b_{·S}^{ ³ t0} Î B_{a}^{Tt0+1}
Pr
(U( < t_{0},T > ,s^{ ³ t0},b_{·S}^{ ³ t0})S_{i}^{t0}=s_{G},U( < 0,t_{0} > ,s^{ £ t0},b_{·S}^{ £ t0}))·
T
å
t=t_{0}
R_{A}(s^{ ³ t0}[t],p_{aA}(b_{S·}(b_{·S}^{ ³ t0}[t],p_{aS+sG})))
å
s^{ ³ t0} Î (S)^{Tt0+1}
å
b_{·S}^{ ³ t0} Î B_{a}^{Tt0+1}
Pr
(U( < t_{0},T > ,s^{ ³ t0},b_{·S}^{ ³ t0})S_{i}^{t0}=null,U( < 0,t_{0} > ,s^{ £ t0},b_{·S}^{ £ t0}))·
T
å
t=t_{0}
R_{A}(s^{ ³ t0}[t],p_{aA}(b_{S·}(b_{·S}^{ ³ t0}[t],p_{aS+null})))
ö
ø
³

å
s Î G
å
b Î B_{a}
Pr
(U( < t_{0},t_{0} > ,s,b))R_{S}(s,s_{G}) (2)
Proof: The complete proof of the following theorem appears in Online
Appendix 1. The
definition of D^{T}
in Equation 13 is the difference between two expectations,
where each expectation is a sum over the possible trajectories of the agent
team. Each trajectory must includes a sequence of possible world states,
since the agents' reward at each point in time depends on the particular
state of the world at that time. The agents' reward also depends on their
actions (both domain and communicationlevel). These actions are
deterministic, given the agents' policies, p_{aA}
and p_{S}, and their belief states. Thus, in
addition to summing over the possible states of the world, we must also sum
over the possible states of the agents' beliefs (both pre and
postcommunication):

We can rewrite these summations more simply using our various
shorthand notations:

Theorem 7 states, informally, that we prefer sending s_{G} whenever the the cost of execution after achieving G outweighs the cost of communication of the fact that G has been achieved. More precisely, the outer summations on the lefthand side of the inequality iterate over all possible past histories of world and belief states, producing a probability distribution over the possible states the team can be in at time t_{0}. For each such state, the expression inside the parentheses computes the difference in domainlevel reward, over all possible future sequences of world and belief states, between sending and not sending s_{G}. By our theorem's assumption that no communication other than s_{G} is possible after G has been achieved, we can ignore any communication costs in the future. However, if we relax this assumption, we can extend the lefthand side in a straightforward manner into a longer expression that accounts for the difference in future communication costs as well. Thus, the lefthand side captures our intuition that, when not communicating, the team will incur a cost if the agents other than i are unaware of G's achievement. The righthand side of the inequality is a summation of the cost of sending the s_{G} message over possible current states and belief states.
We can use Theorem 7 to derive the locally optimal communication decision across various classes of problem domains. Under no communication, we cannot send s_{G}. Under free communication, the righthand side is 0, so the inequality is always true, and we know to prefer sending s_{G}. Under no assumptions about communication, the determination is more complicated. When the domain is individually observable, the lefthand side becomes 0, because all of the agents know that G has been achieved (and thus there is no difference in execution when sending s_{G}). Therefore, the inequality is always false (unless under free communication), and we prefer not sending s_{G}. When the environment is not individually observable and communication is available but not free, then, to be locally optimal at time t_{0}, agent i must evaluate Inequality 16 in its full complexity. Since the inequality sums rewards over all possible sequences of states and observations, the time complexity of the corresponding algorithm is O((S·W_{a})^{T}). While this complexity is unacceptable for most realworld problems, it still provides an exponential savings over searching the entire policy space for the globally optimal policy, where any agent could potentially send s_{G} at times other than T_{G}. Table 3 provides a table of the complexity required to determine the locally optimal policy under the various domain properties.
Individually  Collectively  Collectively  Non  
Observable  Observable  Partially Observable  Observable  
No Comm.  W(1)  W(1)  W(1)  W(1) 
General Comm.  W(1)  O((S·W_{a})^{T})  O((S·W_{a})^{T})  W(1) 
Free Comm.  W(1)  W(1)  W(1)  W(1) 
We can now show that although Theorem 7's algorithm for locally optimal communication provides a significant computational savings over finding the global optimum, it still outperforms existing teamwork models, as exemplified by our p_{aS}^{J} and p_{aS}^{S} policies. First, we can use the criterion of Theorem 7 to evaluate the optimality of the policy, p_{aS}^{J}. If D^{T}(t_{0},i,b) ³ 0 for all possible times t_{0}, agents i, and belief states b that are consistent with the achievement of the goal G, then the locally optimal policy will always specify sending s_{G}. In other words, p_{aS}^{J} will be identical to the locally optimal policy. However, if the inequality of Theorem 7 is ever false, then p_{aS}^{J} is not even locally, let alone globally, optimal.
Second, we can also use Theorem 7 to evaluate STEAM by viewing STEAM's inequality, t·C_{mt} > C_{c}, as a crude approximation of Inequality 16. In fact, there is a clear correspondence between the terms in the two inequalities. The lefthand side of Inequality 16 computes an exact expected cost of miscoordination. However, unlike STEAM's monolithic t parameter, the optimal criterion evaluates a complete probability distribution over all possible states of miscoordination by considering all possible past sequences consistent with the agent's current beliefs. Likewise, unlike STEAM's monolithic C_{mt} parameter, the optimal criterion looks ahead over all possible future sequences of states to determine the true expected cost of miscoordination. Furthermore, we can view STEAM's parameter, C_{c}, as an approximation of the communication cost computed by the righthand side of Inequality 16. Again, STEAM uses a single parameter, while the optimal criterion computes an expected cost over all possible states of the world.
STEAM does have some flexibility in its representation, because C_{mt}, t, and C_{c} are not necessarily fixed across the entire domain. For instance, C_{mt} may vary based on the specific joint plan that the agents may have jointly committed to (i.e., there may be a different C_{mt} for each goal G). Thus, while Theorem 7 suggests significant additional flexibility in computing C_{mt} through explicit lookahead, the optimal criterion derived with the COMMTDP model also provides a justification for the overall structure behind STEAM's approximate criterion. Furthermore, STEAM's emphasis on online computation makes the computational complexity of Inequality 16 (as presented in Table 3) unacceptable, so the approximation error may be acceptable given the gains in efficiency. For a specific domain, we can use empirical evaluation (as demonstrated in the next section) to quantify the error and efficiency to precisely judge this tradeoff.
This section presents results of a COMMTDP analysis of an example domain involving agentpiloted helicopters, where we focus on the key communication decision faced by many multiagent frameworks (as described in Section 4), but vary the cost of communication and degree of observability to generate a space of distinct domains with different implications for the agents' performance. By evaluating communication policies over various configurations of this particular testbed domain, we demonstrate a methodology by which one can use the COMMTDP framework to model any problem domain and to evaluate candidate communication policies for it.
The two agents form a toplevel joint commitment, G_{D}, to reach their destination. There is no incentive for the agents to communicate the achievement of this goal, since they will both eventually reach their destination with certainty. However, in the service of their toplevel goal, G_{D}, the two agents also adopt a joint commitment, G_{R}, of destroying the radar unit. We consider here the problem facing Escort with respect to communicating the achievement of goal, G_{R}. If Escort communicates the achievement of G_{R}, then Transport knows that it is safe to fly at its normal altitude (thus reaching the destination sooner). If Escort does not communicate the achievement of G_{R}, there is still some chance that Transport will observe the event anyway. If Transport does not observe the achievement of G_{R}, then it must fly napoftheearth the whole distance, and the team receives a lower reward because of the later arrival. Therefore, Escort must weigh the increase in expected reward against the cost of communication.
In the COMMTDP model of this scenario (presented in Figures 2, 3 and 4), the world state is the position (along a straight line between origin and destination) of Transport, Escort, and the enemy radar. The enemy is at a randomly selected position somewhere in between the agents' initial position and their destination. Transport has no possible communication actions, but it can choose between two domainlevel actions: flying napoftheearth and flying at its normal speed and altitude. Escort has two domainlevel actions: flying at its normal speed and destroying the radar. Escort also has the option of communicating the special message, s_{GR}, indicating that the radar has been destroyed. In the tables of Figures 2, 3 and 4, the ``·'' symbol represents a wildcard (or ``don't care'') entry.
a  ={Escort (E),Transport (T)}  
S  =X_{E}×X_{T}×X_{R}  
Position of Escort: X_{E}={0,1,¼,8,9,Destination}  
Position of Transport: X_{T}={0,0.5,¼,9,9.5,Destination,  
Destroyed}  
Position of Radar: X_{R}={1,2,¼,8,Destroyed}  
A_{a}  =A_{E}×A_{T} = {fly,destroy,wait}×{flyNOE,flynormal,wait}  
S_{a}  =S_{E}×S_{T} = {clear (s_{GR}),null}×{null}  
R_{A}( < x_{E},x_{T},x_{R} > ,a)  =
 
R_{S}(s, < null,null > )  =0  
R_{S}(s, < s_{GR},null > )  =r_{S} Î [0,1] 





If Escort arrives at the radar, then it observes its presence with certainty and can destroy it to achieve G_{R}. The likelihood of Transport's observing the radar's destruction is a function of its distance from the radar. We can vary this function's observability parameter (l in Figure 4) within the range [0,1] to generate distinct domain configurations (0 means that Transport will never observe the radar's destruction; 1 means Transport will always observe it). If the observability is 1, then they achieve mutual belief of the achievement of G_{R} as soon as it occurs (following the argument presented in Section 4.1). However, for any observability less than 1, there is a chance that the agents will not achieve mutual belief simply by common observation. The helicopters receive a fixed reward for each time step spent at their destination. Thus, for a fixed time horizon, the earlier the helicopters reach there, the greater the team's reward. Since flying napoftheearth is slower than normal speed, Transport will switch to its normal flying as soon as it either observes that G_{R} has been achieved or Escort sends the message, s_{GR}. Sending the message is not free, so we impose a variable communication cost (r_{S} in Figure 2), also within the range [0,1].
We constructed COMMTDP models of this scenario for each combination of observability and communication cost within the range [0,1] at 0.1 increments. For each combination, we applied the Jennings and STEAM policies, as well as a completely silent policy. For this domain, the policy, p_{aS}^{J}, dictates that Escort always communicate s_{GR} upon destroying the radar. For STEAM, we vary the t and C_{c} parameters with the observability and communication cost parameters, respectively. We used two different settings (low and medium) for the cost of miscoordination, C_{mt}. Following the published STEAM algorithm [Tambe, 1997], Escort sends message s_{GR} if and only if STEAM's inequality t·C_{mt} > C_{c}, holds. Thus, the two different settings, low and medium, for C_{mt} generate two distinct communication policies; the high setting is strictly dominated by the other two settings in this domain. We also constructed and evaluated locally and globally optimal policies. In applying each of these policies, we used our COMMTDP model to compute the expected reward received by the team when following the selected policy. We can uniquely determine this expected reward given the candidate communication policy and the particular observability and communication cost parameters, as well as the COMMTDP model specified in Figures 2, 3, and 4.
Figures 5 and 6 plot how much utility the team can expect to lose by following the Jennings, silent, and the two STEAM policies instead of the locally optimal communication policy (thus, higher values mean worse performance). We can immediately see that the Jennings and silent policies are significantly suboptimal for many possible domain configurations. For example, not surprisingly, the surface for the policy, p_{aS}^{J}, peaks (i.e., it does most poorly) when the communication cost is high and when the observability is high, while the silent policy does poorly under exactly the opposite conditions.
Previously published results [Jennings, 1995] demonstrated that the Jennings policy led to better team performance by reducing waste of effort produced by alternate policies like our silent one. These earlier results focused on a single domain, and Figure 5 partially confirms their conclusion and shows that the superiority of the Jennings policy over the silent policy extends over a broad range of possible domain configurations. On the other hand, our COMMTDP results also show that there is a significant subclass of domains (e.g., when communication cost and observability are high) where the Jennings policy is actually inferior to the silent policy. Thus, with our COMMTDP model, we can characterize the types of domains where the Jennings policy outperforms the silent policy and vice versa.
Figure 6 shows the expected value lost by following the two STEAM policies. We can view STEAM as trying to intelligently interpolate between the Jennings and silent policies based on the particular domain properties. In fact, under a low setting for C_{mt}, we see two thresholds, one along each dimension, at which STEAM switches between following the Jennings and silent policies, and its suboptimality is highest at these thresholds. Under a medium setting for C_{mt}, STEAM does not exhibit a threshold along the dimension of communication cost, due to the increased cost of miscoordination. Under both settings, STEAM's performance generally follows the better of those two fixed policies, so its maximum suboptimality (0.587 under both settings) is significantly lower than that of the silent (0.700) and Jennings' (1.000) policies. Furthermore, STEAM outperforms the two policies on average, across the space of domain configurations, as evidenced by its mean suboptimality of 0.063 under low C_{mt} and 0.083 under medium C_{mt}. Both values are significantly lower than the silent policy's mean of 0.160 and the Jennings' policy's mean of 0.161. Thus, we have been able to quantify the savings provided by STEAM over less selective policies within this example domain.
However, within a given domain configuration, STEAM must either always or never communicate, and this inflexibility leads to significant suboptimality across a wide range of domain configurations. On the other hand, Figure 6 also shows that there are domain configurations where STEAM is locally optimal. In this relatively smallscale experimental testbed, there is no need to incur STEAM's suboptimality, because the agents can compute the superior locally optimal policy in under 5 seconds. In largerscale domains, on the other hand, the increased complexity of the locally optimal policies may render its execution infeasible. In such domains, STEAM's constanttime execution would potentially make it a preferable alternative. This analysis suggests a possible spectrum of algorithms that make different optimalityefficiency tradeoffs.
To understand the cause of STEAM's suboptimality, we can examine its performance more deeply in Figures 7 and 8, which plot the expected number of messages sent using STEAM (with both low and medium C_{mt}) vs. the locally optimal policy, at observability values of 0.3 and 0.7. STEAM's expected number of messages is either 0 or 1, so STEAM can make at most two (instantaneous) transitions between them: one threshold value each along the observability and communication cost dimensions.
From Figures 7 and 8, we see that the optimal policy can be more flexible than STEAM by specifying communication contingent on Escort's beliefs beyond simply the achievement of G_{R}. For example, consider the messages sent under low C_{mt} in Figure 7, where STEAM matches the locally optimal policy at the extremes of the communication cost dimension. Even if the communication cost is high, it is still worth sending message s_{GR} in states where Transport is still very far from the destination. Thus, the surface for the optimal policy, makes a more gradual transition from always communicating to never communicating. We can thus view STEAM's surface as a crude approximation to the optimal surface, subject to STEAM's fewer degrees of freedom.
We can also use Figures 7 and 8 to identify the domain conditions under which joint intentions theory's prescription of attaining mutual belief is or is not optimal. In particular, for any domain where the observability is less than 1, the agents will not attain mutual belief without communication. In both Figures 7 and 8, there are many domain configurations where the locally optimal policy is expected to send fewer than 1 s_{GR} message. Each of these configurations represents a domain where the locally optimal policy will not attain mutual belief in at least one case. Therefore, attaining mutual belief is suboptimal in those configurations!
These experiments illustrate that STEAM, despite its decisiontheoretic communication selectivity, may communicate suboptimally under a significant class of domain configurations. Previous work on STEAMbased, realworld, agentteam implementations informally noted suboptimality in an isolated configuration within a more realistic helicopter transport domain [Tambe, 1997]. Unfortunately, this previous work treated that suboptimality (where the agents communicated more than necessary) as an isolated aberration, so there was no investigation of the degree of such suboptimality, nor of the conditions under which such suboptimality may occur in practice. We recreated these conditions within the experimental testbed of this section by using a medium C_{mt}. The resulting experiments (as shown in Figure 7) illustrated that the observed suboptimality was not an isolated phenomenon, but, in fact, that STEAM has a general propensity towards extraneous communication in situations involving low observability (i.e., low likelihood of mutual belief) and high communication costs. This result matches the situation where the ``aberration'' occurred in the more realistic domain.
The locally optimal policy is itself suboptimal with respect to the globally optimal policy, as we can see from Figure 9. Under domain configurations with high observability, the globally optimal policy has the escort wait an additional time step after destroying the radar and then communicate only if the transport continues flying napoftheearth. The escort cannot directly observe which method of flight the transport has chosen, but it can measure the change in the transport's position (since it maintains a history of its past observations) and thus infer the method of flight with complete accuracy. In a sense, the escort following the globally optimal policy is performing plan recognition to analyze the transport's possible beliefs. It is particularly noteworthy that our domain specification does not explicitly encode this recognition capability. In fact, our algorithm for finding the globally optimal policy does not even make any of the assumptions made by our locally observable policy (i.e., single agent is deciding whether to communicate or not, regarding a single message, at a single point in time); rather, our generalpurpose search algorithm traverses the policy space and ``discovers'' this possible means of inference on its own. We expect that such COMMTDP analysis can provide an automatic method for discovering novel communication policies of this type in other domains, even those modeling realworld problems.
Indeed, by exploiting this discovery capability within our example domain, the globally optimal policy gains a slight advantage in expected utility over the locally optimal policy, with a mean difference of 0.011, standard deviation of 0.027, and maximum of 0.120. On the other hand, our domainindependent code never requires more than 5 seconds to compute the locally optimal policy in this testbed, while our domainindependent search algorithm always required more than 150 minutes to find the globally optimal policy. Thus, through Theorem 7, we have used the COMMTDP model to construct a communication policy that, for this testbed domain, performs almost optimally and outperforms existing teamwork theories, with a substantial computational savings over finding the globally optimal policy. Although these results hold for an isolated communication decision, we expect the relative performance of the policies to stay the same even with multiple decisions, where the inflexibility of the suboptimal policies will only exacerbate their losses (i.e., the shapes of the graphs would stay roughly the same, but the suboptimality magnitudes would increase).
The COMMTDP framework provides a general methodology for analysis across both general domain subclasses and specific domain instantiations. As demonstrated in Section 4, we can express important existing teamwork theories within a COMMTDP framework and derive broadly applicable theoretical results about their optimality. Section 5 demonstrates our methodology for the analysis of a specific domain. By encoding a teamwork problem as a COMMTDP, we can use the leverage of our generalpurpose software tools (available in Online Appendix 1) to evaluate the optimality of teamwork based on potentially any other existing theory, as demonstrated in this paper using two leading instantiations of joint intentions theory. In combining both theory and practice, we can use the theoretical results derived using the COMMTDP framework as the basis for new algorithms to extend our software tools, just as we did in translating Theorem 7 from Section 4 into an implemented algorithm for locally optimal communication in Section 5. We expect that the COMMTDP framework, the theorems and complexity results, and the reusable software will form a basis for further analysis of teamwork, both by ourselves and others in the field.
While our initial COMMTDP results are promising, there remain at least three key areas where future progress in COMMTDPs is critical. First, analysis using COMMTDPs (such as the one presented in Section 5) requires knowledge of the rewards, transition probabilities, and observation probabilities, as well as of the competing policies governing agent behavior. It may not always be possible to have such a model of the domain and agents' policies readily available. Indeed, other proposed teamanalysis techniques [Nair, Tambe, Marsella, & Raines, 2002b,Raines, Tambe, & Marsella, 2000], do not require a priori handcoding of such models, but rather acquire them automatically through machine learning over large numbers of runs. Also, in the interests of combating computational complexity and improved understandability, some researchers emphasize the need for multiple models at multiple levels of abstraction, rather than focusing on a single model [Nair, Tambe, Marsella, & Raines, 2002b]. For instance, one level of the model may focus on the analysis of the individual agents' actions in support of a team, while another level may focus on interactions among subteams of a team. We can potentially extend the COMMTDP model in both of these directions (i.e., machine learning of model parameters, and hierarchical representations of the team to provide multiple levels of abstraction).
Second, it is important to extend COMMTDP analysis to other aspects of teamwork beyond communication. For instance, team formation (where agents may be assigned specific roles within the team) and reformation (where failure of individual agents leads to role reassignment within in the team) are key problems in teamwork that appear suitable for COMMTDP analysis. Such analysis may require extensions to the COMMTDP framework (e.g., explicit modeling of roles). Ongoing research [Nair, Tambe, & Marsella, 2002a] has begun investigating the impact of such extensions and their applications in domains such as RoboCup Rescue [Kitano, Tadokoro, Noda, Matsubara, Takahashi, Shinjoh, & Shimada, 1999]. Analysis of more complex team behaviors may require further extensions to the COMMTDP model to explicitly account for additional aspects of teamwork (e.g., notions of authority structure within teams).
Third, extending COMMTDP analysis beyond teamwork to model other types of coordination may require relaxation of COMMTDP's assumption of selfless agents receiving the same joint reward. More complex organizations may require modeling other nonjoint rewards. Indeed, enriching the COMMTDP model in this manner may enable analysis of some of the seminal work in multiagent coordination in the tradition of PGP and GPGP [Decker & Lesser, 1995,Durfee & Lesser, 1991]. Such enriched models may first require new advances in the mathematical foundations of our COMMTDP framework, and ultimately contribute towards the emerging sciences of agents and multiagent systems.