                          A Critique of Tileworld

Tentative Outline
I) Introduction
  What does tileworld paper consist of?
     claim about testbeds for planners
     an agent architecture and how it can be implemented
      and validated
     experimental methodology (what does it mean to validate)

II) Description of the implemented world
  Brief description of the world & more detailed
  description of the implementation.

III) Description of the tileworld agent
  What ringuette says and what he does

IV) Description of the tileworld experiments
  What they say and what they do.

V) World
  What does a good world testbed have to look like?
    what are the issues 
    what are then the necessary features

  To what extent does tileworld exhibit these features?

MAXIMUM SEPARATION BETWEEN AGENT AND WORLD.
 bad from design standpoint:  you as agent designer have 
  to design and implement with the details of the world implementation
  in mind

 bad from an experimental standpoint:  to the extent that they're
  combined, one is never sure whether results are due to the 
  agent, the world, or the interface.

violations:  time, state, task/value/score

    V.A) Problems that prohibit comparisons across architecture families.
    V.B) Problems that prohibit comparisons within an architecture family.
    V.C) How Tileworld compares to other geometric real-time simulators.
    V.D) Proposal to add new knobs.
    V.E) What class of agent architectures are expressible in Tileworld. #
    V.F) What are the important aspects of an agent architecture. #
    V.G) How to best test the important aspects of an agent architecture.
    V.F) What the characteristics of the simulated world should have been.

VI) Implementing and validating Agent Architecture
    VI.A) Summary of IRMA architecture from [Bratman et al].
    VI.B) How good of an implementation of IRMA is the agent.
    VI.C) Think time vs. act time, and how time is implemented.
    VI.A) description of experiments conducted .
    VI.B) Claims that have no experimental support.
    VI.C) What could have been an appropriate experimental regime.
VII) What did we learn from Tileworld?
VIII)Conclusion.



V) World:

the world is a subroutine the agent calls (and vice versa)
  the fact that the path planner can literally call the world
  to do its simulation is an abomination!!

in fact, the world is a data structure that the agent passes
to the execution-time code.

now both can be kludged:  the agent can be responsible for 
updating time in a principled way and for not looking at the 
whole data structure or looking at it "wrong" (i.e. introducing 
its own noise),  but if the world is going to be a good testbed, 
i.e. one that forces an architecture to confront the central 
problems that a real agent will face in a realistic world, then 
these things should be exactly the features the world itself should
provide---it shouldn't force the agent to provide those complications
if it feels like it.

perception, for example, can be added, but (1) not as part of the 
subroutine/data structure architecutre that exists now, and (2) not 
as a simple add-on.  if the world is going to decide not to answer 
queries, or to answer them wrong, then it has to have a theory 
of how to do so, and such a theory is not trivial by any stretch
of the imagination.  the problem is as bad as the sensing problem itself.

There are two major claims here.  First, a good setting of the Tileworld
agent depends in large part on the settings of the world (i.e. Tileworld is
a good indicator of performance within an architecture family).  Second,
Tileworld is an adequate testbed for comparing agent architectures (i.e.
Tileworld is a good indicator of performance across architecture families).

V.A) Problems that prohibit comparisons across architecture families:

* "The paper involves the implementation of a system for experimentally
evaluating competing theoretical and architectural proposals."
* "The simulator is designed to minimize noise and preserve fine
distinctions in performance."
* "See the Tileworld testbed as a good basis for comparison of other
agent architectures."
###############
Since limitations on the world itself strongly limit the class of architectures
that can meaningfully be inserted in the environment, it is hard to test an
architectural proposal that is significantly different from the implemented
agent; 
- It is not clear what is the world's interface to an agent, 

   this was not made clear in the paper, and not even in the code, 
   and if this is to be a universally accepted testbed for planning
   systems, the question of what *exactly* is the planner's interface 
   with the world is of paramount importance.

- the world has no sense of elapsed time, 
   
   time is a property of the world, not of the agent.  when the 
   agent has to tell the world how much time has elapsed, we have
   to conclude one of two things:  either such a responsibility is 
   trivial, in which case the world might as well determine for 
   itself how much time has elapsed, or it's not, in which case 
   the agent is both *deciding* how much time has elapsed, then 
   telling the world.  the deciding is something the agent should
   not do:  at best it requires a competence that is very strange
   to expect from the agent;  at worst it invites abuse of the 
   experiments.

- it has a limited notion of payoff, 
- it has very limited subgoal interaction,
- is effected considerably by noise introduced by randomness.  
- the world lacks meaningful regularities, 
- has no separate notion of `task', 

-  does the agent have a model of the world?

  in the current implentation the agent gets a structure from the 
  world which is in fact the world itself.  this becomes the agent's
  world model.  this is unacceptable for two reasons:  first of all 
  it implies that the agent *can* maintain a full model of the complete
  world, and that it can get for free and immediately and automatically 
  the information necessary to do so.  (it doesn't have to do sensing or 
  inference in order to build its world model, but what is planning or 
  projection if not doing exactly that inference?)

  one might argue that this is an implementation detail, and that 
  the agent might well get this structure and limit itself only to 
  looking at certain aspects of the world, and so on.  but this belies
  bad design on the part of the world itself.  is this an appropriate 
  breakdown of functionality between the agent and the world?  (which 
  bears on the question of whether the world itself is an appropriate
  testbed for evaluating various architectures and algorithms.)
  if sensing, incomplete information, and generally building the world
  model are central problems that an agent architecture must confront
  (and we believe they are) then it doesn't make sense to provide a 
  testbed world in which these problems dont't arise, then expect 
  the agent itself to generate those problems.

is the tileworld a good testbed
  1.  does it have the features that generate the important issues
      involved with building an agent
  2.  is the interface similar to that which a "real" agent might 
      confront?
  
    In `traditional' planning, world interfaces to an agent are perception
or sensing, in which information about the world is passed to the agent,
and action, in which an action is passes to the world.  In Tileworld, as
implemented in the paper, the agent has complete knowledge about the
current objects in the world, so there is no perception interface.  Acting
or effecting interface, which is defined by a function called "tw-step"
that takes the next move and the duration of that move as parameters, can
be easily manipulated by the agent since the agent is the one to determine
the duration of an act.  An important point to add here is that the world
interfaces must be well documented, a feature missing from Tileworld, to
allow incorporating other agent architectures.   
    The notion of time is alien to the world, and is completely specified by
the agent.  The only time the `world clock' ticks is when the agent tells
it to.  This makes comparing the `speeds' of alternative agent
architectures at best ambiguous.  Two extremes are an agent that estimates 
the number of act cycles equivalent to elapsed reasoning time and executes
an equivalent number of 'nop' acts so in effect it remains idle an amount
of time equivalent to reasoning time, and another that does not account for 
reasoning time at all and only takes into account actual acting time.  For one
to argue for the estimation of reasoning time, they must able to find the
ratio of act time verses reasoning time (not clearly defined).  
    Since payoff is completely defined by score and given the problem of
accounting for reasoning time, the only results that seem meaningful to
compare are total scores attained given an equal number of moves (assume
all moves take the same duration).  Still then, total score attained is 
not a measure of how well the agent did alone, it is greatly influenced by
the randomness of the world.  The total randomness of the world along with
using score as a performance measure introduce so much noise that two runs
with identical settings may have results that are significantly different
(this point should be proved with an example). 
    Although a great deal of planning research was focused on subgoal
interaction,  Tileworld seem to side step this issue.  The only forms of
subgoal interaction exhibited by Tileworld are resource availability
constraints, which are the limited number of moves and the limited number
of tiles.  
    Because events in Tileworld, appearance and disappearance of tiles and
holes, are totally random albeit their rate is controlled by a knob, the
world lacks meaningful regularities.  Controllable event rates does not make
prediction and projection viable, and planning can only consider existing
aspects of the world. (this point is related to the question of what do we
want to test in agent)  
    Finally, the notions of world and task are closely tied together.
There are three basic entities in a planning environment, a world, an 
agent, and a task.(research this point)
    The lack of meaningful documentation, in addition to the limitations
discussed above, may render two different agents built from scratch
incomparable, because cost and utility are not clearly defined.  
    The general question that should be asked then is: what are the important
aspects of an agent architecture and how to test and compare them? 
To develop a testbed for evaluating agent architectures, the first step
should be to figure out the important aspects of an agent architecture, 
the next step would be to design a testbed to evaluate those aspects.

V.B) Problems that prohibit comparisons within an architecture family.
* "The appropriateness of a particular meta-level reasoning strategy will
depend in large part upon the characteristics of the environment in which
the agent incorporating that strategy is situated."

When 
    meta-level reasoning strategy := Tileworld agent setting;
    environment characteristics := world setting;
No experiments were conducted to prove this point.  A reasonable experiment
would be to test several agent reasoning strategies with several world
settings and figure out which reasoning strategy is appropriate with each
class of world settings.  
    Still even if those experiments were conducted, the limited notion of
payoff and noise introduced by randomness cloud up the fine differences in
performance.  The problem with having a single measure of performance
(score) is that we cannot asses the performance of the various `intelligence
components' of an agent since score will tend to aggregate the performance
of all the components.

V.C) How Tileworld compares to other geometric real-time simulators.

* "Tileworld exhibits spatial complexity; and it includes tasks of varying
degrees in importance and difficulty.  It is generic, it has a wide
distribution of task values (hole scores) and task difficulty (hole size),
which differentiates it from Phoenix and MICE."

7.  it would be good to compare it to the spatial complexity 
of a REAL geometric simulator.  there's a good one from CMU, 
whose name escapes me, but i can find a reference.  did you find
out anything about phoenix or mice?
(find phoenix reference)

V.D) Proposal to add new knobs.

* "Changing other parameters will be very useful, like size of the space,
distribution of the task value and difficulty, and availability of tiles."

Before introducing new knobs (parameters).  A good understanding of the
utility and effect of the existing knobs is required.  Existing knobs
should be experimented and the degree to which they effect difficulty and
reasoning should be explained.  From the paper, only the agent's act/think
rate and the sophistication of the deliberation mechanism have adequate
explanations of their consequences.  Some results that may be learned from
those experiments are for example: which has a stronger effect on
performance, dynamism or hostility? similarly for variability in utility and
variability in difficulty.  

14.  good.  i think it's important to say what we thing 
a GOOD experimental regime would have been.  it would be 
good to read and consider the drummond and langely paper
in this regard.

V.F) What the characteristics of the simulated world should have been.

16. gotta agree.  this paper has to have some constructive  
work in it too---one thing to do is to think about what 
characteristics a simulated world SHOULD have.  that 
should lead us to some very fundamental thinking about the 
necessary tradeoff between a simple, uniform, generic testbed
and (1) the ability to express a meaningful agent architecture
in it, and (2) the ability to do meaningful experiments. 
our conclusion may well be that it's going to be a tough job
to balance the two, but one shouldn't sacrifice the hard issues
just in order to get nice graphs in the paper.

-----------------------------------------------------------------
Agent Architecture:

IV.A) Summary of IRMA architecture from [Bratman et al].

* "The process of deliberation is different from means-ends reasoning."
###############
There were no evidence from the experiments to support that claim.  Even if
we assume that the claim is true, in Tileworld, a good answer to the
question of which goal to pursue next (deliberation) depends crucially on
the difficulty of achieving each goal (means-ends reasoning).  This aspect
is especially important in Tileworld where goals are independent and there
is no ordering or interaction between goals (other than in a weak sense
which is tile availability). 

8.  does this claim in fact come from bratman et al?  is there
any support at all for the claim.  we should note that it 
certainly is crucial to ringuette's agent's performance:  the 
latter being written in C and not open to experimentation.

important point about no real goal/subgoal interaction---that's
obviously the point of all classical planning efforts.   this 
should be grouped in the complaint about no real sense of 
value.  i want to hit hard the fact that the WORLD decides scores
etc.  value should be an attribute of the agent, not the world.

IV.B) How good of an implementation of IRMA is the agent.

* "The Tileworld agent is an implementation of IRMA (Bratman 88) that
captures its essence."
###############
It seems like the agent is a simple minded implementation of the
architecture described in the paper, especially the filtering mechanism and the
means-ends reasoner interface to the filter.  From reading the Bratman et
al. paper, one finds out that the whole idea of an Intelligent
Resource-bounded Architecture depends on the filtering mechanism.  The
filtering mechanism is where "resource saving" is achieved, but since the
Tileworld agent had a fast approximate deliberator and a nearly useless
filter, the implementations did not capture the essence of the
architecture.  Any way the idea of an IRMA agent only makes sense when
resources are actually bounded (deliberator is expensive).  It would have been
nice if there were a comparison of an IRMA agent and another agent that
does not care about resources in which scores are compared against total
reasoning time (this may be a problem since Tileworld has no sense of
elapsed reasoning time).  One question that may be asked is what is the
connections between Tileworld and the IRMA architecture? it seems to me
that they totally  separate ideas, and they may be even non-compatible
because Tileworld lacks the sense of reasoning time.  It was a bad choice
to advocate two separate ideas in one paper without proving either
thoroughly.

9.  it would be nice to summarize the important aspects of the 
irma agen---what are the real, important ideas---and to what 
extent does ringuette's architecture capture it.  in fact, what 
would it mean for an implementation to capture the ideas in 
such an architecutre at all?

* "Tileworld has been shown to be a viable system for evaluating agent
architectures.  The agent was demonstrated and used to test differing
deliberation and filtering strategies as described in (Bratman 88)."

Only one agent sub-architecture was tested.  Within that architecture, only
two simple deliberation strategies were tested.  Filtering was treated very
simplistically. 

IV.C) Think time vs. act time, and how time is implemented.

* "Speed-accuracy trade-off."

The notion of time is alien to the world and is completely defined by the
agent.  So it is very hard to talk about speed when there is no precise
definition of reasoning time and its effect on total performance.
Consequently, it is hard to compare the "speeds" of alternative agent
proposals.  

5.  good.  this deserves emphasis.  at some point we should
say that what he's REALLY trying to get at is a tradeoff
between deliberation and action (at least that's what 
everybody else is getting at) and the way the world is 
set up makes it difficult to talk about time.

---------------------------------------------------------------------
Experiments:

* "Can experimentally investigate the behavior of various meta-level
reasoning strategies by tuning the parameters of the agent, and can assess
the success of alternative strategies in different environments by tuning
the environmental parameters."
* "The parameters of the world are dynamism (the rate at which new holes
appear), hostility (the rate at which obstacles appear), variability of
utility (differences in hole scores), variability of difficulty
(differences in hole sizes and distances from tiles), and hard/soft bounds
(holes having either hard timeouts or gradually decaying in value).  Agent
parameters are: act/think rate (the relative speeds of acting and
thinking), the filter's threshold level, and sophistication of the
deliberation mechanism."

Of all those parameters and given all the claims in the paper, only two
agent parameters were varied in the first experiment, speed and
deliberation mechanism, and one parameter in the second experiment,
threshold.  Other than matching world speed with agent speed (which
corresponds to dynamism and possibly hostility), world parameters were not
varied at all, which makes one question the utility of such parameters
especially given the claims of how the different world settings match some
agent settings more suitably than others.     

10.  at this point it will be important to be very precise 
about what the Tileworld actually does.  the point we want 
to make is that ringuette is using some very suggestive words
and terms, and the reality of the world doesn't match them. 
this is best done by taking one and explaining exactly how 
it's implemented in the Tileworld.

* "Experiment 2, usefulness of the filtering mechanism:  filtering does
not help for the LV deliberator because it does not cost much, it may be good 
for better deliberator."  

Agree since the agent implementation is simplistic, the filter serves no
useful purpose.  But in general how good is the idea of having different
levels of planning each with different capabilities? and what experiments
are necessary to prove that point? Anyhow an experiment of such would not
be conclusive if reasoning time was not taken into account.  The function
of the filtering mechanism is to conserve reasoning time while at the same
time not perform much worse than unfiltered case.  It is suspected that the
inconclusiveness of experiment 2 is partially due noise introduced by the
randomness in the world.  

* "Experiment 3, application of  bias towards existing intentions: bias
does not have an effect on performance. Reasons may be: domain is too
small, try larger worlds, or too many opportunities (holes and tiles)." 

It seems like the deliberators to which bias is applied do not account for
path planning time, so applying a percentage of the value returned by the
deliberator does not make the estimate of path planning time any better.
Still, this problem is strongly connected to Tileworld's inability to
account for time in general and reasoning time in particular. 
-----------------------------------------------------------------------
Conclusions:

15. "Intend to add limited perception, learning, and intention
coordination(?).  Expect that, as deliberation and planning get more
expensive, filtering will be more important and the intention structure
will involve more complex interactions among intentions."

Filtering will not make sense until then.  

17. The overall goal of the project is an improved understanding of the
relation between agent design and environmental factors.  

Still need to experiment environmental factors.

-------------------------------------------------
Agenda:
1- summarize Bratman's architecture.
2- find a paper on Phoenix (AAAI90 jscott).
3- give a detailed description of Tileworld.
4- talk about how the agent simulated think time, parallel acting/thinking.
5- what agent architectures could be expressed in Tileworld?
