I) Introduction
 
Tileworld, as proposed in [Pollack & Ringuette 90], is a testbed for
experimentally evaluating agent architectures.  Tileworld consists of a
simulated robot agent and a simulated environment which is both dynamic and
unpredictable.  The paper contains many claims about the appropriateness of
Tileworld as a testbed for planners and experiments purporting to support
those claims.  Our work includes bringing up the Tileworld code and
actually testing some of the claims in the paper  and studying and
identifying the important issues involved with building an intelligent
agent.  The paper includes a brief description of the Tileworld environment
and agent, a discussion of the implementation of Tileworld, a discussion of
the experimental methodology followed in the paper,  some thoughts on what
are the necessary features for a good testbed for evaluating agent
architectures,  and the extent to which Tileworld exhibits those features.

II) Description of the implemented world

II.a) The world as presented in the paper
[figure 1 a diagram of Tileworld]
Tileworld consists of a simulated robot agent and a simulated environment
which is both dynamic and unpredictable.  The Tileworld is a
chessboard-like  grid on which there are agents, tiles, obstacles and
holes.  An agent is a unit square which is able to move up, down, left, or
right, one cell at a time, and can, in so doing, move tiles.  A tile is a
unit square which "slides": rows of tiles can be pushed by the agent.  An
obstacle is a group of grid cells which are immovable.  A hole is a
group of grid cells, each of which can be "filled in" by a tile when the
tile is moved on top of the hole cell; the tile and particular hole cell
disappear, leaving a blank cell.  When all the cells in a hole are filled
in, the agent gets points for filling the hole.  The agent knows ahead of
time how valuable the hole is; its overall goal is to get as many points
as possible by filling in holes.  A Tileworld simulation takes place
dynamically: it begins in a state which is randomly generated by the
simulator according to a set of parameters, and changes continually over
time.  Objects (holes, tiles, and obstacles) appear and disappearance at rates
determined by parameters set by the experimenter, while at the same time
the agent moves around and pushes tiles into holes.  Both the environment
and the agent are parameterizable by "knobs".  The knob settings control
the evolution of a Tileworld simulation.  Some of knobs are those that
control the frequency of appearance and disappearance of each object
type.  Other knobs control the number and average size of each object type.
still other knobs are used to control factors such as the shape of the
distribution of scores associated with holes, or the choice between the
instantaneous disappearance of a hole and a slow decrease in value.

II.b) The implementation of the Tileworld environment (world)

An agent's view of the world, as implemented, is of a Lisp structure with a 
set of associated routines.  The structure contains the lists of all
objects currently in the world (tiles, agents, holes, obstacles) and
information relevant to those objects like the knob settings for each
object type and location, size (for obstacles and holes), score (for
holes), and timeout for individual objects.  
    Different knobs control different aspects of the world:  a knob
controls appearance and disappearance rates of an object type; other knobs,
which control the number of objects, the number of cells per object, the
score of an object, and the timeout, have subknobs that control the initial
configuration and limit the possible values those world aspect can take.
For an example, by setting the subknobs of the number of holes knob to x,
y, and z the initial  configuration of the world would have x holes and
throughout the game the number of holes should not be less than y or
greater than z.  
    In addition to knob settings, the world structure includes
the world time slice (the duration between clock ticks) and several arrays
that have the same dimensions as the world and represent the grid shaped
world.  Initialization of the world includes setting the world time slice
(in milliseconds), dimensions (number of rows and columns of the world
grid), and the knob settings for each object type.  The initial state of
the world is randomly generated according to the knob settings.  The
routines of the world, relevant to an agent, are the simulation routine,
which is called to simulate an action, and utility routines, one of which
allows an agent to copy a world structure.    

To query information about the world, an agent has full access to the world
structure and it is assumed that it knows the organization of the world
structure.  To simulate an action (a move) taken by the agent, the agent
calls a world simulation subroutine with the current world structure, the
agent structure, the current move action decided by the agent, and the
amount of time this action is supposed to take, 'delta-t', as parameters. 
    The subroutine updates the world structure after incrementing elapsed
time by 'delta-t' and then simulates the agent's move in the world.
The world structure is updated by decreasing the timeouts for all objects in
the world by the number of clock ticks encompassed by delta-t.
Consequently, objects that expire (their timeouts reaches zero) are removed
from the world.  At the same time, the world maintains the number 
of objects in the world within the knobs' limits by removing or adding 
randomly selected objects as necessary.  The agent's move is then simulated
by changing the location of the agent and all pushable  objects (tiles) on
its move path, according to the direction of the move.  If a tile is pushed
into a hole cell both the tile and the hole cell disappear.  At the end of
the execution of the subroutine, failure is reported if the agent, because
of its move, fell into a hole and it success is reported otherwise.  

III) Description of the Tileworld agent

III.a) What is an IRMA?

(should I summarize Bratman paper here? )

III.b) The agent architecture as presented in the paper

[figure 2 agent architecture diagram][copied from the paper more or less]
    The agent implemented is an instantiation of IRMA -the Intelligent Resource
Bounded Machine Architecture- [Bratman et al., 1988].  The implementation
models a robot with two processors.  One processor executes a reasoning
cycle and the other executes an act cycle.  The act cycle executes the acts
(moves) formulated during the previous reasoning cycle and monitors for
limited kinds of failures.  Perception occurs during an act cycle,
when the agent accesses a global map of the world that indicates the
locations of all objects, as well as the score and time remaining to
timeout for all objects.  
    The reasoning cycle decides which goals to pursue next and how to achieve
them.  The reasoning cycle is to maintain the intention structure, a time
ordered set of tree-structured plans the represent that agent's current
intentions.  During a Reasoning cycle, an agent does either filtering and
deliberation or means-ends reasoning.  During the filtering and
deliberation processes potential additions to the intentions structure,
called options, are considered.  Options may be suggested either by 
perceiving an environmental change, when the agent detects the appearance
of a new hole or tile, or by means-ends reasoner.  Means ends reasoning can
be performed to produce new options that can serve as means to current
intentions.  The bulk of the means-ends reasoner is a special-purpose route
planner.  
    The filtering and deliberation processes are confined to top-level
options, i.e., options to fill a particular hole.  The filter has two
parts: the compatibility filter and the filter override.  An option passes
the filter if it is either compatible with all existing intentions, or if
it triggers an override.  Since a top-level option is to fill a hole now or
later, compatibility, as implemented, is straightforward: if the agent has
a current intention to fill a hole NOW, no other option to fill another
hole NOW is compatible.  All intentions to fill a hole LATER are compatible
with each other.  The override mechanism compares the score 
of a hole being considered as an option to that of of the hole currently
being filled.  If the difference between them equals or exceeds some
threshold value V, then the new hole passes the filter.  The threshold value
is set by a Tileworld parameter.  
    Options that pass the filter are considered by the deliberator.
Currently, there are two implemented deliberation strategies.  The simpler
deliberation module evaluates competing top-level options by selecting the
one with the higher score.  The more sophisticated deliberation strategy
computes the likely value (LV) of a hole.  LV is an estimate of expected
utility, combining information about score with information about
likelihood of success.  FORMULA. 
    
II.b) The implementation of the Tileworld agent

The agent's main loop consists of a think cycle followed by a number of act
cycles.  Throughout a game, the think and act cycles operate on the agent
data structure: the agent's mental state that includes fields to hold
act-time (the length of time an atomic action takes in milliseconds),
wait-time (length of time to act if there were no work done during
reasoning), deliberation strategy, threshold, options (holes not considered
yet by the deliberator), and intentions (holes considered by the
deliberator with possibly one hole that is designated to be filled now). 
    To model a robot with parallel thinking and acting, the agent figures
out the amount of time used in thinking, think-time, an acts for the same
amount of time.  Also, both the think and act cycles use copies of the same
agent structure and world structure to begin with.  The last step in the
agent's main loop  is to merge both mental states resulting from the think
cycle and the act cycles.  The agent's knobs include: speed (more
appropriately, think-time/act-time, which is modified by setting act-time),
deliberation strategy, and threshold. 

Think Cycle:
    Recall that, during a reasoning cycle, the agent either does filtering
and deliberation or means-ends reasoning.  Filtering decides whether or not
a new option (hole) is admitted to the intention structure and deliberation
decides which of the holes in the intention structure to pursue next.  On
the other hand, means-ends reasoning, or path planning, is concerned with
devising a detailed path plan to fill a hole.  A complete path plan is a
series of moves intended to take the agent from its present location to a
tile and then to push the tile into a hole cell for all the cells of a
hole to be filled.  
    At the beginning of a think cycle, if the agent detects a new option
(the appearance of a new hole) or if it does not have a current intention
(e.g. it is just done filling a hole), the think cycle is spent on
filtering and deliberation.   Otherwise, if there is a current intention
(a hole selected to be filled now) that does not have a complete path plan,
the think cycle is spent on path planning.  During a single think cycle, a
path plan for filling only one cell of hole is devised.  In the case that
non of the above conditions apply, the think cycle does nothing and
think-time is set to wait-time to allow the agent to act.   Otherwise, the
internal processor time before and after thinking is done is noted,
and the difference between those times is labeled think-time. 
    Filtering, as implemented, is very simple.  If there is no current
intention, all new options pass the filter.  Otherwise, only options who's
score is at least equal to the score of the current intention plus the
threshold pass the filter.  The simple deliberation strategy compares the
scores of competing options, and the option with the highest score is
selected to be filled next.  The more sophisticated deliberation strategy
computes the likely value of filling a hole and it implements the LV
formula mentioned above.  Computing LV for filling a hole is quite easy
since the agent has complete access to the world structure.  
    The path planner (means-ends reasoner) is the most complicated module of
the agent.  The heart of it consists of C language code that does breadth
first search to find the shortest path that the agent may take to move from
its present location to a tile and then to push that tile to one of the
cells of the hole selected by the deliberator.   
    The path planner is passed the hole to filled and a copy of the world
structure.  It starts by finding a closest tile to the agent.  This is done
by searching the squares nearest to the agent for tiles first, and then
advancing to further and further squares in a circular wave fashion.  Once
a tile is found, a piece of C code is called to attempt to generate a plan,
called a move plan, from the agent's location to any square adjacent to the
tile.  If for some reason no plan was generated (e.g. obstacles block the
way), the search is repeated to find the next closest tile and the
procedure is repeated until there is a path plan from the agent to a tile.
Before generating the next portion of the plan, a plan to push the tile
into a hole cell, a copy of the world structure is updated by running the
world simulation routine on that world copy using each atomic move in the
generated move plan.  
    Using the updated  world structure, another piece of C code attempts to
generate a plan, called a push plan, for pushing the tile into an empty
hole cell.  As before, if for some reason no plan was generated (e.g. the
agent can not position it self behind the tile because the tile is adjacent
to an obstacle), the tile is discarded and the search is repeated to find
the next closest tile and the procedure is repeated until there is a
complete path plan to fill a hole cell.  
    If the path planner is unable to fill any hole cell (for the lack of
tiles for example), it will report failure.  In the case of success, the
world structure is updated, again, by running the world simulation routine
using the push plan.   The two updated world structures, one from running
the move plan and the other from running the push plan, are stored in the
agent structure to facilitate further path planning.  Furthermore, the
generated path plan is stored in the agent structure along with the hole
(current intention) it is intended to fill. 

Act cycles:
      After the think cycle is over the number of cycles to act is
calculated by dividing think-time by the agent's act-time (cycle length).
By adjusting the agent's act-time, one can vary the think-time to act-time 
ratio.  Since think-time only depends on the time taken by the think cycle,
increasing act-time reduces the average number of act cycles per
think-cycle and vice versa.  
    Within a main loop iteration, the first act cycle starts with the same
starting world the think cycle started with.  Furthermore, the mental state
generated by the last think cycle will not be used before the next
iteration, so a newly generated path plan, for instance, will not be
available to the actor until the next iteration through the loop.  
    During each act cycle, atomic acts are taken from the current intention
of the agent structure and are simulated in the world by calling the world
simulation routine.  If the world simulation reports failure (the agent
fell into a newly appearing hole for example), the complete path plan to
fill a hole is killed.  
    Perceiving, limited to the detection of newly appearing or disappearing
holes, is done at the end of an act cycle.  Generally, the results of
perceiving effect the agent structure (mental state) processed at the next 
think cycle and the immediately following act cycles are not effected.  If
the number of act cycles to be executed is greater than the number of
atomic acts, 'NOP's are simulated: the passage of of an amount of time
equal to act-time while the agent remains stationary.  
    Finally, since the agent is simulating concurrent thinking and acting,
the last step in the agent's main loop is merge the mental states (agent
structures) resulting from thinking and acting.  Thinking removes options
and adds intentions.  While acting removes intentions which may or may not
have succeeded and perceiving, which occurs during the act cycle, removes
intentions and options and also adds options.   

IV) Description of the Tileworld experiments
[later]
Experiment 1 successfully demonstrates the following points: acting in
parallel with reasoning is advantageous, the LV deliberation strategy
outperforms the simple strategy, and a more rapidly changing environment
(given a more or less constant think time) results in performance
degradation.   
    Experiment 2 attempts to clarify some of the design trade-offs in the
agent: it tests the usefulness of the implemented filtering mechanism,
using the LV deliberation strategy.  During the experiment, threshold is
varied from -100 to 100 and three agent speeds are tested.  The results
showed "that filtering is harmful at slow speeds, and even at high speeds
does not give a net benefit".  The reasons for the unexpected behavior, as
explained in the paper, stems from the noise in the data that is due, in
large part, to the decision to use actual CPU-time as a reasoning time
metric; or there may be faster speeds at which filtering may be useful.
Alternatively, it maybe the case that filtering is more useful when the
deliberation strategy is more accurate and costly.
    Experiment 3 attempts to test a suspected deficiency of the LV
evaluator: it does not take into account the time cost of means-ends
reasoning already performed.  After biasing the deliberator in favor of 
existing intentions, however, there did not appear to be a clear effect on
total performance.  Two hypotheses were conjectured to explain this
behavior: (1) the test environment had too many opportunities, that a few
missed opportunities did not have a net effect of performance; (2) or that
means-ends reasoning was too inexpensive in the existing environment.
Proposals to correct those problems were the following: reduce the number
of opportunities in the environment (e.g. provide less tiles), and increase
the size of the environment or add more complex planning routines to
increase means-ends reasoning effort.    

V) Planning benchmarks

For a planning benchmark to be indicative of a planners performance in the
real world, it must present the planner with the important issues an
intelligent agent might have to deal with in a realistic world.  And to
facilitate comparisons among agent architectures and for the tests to give
useful results, planning benchmarks must have well defined interfaces and
indicative performance measures.  

V.a) Some of the important issues an intelligent agent might have to deal
with in a realistic world: 

    The effect of elapsed planning time in a real-time world: opportunities
may be lost, the world may change in a manner that makes the planning
taking effect worthless, or in `anytime planning' algorithms the quality of
the solution is directly effected by planning time.   Equally important,
utility of a goal may vary over time, or a deadline might be imposed for
satisfying a goal. 

    Incomplete knowledge of the world: the world may be complex and
dynamic, acquiring information about the world may not be easy (sensing),
or the exact effects of various events (including the agent's own actions)
on the state of the world may not be known.  

    Richness of the environment (the sensing problem): the aspects of the
world that an agent may be interested in, how might an agent learn about
them, the difficulty and cost of doing so, and the reliability of such
information.

    Complexity of effectors (multiple freedom movement): the number of
possible actions an agent may take in a world.  Usually, the greater the
number of possible actions, the more difficult decision making becomes
(increase in branching factor).  

    Execution uncertainty: the exact effect of an agent's action may not be
known, the world may change in a way that makes an action fail, or an
urgent situation may arise that requires immediate attention.  All those
case call for for execution monitoring on the agent's part.  

    The representation problem: in a complex and dynamic world, an agent may
need to maintain a world model that captures the agent's beliefs, goals,
needs, and desires concerning that world.  Also, such a world model can be
used to draw inference about the world. [hanks and firby 90]  If the
problem involves extensive geometry or reasoning about activities over time
it may need specialized representations.    

    The control problem (acting vs. deliberation): should decisions be made
as far ahead in time as possible or should decisions be deferred as long as
possible, act at the last possible moment.  

    Prediction and projection: can an agent look into the future to predict
future world states? Although a world may be dynamic and not completely
predictable, if it possess a minimal set meaningful regularities, an agent
may be able make sense of what may happen in the future.

    Goal interaction: can the agent get away with planning each
goal/subgoal independently? If so, this simplifies the planning task
drastically.

    Resource management: the problem may take the character of a classical
optimization problem if agent must deal with metric time and continuous
quantities.

     A priori knowledge requirement: some domains require an agent to
acquire a vast amount of knowledge a priori, while in other domain an agent
can get away with minimal knowledge and refine its knowledge iteratively.

    Concurrency: If a task to accomplish requires a group of interacting
agents, then reasoning about the states of other agents may be required and
communication protocols must be established.

    Variability of utility and difficulty of task in the domain: (just to
give credit to the only aspect mentioned by Ringuette).

V.b) Features to look for in a testbed world

    Interface, sensing, and effecting: a benchmark in which to compare
agent architectures must have a clean and well defined interface.
Classically, an agent is interfaced to a world through sensing and
effecting.  Knowledge about the world is acquired through sensing and
agent's decisions are conveyed through effecting.  A well defined interface 
should allow for maximum separation between agent and world.  The interface
should abstract away the world's implementation details and should not be
specific to any agent architecture.  

    Simplicity vs. usefulness: In the design of testbed world, there is a
necessary tradeoff between a simple, uniform, and generic testbed and (1)
the ability to express a meaningful agent architecture in it, and (2) the
ability to do meaningful experiments.  For a generic testbed to be useful,
it should try to balance those factors, even though this may not be an easy
job. 

    Measure of behavior: "To evaluate any planning system, one needs some
measured of its behavior.  In most experiments, there ate the dependent
variables that one would like to predict.  There are two obvious classes of
metrics for planning algorithms - the quality of the generated plans and
the effort required to generate them.  There exist many variations on the
notion of plan quality.  In a classical planning framework, one might
simply measure the length of the solution path or the total number of
actions.  More sophisticated dependent variables involve the time taken to
execute a plan, the energy required, or the use of other resources.
Alternatively, one can examine the robustness of a plan, as would be
characterized by its ability to respond well under changing or uncertain
conditions." [Langley and Drummond]

    Performance indicators: When using a testbed to evaluate a planner,
its outcome should give a hint to the reason (part, module, or strategy)
the planner exhibited a specific behavior (e.q. failure to achieve a goal,
or extreme slowness)(the credit assignment problem?).  This makes the
performance indicators of a testbed more useful than a mere score, because
it points out what may need to be improved in a planner, or which component
performed exceptionally well.   

    Controllable parameters: For the benefit of experimental design, it will
be very useful to control for parameters that correspond to the important
modules of an agent.  Some proposed parameters are: max planning time,
ratio of deliberation to execution, amount of goal interaction, dynamism,
reliability of effector, reliability of sensors, Number of available
actions, complexity of the environment (geometry).  METRICS

    Single testbed vs. set of benchmarks: Ultimately, given the vast
difference between agent architectures and the large number of aspects that
come into play in the design of an intelligent agent, one might ask does it
make sense to have a universal planning testbed?   A better proposal,
proposed in [Drummond], may be having a group of benchmarks that test the
wide set of possible attributes.   

    
    Benchmark domains:  benchmarks may be specified differently and run in
different domains.  Some benchmark specifications methods could be: natural
language description, which are easy to generate but hard to formalize,
simulation, which is precise but may be over simplified, formal
specifications, which  are hard to come up with for complex tasks but
provide insight into the  underlying problem complexity, or an actual
physical environment, which may be hard to control but provides valuable
practical insight.  An advantage of artificial domains is that it is
possible to establish upper bounds on performance and performance in that
domain can be  compared to optimal.    (continue from Langley and
Drummond)...... 


V.c) The extent to which Tileworld exhibits these features

One of the biggest problems with Tileworld, as it stands now, is that it is
not clear what is the world's interface to an agent.  The interface was not
made clear in the paper, and not even in the code, and if it is to be a
universally accepted testbed for planning systems, the question of what
exactly is the planner's interface with the world is of paramount
importance.  
    Another closely related problem is that the agent does not have a model
of the world.  In Tileworld, the world is a subroutine the agent calls.
In fact,  the world is a data structure that the agent passes to the 
execution-time code.  The path planner can literally call the world to do
its simulation!. ( is an abomination!!)  So, the Tileworld agent has
complete knowledge about the current objects in the world (by getting a
structure from the world which is in fact the world itself), and the world 
structure becomes the agent's `world model'.  This is unacceptable for two
reasons:  first of all it implies that the agent can maintain a full model
of the complete world, and that it can get for free, immediately, and
automatically the information necessary to do so.  This means that the
agent can build its world model without having to do sensing or interface,
a task a lot of research in planning and projection was conducted to do.  
    Regarding the lack of perception interface, one might argue that it is
an implementation detail, and that the agent might well get this structure
and limit itself only to looking at certain aspects of the world, introduce
its own noise in them, or that genuine perception can be added.  However,  
perception, if added, (1) can not be part of the subroutine/data structure
architecture that exists now, and (2) can not be  a simple add-on.  If the
world is going to decide not to answer queries, or to answer them wrong,
then it has to have a theory of how to do so, and such a theory is not
trivial by any stretch of the imagination.  The problem is as bad as the
sensing problem itself.  
    If sensing, incomplete information, and generally building the world
model are central problems that an agent architecture must confront in
realistic world (and we believe they are) then it does not make sense to
provide a testbed world in which these problems do not arise, then expect
the agent itself to generate those problems.   
    [The notion of time is alien to the world, and is completely specified by
the agent.]  The agent is responsible for updating time, the `world clock'
ticks only when the agent tells it to.  This makes comparing the `speeds'
of alternative agent architectures at best ambiguous.  Two extremes are an
agent that estimates the number of act cycles equivalent to elapsed
reasoning time and executes an equivalent number of 'nop' acts so in effect
it remains idle an amount of time equivalent to reasoning time, and another
that does not account for reasoning time at all and only takes into account
actual acting time.  Time is a property of the world, not of the agent.  When
the agent has to tell the world how much time has elapsed, we have to
conclude one of two things:  either such a responsibility is trivial, in
which case the world might as well determine for itself how much time
has elapsed, or it's not, in which case the agent is both *deciding* how
much time has elapsed, then telling the world.  The deciding is something
the agent should not do:  at best it requires a competence that is very
strange to expect from the agent;  at worst it invites abuse of the
experiments.  
    It seems like the reason this problem arises is because of the
fact that the world is not running in real-time.  Two ways to account for
reasoning time or effort are (1)  actual CPU time, which is the simplest
measure but it depends on the machine and implementation, or (2)  measure
the number of nodes considered in a search tree, the number of unifications
required, and the number of subgoals expanded.  However such internal
measures are not very interesting for agents that interact with the
external world; in such cases measures of the overall external time for 
reasoning and execution are relevant, despite possible differences in
hardware (Langley ad Drummond). 
    The notion of payoff is totally specified by the world, not the agent.
Furthermore, it coincides exactly with the performance measure.  Since
payoff is completely defined by score and given the problem of accounting
for reasoning time, the only results that seem meaningful to compare are
total scores attained given an equal number of moves (assume all moves take
the same duration).  Still then, total score attained is not a measure of
how well the agent did alone, it is greatly influenced by the randomness of
the world.  The total randomness of the world along with using score as a
performance measure introduce so much noise that two runs with identical
settings may have results that are significantly different (this point
should be proved with an example).
    The inappropriate breakdown of functionality between the agent and the 
world has two down sides to it: (1)  it is bad from design standpoint: the
agent designer has to design and implement with the details of the world
implementation in mind, (2) it is bad from an experimental standpoint:  to
the extent that they are combined, one is never sure whether results are
due to the agent, the world, or the interface.  This belies bad design on
the part of the world itself, which bears on the question of whether the
world itself is an appropriate testbed for evaluating various architectures
and algorithms. 
    The world allows for a limited type and number of actions.  The only
type of action possible is movement around the grid and since the grid is two
dimensional, there are four possible direction to move. (pushing a tile)
Furthermore, the outcome of every action is known in advance, and in the
event of failure (falling into a hole) the agent is informed immediately.  
     Although a great deal of planning research has focused on subgoal
interaction, Tileworld seems to side step this issue.  The only forms of
subgoal interaction exhibited by Tileworld are resource availability
constraints, which are the limited number of moves and the limited number  
of tiles.  
    Because events in Tileworld, appearance and disappearance of tiles and
holes, are totally random albeit their rate is controlled by a knob, the
world lacks meaningful regularities.  Controllable event rates does not make
prediction and projection viable, and planning can only consider existing
aspects of the world. (this point is related to the question of what do we
want to test in agent)  
    The lack of meaningful documentation, in addition to the limitations
discussed above, may render two different agents built from scratch
incomparable, because cost and utility are not clearly defined.  
    In general, Tileworld is too easy of a planning problem.  One can
easily design an agent specially to have excellent performance in Tileworld.
However, this agent may be useless for solving any "real planning problem".
[compare to phoenix and mice]

VI) Implementing and validating Agent Architecture

VI.A) On implementing IRMA in Tileworld

If one looks closely to the implemented Tileworld agent, one finds that key
components of IRMA appear to be missing or overly simplified.  Perception,
for instance, was not implemented, as well as the means-ends reasoner
interface to the filter.  Even though perception may not be essential to
IRMA, it appears to be the case, from reading the Bratman et al. paper,
that the whole idea of an Intelligent Resource-bounded Architecture depends
on the filtering mechanism.  The filtering mechanism is where "resource
saving" is achieved, but since the Tileworld agent had a fast approximate
deliberator and a very simplistic filter, trying to save resources probably
did a little good.  The idea of an IRMA agent might make more sense when
resources are more tightly bounded (deliberator is expensive).    
    The IRMA architecture pushes to the extreme the distinction between
deliberation (selecting which goal to pursue next) and means-ends reasoning
(deciding how to achieve a goal).  This distinction is valid in an
environment like Tileworld, were goals (filling holes) are independent and
could be considered in isolation.  When planning for conjunctive goals (an
area the bulk of planning research was conducted to resolve), however,
goals can not be considered in isolation.  As a result, that separation
becomes less apparent.  To claim two components different, they must either
have different inputs and outputs, or their processing is independent so
that they can operate in parallel, for an example.  [elaborate]

VI.B) What the experiments say

In terms of experimental methodology, it appears that the first experiment
was appropriately designed and conducted.  The hypotheses to be proved were
fundamental to the claims that the LV evaluator outperforms the simple
evaluator, and for the speed-accuracy tread-off (more concretely, the
opportunity cost of reasoning is greater in more rapidly changing
environment).  The results of the experiment did, indeed, support the
hypotheses.  

Experiments 2 and 3 attempted to prove hypotheses about various
components of the implemented IRMA architecture.  Regardless of the fact
that both experiments were admittedly inconclusive, it appears that the
choice of experiments was not appropriate.  Even though testing the
components of the architecture is a valuable experiment, experimentation of
more fundamental claims about the world was badly in need.  For instance,
one of the claims about Tileworld is that it is "a system for
experimentally evaluating competing theoretical and architectural
proposals", experimenting at least one more agent architecture was
crucially required.  Before Tileworld can be adopted as testbed for
planners, Tileworld, itself, must be tested and validated: a validation
maybe to implement and evaluate a different agent architecture that has a
known performance characteristic (slow, ineffective, fast..).  And then to
see if the performance indicators from Tileworld agree with the known
relative performance characteristics of the competing architectural
proposals.   

    Moreover, there was a lack of experimental support for claims about the
utility of having a parameterizable world and agent, such as: "Can
experimentally investigate the behavior of various meta-level reasoning
strategies by tuning the parameters of the agent, and can assess the
success of alternative strategies in different environments by tuning the
environmental parameters.", and  "The parameters of the world are dynamism
(the rate at which new holes appear), hostility (the rate at which
obstacles appear), variability of utility (differences in hole scores),
variability of difficulty (differences in hole sizes and distances from
tiles), and hard/soft bounds (holes having either hard timeouts or
gradually decaying in value).  Agent parameters are: act/think rate (the
relative speeds of acting and thinking), the filter's threshold level, and
sophistication of the deliberation mechanism."   
Of all those parameters and given all the claims in the paper, only two
agent parameters were varied in the first experiment, speed and
deliberation mechanism, and one parameter in the second experiment, 
threshold.  Other than matching world speed with agent speed (which
corresponds to dynamism and possibly hostility), world parameters were not
varied at all, which makes one question the utility of such parameters
especially in view of the claims of how the different world settings match
some agent settings more suitably than others.     

In view of the claim that "Tileworld is not tightly coupled to any
particular application domain.  But instead allows an experimenter to study
key characteristics of whatever domain he or she is interested in, by
varying parameter settings.", it must be shown that the space encompassed
by varying parameter settings does indeed bring up key characteristics of
domains interesting to agent architecture designers.  An analysis similar
to ours of the issues an agent must deal with in a realistic world was due.
Alternatively, experimental support may have been drown by showing how
specific parameter settings --- simulating key characteristics of a domain---
influence the behavior of an agent.  And that the agent's behavior is
indeed what is expected of an agent in the original domain.  

It is hard to disagree with the claim that "The appropriateness of a
particular meta-level reasoning strategy will depend in large part upon the
characteristics of the environment in which the agent incorporating that
strategy is situated."  However, When 
    meta-level reasoning strategy := Tileworld agent setting;
    environment characteristics := world setting;
No experiments were conducted to prove this point.  A reasonable experiment
would be to test several agent reasoning strategies with several world
settings and figure out which reasoning strategy is appropriate with each
class of world settings.  
    Still even if those experiments were conducted, the limited notion of
payoff and noise introduced by randomness cloud up the fine differences in
performance.  The problem with having a single measure of performance
(score) is that we cannot asses the performance of the various `intelligence
components' of an agent since score will tend to aggregate the performance
of all the components.

Given the limited experiments conducted it is hard to conclude that
"Tileworld has been shown to be a viable system for evaluating agent 
architectures.  The agent was demonstrated and used to test differing
deliberation and filtering strategies as described in (Bratman 88)." Since
only one agent architecture was tested; within that architecture, only 
two simple deliberation strategies were tested.  

VII) Conclusions

As one can see the Tileworld paper was torn down to pieces, the IRMA
architecture was shown useless, and the issue of planning benchmarks still
remains an open question.  