One of the basic problems of machine learning is deciding how to act in
an uncertain world. For example, if I want my robot to bring me a cup
of coffee, it must be able to compute the correct sequence of
electrical impulses to send to its motors to navigate from the coffee
pot to my office. In fact, since the results of its actions are not
completely predictable, it is not enough just to compute the correct
sequence; instead the robot must sense and correct for defiations from
its intended path.
In order for any machine learner to act reasonably in an uncertain
environment, it must solve problems like the above one quickly and
reliably. Unfortunately, the world is often so complicated that it is
difficult or impossible to find the optimal sequance of actions to
achieve a given goal. So, in order to scale our learners up to
real-world problems, we usually must settle for approximate solutions.
One representation for a learner's environment and goals is a Markov
decision process or MDP. MDPs allow us to represent actions that have
probabilistic outcomes, and to plan for complicated,
temporally-extended goals. An MDP consists of a set of states that the
environment can be in, together with rules for how the environment can
change state and for what the learner is supposed to do.
One way to approach a large MDP is to try to compute an approximation
to its optimal state evaluation function, the function which tells us
how much reward the learner can be expected to achieve if the world is
in a particular state. If the approximation is good enough, we can use
a shallow search to find a good action from most states. Researchers
have tried many different ways to approximate evaluation functions.
This thesis aims for a middle ground, between algorithms that don't
scale well because they use an impoverished representation for the
evaluation function and algorithms that we can't analyze because they
use too complicated a representation.