Exploiting Tree Search in Model Learning Reinforcement Learning Algorithms Scott Davies , Andrew Ng , Andrew Moore Reinforcement Learning in continuous state spaces requires the use of function approximators to represent dynamic programming solutions such as value functions or Q-functions. Unfortunately, the computational expense of using these approximators is often prohibitive, particularly when dynamic programming is performed repeatedly on-line as the learner's model of the state space changes. Thus, relatively crude approximators are typically used to represent the DP solutions, which then negatively impacts on-line performance. We investigate the possibility of using online model-based search trees to alleviate this problem in continuous-state reinforcement learning domains. In particular, we examine the following possibilities: -Microsearch: We allow our learner to use its model of the state space in order to perform a short search over limited trajectories from its current position; the approximated value function is used as the heuristic evaluation function at the end of the search. This search provides the controller with more robust performance in the face of crudely approximated value functions. -Uniformed Macrosearch: in this breadth-first-search technique borrowed from robot motion planning, we break the state space up into a uniform grid. From its current position, the learner uses its model of the state space to search for a complete, continuous trajectory to the goal. A search procedure similar to the one used by "Microsearch" is used to find paths from one grid element to another; multiple trajectories entering the same grid element are pruned. In this manner, it can build complete trajectories that can be trusted to the same extent that its model can be trusted; failures to reach the goal by attempting to follow trajectories found in this manner can be attributed directly to inaccuracies in the learner's model. Using these trajectories can significantly improve the learner's online performance. -Informed Macrosearch: The uniformed macrosearch procedure outlined above suffers from the fact that its search procedure often requires too much time and memory to be practical. However, we can use use an approximated value function as the heuristic for an A*-style search from grid element to grid element; even when the approximated value function is relatively crude, this may dramatically reduce the time and memory required for the search. We investigate the strengths and feasibility of these approaches both in situations in which the learner already has a perfect model of the state space, and in situations in which the learner is learning a model from scratch on-line.