STAGE

In their simulated-annealing approach to the channel-routing problem (described in Section 5.1, Wong et. al. say,

To apply STAGE to this problem, we began with not the contrived function C but the natural objective function

. The additional objective function terms used in Eq. 7, p and U, along with w itself, were given as the three input features to STAGE's function approximator.

Table 2 summarizes the results we have obtained thus far, using STAGE to predict and improve upon the results of stochastic hillclimbing optimization on a large channel-routing problem (YK4). For prediction, we have used simple models: linear regression in the batch-update case, and backprop on MLPs (2 hidden units) in the TD(

) case.

Performance typical traj. len

ALGORITHM best, mean taken / considered

(*) Global optimum for problem 10

(A) Hillclimbing, patience=100 (N=30) 44, 52.4 93 / 696

(B) Hillclimbing, patience=1000 (N=30) 43, 47.9 99 / 2591

(C) Simulated annealing, (N=3) 35, 35 3836 / 126,202

(D) Simulated annealing, f(x)=w (N=3) 14, 14.7 163,401 / 279,073

(E) Modified hillclimbing, patience=1000 (N=5) 19, 19.8 7765 / 11,342

(F) STAGE, TD(0.6) + backprop (N=3) 14, 22.7 388 / 2779

(G) STAGE, batch linear regression (N=28) 13, 17.1 340 / 2293

(H) STAGE, random fit (for validation) 50, 55 190 / 1704

Table 2: Results on routing problem YK4. The first pair of columns gives the best and mean channel size attained over N runs of each algorithm. The second pair counts the steps taken in a typical search trajectory from the starting state to a local optimum (in the case of STAGE, after learning).

**Table 2:** Results on routing problem YK4. The first pair of columns gives the best and mean channel size attained over N runs of each algorithm. The second pair counts the steps taken in a typical search trajectory from the starting state to a local optimum (in the case of STAGE, *after* learning).
	Performance	typical traj. len
ALGORITHM	best, mean	taken / considered
(*) Global optimum for problem	10
(A) Hillclimbing, patience=100 (N=30)	44, 52.4	93 / 696
(B) Hillclimbing, patience=1000 (N=30)	43, 47.9	99 / 2591
(C) Simulated annealing, (N=3)	35, 35	3836 / 126,202
(D) Simulated annealing, f(x)=w (N=3)	14, 14.7	163,401 / 279,073
(E) Modified hillclimbing, patience=1000 (N=5)	19, 19.8	7765 / 11,342
(F) STAGE, TD(0.6) + backprop (N=3)	14, 22.7	388 / 2779
(G) STAGE, batch linear regression (N=28)	13, 17.1	340 / 2293
(H) STAGE, random fit (for validation)	50, 55	190 / 1704

Our results provide some interesting insights. Experiments (A) and (B) show that hillclimbing easily gets stuck in very poor local optima, even when the ``patience'' parameter is set high (allowing it to evaluate more random moves before terminating). Experiment (C) shows that simulated annealing, as used in [Wong et al. 1988], does considerably better. Very surprisingly, when given plenty of computing time, the annealer of experiment (D) does better still. It seems that the ``crude'' evaluation function f(x)=w allows simulated annealing to effectively do a random walk along the ridge of all solutions of equal cost w, and given enough time it will fortuitously find a hole in the ridge. Hillclimbing modified to accept equi-cost moves performed nearly as well (E).

The hillclimber we gave to STAGE for learning was the fast but weakly-performing one of experiment (A). The results show STAGE learned to optimize superbly, not only bettering the performance of (A) but quickly finding solutions as good as those found by the long simulated annealing runs. The batch fits with linear regression (G) were especially impressive, often producing evaluation functions that performed well after only two or three trajectories of training. This seems too good to be true; did STAGE work according to its design or did it somehow ``cheat''?

We considered and eliminated several hypotheses. (1) Since STAGE concatenates two hillclimbing trajectories, perhaps it simply has twice as much time to explore. This explanation is rejected by experiment (B). (2) The function approximator may simply be smoothing f(x), which helps eliminate local minima and plateaus. No, we tried training V(x) to smooth f(x) directly and running A(V);A(f), but this failed to produce improvement. We also trained V(x) to fit randomly-generated training values from the same range as those generated by STAGE; again, this failed to improve over plain one-stage hillclimbing (Experiment H). (3) The input features happen to be excellent, providing an easy answer. This explanation turns out to be correct; I discuss how at the end of the next section.