Linear Function Approximation

A linear function approximator is a function y=f(x,w) that is linear in the weights, though not necessarily linear in the input x:
```    y = w_1 * f_1(x) + w_2 * f_2(x) + ... + w_n * f_n(x)
```
where x, y, and w can be vectors, the f_i() functions can be linear or nonlinear, and w_i is the ith element of the w vector. Examples of linear function approximators include:
• Lookup table
There is a separate weight for each possible value of x. There are only n possible values for x, and f_i(x)=1 when x=i and f_i(x)=0 otherwise.
• Linear
The output is just the dot product of w and x. The individual functions are just f_i(x)=x_i, where x_i is the ith element of vector x.
Each f_i(x) function looks like a smooth bump. Each f_i() function has a "center" location, and f_i(x) is a monotonic function of the distance from x to the center. The "distance" may be Euclidean distance (circular bumps), or there may be a diagonal covariance matrix (ellipsoidal bumps parallel to the axes), or there may be a full covariance matrix (general ellipsoidal bumps). To be a linear function approximator, the bumps must not move or change shape.
• Wavelets
Each f_i(x) is a wavelet, typically the product of a cosine and a Gaussian. This is particularly useful in image applications, because the human visual system seems to use similar functions.
• CMAC
Each f_i(x) function has a value of 1 inside of k square regions in input space, and 0 everywhere else. A hash function is used to make sure that the k squares are randomly scattered. The functions are chosen so that for any give x, there will be exactly c different f_i(x) functions active, and their square regions will be offset from one another slightly. Because of the hash function, the CMAC has the nice property that it makes good use of its weights even when all the training examples are in one small region of input space which wasn't known beforehand. One particularly interesting example of this is when the input space is very high dimensional, but all training examples come from a simple, low-dimensional manifold. For some reason, people often implement CMACs with k=1, which destroys this useful property. The hash function was originally proposed to be reminiscent of random neuron wiring in the brain.
Linear function approximators have several nice properties. For supervised learning, the weights can be found immediately at the cost of a single matrix inversion, without any gradient-descent or incremental learning. For reinforcement learning, the weights can be found at the cost of solving a single linear program.

For incremental reinforcement learning algorithms there are also a few useful properties. For a lookup table, almost all the incremental algorithms are guaranteed to converge to optimality. For other linear function approximators, TD(lambda) is guaranteed to converge when doing on-policy training (transitions are trained on with a frequency proportional to their frequency in the Markov chain).

Unfortunately, very few other convergence results are true. TD(lambda) can diverge for off-policy training. Q-learning can diverge even with on-policy training. SARSA can oscillate wildly and periodically forget everything useful it had learned so far. For incremental algorithms, limiting the function approximator to linear function approximators does not help convergence very much.

Fortunately there are ways to ensure that all these algorithms will converge: use the the residual form of each algorithm. In that case, they will converge for both linear and nonlinear function approximators.

There are also results indicating that nonlinear function approximators may be more powerful in general than linear function approximators for learning high-dimensional functions. For example, if the target function is fairly smooth (has little energy in the high requencies), and the function approximator is nonlinear, then it is known that the number of weights needed for a good fit grows only polynomially with the dimensionality of the input. This is true for such diverse function approximators as sigmoidal neural networks and linear combinations of sine waves. It may even be true for all of the popular nonlinear function approximators. But it has been shown that it is not true for any linear function approximator. This result suggests that linear function approximators may be most useful in cases where the input space is low dimensional, or where the training examples all come from a low-dimensional manifold in a high-dimensional space.