\slidehead{Instance Based Learning \red \ } \bk
\centerline{[Read Ch. 8]}
\bi
\item $k$-Nearest Neighbor
\item Locally weighted regression
\item Radial basis functions
\item Case-based reasoning
\item Lazy and eager learning
\ei
\slidehead{Instance-Based Learning \red \ } \bk
Key idea: just store all training examples $\langle x_i, f(x_i) \rangle$
\bigskip
Nearest neighbor:
\bi
\item
Given query instance $x_q$, first locate nearest training example $x_n$, then
estimate $\hat{f}(x_q) \la f(x_n)$
\ei
$k$-Nearest neighbor:
\bi
\item Given $x_q$, take vote among its $k$ nearest nbrs (if discrete-valued
target function)
\item take mean of $f$ values of $k$ nearest nbrs (if real-valued)
\[ \hat{f}(x_{q}) \la \frac{\sum_{i=1}^{k}f(x_{i})}{k} \]
\ei
\slidehead{When To Consider Nearest Neighbor \red \ } \bk
\bi
\item Instances map to points in $\Re^n$
\item Less than 20 attributes per instance
\item Lots of training data
\ei
Advantages:
\bi
\item Training is very fast
\item Learn complex target functions
\item Don't lose information
\ei
Disadvantages:
\bi
\item Slow at query time
\item Easily fooled by irrelevant attributes
\ei
\slidehead{Voronoi Diagram \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/knn-f1.ps,width=6.5in}}}
\slidehead{Behavior in the Limit \red \ } \bk
Consider $p(x)$ defines probability that instance $x$ will be labeled 1
(positive) versus 0 (negative).
\bigskip
Nearest neighbor:
\bi
\item As number of training examples $\ra \infty$,
approaches Gibbs Algorithm
\item[] Gibbs: with probability $p(x)$ predict 1, else 0
\ei
\bigskip
$k$-Nearest neighbor:
\bi
\item As number of training examples $\ra \infty$ and $k$ gets large, approaches Bayes optimal
\item[] Bayes optimal: if $p(x)>.5$ then predict 1, else 0
\ei
\bigskip
Note Gibbs has at most twice the expected error of Bayes optimal
\slidehead{Distance-Weighted $k$NN \red \ } \bk
Might want weight nearer neighbors more heavily...
\[ \hat{f}(x_{q}) \la \frac{\sum_{i=1}^{k} w_{i} f(x_{i})}{\sum_{i=1}^{k} w_{i}}
\]
where
\[ w_{i} \equiv \frac{1}{d(x_{q}, x_{i})^{2}} \]
and $d(x_{q}, x_{i})$ is distance between $x_{q}$ and $x_{i}$
\bigskip
\bigskip
Note now it makes sense to use {\em all} training examples instead of just $k$
\bi \item[$\ra$]Shepard's method \ei
\slidehead{Curse of Dimensionality \red \ } \bk
Imagine instances described by 20 attributes, but only 2 are relevant to
target function
\bigskip
{\em Curse of dimensionality}: nearest nbr is easily mislead when
high-dimensional $X$
\bigskip
One approach:
\bi
\item Stretch $j$th axis by weight $z_j$, where $z_1, \ldots, z_n$
chosen to minimize prediction error
\item Use cross-validation to automatically choose weights $z_1, \ldots, z_n$
\item Note setting $z_j$ to zero eliminates this dimension altogether
\ei
\bigskip
\centerline{ see [Moore and Lee, 1994]}
\slidehead{Locally Weighted Regression \red \ } \bk
Note $k$NN forms local approximation to $f$ for each query point $x_q$
\bigskip
Why not form an explicit approximation $\hat{f}(x)$ for region surrounding
$x_q$
\bi
\item Fit linear function to $k$ nearest neighbors
\item Fit quadratic, ...
\item Produces ``piecewise approximation'' to $f$
\ei
Several choices of error to minimize:
\bi
\item Squared error over $k$ nearest neighbors
\[E_{1}(x_q) \equiv \frac{1}{2} \sum_{x \in\ k\ nearest\ nbrs\ of\ x_q} (f(x)
- \hat{f}(x))^2 \]
\item Distance-weighted squared error over all nbrs
\[E_{2}(x_q) \equiv \frac{1}{2} \sum_{x \in D} (f(x) - \hat{f}(x))^2\
K(d(x_{q}, x)) \]
\item $\ldots$
\ei
\slidehead{Radial Basis Function Networks \red \ } \bk
\bi
\item Global approximation to target function, in terms of linear combination
of local approximations
\item Used, e.g., for image classification
\item A different kind of neural network
\item Closely related to distance-weighted regression, but ``eager'' instead
of ``lazy''
\ei
\slidehead{Radial Basis Function Networks \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/rbf2.ps,height=3in}}}
where $a_i(x)$ are the attributes describing instance $x$, and
\[ f(x) = w_0 + \sum_{u=1}^{k} w_u K_u(d(x_u,x)) \]
One common choice for $K_u(d(x_u,x))$ is
\[ K_u(d(x_u,x)) = e^{- \frac{1}{2 \sigma_u^2}d^2(x_u,x)} \]
\slidehead{Training Radial Basis Function Networks \red \ } \bk
Q1: What $x_u$ to use for each kernel function $K_u(d(x_u,x))$
\bi
\item Scatter uniformly throughout instance space
\item Or use training instances (reflects instance distribution)
\ei
Q2: How to train weights (assume here Gaussian $K_u$)
\bi
\item First choose variance (and perhaps mean) for each $K_u$
\bi \item e.g., use EM \ei
\item Then hold $K_u$ fixed, and train linear output layer
\bi \item efficient methods to fit linear function \ei
\ei
\slidehead{Case-Based Reasoning \red \ } \bk
Can apply instance-based learning even when $X \neq \Re^n$
\bi \item[$\ra$] need different ``distance'' metric \ei
\bigskip
Case-Based Reasoning is instance-based learning applied to instances with
symbolic logic descriptions
\bigskip
\begin{verbatim}
((user-complaint error53-on-shutdown)
(cpu-model PowerPC)
(operating-system Windows)
(network-connection PCIA)
(memory 48meg)
(installed-applications Excel Netscape VirusScan)
(disk 1gig)
(likely-cause ???))
\end{verbatim}
\slidehead{Case-Based Reasoning in CADET \red \ } \bk
CADET: 75 stored examples of mechanical devices
\bi
\item each training example: $\langle$ qualitative function, mechanical structure$\rangle$
\item new query: desired function,
\item target value: mechanical structure for this function
\ei
Distance metric: match qualitative function descriptions
\slidehead{Case-Based Reasoning in CADET \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/cbr.ps,width=6.5in}}}
\slidehead{Case-Based Reasoning in CADET \red \ } \bk
\bi
\item Instances represented by rich structural descriptions
\item Multiple cases retrieved (and combined) to form solution to new problem
\item Tight coupling between case retrieval and problem solving
\ei
Bottom line:
\bi
\item Simple matching of cases useful for tasks such as
answering help-desk queries
\item Area of ongoing research
\ei
\slidehead{Lazy and Eager Learning \red \ } \bk
Lazy: wait for query before generalizing
\bi \item \pgm{$k$-Nearest Neighbor}, Case based reasoning \ei
\bigskip
Eager: generalize before seeing query
\bi \item Radial basis function networks, ID3, Backpropagation, NaiveBayes, $\ldots$ \ei
Does it matter?
\bi
\item Eager learner must create global approximation
\item Lazy learner can create many local approximations
\item if they use same $H$, lazy can represent more complex fns (e.g.,
consider $H$ = linear functions)
\ei