\slidehead{Computational Learning Theory \red \ } \bk
\centerline{[read Chapter 7]}
\centerline{[Suggested exercises: 7.1, 7.2, 7.5, 7.8]}
\bigskip
\bi
\item Computational learning theory
\item Setting 1: learner poses queries to teacher
\item Setting 2: teacher chooses examples
\item Setting 3: randomly generated instances, labeled by teacher
\item Probably approximately correct (PAC) learning
\item Vapnik-Chervonenkis Dimension
\item Mistake bounds
\ei
\slidehead{Computational Learning Theory \red \ } \bk
What general laws constrain inductive learning?
\bigskip
We seek theory to relate:
\bi
\item Probability of successful learning
\item Number of training examples
\item Complexity of hypothesis space
\item Accuracy to which target concept is approximated
\item Manner in which training examples presented
\ei
\slidehead{Prototypical Concept Learning Task \red \ } \bk
\bi
\item
{\bf Given:}
\bi
\item
Instances $X$: Possible days, each described by the attributes
{\em Sky, AirTemp, Humidity, Wind, Water, Forecast}
\item
Target function $c$: $EnjoySport: X \rightarrow \{0,1 \}$
\item
Hypotheses $H$: Conjunctions of literals. E.g.
\[\langle ?, Cold, High, ?, ?, ? \rangle. \]
\item
Training examples $D$: Positive and negative examples of the target function
\[\langle x_1, c(x_1) \rangle , \ldots \langle x_m, c(x_m) \rangle\]
\end{itemize}
\item {\bf Determine:}
\begin{itemize}
\item A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$ in $D$?
\item A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$ in $X$?
\end{itemize}
\ei
\slidehead{Sample Complexity \red \ } \bk
How many training examples are sufficient to learn the target concept?
\be
\item
If learner proposes instances, as queries to teacher
\bi
\item Learner proposes instance $x$, teacher provides $c(x)$
\ei
\item
If teacher (who knows $c$) provides training examples
\bi
\item teacher provides sequence of examples of form $\langle x, c(x) \rangle$
\ei
\item
If some random process (e.g., nature) proposes instances
\bi
\item instance $x$ generated randomly, teacher provides $c(x)$
\ei
\ee
\slidehead{Sample Complexity: 1 \red \ } \bk
Learner proposes instance $x$, teacher provides $c(x)$
(assume $c$ is in learner's hypothesis space $H$)
\bigskip
Optimal query strategy: play 20 questions
\bi \item pick instance $x$ such that half of hypotheses in $VS$ classify $x$
positive, half classify $x$ negative
\item When this is possible, need $\lceil \log_2 |H| \rceil $ queries to learn $c$
\item when not possible, need even more
\ei
\slidehead{Sample Complexity: 2 \red \ } \bk
Teacher (who knows $c$) provides training examples
(assume $c$ is in learner's hypothesis space $H$)
\bigskip
Optimal teaching strategy: depends on $H$ used by learner
\bigskip
Consider the case $H = $ conjunctions of up to $n$ boolean literals and
their negations
\bi
\item[] e.g., $(AirTemp = Warm) \tand (Wind = Strong)$, where $AirTemp, Wind,
\ldots$ each have 2 possible values.
\ei
\bi
\item if $n$ possible boolean attributes in $H$, $n + 1$ examples suffice
\item why?
\ei
\slidehead{Sample Complexity: 3 \red \ } \bk
Given:
\bi
\item set of instances $X$
\item set of hypotheses $H$
\item set of possible target concepts $C$
\item training instances generated by a fixed, unknown probability
distribution $\cal{D}$ over $X$
\ei
Learner observes a sequence $D$ of training examples of form $\langle x, c(x)
\rangle$, for some target concept $c \in C$
\bi
\item instances $x$ are drawn from distribution $\cal{D}$
\item teacher provides target value $c(x)$ for each
\ei
Learner must output a hypothesis $h$ estimating $c$
\bi
\item $h$ is evaluated by its performance on subsequent instances drawn
according to $\cal{D}$
\ei
Note: randomly drawn instances, noise-free classifications
\slidehead{True Error of a Hypothesis \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/pac-err.ps,width=5.0in}}}
\bigskip
\begin{quote}
{\bf Definition:} The {\bf true error} (denoted $error_{\cal{D}}(h)$) of
hypothesis $h$ with respect to target concept $c$ and distribution $\cal{D}$
is the probability that $h$ will misclassify an instance drawn at random
according to $\cal{D}$.
\[error_{\cal{D}}(h) \equiv \Pr_{x \in \cal{D}}[c(x) \neq h(x)] \]
\end{quote}
\slidehead{Two Notions of Error \red \ } \bk
{\em Training error} of hypothesis $h$ with respect to target concept $c$
\bi
\item How often $h(x) \neq c(x)$ over training instances
\ei
\bigskip
{\em True error} of hypothesis $h$ with respect to $c$
\bi
\item How often $h(x) \neq c(x)$ over future random instances
\ei
\bigskip
\bigskip
Our concern:
\bi
\item Can we bound the true error of $h$ given the training error of
$h$?
\item First consider when training error of $h$ is zero (i.e., $h \in VS_{H,D}$)
\ei
\slidehead{Exhausting the Version Space \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/pac-vs-exhausted.ps,width=5.0in}}}
\centerline{($r =$ training error, $error = $ true error)}
\bigskip
\begin{quote}
{\bf Definition:} The version space $VS_{H,D}$ is said to be $\epsilon$-{\bf
exhausted} with respect to $c$ and $\cal D$, if every hypothesis $h$ in
$VS_{H,D}$ has error less than $\epsilon$ with respect to $c$ and $\cal D$.
\[(\forall h \in VS_{H,D})\ error_{\cal D}(h) < \epsilon \]
\end{quote}
\slidehead{\normalsize How many examples will $\epsilon$-exhaust the VS?
\red \ } \bk
{\bf Theorem:} [Haussler, 1988].
\begin{quote}
If the hypothesis space $H$ is finite, and $D$ is a sequence of $m \geq 1$
independent random examples of some target concept $c$, then for any $0 \leq
\epsilon \leq 1$, the probability that the version space with respect to $H$
and $D$ is not $\epsilon$-exhausted (with respect to $c$) is less than \[|H|
e^{-\epsilon m} \]
\end{quote}
\bigskip
Interesting! This bounds the probability that any consistent learner will
output a hypothesis $h$ with $error(h) \geq \epsilon$
\bigskip
If we want to this probability to be below $\delta$
\[|H|e^{- \epsilon m} \leq \delta \]
then
\[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \]
\slidehead{Learning Conjunctions of Boolean Literals \red \ } \bk
How many examples are sufficient to assure with probability at
least $(1 - \delta)$ that
\bi
\item[] every $h$ in $VS_{H,D}$ satisfies $error_{\cal D}(h) \leq \epsilon$
\ei
\bigskip
Use our theorem:
\[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \]
Suppose $H$ contains conjunctions of constraints on up to $n$ boolean
attributes (i.e., $n$ boolean literals). Then $|H| = 3^n$, and
\[m \geq \frac{1}{\epsilon}(\ln 3^n + \ln(1/\delta)) \]
or
\[m \geq \frac{1}{\epsilon}(n \ln 3 + \ln(1/\delta)) \]
\slidehead{How About $EnjoySport$? \red \ } \bk
\[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \]
If $H$ is as given in $EnjoySport$ then $|H| = 973$, and
\[m \geq \frac{1}{\epsilon}(\ln 973 + \ln(1/\delta)) \]
\bigskip
... if want to assure that with probability 95\%, $VS$ contains only
hypotheses with $error_{\cal D}(h) \leq .1$, then it is sufficient to have
$m$ examples, where
\[m \geq \frac{1}{.1}(\ln 973 + \ln(1/.05)) \]
\[m \geq 10 (\ln 973 + \ln 20) \]
\[m \geq 10 (6.88 + 3.00) \]
\[m \geq 98.8 \]
\slidehead{PAC Learning \red \ } \bk
Consider a class $C$ of possible target concepts defined over a set of
instances $X$ of length $n$, and a learner $L$ using hypothesis space $H$.
\begin{quote}
{\em Definition:} $C$ is {\bf PAC-learnable} by $L$ using $H$ if for all $c
\in C$, distributions $\cal D$ over $X$, $\epsilon$ such that $0 < \epsilon <
1/2$, and $\delta$ such that $0 < \delta < 1/2$,
learner $L$ will with probability at least $(1-\delta)$ output a hypothesis $h
\in H$ such that $error_{\cal D}(h) \leq \epsilon$, in time that is polynomial
in $1/\epsilon$, $1/\delta$, $n$ and $size(c)$.
\end{quote}
\slidehead{Agnostic Learning \red \ } \bk
So far, assumed $c \in H$
\vspace*{.1in}
Agnostic learning setting: don't assume $c \in H$
\bi
\item What do we want then?
\bi \item The hypothesis $h$ that makes fewest errors on training data \ei
\item What is sample complexity in this case?
\ei
\[ m \geq \frac{1}{2 \epsilon^{2}}(\ln|H| + \ln(1/\delta)) \]
derived from Hoeffding bounds:
\[ Pr[error_{\cal D}(h) > error_D(h) + \epsilon] \leq e^{-2m\epsilon^{2}} \]
\slidehead{Shattering a Set of Instances \red \ } \bk
\begin{quote}
{\em Definition:} a {\bf dichotomy} of a set $S$ is a partition of $S$ into
two disjoint subsets.
\end{quote}
\bigskip
\begin{quote}
{\em Definition:} a set of instances $S$ is {\bf shattered} by hypothesis
space $H$ if and only if for every dichotomy of $S$ there exists some
hypothesis in $H$ consistent with this dichotomy.
\end{quote}
\slidehead{Three Instances Shattered \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/pac-shatter.ps,width=6.5in}}}
\slidehead{The Vapnik-Chervonenkis Dimension \red \ } \bk
\begin{quote}
{\em Definition:} The {\bf Vapnik-Chervonenkis dimension}, $VC(H)$, of
hypothesis space $H$ defined over instance space $X$ is the size of the
largest finite subset of $X$ shattered by $H$. If arbitrarily large finite
sets of $X$ can be shattered by $H$, then $VC(H) \equiv \infty$.
\end{quote}
\slidehead{VC Dim. of Linear Decision Surfaces \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/pac-pts.ps,width=6.5in}}}
\slidehead{Sample Complexity from VC Dimension \red \ } \bk
How many randomly drawn examples suffice to $\epsilon$-exhaust $VS_{H,D}$ with
probability at least $(1 - \delta)$?
\[
m \geq \frac{1}{\epsilon}(4\log_2(2/\delta) + 8 VC(H) \log_2 (13/\epsilon))
\]
\slidehead{Mistake Bounds \red \ } \bk
So far: how many examples needed to learn?
\vspace*{.1in}
What about: how many mistakes before convergence?
\vspace*{.2in}
Let's consider similar setting to PAC learning:
\bi
\item Instances drawn at random from $X$ according to distribution ${\cal D}$
\item Learner must classify each instance before receiving correct
classification from teacher
\item Can we bound the number of mistakes learner makes before converging?
\ei
\slidehead{Mistake Bounds: Find-S \red \ } \bk
Consider Find-S when $H=$ conjunction of boolean literals
\begin{quote}
\pgm{Find-S}:
\bi
\item Initialize $h$ to the most specific hypothesis $l_{1} \tand \neg l_{1} \tand l_{2} \tand \neg l_{2}
\ldots l_{n} \tand \neg l_{n}$
\item For each positive training instance $x$
\bi
\item Remove from $h$ any literal that is not satisfied by $x$
\ei
\item Output hypothesis $h$.
\ei
\end{quote}
\vspace*{.2in}
How many mistakes before converging to correct $h$?
\slidehead{Mistake Bounds: Halving Algorithm \red \ } \bk
Consider the Halving Algorithm:
\bi
\item Learn concept using version space \pgm{Candidate-Elimination} algorithm
\item Classify new instances by majority vote of version space members
\ei
\vspace*{.2in}
How many mistakes before converging to correct $h$?
\bi
\item ... in worst case?
\item ... in best case?
\ei
\slidehead{Optimal Mistake Bounds \red \ } \bk
Let $M_{A}(C)$ be the max number of mistakes made by algorithm $A$ to learn
concepts in $C$. (maximum over all possible $c \in C$, and all possible
training sequences)
\[M_{A}(C) \equiv \max_{c \in C} M_{A}(c)\]
\vspace*{.3in}
{\em Definition:} Let $C$ be an arbitrary non-empty concept class. The {\bf
optimal mistake bound} for $C$, denoted $Opt(C)$, is the minimum over all
possible learning algorithms $A$ of $M_{A}(C)$.
\[Opt(C) \equiv \min_{A \in learning\ algorithms} M_{A}(C) \]
\vspace*{.3in}
\[ VC(C) \leq Opt(C) \leq M_{Halving}(C) \leq log_{2}(|C|). \]