\slidehead{\large Learning Sets of Rules \red \ } \bk
\centerline{[Read Ch. 10]}
\centerline{[Recommended exercises 10.1, 10.2, 10.5, 10.7, 10.8]}
\bi
\item Sequential covering algorithms
\item FOIL
\item Induction as inverse of deduction
\item Inductive Logic Programming
\ei
\slidehead{\large Learning Disjunctive Sets of Rules \red \ } \bk
Method 1: Learn decision tree, convert to rules
\bigskip
\bigskip
Method 2: Sequential covering algorithm:
\be
\item {\em Learn one rule} with high accuracy, any coverage
\item Remove positive examples covered by this rule
\item Repeat
\ee
\slidehead{\large Sequential Covering Algorithm \red \ } \bk
\pgm{Sequential-covering}($Target\_attribute, Attributes, Examples, Threshold$)
\bi
\item
$Learned\_rules \la \{\}$
\item
$Rule \la$ \pgm{learn-one-rule}$(Target\_attribute, Attributes, Examples)$
\item
while \pgm{performance}($Rule, Examples$) $> Threshold$, do
\bi
\item $Learned\_rules \la Learned\_rules + Rule$
\item $Examples \la Examples \ -$ \{examples correctly classified by $Rule$\}
\item $Rule \la$ \pgm{learn-one-rule}$(Target\_attribute, Attributes, Examples)$
\ei
\item $Learned\_rules \la$ sort $Learned\_rules$ accord to \pgm{performance}
over $Examples$
\item return $Learned\_rules$
\ei
\slidehead{\large Learn-One-Rule \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/learn-one-rule.ps,width=7in}}}
\bigskip
%%%%%\slidehead{\large Learn One Rule \red \ } \bk
\newpage
\pgm{Learn-One-Rule}
\bi
\item
$Pos \la$ positive $Examples$
\item
$Neg \la$ negative $Examples$
\item
while $Pos$, do
\bi
\item[]\hspace*{-.18in} {\em Learn a $NewRule$}
\item $NewRule \la$ most general rule possible
\item $NewRuleNeg \la Neg$
\item
while $NewRuleNeg$, do
\be
\item[]\hspace*{-.18in} {\em Add a new literal to specialize $NewRule$}
\item $Candidate\_literals \la$ generate candidates
\item $Best\_literal \la \argmax_{L \in Candidate\_literals}$ \\
$Performance(SpecializeRule(NewRule,L))$
\item add $Best\_literal$ to $NewRule$ preconditions
\item $NewRuleNeg \la$ subset of $NewRuleNeg$ that satisfies $NewRule$
preconditions
\ee
\item
$Learned\_rules \la Learned\_rules + NewRule$
\item
$Pos \la Pos \ - $ \{members of $Pos$ covered by $NewRule$\}
\ei
\item Return $Learned\_rules$
\ei
\slidehead{Subtleties: Learn One Rule \red \ } \bk
\be
\item May use {\em beam search}
\item Easily generalizes to multi-valued target functions
\item Choose evaluation function to guide search:
\bi
\item Entropy (i.e., information gain)
\item Sample accuracy:
\[ \frac{n_c}{n} \]
where $n_c =$ correct rule predictions, $n = $ all predictions
\item $m$ estimate:
\[ \frac{n_c + mp}{n + m}\]
\ei
\ee
\slidehead{\large Variants of Rule Learning Programs \red \ } \bk
\bi
\item
{\em Sequential} or {\em simultaneous} covering of data?
\item
General $\ra$ specific, or specific $\ra$ general?
\item
Generate-and-test, or example-driven?
\item
Whether and how to post-prune?
\item
What statistical evaluation function?
\ei
\slidehead{\large Learning First Order Rules \red \ } \bk
Why do that?
\bi
\item Can learn sets of rules such as
\begin{tabbing}
$Ancestor(x,y) \la Parent(x,y)$ \\
$Ancestor(x,y) \la Parent(x,z) \tand Ancestor(z,y) $
\end{tabbing}
\item General purpose programming language \pgm{Prolog}: programs are sets of
such rules
\ei
\slidehead{First Order Rule for Classifying Web Pages \red } \bk
%
\centerline{[Slattery, 1997]}
\begin{tabbing}
course(\=A) $\la$\\
\>has-word(A, instructor),\\
\>Not has-word(A, good),\\
\>link-from(A, B),\\
\>has-word(B, assign),\\
\>Not link-from(B, C)\\[0.2in]
Train: 31/31, Test: 31/34
\end{tabbing}
\newpage
\pgm{FOIL}($Target\_predicate, Predicates, Examples$)
\bi
\item
$Pos \la$ positive $Examples$
\item
$Neg \la$ negative $Examples$
\item
while $Pos$, do
\bi
\item[]\hspace*{-.18in} {\em Learn a $NewRule$}
\item $NewRule \la$ most general rule possible
\item $NewRuleNeg \la Neg$
\item
while $NewRuleNeg$, do
\be
\item[]\hspace*{-.18in} {\em Add a new literal to specialize $NewRule$}
\item $Candidate\_literals \la$ generate candidates
\item $Best\_literal \la \argmax_{L \in Candidate\_literals} Foil\_Gain(L,
NewRule)$
\item add $Best\_literal$ to $NewRule$ preconditions
\item $NewRuleNeg \la$ subset of $NewRuleNeg$ that satisfies $NewRule$
preconditions
\ee
\item
$Learned\_rules \la Learned\_rules + NewRule$
\item
$Pos \la Pos \ - $ \{members of $Pos$ covered by $NewRule$\}
\ei
\item Return $Learned\_rules$
\ei
\slidehead{\large Specializing Rules in FOIL \red \ } \bk
Learning rule: $P(x_{1},x_{2}, \ldots, x_{k}) \la L_{1} \ldots L_{n}$
Candidate specializations add new literal of form:
\bi
\item
$Q(v_{1},\ldots,v_{r})$, where at least one of the $v_{i}$ in the created
literal must already exist as a variable in the rule.
\item
$Equal(x_{j},x_{k})$, where $x_{j}$ and $x_{k}$ are variables already present
in the rule
\item
The negation of either of the above forms of literals
\ei
\slidehead{\large Information Gain in FOIL \red \ } \bk
\[Foil\_Gain(L,R) \equiv t \left( \log_{2}\frac{p_{1}}{p_{1}+n_{1}} -
\log_{2}\frac{p_{0}}{p_{0}+n_{0}} \right) \]
Where
\bi
\item $L$ is the candidate literal to add to rule $R$
\item $p_0$ = number of positive bindings of $R$
\item $n_0$ = number of negative bindings of $R$
\item $p_1$ = number of positive bindings of $R+L$
\item $n_1$ = number of negative bindings of $R+L$
\item $t$ is the number of positive bindings of $R$ also covered by $R+L$
\ei
Note
\bi
\item $-\log_{2}\frac{p_{0}}{p_{0}+n_{0}}$ is optimal number of bits to
indicate the class of a positive binding covered by $R$
\ei
\slidehead{\large FOIL Example \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/foil.ps,width=4in}}}
\bigskip
Instances:
\bi \item
pairs of nodes, e.g $\langle 1,5 \rangle$, with graph described by
literals {\em LinkedTo(0,1), $\neg$ LinkedTo(0,8)} etc.
\ei
Target function:
\bi \item {\em CanReach(x,y)} true iff directed path from $x$ to $y$ \ei
Hypothesis space:
\bi \item Each $h \in H$ is a set of horn clauses using predicates $LinkedTo$
(and $CanReach$)
\ei
\slidehead{\large Induction as Inverted Deduction \red \ } \bk
Induction is finding $h$ such that
\[(\forall \langle x_{i}, f(x_i) \rangle \in D)\ B \tand h \tand x_{i} \entails f(x_{i}) \]
where
\bi
\item $x_i$ is $i$th training instance
\item $f(x_i)$ is the target function value for $x_i$
\item $B$ is other background knowledge
\ei
\vspace*{1in}
So let's design inductive algorithm by inverting operators for automated
deduction!
\slidehead{\large Induction as Inverted Deduction \red \ } \bk
``pairs of people, $\langle u,v \rangle$ such that child of $u$ is $v$,''
\begin{eqnarray}
f(x_{i}): & Child(Bob,Sharon) \nonumber \\
x_{i}: & Male(Bob), Female(Sharon), Father(Sharon, Bob) \nonumber \\
B: & Parent(u,v) \la Father(u,v) \nonumber
\end{eqnarray}
\bigskip
What satisfies $(\forall \langle x_{i}, f(x_i) \rangle \in D)\ B \tand h \tand x_{i} \entails f(x_{i})$?
\begin{eqnarray}
h_{1}: & Child(u,v) \la Father(v,u) \nonumber \\
h_{2}: & Child(u,v) \la Parent(v,u) \nonumber
\end{eqnarray}
\newpage
\bigskip
\begin{quote}
Induction is, in fact, the inverse operation of deduction, and cannot be
conceived to exist without the corresponding operation, so that the question
of relative importance cannot arise. Who thinks of asking whether addition or
subtraction is the more important process in arithmetic? But at the same time
much difference in difficulty may exist between a direct and inverse
operation; $\ldots$ it must be allowed that inductive investigations are of a
far higher degree of difficulty and complexity than any questions of
deduction$.\ldots$
\end{quote}
\indent
(Jevons 1874)
\slidehead{\large Induction as Inverted Deduction \red \ } \bk
We have mechanical {\em deductive} operators $F(A,B)=C$, where $A \tand B
\entails C$
\bigskip
need {\em inductive} operators
\[ O(B,D) = h \mbox{ where } (\forall\langle x_{i},f(x_{i}) \rangle \in
D)\ (B \tand h \tand x_{i}) \entails f(x_{i}) \]
\slidehead{Induction as Inverted Deduction \red \ } \bk
Positives:
\bi
\item Subsumes earlier idea of finding $h$ that ``fits'' training data
\item Domain theory $B$ helps define meaning of ``fit'' the data
\[ B \tand h \tand x_{i} \entails f(x_{i}) \]
\item Suggests algorithms that search $H$ guided by $B$
\ei
\slidehead{Induction as Inverted Deduction \red \ } \bk
Negatives:
\bi
\item Doesn't allow for noisy data. Consider
\[ (\forall\langle x_{i},f(x_{i}) \rangle \in D)\ (B \tand h
\tand x_{i}) \entails f(x_{i}) \]
\item First order logic gives a {\em huge} hypothesis space $H$
\bi \item[$\ra$] overfitting...
\item[$\ra$] intractability of calculating all acceptable $h$'s
\ei
\ei
\slidehead{\large Deduction: Resolution Rule \red \ } \bk
\begin{tabular}{ccc}
\\
$P$ & $\tor$ & $L$ \\ $\neg L$ & $\tor$ & $R$ \\ \hline $P$ & $\tor$ & $R$ \\
\\
\end{tabular}
\be
\item
Given initial clauses $C_{1}$ and $C_{2}$, find a literal $L$ from clause
$C_{1}$ such that $\neg L$ occurs in clause $C_{2}$
\item
Form the resolvent $C$ by including all literals from $C_{1}$ and $C_{2}$,
except for $L$ and $\neg L$. More precisely, the set of literals occurring in
the conclusion $C$ is
\[ C = (C_{1} - \{L\}) \union (C_{2} - \{\neg L\}) \]
where $\union$ denotes set union, and ``$-$'' denotes set difference.
\ee
\slidehead{\large Inverting Resolution \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/res-rule.ps,width=7in}}}
\slidehead{\large Inverted Resolution (Propositional) \red \ } \bk
\be
\item[\bf 1.]
Given initial clauses $C_{1}$ and $C$, find a literal $L$ that occurs in
clause $C_{1}$, but not in clause $C$.
\item[\bf 2.]
Form the second clause $C_{2}$ by including the following literals
\[\hskip-6pt C_{2} = (C - (C_{1} - \{L\})) \union \{\neg L\} \]
\ee
\slidehead{\large First order resolution \red \ } \bk
First order resolution:
\be
\item
Find a literal $L_{1}$ from clause $C_{1}$, literal $L_{2}$ from clause
$C_{2}$, and substitution $\theta$ such that $L_{1} \theta = \neg L_{2}
\theta$
\item
Form the resolvent $C$ by including all literals from $C_{1}\theta$ and
$C_{2}\theta$, except for $L_{1} \theta$ and $\neg L_{2} \theta$. More
precisely, the set of literals occurring in the conclusion $C$ is
\[ C = (C_{1} - \{L_{1}\})\theta \union (C_{2} - \{L_{2}\})\theta \]
\ee
\slidehead{\large Inverting First order resolution \red \ } \bk
\[ C_{2} = (C - (C_{1} - \{L_{1}\})\theta_{1})\theta_{2}^{-1} \union \{\neg L_{1}
\theta_{1}\theta_{2}^{-1}\} \]
\slidehead{\large Cigol \red \ } \bk
\centerline{\hbox{\psfig{file=./bookps/res-rule2.ps,width=7in}}}
\slidehead{\large Progol \red \ } \bk
\pgm{Progol}: Reduce comb explosion by generating the most specific acceptable $h$
\be
\item
User specifies $H$ by stating predicates, functions, and forms of arguments
allowed for each
\item \pgm{Progol} uses sequential covering algorithm. \\
For each $\langle x_i, f(x_i) \rangle$
\bi \item Find most specific hypothesis $h_i$ s.t. $B \tand h_i \tand x_{i} \entails
f(x_{i})$
\bi \item actually, considers only $k$-step entailment \ei \ei
\item
Conduct general-to-specific search bounded by specific hypothesis $h_i$,
choosing hypothesis with minimum description length
\ee