\documentstyle[apalike]{article}
\makeindex
\begin{titlepage}
\title{FORTRAN programs for Discriminant Analysis}
\bigskip
\author{Linear, Quadratic and Logistic Discriminants}
\bigskip
\date{Tuesday 17th August, 1993}
\bigskip
\maketitle
\end{titlepage}

\begin{document}

\newcommand{\beq}{\begin{equation}}
\newcommand{\eeq}{\end{equation}}
\newcommand{\SUM}{\displaystyle\sum}

\centerline{\Large Contact Name}
\bigskip
{\it Dr. R.J. Henery}
{\obeylines
Dept. of Statistics and Modelling Science
University of Strathclyde
Richmond Street
Glasgow G1 1XH
Tel: +44 41 552 4400
Fax: +44 41 552 4711
e-mail: bob@uk.ac.strathclyde.stams
or e-mail: cais05@uk.ac.strathclyde.vaxb
}


\section{INTRODUCTION}

\index{Fisher's linear discriminant}\index{Classical discrimination techniques}\index{Logistic discriminant}\index{Quadratic discriminant}\index{Misclassification costs}
\index{Bayes rule}
This section provides a very brief
introduction to the classical statistical discrimination techniques.  For 
more complete details
of the statistical theory involved the reader should consult 
McLachlan (1992)\nocite{McL92}.\\

\subsection{Notation}
\index{Training set}\index{Test set}\index{Categorical variables}\index{Missing values}
Throughout this paper, we adopt the following notation.   The training set consists
of examples drawn from $q$ known classes.   (Often
$q$ will be~2.)  There are $ n_i$ examples in the i'th class, and the total
number of examples in all classes is $ N$.
The values of $p$ numerical-valued attributes are
 known for each example, and these form the 
{\em attribute vector}\index{Attribute vector}.    The attribute vector {\bf x}
has components $(x_1,x_2,...,x_p)$, although it is very convenient to adjoin
a constant term of unity to the attribute vector: $(1,x_1,x_2,..,x_p)$ giving
an extended vector of length $p+1$.   Sample means are denoted by a bar:  for
example, the sample mean vector is $ {\bf \bar x}$.   The sample covariance
matrix for the i'th class is $ S_i$, which estimates the i'th class
population covariance matrix $ \Sigma_i$.

Linear discrimination is based on a 
linear combination $ \beta_i {\bf x}$ of the attributes, with each class having
its own set of coefficients $ \beta$.

It should be noted that classical statistical methods require numerical attribute
vectors, and also require that none of the values is missing.  Where an 
attribute is categorical with two values, an indicator must be used, i.e.\ an
attribute which takes the value 1 for one category, and 0 for the other.
Where there are more than two categorical values, indicators are normally set up 
for each of the values.  However there is then redundancy among these new
 attributes and the normal procedure is to drop one of them.  In this way a 
single categorical attribute with $j$ values is replaced by $j\!-\!1$ 
attributes whose values are 0 or~1.   Where the
attribute values are ordered, it may be acceptable to use a single 
numerical-valued attribute.  Care has to be taken that the numbers used
reflect the spacing of the categories in an appropriate fashion.


\section{LINEAR DISCRIMINANTS}\index{Maximum likelihood}
\label{section:ml}

The justification of the linear discriminant procedure depends on the 
assumption of normal
probability distributions, 
although there is a separate justification 
by least squares (see McLachlan (1992)\nocite{McL92} for example).  
It is assumed that 
the attribute vectors for examples of class~$A_i$ are independent and follow a 
certain probability 
distribution with probability density function (pdf)~$f_i$.  A new point with
attribute vector $\bf x$ is then assigned to that class for which the probability 
density function
 $f_i(\bf x)$ is greatest.
This is a {\em maximum likelihood} method.   The distributions are assumed
{\em normal} 
(or {\em Gauss\-ian})\index{Normal distribution}\index{Gaussian distribution}
 with different means but the same covariance matrix.  The probability density
 function of the normal distribution is
\beq{1\over \sqrt{ \det({2\pi\Sigma})}}
\exp{(-{1\over2}{(\bf x - \mu)}^T\Sigma^{-1}{(\bf x - \mu)})} ,\label{eq:norm}
\eeq
where ${\bf \mu}$ is a $p$-dimensional vector denoting the (theoretical) mean 
for a class and $\Sigma$, the (theoretical) covariance matrix\index{Covariance
matrix}, is a $p\times p$ (necessarily positive definite) matrix.  The (sample)
covariance matrix that we saw earlier is the sample analogue of this
covariance matrix, which is best thought of as a set of coefficients in the
pdf or a set of parameters for the distribution.
 This means that the points for the class  are distributed 
in a cluster centered at ${\bf \mu}$ of ellipsoidal shape described by $\Sigma$. 
 Each cluster has the same orientation and spread though their means will of
course be different.   (It should be noted that there is in theory no absolute 
boundary for the 
clusters but the contours for the probability density function have ellipsoidal
 shape.  In practice occurrences of examples outside a certain ellipsoid will
be extremely rare.)  In this case it can be shown that the boundary separating 
two classes, defined by equality of the two pdfs, is indeed a hyperplane and
it passes through the mid-point of the two centres. 
 Its equation is
$$ {\bf x}^T\Sigma^{-1}({\bf\mu}_1 - {\bf\mu}_2) -
 {1\over 2}({\bf\mu}_1 + {\bf\mu}_2)^T \Sigma^{-1} ({\bf\mu}_1 - {\bf\mu}_2)  =  0 ,$$
where ${\bf\mu}_i$ denotes the population mean for class $A_i$.
However in class\-ifi\-cation the exact
distribution is usually not known, and it becomes necessary to estimate the
 parameters for the distributions.  With two classes, if the sample means are
substituted for $\mu_i$ and the pooled sample covariance matrix for $\Sigma$, then
Fisher's linear discriminant is obtained. 

\index{Covariance matrix}\index{Pooled covariance matrix}

A practical complication is that for the algorithm to work the pooled sample
covariance matrix must be invertible.\index{Covariance matrix}
The {\em covariance matrix} for a dataset with $n_i$ examples from class~$A_i$,
 is 
$$S_i={1\over {n_i}}X^TX - {\bf \bar x}^T{\bar x} ,$$
 where $X$ is the $n_i\times p$ matrix of attribute values, 
and ${\bf \bar x} $ is the $p$-dimensional
 row-vector of attribute means.   
The {\em pooled covariance matrix} is $\sum n_iS_i/(n-q)$
where the summation is 
over all the classes, and the divisor $n-q$ is chosen to make the pooled 
covariance matrix unbiased.  


\subsection{More than two classes}
\index{Plug-in estimates}
When there are more than two classes, it is no longer possible to use a single
linear discriminant score to separate the classes.  The 
simplest procedure is to calculate a linear 
discriminant for each class, this discriminant being just the logarithm of 
the estimated probability density function for the appropriate class,
with constant terms dropped.   Sample values are substituted for population
values where these are unknown (this gives the ``plug-in'' estimates).    
Where the prior class proportions are unknown, they would be estimated by the
relative frequencies in the training set.   Similarly, the sample means and pooled
covariance matrix are substituted for the population means and covariance matrix.\\

Suppose the prior probability of class $A_i$ is $\pi_i$, and that $f_i(x)$ is
the probability density of $x$ in class $A_i$, and that $f_i(x)$ is the normal density 
given in equation
\ref{eq:norm}.   The joint probability of observing class $A_i$ and attribute $x$ is
$ \pi_i f_i(x) $
and the logarithm of the probability of observing class $A_i$ and attribute ${\bf x}$ is

$$ \log \pi_i + {\bf x}^T\Sigma^{-1}{\bf\mu}_i -
{1\over 2}{\bf\mu}_i^T \Sigma^{-1} {\bf\mu}_i $$
to within an additive constant.  So the coefficients $\beta_i$ are given by 
the coefficients of $\bf x$

$$ \beta_i ~=~ \Sigma^{-1}{\bf\mu}_i $$
and the additive constant $\alpha_i$ by
 
$$ \alpha_i ~=~ \log \pi_i - {1\over 2}{\bf\mu}_i^T \Sigma^{-1} {\bf\mu}_i $$
though these can be simplified by subtracting the coefficients for the last class.\\

The above formulae are stated in terms of the (generally unknown) population 
parameters $\Sigma$, $\bf\mu_i$ and $\pi_i$.   To obtain the corresponding
``plug-in'' formulae, substitute the corresponding sample estimators:  $S$ for $\Sigma$;
${\bf \bar x}_i$ for $\bf\mu_i$;  and $p_i$ for $\pi_i$, where $p_i$ is the sample
proportion of class $A_i$ examples. 

\subsection{FORTRAN program {\it discrim}}

This is the basis of the FORTRAN linear discriminant program {\it discrim}.
For notational simplicity, the constant terms are included in the coefficients
$ \beta$, with a dummy variable $x_0=1$ as the first attribute, and the 
old attribute vector $x~=~(x_1,...,x_p)$ replaced by the
attribute vector $x'~=~(1,x_1,...,x_p)$.   This allows us to write the linear
discriminant for the i'th class as
$$ \beta_i^T x' $$
where the first term in $ \beta_i$ is now the coefficient $ \alpha_i$.

Furthermore, to save storage, the discriminants are relative to the reference class, 
i.e. the coefficients $ \beta_q$ for the reference class are subtracted from
the $ \beta_i$ for the other classes.   In practice, only the constant terms 
are calculated for class $ q$, and these are subtracted from the other constant
terms.   The coefficients for the non-constant attributes $ \beta_i$ are found from
$$ \beta_i ~=~ \Sigma^{-1}({\bf\mu}_i-{\bf\mu}_q) $$

\subsection{Standard Errors of Coefficients}\index{Variance of linear discriminant coefficients}

The effect of removing an attribute from the discriminant procedure can
be judged from the standard error of the associated coefficient.   There are
two formulae suggested for this:  we adopted that proposed in 
Kendall {\it et al} (1983)\nocite{kendall:1983}.   These formulae allow us to judge not only if
individual coefficients are significant, but, more usefully, if an attribute
contributes usefully to the discrimination as a whole by checking
the complete set of coefficients for that attribute in all classes.
Individual coefficients are tested by reference to the standard normal
distribution:  the effect of removing an attribute wholly is tested by
a chi-square statistic based on $q-1$ degrees of freedom.   The test statistics
for individual coefficients and complete attributes are written to the log file.\\

In passing we may note that the alternative formula for standard errors 
proposed by Rao (1946)\nocite{rao} is more often quoted in textbooks:  see for example
McLachlan (1992)\nocite{McL92} and Mardia {\it et al} (1979)\nocite{mardia}.
In the two-class problem,
this formulae is equivalent to the corresponding tests for the significance
of the attribute in a regression of class on attributes.  Generally, the 
Rao (1946)\nocite{rao} formula gives smaller variances.

\subsection{Singular Covariance Matrix}

For invertibility the attributes must
be linearly independent, which means that no attribute may be an exact linear 
combination of other attributes.   In order to
achieve this, some attributes may have to be dropped.   A singular value
decomposition of the pooled covariance matrix could be used to solve the
requisite equations, but it is generally more instructive to seek out
the offending attribute and remove it from the problem altogether.\\
 
Also, no attribute can be constant within each class, as this would give
a pooled variance of zero for that attribute.  Of
course an attribute which is constant within each class but not overall may be
an excellent discriminator.  However it will cause the linear discriminant 
algorithm to fail. 
This situation  can be treated 
by adding a small positive constant to the corresponding diagonal
element of the
pooled covariance matrix (which would require altering the FORTRAN program), 
or by adding random noise to the attribute before 
applying the algorithm (which would require altering the dataset).   See also
section \ref{regular:section}.

\section{QUADRATIC DISCRIMINANT}\index{Quadratic discriminant}\index{Quadiscr}\label{section:Quadisc}\index{Kurtosis}
\index{Quadratic functions of attributes}

Quadratic discrimination is based on the maximum likelihood argument with 
normal distributions and {\it differing} class covariance matrices.
Dropping the assumption of equal covariance matrices leads to a quadratic 
surface (e.g.\ ellipsoid, hyperboloid, etc.).  This type of
discrimination can deal with classifications where the set of attribute values
for one class to some extent surrounds that for another. 


\subsection{Comparison of Linear and Quadratic Discriminants}

 Clarke 
{\it et al.}\ (1979)\nocite{Clarke78} find that the 
quadratic discriminant procedure is robust to small departures from normality 
and that heavy kurtosis does not substantially reduce accuracy.
However, the number of parameters to be estimated becomes \hbox{$qp(p\!+\!1)/2$,}
and the difference between the variances would need to be considerable to justify
the use of this method, especially for small or moderate sized datasets 
(Marks \& Dunn, 1974)\nocite{Marks74}.   Occasionally, differences in the
covariances are of
scale only and some simplification may occur (Kendall {\it et al}., 1983)
\nocite{kendall:1983}.  
Linear discriminant is thought to be still effective if the departure from 
equality is small (Gilbert, 1969)\nocite{Gilb69}.   Some aspects of quadratic 
dependence may 
be included in the linear or logistic form (see below) by adjoining 
new attributes that are quadratic functions of the given attributes.  


\subsection{Quadratic discriminant - programming details}
\index{Quadratic discriminant}

The quadratic discriminant function is most simply defined as the logarithm 
of the appropriate probability density function, so that one quadratic discriminant is 
calculated for each class.   The procedure used is to take the logarithm of the
probability density function~\ref{eq:norm} and to substitute the sample means 
and covariance matrices in place of the population values, giving the 
so-called ``plug-in'' estimates.   Taking the logarithm of
equation (1), and allowing for differing prior class probabilities $\pi_i$,
we obtain 

$$ \log (\pi_i) ~-~ {1\over 2} \log ({\det(\Sigma_i)} )
 ~-~ {1\over2}{(\bf x - \mu_i)}^T\Sigma_i^{-1}{(\bf x - \mu_i)} $$
as the quadratic discriminant for class $A_i$.   Here it is understood that the suffix $i$ refers to the sample of
values from class $A_i$.

In classification, the quadratic discriminant is calculated for
each class and the class with the largest discriminant is chosen.  To find the
a posteriori class probabilities explicitly, the exponential is taken of
the discriminant and the resulting quantities normalised to sum to
unity.  Thus the a posteriori class probabilities
$P(A_i|{\bf x})$ are given by 

$$P(A_i|{\bf x}) ~=~ \exp ( ~ \log (\pi_i) ~-~ {1\over 2} \log ({\det(\Sigma_i)} )
 ~-~ {1\over2}{(\bf x - \mu_i)}^T\Sigma_i^{-1}{(\bf x - \mu_i)} ) $$
apart from a normalising factor.

If there is a cost matrix, then,  
no matter the number of classes, the simplest procedure is to calculate 
the class probabilities $P(A_i|{\bf x})$ and associated expected costs 
explicitly, using the formulae of section (\ref{section:bayesx}).    
\index{Zero variance}
The most frequent problem with quadratic discriminants is caused when some
attribute has zero variance in one class, for then the covariance matrix
cannot be inverted.   One way of avoiding this problem is to add a small positive
constant term to the diagonal terms in the covariance matrix (this corresponds
to adding random noise to the attributes).   Another way, adopted in the
StatLog realisation, is to use some combination of the class covariance and
the pooled covariance. \\ 

Once again, the above formulae are stated in terms of the unknown population 
parameters $\Sigma_i$, $\bf\mu_i$ and $\pi_i$.   To obtain the corresponding
``plug-in'' formulae, substitute the corresponding sample estimators:  $S_i$ for $\Sigma_i$;
${\bf \bar x}_i$ for $\bf\mu_i$;  and $p_i$ for $\pi_i$, where $p_i$ is the sample
proportion of class $A_i$ examples. \\

\subsection{Regularisation and Smoothed Estimates}\label{regular:section}\index{Regularisation}

The main problem with quadratic discriminants is the large number of parameters
that need to be estimated and the resulting large variance of the estimated
discriminants.   A related problem is the presence of zero or near zero 
eigenvalues of the sample covariance matrices.   Attempts to alleviate
this problem are known as regularisation methods, and the most practically useful
of these was put forward by 
Friedman (1989)\nocite{friedman89}, who proposed a compromise between linear and 
quadratic discriminants via a two-parameter family of estimates.   One parameter 
controls
the smoothing of the class covariance matrix estimates.   The smoothed
estimate of the class i covariance matrix is
$$ (1- \delta_i) S_i~+~ \delta_i S $$
where $ S_i$ is the class i sample covariance matrix and $ S$ is the pooled
covariance matrix.
When $ \delta_i$ is zero, there is no smoothing and the estimated class i covariance
matrix is just the i'th sample covariance matrix $ S_i$. 
When the $ \delta_i$
are unity, all classes have the same covariance matrix, namely the pooled
covariance matrix $ S$.   Friedman (1989)\nocite{friedman89} makes
the value of $ \delta_i$ smaller for classes with larger numbers.  For
the i'th sample with $ n_i$ observations:
$$ \delta_i ~=~ \delta (N-q) / \{ \delta (N-q) ~+~ (1- \delta) (n_i-1) \} $$
where $ N = N_1 + N_2 + .. + N_q $.

The other parameter $\lambda$ is a (small) constant term that is added to 
the diagonals of
the covariance matrices:  this is done to make the covariance matrix non-singular,
and also has the effect of smoothing out the covariance matrices.   As we have already
mentioned in connection  
with linear discriminants, any singularity of the covariance matrix will cause
problems, and as there is now one covariance matrix for each class the likelihood
of such a problem is much greater, especially for the classes with small sample sizes.\\

This two-parameter family of procedures is described by 
Friedman (1989)\nocite{friedman89} as ``regularized discriminant analysis''.   
Various simple
procedures are included as special cases:   ordinary linear 
discriminants ($ \delta=1, \lambda=0$);
quadratic discriminants ($ \delta=0, \lambda=0$);  and the values $ \delta=1, \l;ambda=1$ correspond to a minimum Euclidean distance rule.

This type of regularisation has been
incorporated in the Strathclyde version of {\it quadiscr}.   Very little
extra programming effort is required.   However, it is up to the
user, by trial and error, to choose the values of $ \delta$ and $ \lambda$.\\

\subsection{Choice of Regularization Parameters}

The default values of $ \delta$ = 0 and $ \lambda$=0 were adopted for
the majority of StatLog datasets, the
philosophy being to keep the procedure ``pure'' quadratic.

The exceptions were those cases where
a covariance matrix was not invertible.   Non-default values were used
for the head injury dataset\index{head injury dataset} ($ \lambda$=0.05)
and the DNA dataset\index{DNA dataset} ($ \delta$=0.5 approx.).   
In practice, great improvements in the performance of quadratic discriminants
may result from the use of regularization, especially in the smaller datasets.

\section{LOGISTIC DISCRIMINANT}\index{Logistic discriminant}\index{Logdiscr}\label{section:Logdiscr}

Exactly as in Section~\ref{sec:ld}, logistic regression operates by choosing a
hyperplane to separate the classes as well as possible, but the criterion
for a good separation is changed.  Fisher's linear
discriminants optimises a quadratic cost function whereas in
logistic discrimination it is a conditional likelihood that is maximised.
However, in practice, there is often very little difference between the two,
and the linear discriminants provide good starting values for the logistic.
Logistic discrimination is identical, in theory, to linear
discrimination for normal distributions with equal covariances, and also
for independent binary attributes, so the greatest differences between the
two are to be expected when we are far from these two cases, for example
when the attributes have very non-normal
distributions with very dissimilar covariances. \\

The method is only partially parametric, as
the actual pdfs for the classes are not modelled, but rather the ratios between
them.

Here the logarithms of the ratios of the probability density functions for
the classes  (or ``log odds'') are modelled as 
linear functions of the attributes.    Thus, for two classes,

$$\log{f_1({\bf x})\over f_2({\bf x})} = \alpha + {\bf\beta}'{\bf x} ,$$
where $\alpha$ and the $p$-dimensional vector ${\bf\beta}$ are the parameters 
of the model that are to be estimated.
The case of normal distributions with
equal covariance is a special case of this, for which the  parameters
 are functions of the class means and the common covariance matrix.
However the model covers other cases too, such as that where the attributes
are independent with values 0 or~1.
One of the attractions of considering the log odds is that the scale covers all
real numbers.  A large positive value indicates that class $A_1$ is likely, 
while a large negative value indicates that class $A_2$ is likely.


In practice the parameters are estimated by maximum $ conditional $ likelihood.  
The model implies that, given attribute values $\bf x$, the conditional class 
probabilities for classes $A_1$ and $A_2$ take the forms:

$$P(A_1|{\bf x}) ~=~ {\exp(\alpha+{\bf\beta}'{\bf x})\over1+\exp(\alpha+{\bf\beta}'{\bf x})}$$

$$P(A_2|{\bf x}) ~=~ { 1 \over1+\exp(\alpha+{\bf\beta}'{\bf x})}$$

\hbox{respectively.}

Given independent samples from the two classes, 
the conditional likelihood for the parameters $\alpha$ and $\beta$ is defined to
be

$$
L(\alpha,\beta) ~=~ 
\prod\limits_{\{A_1 \hbox{sample}\}} P(A_1|{\bf x}) 
\prod\limits_{\{A_2 \hbox{sample}\}} P(A_2|{\bf x}).
$$
and the parameter estimates are the values that maximise
this likelihood.
They are found by iterative methods, as
proposed by Cox (1966)\nocite{Cox66}
and Day \& Kerridge (1967){\nocite{Day67}.   Logistic models belong to the class of
generalised linear models (GLIM), which generalise the use of linear regression 
models
to deal with non-normal random variables, and in particular to deal with
binomial variables.   In this context, the binomial variable is an indicator
variable that counts whether an example is class $A_1$ or not.  Many 
statistical packages (GLIM, Splus, ...) now include a GLIM function, 
enabling logistic
regression to be programmed easily,
in two or three lines of code. 
\index{GLIM}\index{Splus}\index{Generalised linear models (GLM)}\index{Linear regression}\index{Binomial}\index{Link function}\index{Reference class}\index{Indicator variables} 
The procedure is to define an indicator variable for class $A_1$ occurrences. 
The indicator variable is then declared to
be a ``binomial'' variable with the ``logit'' link function, and 
generalised regression performed on the attributes.   We used the package
Splus for this purpose.   This is fine for two classes, and has the merit 
of requiring little extra programming effort.   For more than two classes,
the complexity of the problem increases substantially, and, although it is
technically still possible to use GLIM procedures, the programming effort is
substantially greater and much less efficient.  
\index{Odds}
When there are more than two classes, one
class is taken as a reference class, and there are $q\!-\!1$ sets of parameters
for the odds of each class relative to the reference class.   To discuss this
case, we abbreviate the notation for $ \alpha+{\bf\beta}'{\bf x} $ to the simpler
$ {\bf\beta}'{\bf x} $.   For the remainder of this section, therefore, {\bf x}
is a $(p+1)$-dimensional vector with leading term unity, and the leading term in
$\beta$ corresponds to the constant $\alpha$.   


\index{Maximum likelihood}\index{Maximum conditional likelihood}
Again, the parameters are estimated by maximum conditional likelihood.  
Given attribute values $\bf x$, the conditional class probability for class 
$A_i$, where $i\ne q$, and the conditional class probability for $A_q$ take the forms:

$$P(A_i|{\bf x}) ~=~ {\exp({\bf\beta}_i'{\bf x})\over\SUM_{j=1,..,q}\exp({\bf\beta}_j'{\bf x})}$$

$$P(A_q|{\bf x}) ~=~ { 1 \over\SUM_{j=1,..,q}\exp({\bf\beta}_j'{\bf x})}$$
respectively.
Given independent samples from the $q$ classes, 
the conditional likelihood for the parameters $\beta_i$ is defined to
be
$$
L(\beta_1,...,\beta_{q-1}) ~=~ 
\prod\limits_{\{A_1 \hbox{sample}\}} P(A_1|{\bf x}) 
\prod\limits_{\{A_2 \hbox{sample}\}} P(A_2|{\bf x})  ... 
\prod\limits_{\{A_q \hbox{sample}\}} P(A_q|{\bf x}).
$$
Once again, the parameter estimates are the values that maximise
this likelihood.

In the basic form of the algorithm an example is 
assigned to the class for which the estimated log odds  is greatest if that is
greater than~0, or to the 
reference class if all estimated log odds are negative. 

\index{Transformation of attributes}\index{Binary attributes}\index{Ordered categories}

More complicated models can be accommodated by adding transformations of the
given attributes.  
When categorical attributes with $r$~($>2$) values
occur, it may be necessary to convert them into \hbox{$r\!-\!1$} binary 
attributes before
using the algorithm, especially if the categories are not ordered.

\index{Products of attributes}
Anderson (1984)\nocite{And84} points out that it may be appropriate to include transformations
or products of the attributes in the linear function, but for large datasets
this may involve much computation.  See McLachlan (1992)\nocite{McL92}
for useful hints.
One way to increase 
complexity of model, without sacrificing intelligibility, is to add parameters
in a {\em hierarchical} fashion, and there are then links with {\em graphical 
models} and {\em Polytrees}.

\subsection{Logistic Discriminant - Programming details}
\index{Logistic discrimination - programming}

The maximum likelihood solution can be found via a Newton-Raphson iterative procedure,
as it is quite easy to write down the necessary derivatives of the likelihood
(or, equivalently, the log-likelihood).   The simplest starting procedure is to set the
$\beta_i$ coefficients to zero except for the leading coefficients ($\alpha_i$) 
which are set to the logarithms of the numbers in the various classes:
$$\alpha_i ~=~ \log n_i $$
where $n_i$ is the number of class $A_i$ examples.   This ensures that the
values of $\beta_i$ are those of the linear discriminant after the first 
iteration.   Of course, an alternative would be to use the linear
discriminant parameters as starting values.   In subsequent
iterations, the step size may occasionally have to 
be reduced, but usually the procedure converges in about 10 iterations.   This
is the procedure we adopted where possible.  

However, each iteration requires a separate calculation of the
Hessian, and it is here that the bulk of the computational work is required.
The Hessian is a square matrix with $(q-1)(p+1)$ rows, and each term requires
a summation over all the observations in the whole dataset (although some
saving can by achieved using the symmetries of the Hessian).   Thus there
are of order $q^2.p^2.N$ computations required to find the Hessian matrix
at each iteration.   In the KL digits dataset, for example, $q=10$, $p=40$, and
$N=9000$, so the number of operations is of order $10^9$ in each iteration.   
In such cases, it is preferable to use a purely numerical search
procedure, or, as we did when the Newton-Raphson procedure was too
time-consuming, to use an approximate method based on an
approximate Hessian.   The approximation uses the fact that the Hessian
for the zero'th order iteration is simply a replicate of the design matrix
(cf. covariance matrix) used by the linear discriminant rule.   This
zero-order Hessian is used for all iterations.   In situations where there is
little difference between the linear and logistic parameters, the approximation
is very good and convergence is fairly fast (although a few more iterations
are generally required).   However, in the more interesting case that the
linear and logistic parameters are very different, convergence using this
procedure is very slow, and it may still be quite far from convergence
after, say, 100 iterations.   We generally stopped after 50 iterations:  
although
the parameter values were generally not stable, the predicted classes for
the data were reasonably stable, so the predictive power of the resulting
rule may not be seriously affected.   This aspect of logistic regression
has not been explored.


\subsection{Choice of convergence parameters}

There are two main parameters governing the iterative procedure:  {\it Niter}= 
maximum number of iterations;  and {\it B}= smallest change in deviance that 
will force the procedure to stop.   In principle, if {\it Niter} is infinite
and {\it B} is zero, the solution will converge to the Maximum Likelihood
solution.   In practice, a finite number of iterations is used:  we took
{\it Niter}=10.  Also, we deliberately chose a value of {\it B} which
ensure that the process can stop well short of the Maximum Likelihood
solution, and so adopted a value of {\it B} which, by normal standards, 
is very large:
$$ B~=~ d.p(q-1)$$
where p is the number of attributes, q the number of classes and d is
a constant (usually we took d=1.0).   Thus the default values were:
$$ Niter~=~10; ~~~~~~~~~~ d~=~1.0 .$$   \\

There are two reasons for stopping short of the Maximum Likelihood solution:
(i) the iterative process is often very time consuming;  (ii) the
solution is often better in terms of accuracy of predictions when tested on
independent data.
   
\subsection{One-parameter family of linear discriminants}\index{Bias in logistic regression}

As we have observed empirically, the Maximum Likelihood solution of the
logistic regression equations may even give a rule that has lower accuracy
than the linear discriminant rule.   This paradoxical result arises for
two main reasons:  (i) the Maximum Likelihood estimates of the coefficients
are biased;  and (ii) from the viewpoint of minimising
misclassification error, too much emphasis is placed on points (e.g. outliers)
far from the decision boundary.   This is another way of saying that the Maximum Likelihood criterion results in overfitting the data.   In turn,
this suggests that we should be minimising
a quantity more directly related to misclassification error, and in this section
we propose a one-parameter family of estimates which does exactly that,
though the minimisation is over a single parameter only.   In the following,
we use the subscript LS to denote Least Squares (meaning the linear 
discriminant) and ML to denote Maximum Likelihood (meaning logistic regression).\\

Let $ \beta_{LS}$ and $\beta_{ML} $ denote the coefficients from linear and
logistic regression.   The suggestion is to use a linear combination of 
these two sets of coefficients:
$$ \beta(\gamma )~=~ (1- \gamma ). \beta_{LS}~+~ \gamma . \beta_{ML} $$
as linear predictor, the parameter $ \gamma $ being chosen to minimise
the error rate on an independent test set.    When $ \gamma =0$, we get
the ML solution, and when $ \gamma =1$ we get the LS solution. 

\subsubsection{Example on DNA data}\index{DNA dataset}

As an example of the phenomenon of overfitting in logistic regression, 
we ran the procedures {\it discrim} and {\it logdiscr} on the dna data.
With the {\it logdiscr} parameters set to give at eight iterations,
this resulted in a very low residual deviance but much overfitting.
Ordinary linear discriminants obtain an error rate of
0.059\% on test data, whereas logistic regression gets an error rate of
about 0.070\%.   We say ''about'' because the iterative procedure was 
terminated before complete convergence.   Somewhere between the two solutions,
there is a more accurate rule, as is clear from table \ref{logist.tab}
which tabulates the error rate as a function of the relative distance $ \gamma $
from the logistic solution.

% latex.table(x = logd.tab, cdec = c(2, 4, 4)) 
%
\begin{table}[hptb]
\begin{center}
\begin{tabular}{|c|c|c|} \hline
\multicolumn{1}{|c|}{$ \gamma $}&\multicolumn{1}{c|}{Training}&\multicolumn{1}{c|}{Test}\\ \hline
0.00&0.0340&0.0590\\ 
0.10&0.0260&0.0514\\ 
0.25&0.0165&0.0506\\ 
0.50&0.0050&0.0548\\ 
0.75&0.0000&0.0616\\ 
1.00&0.0000&0.0700\\ 
\hline
\end{tabular}
\caption{\label{logist.tab} Error rate for a linear combination of ordinary
linear and logistic discriminants, as tested on the training set of 2000 examples and a test set of 1186 examples.} 
\end{center}
\end{table}

With the DNA data, there are 2000 examples in the training set and 
1186 examples in the test set.   The above differences are not so significant
therefore, but they do illustrate the general point that the logistic
regression rule ($ \gamma =1)$ is overtraining with no errors in the training
data, but 0.07\% errors in the test data.   Also it suggests that a better
error rate is obtained if the iterative procedure is terminated earlier
rather than later.

\subsection{Standard errors of coefficients}\index{Variance of logistic regression coefficients}

One of the main advantages of the Maximum Likelihood method is that 
the variance of the parameters can be estimated, and this could be used
as a basis for eliminating variables that did not contribute usefully
to the regression.   One common way of dropping variables is through
some kind of stepwise selection procedure.   See McLachlan (1992) for
more details.  \\

The variance formulae allow us to judge not only if
individual coefficients are significant, but, more usefully, if an attribute
contributes usefully to the discrimination as a whole by checking
the complete set of coefficients for that attribute in all classes.
Individual coefficients are tested by reference to the standard normal
distribution:  the effect of removing an attribute wholly is tested by
a chi-square statistic based on $q-1$ degrees of freedom.   The test statistics
for individual coefficients and complete attributes are written to the log file.\\
 


\bibliographystyle{apalike}
\bibliography{whole}
\input{programs.ind}

\end{document}
