\documentstyle{article}
\begin{titlepage}
\title{Discriminant Programs and Evaluation Assistant}
\bigskip
\author{Guide to classification trials}
\bigskip
\date{Tuesday 17th August, 1993}
\bigskip
% \maketitle
\end{titlepage}

\begin{document}
\maketitle
\centerline{\Large Contact Names}
\bigskip
{\it Dr. R.J. Henery}
{\obeylines
Dept. of Statistics and Modelling Science
University of Strathclyde
Richmond Street
Glasgow G1 1XH
Tel: +44 41 552 4400
Fax: +44 41 552 4711
e-mail: bob@uk.ac.strathclyde.stams
or e-mail: cais05@uk.ac.strathclyde.vaxb
}

\bigskip
{\sl Dr.J. Gama} 
{\obeylines
University of Porto 
Tel: +351 26001672
Fax: +351 26003654
e-mail: jgama@nccup.ctt.pt
}

\section{Introduction}

This document has two main purposes:  firstly to describe some FORTRAN programs
for discrimination and secondly to describe how these programs may be imbedded
in Evaluation Assistant to perform standard types of classification trials. \\
 
The program and scripts are accompanied by a disclaimer:  the programs are not 
guaranteed to work, and neither the University of Strathclyde nor
the University of Porto can accept any responsibility for any errors
or omissions.   \\

The first author would, nonetheless, be pleased to hear of any problems
in the running of these programs, although specialised questions relating
to Evaluation Assistant would be better aimed at Dr. J. Gama at the University
of Porto.  

\pagebreak

\section{Linear, Logistic and Quadratic Discrimination}

Firstly this document describes the operation of 
FORTRAN programs for Linear, 
Logistic and Quadratic Discrimination, namely {\it discrim.f}, {\it logdiscr.f}
and {\it quadiscr.f}.   In addition, there is a variant of Logistic
discrimination, {\it logxx.f}, that is recommended for extremely large datasets, 
so there are four FORTRAN programs that are capable of
giving discriminants:  {\it discrim.f}, {\it logdiscr.f}, {\it logxx.f} and 
{\it quadiscr.f}.   To try out these discriminating procedures on data with 
known classifications, two more FORTRAN programs are provided:  {\it lintest.f} 
for testing Linear and Logistic discriminants and {\it quadtest.f} for 
Quadratic discriminants.   
All FORTRAN programs are written in FORTRAN77 for Sun workstations. 
There are two ways of running them.  
\begin{quote}
\item {\bf Manually}:  all necessary files 
are created by the user, and the training and test phases are treated 
separately;
\item {\bf Evaluation Assistant}:  a script file controls 
the creation of all files and summary statistics
are produced automatically.   
\end{quote}
Whichever method is adopted, the user 
must edit the appropriate FORTRAN files to enter various parameters:   
usually the number of attributes {\it nattrs}, number of classes 
{\it klass} and number of data {\it ndata}.   Occasionally, it may be necessary
to alter some default parameters controlling the number of iterations
in the Logistic programs, or change the parameter {\it delta} in {\it quadiscr}
from its default value.

\subsection{Manual Operation - Learning Phase}
The program {\it discrim.f} must be edited to enter the two parameters:
 number of 
attributes {\it nattrs}, and number of classes {\it klass}.   The program 
must then be compiled via:
\begin{verbatim}
f77 -o discrim discrim.f
\end{verbatim}
This produces an executable program 'discrim', and, if the datafile
'disc.tra' contains a suitable dataset, the program may be executed via: 
\begin{verbatim}
discrim
\end{verbatim}
This produces a set of coefficients for the linear 
discriminant stored in a file called 'disc.beta'.   In addition a log file
is produced (discrim.log) containing miscellaneous information about
the running of the program.

\subsection{Manual Operation - Learning Phase}

The discriminant coefficients produced by {\it discrim} can be tested on
another suitable dataset 'disc.tes' containing test data with known classes using
another FORTRAN program 'lintest.f'.    As with {\it discrim.f}, {\it lintest.f}
must first be edited to enter the two parameters:
 number of 
attributes {\it nattrs}, and number of classes {\it klass}, and the program 
compiled via:
\begin{verbatim}
f77 -o tests lintest.f
\end{verbatim}
This produces an executable program 'tests', and, if the datafile
'disc.tes' contains the test dataset, the program may be executed via: 
\begin{verbatim}
tests
\end{verbatim}

\subsection{Learning and Testing Phases Together}

The most common trial of an algorithm consists of three steps:
\begin{itemize}
\item Learn the discriminant coefficients from the training dataset 'disc.tra'
\item Test these coefficients on the training dataset 'disc.tra'
\item Test these coefficients on an independent test dataset 'disc.tes'
\end{itemize}

The first and third steps are quoted in the previous two sections.   The second
step is the same as the third step with the training dataset copied to
the file 'disc.tes'.

Later, in the section on Evaluation Assistant, we will see that all three
operations can be accomplished with the single command:

\begin{verbatim}
tt discrim dataset > discrim.dataset.log
\end{verbatim}
and this is the method that we recommend.   Especially in trials involving
cross-validation, the Evaluation Assistant will be the simplest and most reliable
way of running a trial.   To perform 10-fold cross-validation, for example,
the Evaluation Assistant command is:
\begin{verbatim}
cv discrim dataset 10 > discrim.dataset.log
\end{verbatim}
Before describing Evaluation Assistant, we give a few more details of the 
other FORTRAN programs.

\subsection{Specimen commands for {\it logdiscr.f} and {\it logxx.f} - Manual}

The commands for Logistic discriminants are almost identical with those for 
Linear discriminants,  
with {\it discrim} replaced throughout by {\it logdiscr} or {\it logxx}.   The
sole exception is that the number of examples must be entered as an
additional parameter.   The format of commands for {\it logdiscr.f}
and {\it logxx.f} is identical, so we give only a summary of the commands for 
{\it logdiscr.f}.   As before, we assume that the training and test datasets
are contained in files 'disc.tra' and 'disc.tes' respectively.

\begin{itemize}
\item Edit {\it logdiscr.f}~ to change the {\bf three} parameters:  
number of examples {\it ndata} in the training set, number of 
attributes {\it nattrs}, and number of classes {\it klass}. 
\begin{quote}
{\bf WARNING}.   When deciding what to enter for the parameter {\it ndata} 
in logdiscr.f and logxx.f, it is the {\it number of data in the 
training dataset} disc.tra
that is relevant.   For cross-validation trials, this is {\bf NOT} the
number of data in the complete dataset.  For example, with 9-fold
cross-validation on a dataset with 846 examples, the training and test datasets
have 752 and 94 examples respectively, and the parameter {\it ndata} is 752.
\end{quote}
\item Compile the program:
\begin{verbatim}
f77 -o logdiscr logdiscr.f
\end{verbatim}

\item Execute the learning program:
\begin{verbatim}
logdiscr
\end{verbatim}

\item Edit {\it lintest.f}~ to change the two parameters:  number of 
attributes {\it nattrs}, and number of classes {\it klass};
\item Compile:
\begin{verbatim}
f77 -o tests lintest.f
\end{verbatim}

\item Execute the test program:
\begin{verbatim}
tests
\end{verbatim}

\end{itemize}

\pagebreak

\subsection{Specimen commands for {\it quadiscr.f} - Manual}

The commands for Quadratic discriminants are also very similar to those for 
Linear discriminants,  
with {\it discrim} replaced throughout by {\it quadiscr}.   The
principal difference is that the test program is now {\it quadtest.f}. 
The sequence of steps for Quadratic discriminants is:

\begin{itemize}
\item Edit {\it quadiscr.f~} to change the {\bf two} parameters:  number of 
attributes {\it nattrs}, and number of classes {\it klass}. 
\item Compile the program:
\begin{verbatim}
f77 -o quadiscr quadiscr.f
\end{verbatim}

\item Execute the learning program:
\begin{verbatim}
quadiscr
\end{verbatim}

\item Edit {\it quadtest.f} to change the two parameters:  number of 
attributes {\it nattrs}, and number of classes {\it klass}, and compile:
\begin{verbatim}
f77 -o tests quadtest.f
\end{verbatim}

\item Execute the test program:
\begin{verbatim}
tests
\end{verbatim}

\end{itemize}


\subsection{File conventions}

By and large, programs that produce discriminant functions are written in such a way
that all necesary input and output files have fixed names, so that it is only
necesary, when running the programs, to copy datafiles to standard files. 
The output is also to be found in files with fixed names.   For example, the
training data must always be in the file 'disc.tra' and the test data
in the file 'disc.tes'.   Also, the cost matrix, if supplied, must be in the
file 'cost.mtx'.   Table \ref{files.learn} gives a list of all files that
are required by each of the four FORTRAN discriminant programs.   In normal
operation, the user need not take any action on these output files except
to avoid giving these names to other files.

% latex.table(x = t1) 
%
\begin{table}[hptb]
\begin{center}
\begin{tabular}{|l|l|l|} \hline
\multicolumn{1}{|c|}{Procedure}&\multicolumn{1}{c|}{Input files}&\multicolumn{1}{c|}{Output files}\\ \hline
discrim.f~~~~~~~~~~~~~~&disc.tra~~~~~~~~~~~~~~~&disc.beta~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&discrim.log~~~~~~~~~~~~\\ \hline
logdiscr.f~~~~~~~~~~~~~&disc.tra~~~~~~~~~~~~~~~&disc.beta~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&discrim.beta~(optional)&logdiscr.log~~~~~~~~~~~\\ \hline
logxx.f~~~~~~~~~~~~~~~~&disc.tra~~~~~~~~~~~~~~~&disc.beta~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&logxx.log~~~~~~~~~~~~~~\\ \hline
quadiscr.f~~~~~~~~~~~~~&disc.tra~~~~~~~~~~~~~~~&quad.freq~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&quad.delta~(optional)~~&quad.mean~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&quad.covar~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&quad.covinv~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&quad.determ~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~~~~~&quad.alpha~~~~~~~~~~~~~\\ 
\hline
\end{tabular}
\caption{\label{files.learn} Input and output files for the 
discriminant programs.}
\end{center}
\end{table}
These programs produce sets of coefficients that form the basis of their
respective discriminant procedures.   To test the procedure on a test dataset,
two FORTRAN programs are supplied:  {\it lintest.f} for discrimination based
on linear combinations of attributes (i.e. {\it discrim}, {\it logdiscr} and {\it logxx});  and {\it quadtest.f} for quadratic discrimination (i.e. for {\it quadiscr}).
These programs take the respective output files of the discrimination programs
and use them to construct the discriminants that are to be tested on the
test data.   The test data must be in the file 'disc.tes'.   A cost
matrix, if supplied, must be in the file 'cost.mtx':  if no cost.mtx file exists,
the default cost matrix is created and written into 'cost.mtx'.   Both
{\it lintest} and {quadtest} produce a file 'disc.conf' containing
the confusion matrix for the test set.   Input files
required for these programs are listed in table \ref{files.test}:  these
are all produced automatically by the preceding discriminant procedures.

% latex.table(x = t2) 
%
\begin{table}[hptb]
\begin{center}
\begin{tabular}{|l|l|l|} \hline
\multicolumn{1}{|c|}{Testing Program}&\multicolumn{1}{c|}{Input files}&\multicolumn{1}{c|}{Output files}\\ \hline
lintest.f~~~~~~~~~~&disc.tes~~~~~~~~~~~&disc.conf~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&disc.beta~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&cost.mtx~(optional)&~~~~~~~~~~~~~~~~~~~\\ \hline
quadtest.f~~~~~~~~~&disc.tes~~~~~~~~~~~&disc.conf~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&quad.freq~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&quad.mean~~~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&quad.covinv~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&quad.determ~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&quad.alpha~~~~~~~~~&~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~~~~~~~~~~~~~&cost.mtx~(optional)&~~~~~~~~~~~~~~~~~~~\\ 
\hline
\end{tabular}
\caption{\label{files.test} Input and output files for the 
test programs.}
\end{center}
\end{table}

\section{Evaluation Assistant}\index{Evaluation Assistant}\index{EA}

Manual operation of the discriminant procedures is possible for simple
datasets, but it is recommended, for the sake of standardisation and also
for the sake of reliability, that these procedures be operated automatically
via 
Evaluation Assistant, which is a suite of shell scripts and tools that facilitates 
the testing of learning 
algorithms and provides standardized performance measures. 
Two versions of Evaluation Assistant exist:
\begin{itemize}
\item Command version (EAC)
\item Interactive version (EAI)
\end{itemize}
The command version of Evaluation Assistant (EAC) consists of a set of basic 
commands that enable the user to test learning algorithms. This version is 
implemented as a set of Cshell scripts and C programs.   The version
of EAC that is
available from the Strathclyde ftp site has been modified to run with the
four procedures {\it discrim}, {\it logdiscr}, {\it logxx} and {\it quadiscr}.

The interactive version of Evaluation Assistant (EAI) provides an interactive 
interface that enables the user to customize the various EAC scripts via
an interactive menu-based interface.    The various scripts
can be examined and modified before execution.   The interactive 
version is implemented in C and the interactive interface exploits X 
windows, and will  
run on SUN SPARCstation IPC and other compatible workstations.    
The interactive version is not discussed further in this document:  further
details are available from Dr. J. Gama of the University of Porto.   

This section provides more details about the Strathclyde variant of EAC,
which is the Command Version of Evaluation Assistant (EAC).   
This is set up to provide a common environment for three basic types
of trial: 

\begin{enumerate}
\item Learn-and-test ({\it lt}).   Learn the rules from the training dataset
'disc.tra' and test these rules on an independent dataset 'disc.tes'.
\item Train-and-test ({\it tt}).   Learn the rules from the training dataset
'disc.tra' and test these rules on BOTH the training dataset 'disc.tra'
and an independent dataset 'disc.tes'.
\item N-fold Cross-validation ({\it cv}).   Split the provided dataset into 
N equal portions, and use (N-1) of the portions as the training data and
the remaining portion as test data in a Train-and-test trial, using
each of the N portions as test set in one of the N cross-validation cycles.
\end{enumerate}

\subsection{Learn-and-test {\it lt}}\label{lt.sec}

The procedure {\it learn-and-test} or {\it lt}
is required on only some datasets, and therefore may be omitted on first reading. 
In the StatLog project, variant {\it lt} was used as a prelude to the main 
trials where some parameter had to be tuned by testing on a test set.  This
is illustrated by the {\it quadiscr} trial on the DNA dataset, in which
a parameter {\it delta} has to be chosen to optimise the error rate. \\

The {\it quadiscr} procedure needs the parameter {\it delta} to be specified
before it can run.   Unfortunately, on the DNA dataset, the
default value of {\it delta}=0 produces a singularity, and some other
value of {\it delta} must be chosen, more or less by trial and error.
An unbiased method for choosing a value for {\it delta} is by dividing up
the {\it training} dataset into two parts, one for learning the rule assuming
{\it delta} is known, and the other part for assessing the accuracy for this
value of {\it delta}.   For example the DNA dataset has a training dataset
of 2000 examples and a test set of 1186 examples.   The training set is
further split into a learning set of 1800 examples and what we will describe as
a 'proof' set of 200 examples.   The error rate in the 'proof' set is determined
for a range of values of {\it delta} applied to the learning set, and 
a value chosen to give the lowest error rate in the 'proof' set.
   Before giving
a specimen output from the {\it lt} command, let us see the end result 
of this procedure.   This is shown in table \ref{quad.delta} where
the error rate in the test set is tabulated against the parameter
{\it delta}.   From this table, it appears that {\it delta} should be chosen
to be {\it delta}=0.700, giving an error rate of 0.025 in the 'proof' set.   \\

% latex.table(x = quad.tab) 
%
\begin{table}[hptb]
\begin{center}
\begin{tabular}{|c|c|c|} \hline
\multicolumn{1}{|c|}{delta}&\multicolumn{1}{c|}{Error rate}&\multicolumn{1}{c|}{No. examples}\\ \hline
~~0.000&~~NA~~~&200\\ 
~~0.010&~~0.305&200\\ 
~~0.100&~~0.210&200\\ 
~~0.200&~~0.120&200\\ 
~~0.400&~~0.035&200\\ 
~~0.500&~~0.030&200\\ 
~~0.600&~~0.035&200\\ 
\hline
\end{tabular}
\caption{\label{quad.delta} Variation of error rate as the parameter {\it delta}
is varied.   Optimal choice of {\it delta} is near 0.700.}
\end{center}
\end{table}

The first step in the {\it lt} procedure is to generate the 'learning' 
and 'proof'
sets using the 'splitc' command from EA (see later for further details of 
splitc):
\begin{verbatim}
splitc dna.tra 10 1
\end{verbatim}
which generates two files:  dna.tra.tra, which is the 'learning' file;  and
dna.tra.tes, which is the 'proof' file.
Each line in table \ref{quad.delta} was obtained
by a single command of the form
\begin{verbatim}
lt quadiscr dna.tra > quadiscr.dna.tra.log
\end{verbatim}
which does the learn-and-test procedure.   The following subsection shows the output for a typical run with {\it delta}=0.500..   

\subsubsection{Specimen output from {\it lt} command}

This represents the output from a run of {\it quadiscr} on the 'learning'
dataset 'dna.tra.tra' with the accuracy tested on the 'proof' dataset
'dna.tra.tes'.   The workstation used was ``cochran''.

\begin{verbatim}
cochran

Tue Aug 17 15:28:12 BST 1993
Train then classify test set only
Algorithm         quadiscr
Dataset           dna.tra
Algorithm quadiscr  - Learning phase
Dataset   dna.tra.tra
 Number of data             1800
 Number of attributes       180
 Number of classes          3
  Delta =     0.500
Number of data        1800
Number of attributes  180
Number of classes     3
305.2 2.6 263
Learn_Time:
time(secs)  memory
cpu  user   kBytes
305.6 3.8 263
Algorithm quadiscr - testing phase
Dataset   dna.tra.tes
 Total costs:       6.0000     average cost:      0.03000
213.1 0.8 163
      Test_time:                      TEST_DATA:
  cpu time   user time    memory
   seconds    seconds     kBytes
213.5 1.5 163
MATRIX_TEST:
	[1]	[2]	[3]
[1]	47	0	1
[2]	0	41	1
[3]	2	2	106
SUCCESS_RATE: 0.970000  (194/200)
LOSS: 6
AVERAGE_LOSS: 0.030000
_____________________________
Tue Aug 17 15:37:08 BST 1993

\end{verbatim}

From this output, the key point is that the average loss (error rate) is
0.030 when the parameter {\it delta} is 0.500.   This give us one line in
table \ref{quad.delta}.

\subsection{Train-and-test {\it tt}}

The command {\it tt} may be used to conduct a trial in which
\begin{itemize}
\item the rules are learnt from the training data 'dataset.tra'
\item the rules are tested on the training data 'dataset.tra'
\item the rules are tested on a second dataset 'dataset.tes'
\end{itemize}

The main point here is that the three phases above are timed in a standard way,
and memory usage also recorded in a standard way.   As an example,
suppose that {\it quadiscr} is to be tested via the {\it train-and-test} 
procedure on the DNA dataset, with the
training data in 'dna.tra' and the test data in 'dna.tes'.   This is done by
the EAC command
\begin{verbatim}
tt quadiscr dna > quadiscr.dna.log
\end{verbatim}
The output is given below.

\subsubsection{Specimen output from {\it tt}}

\begin{verbatim}

cochran

Tue Aug 17 15:38:35 BST 1993
Simple train then classify training and test sets
Algorithm         quadiscr
Dataset           dna
Algorithm quadiscr  - Learning phase
Dataset   dna.tra
 Number of data             2000
 Number of attributes       180
 Number of classes          3
  Delta =     0.500
Number of data        2000
Number of attributes  180
Number of classes     3
334.7 4.1 262
Learn_Time:
time(secs)  memory
cpu  user   kBytes
335.2 5.1 262
Algorithm quadiscr - testing phase
Dataset   dna.tra
 Total costs:       5.0000     average cost:      0.00250
2014.2 5.5 165
      Test_time:                      TRAIN_DATA:
  cpu time   user time    memory
   seconds    seconds     kBytes
2014.7 6.6 165
MATRIX_TRAIN:
	[1]	[2]	[3]
[1]	464	0	0
[2]	0	484	1
[3]	0	4	1047
SUCCESS_RATE: 0.997500  (1995/2000)
LOSS: 5
AVERAGE_LOSS: 0.002500
Algorithm quadiscr - testing phase
Dataset   dna.tes
 Total costs:      55.0000     average cost:      0.04637
1217.1 7.7 165
      Test_time:                      TEST_DATA:
  cpu time   user time    memory
   seconds    seconds     kBytes
1217.6 8.7 165
MATRIX_TEST:
	[1]	[2]	[3]
[1]	283	7	13
[2]	4	263	13
[3]	8	10	585
SUCCESS_RATE: 0.953626  (1131/1186)
LOSS: 55
AVERAGE_LOSS: 0.046374
_____________________________
Tue Aug 17 16:41:26 BST 1993

\end{verbatim}

Some key points in the above output are now mentioned.   
For example, the time to
learn the rules for {\it quadiscr} on the DNA dataset 'dna.tra' is 2014.7
seconds (cpu time) and 6.6 seconds (system time), giving an approximate
total time to learn of 2021.3 seconds.   The maximum memory usage during
the learning phase was 165 kBytes.   The accuracy of {\it quadiscr} was
0.0025 on the training data 'dna.tra' and 0.046374 on the test data 'dna.tes'.
The final confusion matrix is fairly symmetric.   For example, there 
were 4 examples 
in which
a true class [2] was classified as class [1], and 7 examples in which
a true class [1] was classified as class [2].

\subsection{Cross-validation {\it cv}}

The main command of the Evaluation Assistant is ``cv''. It calls the 
specified learning algorithm and performs cross-validation tests on the given 
data set. To invoke it the user needs to type:

\begin{verbatim}
     cv   Algorithm_name   Dataset   Nr_Splits  >  cv.log 
\end{verbatim}
where ``cv'' is the name of the main script file which takes 
``Algorithm\_name'', ``Dataset'' and ``Nr\_Splits'' as parameters. The additional 
parameter ``Nr\_Splits'' determines the required number of cycles for cross 
validation.  
The output of the ``cv'' script may be redirected to the output file ``cv.log''. 
The form of 
the output is described later in more detail.
 
The user may adjust this script (and other sub-scripts) to suit the 
particular requirements of the learning algorithm used, for example by providing
the value of an additional parameter at run time.

Evaluation Assistant provides not only cross-validation, but also
a script file ``results'' for generating the required statistics such as average 
error-rate, time
to learn, etc..  This processes the log-file produced by ``cv'' and produces
a summary file containing the essential information describing the overall results 
of the cross-validation trial.   It is necessary to specify the cost matrix to be used
when running ``results'', so  ``results'' is invoked by

\begin{verbatim}
     results cv.log cost.mtx > cv.results
\end{verbatim}

\subsection{Example}

As an example, the linear discriminant program ``discrim''
can be run on the vehicle dataset using 9-fold cross validation, and the
overall statistics generated by the following commands: 

\begin{verbatim}
   cv discrim vehicle 9 > discrim.vehicle.log
   results discrim.vehicle.log cost.mtx > discrim.vehicle.results
\end{verbatim}
   
\subsection{Data Formats and Other Prerequisites}

The dataset supplied must conform to the StatLog format, that is, each line must contain
all the attributes, plus the class value in the last column, with 
all values separated by a space. 

The dataset may be 
accompanied by a cost matrix. The value in row i and column j 
specifies the cost of 
misclassifying a true class i as class j. 
The cost matrix supplied must be stored in a file cost.mtx.  If the file cost.mtx
is not supplied, the discriminant test procedures will generate a file cost.mtx 
containing the default matrix, which is a matrix 
of size qxq containing 1's in all places except the positions in 
the diagonal which contain 0's, so the file cost.mtx would contain the
following entries:

\begin{verbatim}
        0    1    ..    1
        1    0    ..    1
        ..   ..   ..    ..
        1    1    ..    0
\end{verbatim}
The learning algorithm must generate a confusion matrix in the standard 
StatLog format, with the true classification given in the row and 
the allocated classification by the column: 
\begin{verbatim}

      [1]   [2]    ..   [q]
[1]    xx    xx    ..    xx
[2]    xx    xx    ..    xx
       ..    ..    ..    ..
[q]    xx    xx    xx    xx
\end{verbatim}

\subsection{Output of cv}

The cross-validation program creates a LOG file containing relevant information 
concerning every round in the cross-validation procedure:  time to learn the rules;
memory required in learning;  time and memory required to classify the training set;
time and memory required to classify the test set;  and the confusion matrix for
both training and test sets.  This LOG file
may be processed by the script file ``results'' to create a summary file
containing statistics describing the overall result of the trial.
As an example, we give a portion of the log file generated by the command
\begin{verbatim}
cv discrim vehicle 9 > discrim.vehicle.log
\end{verbatim}
In the log file 'discrim.vehicle.log', the first entry
is the hostname of the Sun workstation (here ``cochran'') .

\subsubsection{Specimen output (extract) of log file generated by ``cv''}

\begin{verbatim}

cochran

Algorithm  quadiscr
Dataset    vehicle
Nr. splits 9
Fri Aug 13 11:43:10 BST 1993
Cross Validation Round: 1
Algorithm quadiscr  - Learning phase
Dataset   vehicle.tra
  Delta =     0.000
Number of data        752
Number of attributes  18
Number of classes     4
2.5 0.2 74
Learn_Time:
2.9 0.9 74
2.9 1.0 74
TIME_LEARN:
3.3 1.5 74
Algorithm quadiscr - testing phase
Dataset   vehicle.tra
 Total costs:      65.0000     average cost:      0.08644
11.3 0.2 68
Test_time:
11.7 0.8 68
11.7 0.9 68
TIME_TRAIN:
12.1 1.5 68
MATRIX_TRAIN:
       [1]   [2]  [3]  [4]
[1]    160    27    0    5
[2]     28   166    1    2
[3]      0     0  190    2
[4]      0     0    0  171
SUCCESS_RATE: 0.913564  (687/752)
LOSS: 65
AVERAGE_LOSS: 0.086436
Algorithm quadiscr - testing phase
Dataset   vehicle.tes
 Total costs:      13.0000     average cost:      0.13830
1.6 0.1 70
Test_time:
2.1 0.8 70
2.1 0.9 70
TIME_TEST:
2.5 1.4 70
MATRIX_TEST:
      [1]   [2]   [3]   [4]
[1]    14     6    0    0
[2]     3    15    1    1
[3]     0     0   25    1
[4]     1     0    0    27
SUCCESS_RATE: 0.861702  (81/94)
LOSS: 13
AVERAGE_LOSS: 0.138298
_____________________________
Cross Validation Round: 2
\end{verbatim} 
etc. etc.

From this log file we can extract the following information.   
\begin{enumerate}
\item For the first round of cross-validation, the total time to learn the
rules was 3.0+1.4 = 4.4 seconds, and the maximum memory used was 70 kbytes.
\item An error rate of 0.202128 was achieved in the training data ``disc.tra''.
\item An error rate of 0.191489 was achieved in the test data ``disc.tes''
\item The total time to classify the test set data was 1.0+1.4 = 2.4 seconds.
\end{enumerate}

\subsection{Output of ``results''}

To obtain a summary covering all rounds of the cross-validation trial, the
script file ``results'' may be applied to the cross-validation log file
(in the example above the log file was ``discrim.vehicle.log'') via:
\begin{verbatim}
results discrim.vehicle.log cost.mtx > discrim.vehicle.res
\end{verbatim} 
with the output stored in the file ``discrim.vehicle.res''.
We give below the output from the ``results'' script for a 9-fold
cross-validation trial on the vehicle dataset, i.e. a partial
listing of the file ``discrim.vehicle.res''.

\subsubsection{Specimen ``results'' log file}

\begin{verbatim}
*__________________________________________________________
cochran
Logfile:   discrim.vehicle.log
LEARNING PHASE

MEAN OF TIME_LEARN:
  2.966666667  1.411111111  69.11111111

STANDARD_DEVIATION OF TIME_LEARN:
  0.05  0.06009252126  0.3333333333
*----------------------------------------------------------
TEST PHASE ON TRAINING DATA

MEAN OF TIME_TRAIN:
  2.433333333  1.366666667  65.88888889

STANDARD_DEVIATION OF TIME_TRAIN:
  0.08660254038  0.08660254038  0.3333333333
*__________________________________________________________
TEST PHASE ON TEST DATA

MEAN OF TIME_TEST:
  1.066666667  1.411111111  66.33333333

STANDARD_DEVIATION OF TIME_TEST:
  0.1  0.06009252126  0.5
*__________________________________________________________
OVERALL RESULTS:
TEST PHASE ON TRAINING DATA
CONFUSION MATRIX
       [1]     [2]     [3]     [4]
[1]   1085     504      56      51
[2]    472    1097      85      82
[3]     32       8    1680      24
[4]     16      20      16    1540
SUCCESS_RATE: 0.798168  (5402/6768)
LOSS: 1366
AVERAGE_LOSS: 0.201832
*__________________________________________________________
TEST PHASE ON TEST DATA
CONFUSION MATRIX
      [1]    [2]    [3]    [4]
[1]   128     69     8      7
[2]    61    134     11    11
[3]     4      1    210     3
[4]     4      2      2   191
SUCCESS_RATE: 0.783688  (663/846)
LOSS: 183
AVERAGE_LOSS: 0.216312
-----------------------------
Cost matrix used
 0 1 1 1
 1 0 1 1
 1 1 0 1
 1 1 1 0
   
0.9u 1.3s 0:03 73% 0+300k 11+3io 15pf+0w
Thu Aug 12 15:15:00 BST 1993

\end{verbatim} 

 From the ``results'' log file we can extract the following information.   
\begin{enumerate}
\item Over all 9 cross-validation runs, the average total time to learn the
rules was 2.433+1.367 = 3.800 seconds, and the average maximum memory used 
was 65.89 kbytes.
\item An average error rate of 0.201832 was achieved on all the training sets.
\item An average error rate of 0.216312 was achieved on all the test sets (and 
so on the original dataset).
\item The average time to classify the test sets was 1.067+1.411=2.478 seconds.
\item The confusion matrix for the total test sets shows that objects 1 and 2
tend to be confused (the entries [1,2] and [2,1] are both very high at 69
and 61 respectively).
\end{enumerate}


\subsection{Preprocessing of Data}

Evaluation Assistant includes several data preprocessing routines that can be 
invoked before running N-cross validation.    These 
routines are not normally used on their own:  they are called repeatedly
by other script files.   However they can be invoked as commands in their
own right and are often useful.   The two most important are ``eaperm'',
which permutes the order of the examples in the dataset and should precede
any cross-validation trial, and ``splitc'' which splits the dataset into
parts for training and testing.
\begin{verbatim}
    eaperm   Dataset1   Dataset2  
\end{verbatim}
will generate a random permutation of all the examples in Dataset1 and store 
the result in Dataset2, which is the dataset to be used in cross-validation. 
\begin{verbatim}
    splitc   Dataset 10 3   
\end{verbatim}
divides Dataset into 10 parts:  the 3rd part is used for test purposes (and
is stored in a file named Dataset.tes) with
the remaining 9 parts used for training and stored in a file named Dataset.tra.

\subsection{Different Test Methods}

Evaluation Assistant allows the user to adopt one of the following test 
methods:

\begin{enumerate}
\item N-Cross Validation
\item Leave One Out
\item Train and Test Files
\end{enumerate}
    
    \index{Cross validation} 
Command cv described earlier can be used to run both N-cross validation 
and the Leave One Out test methods.    This is because the 
latter can be considered a 
special case of N-cross validation, where the number of cycles (Nr.Splits) 
is equal to the number of cases in the data set. 

\index{Train and Test}
When the dataset is very large, it may be split into two parts, with one part 
used to train the algorithm and the other for test.   If these two parts
are Dataset.tra and Dataset.tes respectively, a train and test trial may be 
invoked by 

\begin{verbatim}
     tt   Algorithm_name   Dataset   >  Output_file    
\end{verbatim}

\subsection{Description of Commands (Routines)}

This section describes various subsidiary commands of the Evaluation 
Assistant (EAC). This description is useful whenever it is necesary to run 
the tests in a piecemeal fashion, or make alterations to the existing code.  

\begin{table}[hptb]
\begin{center}
\begin{tabular}{|l|l|l|} \hline
\multicolumn{1}{|c|}{Routine}&\multicolumn{1}{c|}{Scripts called}&\multicolumn{1}{c|}{Purpose}\\ \hline
cv~~~~~&~~~~~~~~~&Perform cross validation~~~~~~~~~~\\ 
~~~~~~~&splitc~~~&Split data\_set~~~~~~~~~~~~~~~~~~~\\ 
~~~~~~~&learn.scr&Call algorithm (learning phase)~~~\\ 
~~~~~~~&test.scr~&Call algorithm (testing phase)~~~~\\ 
~~~~~~~&eares~~~~&Process confusion matrix~~~~~~~~~~\\ \hline
results&~~~~~~~~~&Summarise cross-validation results\\
~~~~~~~&stat~~~~~&Statistics (mean, std. deviation)~\\
~~~~~~~&sumatrix~&Overall confusion matrix~~~~~~~~~~\\
~~~~~~~&eares~~~~&Process overall confusion matrix~~\\ \hline
tt~~~~~&~~~~~~~~~&Perform Train-and-Test trial~~~~~~\\
~~~~~~~&learning~&Call algorithm (learning phase)~~~\\ 
~~~~~~~&testing~~&Call algorithm (testing phase)~~~~\\ 
~~~~~~~&eares~~~~&Process confusion matrix~~~~~~~~~~\\ \hline
\hline
\end{tabular}
\end{center}
\end{table}
                 
Some scripts are documented using man pages.   For example, cv.man 
is the man page for cv.   However, these may not be very
informative.   


\section{Installation} 

It is suggested that the EA script files are kept in one directory, the FORTRAN
programs kept in another, and that
all files associated with a particular dataset
are kept in its own directory.   The `path' command must be modified 
so that the EA script files and FORTRAN programs can be called from any of
the dataset directories.

\subsection{Installation of EA} 

The interactive version of Evaluation Assistant can be installed as 
follows. 
\begin{enumerate}
\item Create a local evaluation assistant directory (ea) with the executable scripts
 and EA commands.   The ea directory should contain the following files:
\begin{verbatim}
README      eares.c     lt*         stat*       test.scr*   tt*
cv*         learn.scr*  results*    stat.c      testing*
eares*      learning*   splitc*     sumatrix*   testonly*
\end{verbatim} 

\item Modify the .cshrc file (if you are using 
C shell) so that the ea directory is in the path.
\end{enumerate}

\subsection{Installation of FORTRAN programs} 

The FORTRAN programs are most conveniently located in another directory,
say ``programs'', and the *.f programs located there.   Editing and compiling
the *.f programs should be done in the ``programs'' directory. 
\begin{enumerate}
\item Create a local ``programs'' directory with the FORTRAN programs
\begin{verbatim}
discrim.f   logdiscr.f  quadiscr.f
lintest.f   logxx.f     quadtest.f
\end{verbatim} 

\item Modify the .cshrc file (if you are using 
C shell) so that the ``programs'' directory is in the path.
\item Logout and login again
\item Edit and compile the appropriate files in ``programs''.   For
instance, if running the Linear Discriminant procedure,
this directory should now contain the additional files:
\begin{verbatim}
discrim     tests
\end{verbatim}
\end{enumerate}

\subsection{Running a trial on a dataset}

For the simple train-and-test procedure, two datasets must exist:  the training
and test datasets must have extensions .tra and .tes respectively.   A cost
matrix, if supplied, must exist in the file cost.mtx.   Among the contents of the
'dna' dataset directory should be, for example, the two datasets:
\begin{verbatim}
dna.tes  dna.tra
\end{verbatim}
The appropriate FORTRAN files discrim.f and lintest.f
have been edited and compiled as described previously, with 
number of 
attributes {\it nattrs}, and number of classes {\it klass}
entered as parameters.   Then
the train-and-test procedure can be instigated via
\begin{verbatim}
tt discrim vehicle > discrim.vehicle.log
\end{verbatim}
The only requirement for a cross-validation study is that the dataset should
be in the current directory, and a cost matrix, if supplied, should be
in the file cost.mtx.   Cross validation can be run on the dataset 'vehicle'
for example (note there is no extension to the dataset name) by:
\begin{verbatim}
cv discrim vehicle > discrim.vehicle.log
results discrim.vehicle.log cost.mtx > discrim.vehicle.log
\end{verbatim}

\section{FORTRAN error messages}

With a modicum of luck, the FORTRAN programs will run on all the StatLog
datasets.   They may fail to run for a variety of reasons, however, and this
section gives some likely explanations for such failure with some suggestions
for rectifying the fault.   Before going through this list, a check should
be made that the learning and testing programs (discrim.f and lintest.f for
example) have both been edited and compiled successfully.

\subsection{Segmentation fault}

The most likely explanation for a segmentation fault is that one of the 
parameters (probably the number of attributes) is 
wrongly entered, and has resulted in non-integer values for the classes.

\subsection{log: domain error}

Check that the number of data parameter {\it ndata} is correctly entered 
(equal to the number of data in the {\bf training dataset} disc.tra).
\begin{quote}
\item {\bf Reminder about {\it ndata}}.   When entering 
the parameter {\it ndata} in logdiscr.f and
logxx.f, it is the {\it number of data in the training dataset} disc.tra
that is relevant.   For cross-validation trials, this is {\bf NOT} the
number of data in the complete dataset.  For example, with 9-fold
cross-validation on a dataset with 846 examples, the training and test datasets
have 752 and 94 examples respectively, and the parameter {\it ndata} is 752.
\end{quote}

\subsection{log: zero error}

The class labels must be consecutive integers 1,2,..,q.
Gaps are not allowed, so that it is not permissible to use labels 1,2,4,5
in a four-class problem for example.    In the case  of class labels 0,1,..,q-1
the FORTRAN programs recode class 0 as class q, so that
the ordering of the classes is altered.   Consequently, in this latter case, care
must be taken when entering a cost matrix, or interpreting the output confusion matrix.


\subsection{bus error}

Did you re-compile the learning and testing programs?

\subsection{sqrt error}

Is one attribute a linear combination of the others?   If so, remove it
entirely from the dataset.

\subsection{Failure to converge (logdiscr only)}

The criteria governing convergence in {\it logdiscr.f}~ are very wide, and
it may be felt that the program should be rerun with stricter criteria,
to make the final estimates closer to the maximum likelihood values.
If the logdiscr.log file shows a failure to converge, judged by the
magnitude of the last recorded change in deviance,
the program may be re-run,
with no further ado, by using the last recorded coefficients stored in
the file 'disc.beta' as starting values.   This is done by copying
these coefficients to the file 'discrim.beta' and re-running the program.
This device is likely to be most effective for the train-and-test datasets.
After running a train-and-test trial, say
\begin{verbatim}
tt logdiscr satimage > logdiscr.satimage.log
\end{verbatim}
the logdiscr.log file may show that the iterative procedure stopped before
the deviance had achieved its minimum.   A further set of iterations
starting from the last values can be made via
\begin{verbatim}
cp disc.beta discrim.beta
tt logdiscr satimage > logdiscr.satimage.log2
\end{verbatim}
This way of improving convergence is only to be recommended if it is thought
that just a few more iterations would be required (and the time taken
to re-run the whole program from scratch is very large).   
The alternative is to re-compile the logdiscr.f program with the maximum number
of iterations parameter {\it niter} set at a much larger value, and
the minimum tolerance parameter {\it smalld} set at a much smaller value.


\subsection{log: SING error (quadiscr only)}

It often happens, and it happens specifically for the StatLog dna dataset,
that the variance of an attribute is zero for one class but not for others.
If this is the case, {\it quadiscr.f} will fail with a sqrt error.   
The file quad.covar containing the covariance matrices for all the classes
can be inspected to find the class/attribute in question - the whole row and
column of the covariance matrix will vanish.   In this case, a file
containing the value of the {\it delta} parameter can be created, and
the program rerun (without re-compiling).   An example follows.
\begin{verbatim}
tt quadiscr dna > quadiscr.dna.log
\end{verbatim}
fails with a square root error, so enter the single
number 0.2 in a file, save to 'quad.delta' and rerun.
\begin{verbatim}
tt quadiscr dna > quadiscr.dna.log2
\end{verbatim}
This time the program should run to its normal conclusion. 

An alternative and often very successful way of avoiding singularities 
in the covariance matrix is by 
adding a constant term to the diagonal.   This method has not been programmed,
although it would be very easy to do so.   

\subsubsection{Meaning of {\it delta}}

The effect of the parameter {\it delta} is to replace the sample covariance
by a linear combination of the sample and pooled covariance matrices.
A value of {\it delta} = 1 is equivalent to using linear discrimination,
while {\it delta} = 0 gives the pure quadratic discriminant.   Quadratic discriminants
need a very large number of parameters to be specified, so it is at least
plausible that {\it delta} should be close to unity for small datasets,

\end{document}
