Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.mathworks.com!newsfeed.internetmci.com!in1.uu.net!news.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: some questions on classification
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DowDAz.D2E@unx.sas.com>
Date: Tue, 26 Mar 1996 23:01:47 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <md88-msa.827791277@nada.kth.se> <4j8plr$k5a$1@mhadg.production.compuserve.com>
Organization: SAS Institute Inc.
Lines: 174


In article <4j8plr$k5a$1@mhadg.production.compuserve.com>, Will Dwinnell <76743.1740@CompuServe.COM> writes:
|> >> The problem I am interested in is that of two-way 
|> classification
|> and its solution with a feed forward network (MLP). Reading the 
|> litterature there seems to be two general ideas of architecture;
|> one with a single output, presumably with some thresholding
|> device,
|> the other with two outputs, one for each class.  <<
|> 
|> It is possible, of course, to use a single output for this 
|> problem, but one nice thing about using 2 outputs (or even 2 
|> single output models together) is that when evidence is weak for 
|> any conclusion, no output will register as very strong. 

This is a very common misconception. I have just added the question,
"How should categories be coded?" to the FAQ at 
ftp://ftp.sas.com/pub/neural/FAQ2.html, but since it won't show up 
on the server until Wednesday, I will include the answer here.

Subject: How should categories be coded?

First, consider unordered categories.  If you want to classify cases
into one of C categories (i.e. you have a categorical target variable),
use 1-of-C coding. That means that you code C binary (0/1) target
variables corresponding to the C categories. Statisticians call these
"dummy" variables. Each dummy variable is given the value zero except
for the one corresponding to the correct category, which is given the
value one. Then use a softmax output activation function (see
"What is a softmax activation function?")
so that the net, if properly trained, will produce valid posterior
probability estimates. If the categories are Red, Green, and Blue,
then the data would look like this:

   Category  Dummy variables
   --------  ---------------
    Red        1   0   0
    Green      0   1   0
    Blue       0   0   1

When there are only two categories, it is simpler to use just one
dummy variable with a logistic output activation function; this is
equivalent to using softmax with two dummy variables. 

The common practice of using target values of .1 and .9 instead of
0 and 1 prevents the outputs of the network from being directly
interpretable as posterior probabilities. 

Another common practice is to use a logistic activation function for
each output. Thus, the outputs are not constrained to sum to one, so
they are not valid posterior probability estimates.  The usual
justification advanced for this procedure is that if a test case is not
similar to any of the training cases, all of the outputs will be small,
indicating that the case cannot be classified reliably.  This claim is
incorrect, since a test case that is not similar to any of the training
cases will require the net to extrapolate, and extrapolation is
thoroughly unreliable; such a test case may produce all small outputs,
all large outputs, or any combination of large and small outputs.  If
you want a classification method that detects novel cases for which the
classification may not be reliable, you need a method based on
probability density estimation. For example, see "What is PNN?". 

It is very important not to use a single variable for an
unordered categorical target. Suppose you used a single variable with
values 1, 2, and 3 for red, green, and blue, and the training data
with two inputs looked like this:

      |    1    1
      |   1   1
      |       1   1
      |     1   1
      | 
      |      X
      | 
      |    3   3           2   2
      |     3     3      2
      |  3   3            2    2
      |     3   3       2    2
      | 
      +----------------------------

Consider a test point located at the X. The correct output would
be that X has about a 50-50 chance of being a 1 or a 3. But if
you train with a single target variable with values of 1, 2, and 3,
the output for X will be the average of 1 and 3, so the net will
say that X is definitely a 2! 

For an input with categorical values, you can use 1-of-(C-1) coding.
This is just like 1-of-C coding, except that you omit one of the dummy
variables (doesn't much matter which one). Using all C of the dummy
variables creates a linear dependency on the bias unit, which is not
advisable. 1-of-(C-1) coding looks like this:

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue       0   0

Another possible coding is called "effects" coding or "deviations from
means" coding in statistics. It is like 1-of-(C-1) coding, except that
when a case belongs to the category for the omitted dummy variable, all
of the dummy variables are set to -1, like this:

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue      -1  -1

As long as a bias unit is used, any network with effects coding can be
transformed into an equivalent network with 1-of-(C-1) coding by a
linear transformation of the weights.  So the only advantage of effects
coding is that the dummy variables require no standardizing (see
"Should I normalize/standardize/rescale the data?").

Now consider ordered categories. For inputs, some people recommend a
"thermometer code" like this:

   Category  Dummy variables
   --------  ---------------
    Red        1   1   1
    Green      0   1   1
    Blue       0   0   1

However, thermometer coding is equivalent to 1-of-C coding, in that for
any network using 1-of-C coding, there exists a network with thermometer
coding that produces identical outputs; the weights in the
thermometer-coded network are just the differences of successive weights
in the 1-of-C-coded network.  To get a genuinely ordinal representation,
you must constrain the weights connecting the dummy variables to the
hidden units to be nonnegative (except for the first dummy variable).

It is often effective to represent an ordinal input as a single
variable like this:

   Category  Input
   --------  -----
    Red        1
    Green      2
    Blue       3

Although this representation involves only a single quantitative input,
given enough hidden units, the net is capable of computing nonlinear
transformations of that input that will produce results equivalent to
any of the dummy coding schemes. But using a single quantitative input
makes it easier for the net to use the order of the categories to
generalize when that is appropriate. 

B-splines provide a way of coding ordinal inputs into fewer than C
variables while retaining information about the order of the
categories. See Gifi (1990, 365-370). 

Target variables with ordered categories require thermometer coding.
The outputs are thus cumulative probabilities, so to obtain the
posterior probability of any category except the first, you must take
the difference between successive outputs. It is often useful to use a
proportional-odds model, which ensures that these differences are
positive. For more details on ordered categorical targets, see McCullagh
and Nelder (1989, chapter 5). 

References:

   Gifi, A. (1990), Nonlinear Multivariate Analysis,
   NY: John Wiley & Sons, ISBN 0-471-92620-5.

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models,
   2nd ed., London: Chapman & Hall.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
