Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!uunet!gatech!howland.reston.ans.net!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Questions: Softmax and MLP class membership probability outputs
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D7Au3H.16s@unx.sas.com>
Date: Wed, 19 Apr 1995 20:26:05 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <7471.9504190829@thor.cf.ac.uk>
Organization: SAS Institute Inc.
Lines: 107


In article <7471.9504190829@thor.cf.ac.uk>, spebjp@thor.cf.ac.uk (Bernard Peat) writes:
|> I am seeking a definite, workable (not necessarily perfect) method
|> for estimating class membership probability outputs for an MLP.
|> Softmax seems to be a reasonable approach.  However, having read all
|> the literature I can find, and previous posts here, some points still
|> seem unclear to me.  I include my present understanding (which may be
|> wrong).
|>
|> Standard MLPs with sigmoid output activation functions which minimize
|> the mean square error apparently approximate posterior probabilities
|> for each class output where the target outputs are 1 and 0 (Ruck et
|> al. 1990, Wan 1990).  So one possibility is to take these outputs as
|> they are.
|>
|> A problem with this approach maybe that the outputs do not sum to 1
|> as they should, where output classes are encoded one class per output
|> neuron.  Therefore, the outputs may need to be adjusted.
|>
|> Q1.  How should the outputs be adjusted where they do not sum to 1
|> (maybe using the softmax activation function for recall only)?

Dividing each output by the sum of the outputs (I assume that's what
you mean by "using the softmax activation function for recall only")
should work fairly well if you have an adequate number of training
cases in the regions where you want to make predictions. However,
if you have to extrapolate, the sum may not be anywhere close to one,
and using a post-training adjustment like this may be problematic.
Of course, extrapolation is problematic anyway.

|> Furthermore, the suggested target outputs for MLPs are often say 0.9
|> and 0.1, rather than 1 and 0.
|>
|> Q2.  Is it OK to use 1 and 0 as target outputs?

Yes. That's what statisticians always do. I have never figured out
exactly why so many neural netters prefer .9 and .1, but I suspect
it may have to do with the fact that standard backprop is so
painfully slow.

|> ...
|> A modification which seems to overcome Q1 is softmax (Bridle
|> 1990a,b), which uses the following activation function for the output
|> neurons:
|>
|>     yk = exp(xk) / sumj(exp(xj))
|> ...
|> This can be associated with an entropy measure of fit:
|>
|>     E = - sumk(tk*log(tk / yk))

Yes, but the use of entropy is optional. Least squares will work, too.
There are certain natural associations between training criteria and
output activation functions, but one can mix and match when the
application calls for it.

|> Q5.  The above approach seems to be more principled than the
|> standard method.  Is it in practise better (Richard and Lippman
|> 1991)?

Yes. If you are estimating quantities subject to some constraint
(such as the constraint that the sum is 1), you should incorporate
that constraint in the estimation process. Ignoring useful information
rarely accomplishes much.

|> Q6.  Would the approach for a single output neuron two-class problem
|> be to use a sigmoid activation function together with the above error
|> signal?

Yes, softmax reduces to the usual logistic activation function in the
two-class case, although the error function does not have to be
entropy, as mentioned above.

|> Q7.  I presume softmax can be applied to training algorithms such as
|> conjugate gradient as well as back propagation?

Yes.

|> Q8.  Questions 2-4 may still need to be addressed?

The answer to question 2 is still Yes.

|> Q9.  Are there any other practical items to consider?

You can choose any one of the output units and omit the bias and
connections between that output and the hidden units (and input units
if you have direct connections); just set the output to a constant
value such as 1.

|> Q10.  Denker and Le Cun (1991) mention problems using softmax, and
|> another technique seems to be to post-process the trained network
|> (Denker and leCun 1991, McKay 1992, Masters 1993) to obtain
|> probability distributions using the individual observations.  Any
|> comments?

Regarding these alleged difficulties, Denker and Le Cun cite a
"Technical Memorandum" which I have been unable to obtain. Since
statisticians use softmax as a standard methodology without any
particular difficulties, I would guess that Denker and Le Cun had
a bug in their program.


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
