Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!swrinde!pipex!warwick!yama.mcc.ac.uk!cf-cm!thor.cf.ac.uk!spebjp
From: spebjp@thor.cf.ac.uk (Bernard Peat)
Subject: Questions: Softmax and MLP class membership probability outputs
Message-ID: <7471.9504190829@thor.cf.ac.uk>
Sender: spebjp@thor.cf.ac.uk (Bernard Peat)
Organization: University of Wales College of Cardiff, Cardiff, Wales, UK
Date: Wed, 19 Apr 1995 09:29:47 +0100
X-Mailer: Cardiff Computing Maths PP Mail Open News Gateway
Lines: 121

I am seeking a definite, workable (not necessarily perfect) method 
for estimating class membership probability outputs for an MLP.  
Softmax seems to be a reasonable approach.  However, having read all 
the literature I can find, and previous posts here, some points still 
seem unclear to me.  I include my present understanding (which may be 
wrong).

Standard MLPs with sigmoid output activation functions which minimize 
the mean square error apparently approximate posterior probabilities 
for each class output where the target outputs are 1 and 0 (Ruck et 
al. 1990, Wan 1990).  So one possibility is to take these outputs as 
they are.

A problem with this approach maybe that the outputs do not sum to 1 
as they should, where output classes are encoded one class per output 
neuron.  Therefore, the outputs may need to be adjusted.

Q1.  How should the outputs be adjusted where they do not sum to 1 
(maybe using the softmax activation function for recall only)?

Furthermore, the suggested target outputs for MLPs are often say 0.9 
and 0.1, rather than 1 and 0.

Q2.  Is it OK to use 1 and 0 as target outputs?

Q3.  If Answer 2 is No, what are the appropriate values to use?

Q4.  If other values are used, what adjustment needs to be applied?

A modification which seems to overcome Q1 is softmax (Bridle 
1990a,b), which uses the following activation function for the output 
neurons:

    yk = exp(xk) / sumj(exp(xj))

as opposed to the normal sigmoid function:

    yk = 1 / (1 + exp(-xk))     [ = exp(xk) / (1 + exp(xk)) ]

where

yk = output of the kth output neuron
yj = output of the jth output neuron 
xk = weighted sum of inputs to the kth output neuron

This can be associated with an entropy measure of fit:

    E = - sumk(tk*log(tk / yk))

where tk is the target output for the kth output neuron.  These then 
lead to an error signal for training the network:

    delta(k) = yk - tk      [ = delta(E) / delta(xk) ]

Q5.  The above approach seems to be more principled than the 
standard method.  Is it in practise better (Richard and Lippman 
1991)?

Q6.  Would the approach for a single output neuron two-class problem 
be to use a sigmoid activation function together with the above error 
signal?

Q7.  I presume softmax can be applied to training algorithms such as 
conjugate gradient as well as back propagation?

Q8.  Questions 2-4 may still need to be addressed?

Q9.  Are there any other practical items to consider?

Q10.  Denker and Le Cun (1991) mention problems using softmax, and 
another technique seems to be to post-process the trained network 
(Denker and leCun 1991, McKay 1992, Masters 1993) to obtain 
probability distributions using the individual observations.  Any 
comments?


References:

Bridle, J.S. (1990a).  Probabilistic Interpretation of Feedforward 
Classification Network Outputs, with Relationships to Statistical 
Pattern Recognition.  In: F.Fogleman Soulie and J.Herault (eds.), 
Neurocomputing: Algorithms, Architectures and Applications, Berlin: 
Springer-Verlag, pp. 227-236.

Bridle, J.S. (1990b).  Training Stochastic Model Recognition 
Algorithms as Networks can lead to Maximum Mutual Information 
Estimation fo Parameters.  In: D.S.Touretzky (ed.), Advances in 
Neural Information Processing Systems 2, San Mateo: Morgan Kaufmann, 
pp. 211-217.

Denker, J.S. and Le Cun, Y. (1991).  Transforming Neural-Net Output 
Levels to Probability Distributions.  In: R.P.Lippmann, J.E.Moody and 
D.S.Touretzky, Advances in Neural Information Processing Systems 3, 
San Mateo: Morgan Kaufmann, pp. 853-859.

MacKay, D.J.C. (1992).  A Practical Bayesian Framework for 
Backpropagation Networks.  Neural Computation, Vol. 4, pp. 448-472.

Masters, T. (1993).  Practical Neural Network Recipes in C++.  
London: Academic Press.

Richard, M.D. and Lippmann, R.P. (1991).  Neural Network Classifiers 
Estimate Bayesian a posteriori Probabilities.  Neural Computation, 
Vol. 3, pp. 461-483.

Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E. and Suter, B.W. 
(1990).  The Multilayer Perceptron as an Approximation to a Bayes 
Optimal Discriminant Function.  IEEE Transactions on Neural Networks, 
Vol. 1, No. 4, December, pp. 296-298.

Wan, E.A. (1990).  Neural Network Classification: A Bayesian 
Interpretation.  IEEE Transactions on Neural Networks, Vol. 1, No. 4, 
December, pp. 303-305.

-
Bernard Peat,  University of Wales College of Cardiff, UK.
Email: spebjp@thor.cf.ac.uk
-- 
-----------------------------------------------------------------------------
Bernard Peat					|   spebjp@thor.cf.ac.uk
University of Wales College of Cardiff,	U.K.	|	
