Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!gatech!newsfeed.internetmci.com!in1.uu.net!world!mv!barney.gvi.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: "softmax" activation function?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DLB2DG.E9q@unx.sas.com>
Date: Wed, 17 Jan 1996 03:15:16 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4d3oo8$nk1@news.halcyon.com> <4dc6p2$8hk@newsbf02.news.aol.com> <4ddcjk$39k@fbi-news.Informatik.Uni-Dortmund.DE>
Organization: SAS Institute Inc.
Lines: 55


In article <4ddcjk$39k@fbi-news.Informatik.Uni-Dortmund.DE>, mechow@ls12se.informatik.uni-dortmund.de (Boris Mechow (PG 260)) writes:
|>
|> Just tell me, what a "softmax" activation function is!

The purpose of the softmax activation function is to make the sum of the
outputs equal one, so that the outputs are interpretable as posterior
probabilities.  Let the net input to each output unit be q_i, i=1,...,c
where c is the number of categories. Then the softmax output p_i is:

           exp(q_i)
   p_i = ------------
          c
         sum exp(q_j)
         j=1

You can choose any one of the output units and leave it completely
unconnected--just set the net input to 0. Connecting all of the output
units will just give you redundant weights and will slow down training.

Statisticians usually call softmax a "multiple logistic" function.  It
reduces to the simple logistic function when there are only two
categories. Suppose you choose to set q_2 to 0. Then

           exp(q_1)         exp(q_1)              1      
   p_1 = ------------ = ----------------- = -------------
          c             exp(q_1) + exp(0)   1 + exp(-q_1)
         sum exp(q_j)
         j=1

and p_2, of course, is 1-p_1.

References:

   Bridle, J.S. (1990a).  Probabilistic Interpretation of Feedforward
   Classification Network Outputs, with Relationships to Statistical
   Pattern Recognition.  In: F.Fogleman Soulie and J.Herault (eds.),
   Neurocomputing: Algorithms, Architectures and Applications, Berlin:
   Springer-Verlag, pp. 227-236.

   Bridle, J.S. (1990b).  Training Stochastic Model Recognition
   Algorithms as Networks can lead to Maximum Mutual Information
   Estimation of Parameters.  In: D.S.Touretzky (ed.), Advances in
   Neural Information Processing Systems 2, San Mateo: Morgan Kaufmann,
   pp. 211-217.

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models,
   2nd ed., Chapman & Hall: London. See Chapter 5 for statistical
   applications.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
