Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!gatech!udel!news.mathworks.com!newsfeed.internetmci.com!portal.gmu.edu!hearst.acc.Virginia.EDU!news-server.ncren.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: "softmax", multivar logits, MaxEntropy
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DMvv3v.Dqn@unx.sas.com>
Date: Fri, 16 Feb 1996 19:21:30 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4d3oo8$nk1@news.halcyon.com> <4dc6p2$8hk@newsbf02.news.aol.com> <4ddcjk$39k@fbi-news.Informatik.Uni-Dortmund.DE> <DLB2DG.E9q@unx.sas.com> <rcjanh.2525.823117109@urc.tue.nl>
Organization: SAS Institute Inc.
Lines: 61


In article <rcjanh.2525.823117109@urc.tue.nl>, rcjanh@urc.tue.nl (J. Hajek) writes:
|> >From: saswss@hotellng.unx.sas.com (Warren Sarle)
|> >|> Just tell me, what a "softmax" activation function is!
|> >
|> >The purpose of the softmax activation function is to make the sum of the
|> >outputs equal one, so that the outputs are interpretable as posterior
|> >probabilities.  Let the net input to each output unit be q_i, i=1,...,c
|> >where c is the number of categories. Then the softmax output p_i is:
|> >
|> >           exp(q_i)
|> >   p_i = ------------
|> >          c
|> >         sum exp(q_j)
|> >         j=1
...
|> Could you kindly answer ( I hope Im not the only stupid one here :-)
|> the following Qs:
|> 
|> Q1: There certainly are much simpler and faster formulas which will
|>     make the sum of the outputs = 1.  Therefore: why exp ? what extra
|>     advantages does it have ??  Bridle said in his paper that softmax
|>     has many nice/attractive properties; what are they ??

The exp function derives naturally from log-linear models and leads to
convenient interpretations of the weights in terms of odds ratios.  You
could, however, use a variety of other nonnegative functions on the real
line. Or you could constrain the net inputs to the output units to be
nonnegative, and just divide by the sum--that's called the
Bradley-Terry-Luce model.

|> Q2: For the purpose of point estimation of probabilities softmax was
|>     proposed by Epstein & Fienberg. Unfortunately their papers are so
|>     theor-specialistic that it is not clear to me at all HOW to compute
|>     the necessary values of the exponents. Can you give us REFs & hints
|>     how to do that ?? Maybe from the logit eguations ?? ( Ive never done
|>     that, sorry :-)

What's the reference for Epstein & Fienberg?  What exponents do you have
to compute?  A network using softmax and trained by least squares or by
maximum likelihood as typically used in logistic regression will give
you point estimates of probabilities directly.

|> Q3: Softmax formulas roll out from the MaxEnt(ropy) approach a la Jaynes.
|>     There must be a direct link between them.   What more can be said 
|>     about that ??

Reference, please?

|> Q4: To boil it all down: 
|>  A: What are the pros vs cons of the general sofmax form ??

The critical part is having the outputs sum to one if you are
estimating probabilities; I don't think there are any cons in
that regard.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
