Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!goldenapple.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!portc01.blue.aol.com!newsxfer3.itd.umich.edu!worldnet.att.net!news.mathworks.com!fu-berlin.de!news-ber1.dfn.de!news-ham1.dfn.de!news-han1.dfn.de!news-koe1.dfn.de!news.dfn.de!newsjunkie.ans.net!newsfeeds.ans.net!PITTS-NEWS!not-for-mail
From: "Thomas A. Dean" <thomas.dean.b@bayer.com>
Subject: Re: Threshold function (sigmoid)
Message-ID: <333808C5.3CAF@bayer.com>
Date: Tue, 25 Mar 1997 09:17:57 -0800
References: <01bc387f$63bd8ca0$447201cf@sol.racsa.co.cr> <ychgi0azqc.fsf@avoi.idiap.ch>
Reply-To: thomas.dean.b@bayer.com
Organization: Bayer
X-Mailer: Mozilla 3.0 (WinNT; I)
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Lines: 60

Georg Thimm wrote:
> 
> "Ricardo J. Mndez" <ricardo@tecapro.com> writes:
> 
> Hi!
> 
> > Hi.  This is a newbie question, I know, but I'm afraid I
> > don't know the answer.  Here it goes:
> 
> Although it is a newbie question, the gurus did not yet find the final
> answer ;-)
> 
> > I've been reading some things about backpropagation. In the
> > examples I've seen the sigmoid function is used as a
> > threshold function.  Why is this specific function used in
> > most cases?  Or, maybe even better, what criteria is used
> > for choosing the threshold function?
> 
> Basically, you need a non-linear, differentiable, not too-weird
> function. Furthermore, as the targets are scaled into an interval
> [0,1], the function should be valued in this interval too.  You see,
> there are not many "simple" functions that fulfill this constraints.
> 
> Other choices are the hyperbolic tangent (if you scale to [-1,1]) or a
> Gaussian function. The latter is supposed to yield a worse
> generalization than the sigmoidal function, but counter example
> confirm the rule.
> 
> You will find some more info in the papers "High Order and Multilayer
> Perceptron Initialization", and "The Interchangeability of Learning
> Rate and Gain in Backpropagation Neural Networks" which are accessible
> via my home page http://www.idiap.ch/~thimm.
> 
> Hope this helps.
> 
>         Georg


While I was doing my graduate work, another student in my group did a
comparison of different neural network architectures.  What he found was
that a combination of different activation functions could do much
better than a network with a single kind of activation function. 
Specifically, in his case he had an equation which estimated the actual
underlying non-linearity in the data, but was not sufficient for
prediction.  By simplifying the non-linear function to make it amenable
to back-propagation training and by supplementing it with a small
sigmoidal network to correct the residual non-linearity he achieved a
quickly convergent, generalizable neural-network model which
outperformed a significantly more complex network of sigmoid functions.

The point of this is that parsimony is still important in neural-network
applications and that a good approximation of the underlying function
can be invaluable in reducing the number of parameters to be fit and
improving prediction (particularly extrapolation outside the training
data space).

I hope this addresses part of the original question.


Thomas Dean

