Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!news.alpha.net!uwm.edu!psuvax1!news.ecn.bgu.edu!siemens!flake
From: flake@scr.siemens.com (Gary William Flake)
Subject: Re: Gaussian transfer function ???
Message-ID: <D6KH96.JB2@scr.siemens.com>
Sender: news@scr.siemens.com (NeTnEwS)
Nntp-Posting-Host: gull.scr.siemens.com
Organization: Siemens Corporate Research, Princeton NJ
References: <1995Mar30.113145.1831@cathy.ijs.si> <3lrs37$69c@cantaloupe.srv.cs.cmu.edu> <mike.797036733@motion> <3lt9nv$cbq@cantaloupe.srv.cs.cmu.edu>
Date: Wed, 5 Apr 1995 14:51:06 GMT
Lines: 62

Scott Fahlman <sef@CS.CMU.EDU> wrote:
> In article <mike.797036733@motion> mike@psych.ualberta.ca (Mike Dawson) writes:
>    The fact that the nonmonotonic activation function poses no problem
>    for cascade correlation is interesting me, and will probably lead me
>    to take a look at some papers describing it once again (as per usual,
>    my grasp of the literature waxes and wanes).
> 
>    However, the issue might be the role of the Gaussian in the output
>    layer.  For instance, we have had no problems using vanilla backprop
>    (if my memory serves me correctly) training Gaussian hidden units, as
>    long as the output unit is monotonic.  Our problems emerged for Gaussian
>    output units.
> 
> OK, you only saw problems with Gaussian output units.  I never looked
> at that case.  In Cascor nets with Gaussian-ridge hidden units, I
> never saw a problem.  The same holds for Quickprop nets with gaussian
> hidden units, though I haven't loked at as many of those.  Offhand, I
> don't see why Gaussian output units should be a problem if your target
> output values are one and zero, but if there is a problem, that's
> where it is.  (You would get the usual stuck-unit or early saturation
> problem with Gaussian output units, and perhaps would be more likely
> to get stuck than with sigmoids, but you could use any of the standard
> techniques to prevent this problem.)

I've read both of your papers and I've done little work on my own, so
let me give you my spin of the situation.

Mike's augmented error function is equivalent to:

  E = 1/2 sum_i (t_i - y_i)^2 + 1/2 sum_j t_j h_j^2

where t_i = target, y_i = output, and h_j = net input.  If you compute
dE/dh you get:

  dE/dh_i = d/dh_i d/dy_i 1/2 (t_i - y_i)^2 + d/dh_i 1/2 t_i h_i^2
          = (y - t_i) g'(h_i) + t_i h_i

where g() is the activation function.  Note that Mike only uses t's =
to 1 or 0.  Thus the modified error function results in an update
equation that is equivalent to adding a linear term to the first
derivative of the activation function for the output nodes (when t = 1).

The result: You guys are doing nearly the same thing, as Scott has
always been a fan of what he calls "sigma prime offset" (?)  or adding
a constant or a linear term to the first derivative to make it
non-zero.

As to the question concerning the use of Gaussian activation
functions, think of it this way: Threshold acivation functions are
great when you need to ask question like "Is this net input greater
than my threshold".  However, if you have problems that are more
similar to "Is this net input within a range that I like" then a
Gaussian will do better.  XOR, in any dimension is a problem that
looks more like the second type of problem.

See Mike's paper or my dissertation for further details.

Regards,
Gary
-- 
Gary W. Flake,  flake@scr.siemens.com,  Phone: 609-734-3676,  Fax: 609-734-6565
Siemens Corporate Research,  755 College Road East,  Princeton, NJ  08540,  USA
