Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!fas-news.harvard.edu!newspump.wustl.edu!news.ecn.bgu.edu!psuvax1!uwm.edu!spool.mu.edu!howland.reston.ans.net!news.sprintlink.net!redstone.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Backprop on %error
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <D9sDDq.669@unx.sas.com>
Date: Wed, 7 Jun 1995 04:49:02 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <NSANDHU.95Jun2110405@grizzly.water.ca.gov> <3qqfq5$5n@uuneo.neosoft.com>
Organization: SAS Institute Inc.
Lines: 36


In article <3qqfq5$5n@uuneo.neosoft.com>, hav@neosoft.com writes:
|> ...
|> I guess, if you use RMS, you are getting a better(?) readig of the worst cases
|> than you might get by using % error - maybe this could lead to faster convergence
|> (less impact from poor cases.  I wonder if it can provide better generalization (especially
|> over a large consultation set)?

If you are training on % error, you are giving greater weight to cases
with small target values. If those cases have less noise than cases with
large target values, then it will help generalization to give them
(small targets) more weight. If those cases have more noise, then it
will hurt generalization to give them more weight. If all cases have
about the same amount of noise, then they should be given equal weight,
hence using % error will hurt generalization somewhat, depending on the
% variability in the target values. This is a matter of well-known
statistical theory.

|> p.s. - has anyone ever actually used e^3 for training?  I had a request to add it
|> as an option - but derned if I can fand a data set on which it helps!

Using abs(e^3) for training will help if the noise distribution has
slightly shorter tails than a Gauusian distribution. In general,
training on abs(e^p) is best for a noise density proportional to
exp(-abs(e)^p/ps), where s is a scale parameter. So p>2 is good for
short-tailed distributions and p<2 is good for long-tailed distributions
compared to the Gaussian (p=2). However, p<1.5 can cause problems for
many gradient-based training algorithms because of the
nondifferentiability at e=0. I would be interested to know whether
on-line backprop works reliably with p=1.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
