Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!nntp.sei.cmu.edu!news.cis.ohio-state.edu!math.ohio-state.edu!howland.reston.ans.net!nntp.coast.net!news.kei.com!newsfeed.internetmci.com!news.sprintlink.net!new-news.sprintlink.net!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Q: Transformation of data - why does it work better.
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DsyvDL.973@unx.sas.com>
Date: Fri, 14 Jun 1996 01:41:45 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References:  <4pmmec$k74@wmwap1.math.uni-wuppertal.de>
Organization: SAS Institute Inc.
Lines: 71


In article <4pmmec$k74@wmwap1.math.uni-wuppertal.de>, Jens van Mahnen <mahnen> writes:
|> As a non statistician I'd like to know why the
|> transformation of input data works better with a
|> multilayer feedforward net than the raw data.
|> 
|> Is there a mathmatical reason for this.
|> 
|> I tried a log10 transformation and a (x)^1/n, n=(2,3,4).
|> Both transformations worked better.
|> 
|> I have a three layer 10-5-2 network with sigmoid activation
|> function and identity for the output function.
|> 
|> Is there a good reference on this?
|> The faq could not answer my questions satisfactorilly.

Most importatntly, nonlinear transformations of the targets are
important with noisy data, via their effect on the error function.  Many
commonly used error functions are functions of the difference
abs(target-output).  Nonlinear transformations (unlike linear
transformations) change the relative sizes of these differences. With
most error functions, the net will expend more effort, so to speak,
trying to learn target values for which abs(target-output) is large.

For example, suppose you are trying to predict the price of a stock. If
the price of the stock is 10 (in whatever currency unit) and the output
of the net is 5 or 15, yielding a difference of 5, that is a huge error.
If the price of the stock is 1000 and the output of the net is 995 or
1005, yielding the same difference of 5, that is a tiny error. You don't
want the net to treat those two differences as equally important.  By
taking logarithms, you are effectively measuring errors in terms of
ratios rather than differences, since a difference between two logs
corresponds to the ratio of the original values. This has approximately
the same effect as looking at percentage differences rather than simple
differences.

Less importantly, smooth functions are usually easier to learn than
rough functions.  Generalization is also usually better for smooth
functions. So nonlinear transformations of either inputs or targets that
make the input-output function smoother are usually beneficial.

If the above considerations do not provide any compelling reason to
choose a particular transformation, and if your software does not
provide a sufficient variety of error functions, then it is advisable
to transform the target so that the noise distribution conforms to
whatever error function you are using. For example, if you have to use
least-(mean-)squares training, you will get the best results if the
noise distribution is approximately normal with constant variance, 
since least-(mean-)squares is maximum likelihood in that case.

   Atkinson, A.C. (1985) Plots, Transformations and Regression,
   Oxford: Clarendon Press.

   Carrol, R.J. and Ruppert, D. (1988) Transformation and Weighting in 
   Regression, London: Chapman and Hall.                                     

   Huber, P.J. (1981), Robust Statistics, NY: Wiley.

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models,
   2nd ed., London: Chapman and Hall. 





-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
