Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!nntp.sei.cmu.edu!news.cis.ohio-state.edu!math.ohio-state.edu!cs.utexas.edu!swrinde!news-res.gsl.net!news.gsl.net!news.mathworks.com!zombie.ncsc.mil!newsgate.duke.edu!interpath!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: changes to "comp.ai.neural-nets FAQ" -- monthly posting
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn.changes.posting_838609239@hotellng.unx.sas.com>
Supersedes: <nn.changes.posting_836017240@hotellng.unx.sas.com>
Date: Mon, 29 Jul 1996 03:00:41 GMT
Expires: Mon, 2 Sep 1996 03:00:39 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: modifications, new, additions, deletions
Followup-To: comp.ai.neural-nets
Lines: 412

==> nn1.changes.body <==

==> nn2.changes.body <==
*** nn2.oldbody	Fri Jun 28 23:00:18 1996
--- nn2.body	Sun Jul 28 23:00:17 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part2
! Last-modified: 1996-06-27
  URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part2
! Last-modified: 1996-07-13
  URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 1153,1157 ****
  methods in linear models (Frank and Friedman 1993). Orr (1995) has proposed
  combining regularization with subset selection for RBF training (see also
! Orr 199?). 
  
  References: 
--- 1153,1157 ----
  methods in linear models (Frank and Friedman 1993). Orr (1995) has proposed
  combining regularization with subset selection for RBF training (see also
! Orr 1996). 
  
  References: 
***************
*** 1191,1195 ****
     function centres," Neural Computation, 7, 606-623. 
  
!    Orr, M.J.L. (199?), "Introduction to radial basis function networks,"
     http://www.cns.ed.ac.uk/people/mark/intro.ps or
     http://www.cns.ed.ac.uk/people/mark/intro/intro.html . 
--- 1191,1195 ----
     function centres," Neural Computation, 7, 606-623. 
  
!    Orr, M.J.L. (1996), "Introduction to radial basis function networks,"
     http://www.cns.ed.ac.uk/people/mark/intro.ps or
     http://www.cns.ed.ac.uk/people/mark/intro/intro.html . 
***************
*** 1249,1302 ****
  
  Now for some of the gory details: note that the training data form a matrix.
! You could set up this matrix so that each case forms a row, and the inputs
! and target values form columns. You could conceivably standardize the rows
! or the columns or both or various other things, and these different ways of
  choosing vectors to standardize will have quite different effects on
  training. 
  
! Subquestion: Should I standardize the input column vectors?
! +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  
! That depends primarily on how the network combines inputs to compute the net
! input to the next (hidden or output) layer. If the inputs are combined
! linearly, as in a multilayer perceptron, then it is rarely strictly
! necessary to standardize the inputs, at least in theory. The reason is that
! any rescaling of an input vector can be effectively undone by changing the
! corresponding weights and biases, leaving you with the exact same outputs as
! you had before. However, there are a variety of practical reasons why
! standardizing the inputs can make training faster and reduce the chances of
! getting stuck in local optima. 
! 
! If the inputs are combined via some distance function, such as Euclidean
! distance as in a radial basis-function network, standardizing inputs can be
! crucial. Rescaling an input cannot be undone by adjusting the weights. The
! contribution of an input will depend heavily on its variability relative to
! other inputs. If one input has a range of 0 to 1, while another input has a
! range of 0 to 1,000,000, then the contribution of the first input to the
! distance will be swamped by the second input. So it is essential to rescale
! the inputs so that their variability reflects their importance, or at least
! is not in inverse relation to their importance. For lack of better prior
! information, it is common to standardize each input to the same range or the
! same standard deviation. 
! 
! For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 
! 
! Subquestion: Should I standardize the target column vectors?
! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  
! Standardizing targets is typically more a convenience for getting good
! initial weights than a necessity. However, if you have two or more target
! variables and your error function is scale-sensitive like the usual least
! (mean) squares error function, then the variability of each target relative
! to the others can effect how well the net learns that target. If one target
! has a range of 0 to 1, while another target has a range of 0 to 1,000,000,
! the net will expend most of its effort learning the second target to the
! possible exclusion of the first. So it is essential to rescale the targets
! so that their variability reflects their importance, or at least is not in
! inverse relation to their importance. If the targets are of equal
  importance, they should typically be standardized to the same range or the
  same standard deviation. 
  
! For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 
  
  Subquestion: Should I standardize the input cases (row vectors)?
--- 1249,1363 ----
  
  Now for some of the gory details: note that the training data form a matrix.
! Let's set up this matrix so that each case forms a row, and the inputs and
! target variables form columns. You could conceivably standardize the rows or
! the columns or both or various other things, and these different ways of
  choosing vectors to standardize will have quite different effects on
  training. 
  
! Standardizing either input or target variables tends to make the training
! process better behaved by improving the numerical condition of the
! optimization problem and ensuring that various default values involved in
! initialization and termination are appropriate. Standardizing targets can
! also affect the objective function. 
  
! Standardization of cases should be approached with caution because it
! discards information. If that information is irrelevant, then standardizing
! cases can be quite helpful. If that information is important, then
! standardizing cases can be disastrous. 
  
! Subquestion: Should I standardize the input variables (column
! +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
! vectors)?
! +++++++++
! 
! That depends primarily on how the network combines input variables to
! compute the net input to the next (hidden or output) layer. If the input
! variables are combined via a distance function (such as Euclidean distance)
! in an RBF network, standardizing inputs can be crucial. The contribution of
! an input will depend heavily on its variability relative to other inputs. If
! one input has a range of 0 to 1, while another input has a range of 0 to
! 1,000,000, then the contribution of the first input to the distance will be
! swamped by the second input. So it is essential to rescale the inputs so
! that their variability reflects their importance, or at least is not in
! inverse relation to their importance. For lack of better prior information,
! it is common to standardize each input to the same range or the same
! standard deviation. If you know that some inputs are more important than
! others, it may help to scale the inputs such that the more important ones
! have larger variances and/or ranges. 
! 
! If the input variables are combined linearly, as in an MLP, then it is
! rarely strictly necessary to standardize the inputs, at least in theory. The
! reason is that any rescaling of an input vector can be effectively undone by
! changing the corresponding weights and biases, leaving you with the exact
! same outputs as you had before. However, there are a variety of practical
! reasons why standardizing the inputs can make training faster and reduce the
! chances of getting stuck in local optima. 
! 
! The main emphasis in the NN literature on initial values has been on the
! avoidance of saturation, hence the desire to use small random values. How
! small these random values should be depends on the scale of the inputs as
! well as the number of inputs and their correlations. Standardizing inputs
! removes the problem of scale dependence of the initial weights. 
! 
! But standardizing input variables can have far more important effects on
! initialization of the weights than simply avoiding saturation. Assume we
! have an MLP with one hidden layer applied to a classification problem and
! are therefore interested in the hyperplanes defined by each hidden unit.
! Each hyperplane is the locus of points where the net-input to the hidden
! unit is zero and is thus the classification boundary generated by that
! hidden unit considered in isolation. The connection weights from the inputs
! to a hidden unit determine the orientation of the hyperplane. The bias
! determines the distance of the hyperplane from the origin. If the bias terms
! are all small random numbers, then all the hyperplanes will pass close to
! the origin. Hence, if the data are not centered at the origin, the
! hyperplane may fail to pass through the data cloud. If all the inputs have a
! small coefficient of variation, it is quite possible that all the initial
! hyperplanes will miss the data entirely. With such a poor initialization,
! local minima are very likely to occur. It is therefore important to center
! the inputs to get good random initializations. 
! 
! Standardizing input variables also has different effects on different
! training algorithms for MLPs. For example: 
! 
!  o Steepest descent is very sensitive to scaling. The more ill-conditioned
!    the Hessian is, the slower the convergence. Hence, scaling is an
!    important consideration for gradient descent methods such as standard
!    backprop. 
!  o Quasi-Newton and conjugate gradient methods begin with a steepest descent
!    step and therefore are scale sensitive. However, they accumulate
!    second-order information as training proceeds and hence are less scale
!    sensitive than pure gradient descent. 
!  o Newton-Raphson and Gauss-Newton, if implemented correctly, are
!    theoretically invariant under scale changes as long as none of the
!    scaling is so extreme as to produce underflow or overflow. 
!  o Levenberg-Marquardt is scale invariant as long as no ridging is required.
!    There are several different ways to implement ridging; some are scale
!    invariant and some are not. Performance under bad scaling will depend on
!    details of the implementation. 
! 
! Subquestion: Should I standardize the target variables (column
! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
! vectors)?
! +++++++++
! 
! Standardizing target variables is typically more a convenience for getting
! good initial weights than a necessity. However, if you have two or more
! target variables and your error function is scale-sensitive like the usual
! least (mean) squares error function, then the variability of each target
! relative to the others can effect how well the net learns that target. If
! one target has a range of 0 to 1, while another target has a range of 0 to
! 1,000,000, the net will expend most of its effort learning the second target
! to the possible exclusion of the first. So it is essential to rescale the
! targets so that their variability reflects their importance, or at least is
! not in inverse relation to their importance. If the targets are of equal
  importance, they should typically be standardized to the same range or the
  same standard deviation. 
  
! The scaling of the targets does not affect their importance in training if
! you use maximum likelihood estimation and estimate a separate scale
! parameter (such as a standard deviation) for each target variable. In this
! case, the importance of each target is inversely related to its estimated
! scale parameter. In other words, noisier targets will be given less
! importance. 
  
  Subquestion: Should I standardize the input cases (row vectors)?
***************
*** 1321,1325 ****
  rules of thumb that apply to all applications. 
  
! For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 
  
  ------------------------------------------------------------------------
--- 1382,1432 ----
  rules of thumb that apply to all applications. 
  
! You may want to standardize each case if there is extraneous variability
! between cases. Consider the common situation in which each input variable
! represents a pixel in an image. If the images vary in exposure, and exposure
! is irrelevant to the target values, then it would usually help to subtract
! the mean of each case to equate the exposures of different cases. If the
! images vary in contrast, and contrast is irrelevant to the target values,
! then it would usually help to divide each case by its standard deviation to
! equate the contrasts of different cases. Given sufficient data, a NN could
! learn to ignore exposure and contrast. However, training will be easier and
! generalization better if you can remove the extraneous exposure and contrast
! information before training the network. 
! 
! As another example, suppose you want to classify plant specimens according
! to species but the specimens are at different stages of growth. You have
! measurements such as stem length, leaf length, and leaf width. However, the
! over-all size of the specimen is determined by age or growing conditions,
! not by species. Given sufficient data, a NN could learn to ignore the size
! of the specimens and classify them by shape instead. However, training will
! be easier and generalization better if you can remove the extraneous size
! information before training the network. Size in the plant example
! corresponds to exposure in the image example. 
! 
! If the input data are measured on an interval scale, you can control for
! size by subtracting a measure of the over-all size of each case from each
! datum. For example, if no other direct measure of size is available, you
! could subtract the mean of each row of the input matrix, producing a
! row-centered input matrix. 
! 
! If the data are measured on a ratio scale, you can control for size by
! dividing each datum by a measure of over-all size. It is common to divide by
! the sum or by the arithmetic mean. For positive ratio data, however, the
! geometric mean is often a more natural measure of size than the arithmetic
! mean. It may also be more meaningful to analyze the logarithms of positive
! ratio-scaled data, in which case you can subtract the arithmetic mean after
! taking logarithms. You must also consider the dimensions of measurement. For
! example, if you have measures of both length and weight, you may need to
! cube the measures of length or take the cube root of the weights. 
! 
! In NN aplications with ratio-level data, it is common to divide by the
! Euclidean length of each row. If the data are positive, dividing by the
! Euclidean length has properties similar to dividing by the sum or arithmetic
! mean, since the former projects the data points onto the surface of a
! hypersphere while the latter projects the points onto a hyperplane. If the
! dimensionality is not too high, the resulting configurations of points on
! the hypersphere and hyperplane are usually quite similar. If the data
! contain negative values, then the hypersphere and hyperplane can diverge
! widely. 
  
  ------------------------------------------------------------------------
***************
*** 1638,1643 ****
  principal components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986;
  Jackson 1991). There are variations of Hebbian learning that explicitly
! produce the principal components (Hertz, Krogh, and Palmer 1991; Deco and
! Obradovic 1996). 
  
  Perhaps the most novel form of unsupervised learning in the NN literature is
--- 1745,1750 ----
  principal components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986;
  Jackson 1991). There are variations of Hebbian learning that explicitly
! produce the principal components (Hertz, Krogh, and Palmer 1991; Karhunen
! 1994; Deco and Obradovic 1996). 
  
  Perhaps the most novel form of unsupervised learning in the NN literature is
***************
*** 1675,1678 ****
--- 1782,1788 ----
     Jolliffe, I.T. (1986), Principal Component Analysis, Springer-Verlag. 
  
+    Karhunen, J. (1994), "Stability of Oja's PCA subspace rule," Neural
+    Computation, 6, 739-747. 
+ 
     Kohonen, T. (1995), Self-Organizing Maps, Berlin: Springer-Verlag. 
  
***************
*** 1772,1778 ****
     have fuzzy outputs. 
   o The net can be interpretable as an adaptive fuzzy system. For example,
!    Gaussian RBF nets and B-spline regression models are fuzzy systems with
!    adaptive weights (Brown and Harris 1994) and so can legitimately be
!    called neurofuzzy systems. 
   o The net can be a conventional NN architecture that operates on fuzzy
     numbers instead of real numbers (Lippe, Feuring and Mischke 1995). 
--- 1882,1888 ----
     have fuzzy outputs. 
   o The net can be interpretable as an adaptive fuzzy system. For example,
!    Gaussian RBF nets and B-spline regression models (Dierckx 1995) are fuzzy
!    systems with adaptive weights (Brown and Harris 1994) and can
!    legitimately be called neurofuzzy systems. 
   o The net can be a conventional NN architecture that operates on fuzzy
     numbers instead of real numbers (Lippe, Feuring and Mischke 1995). 
***************
*** 1789,1792 ****
--- 1899,1904 ----
     of Technology, Tampere, Finland: http://www.cs.tut.fi/~tpo/group.html and
     http://dmiwww.cs.tut.fi/nfs/Welcome_uk.html 
+  o Marcello Chiaberge's Neuro-Fuzzy page at 
+    http://polimage.polito.it/~marcello. 
  
  References: 
***************
*** 1807,1810 ****
--- 1919,1925 ----
     Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
     McGraw-Hill, ISBN 0-07-011189-8. 
+ 
+    Dierckx, P. (1995), Curve and Surface Fitting with Splines, Oxford:
+    Clarendon Press. 
  
     Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

==> nn3.changes.body <==
*** nn3.oldbody	Fri Jun 28 23:00:23 1996
--- nn3.body	Sun Jul 28 23:00:21 1996
***************
*** 1061,1064 ****
  ------------------------------------------------------------------------
  
! Next part is part 4 (of 7). Previous part is part 2. 
  
--- 1061,1064 ----
  ------------------------------------------------------------------------
  
! Next part is part 4 (of 7). Previous part is part 2. @
  

==> nn4.changes.body <==
*** nn4.oldbody	Fri Jun 28 23:00:28 1996
--- nn4.body	Sun Jul 28 23:00:25 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part4
! Last-modified: 1996-06-27
  URL: ftp://ftp.sas.com/pub/neural/FAQ4.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part4
! Last-modified: 1996-07-05
  URL: ftp://ftp.sas.com/pub/neural/FAQ4.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 373,376 ****
--- 373,383 ----
  Comments: "They cover a broad area"; "Introductory with suggested
  applications implementation".
+ 
+ Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
+ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0
+ Clear explanations of conjugate gradient and Levenberg-Marquardt
+ optimization algorithms, simulated annealing, kernel regression (GRNN) and
+ discriminant analysis (PNN), Gram-Charlier networks, dimensionality
+ reduction, cross-validation, and bootstrapping. 
  
  Pao, Y. H. (1989). Adaptive Pattern Recognition and Neural Networks

==> nn5.changes.body <==

==> nn6.changes.body <==
*** nn6.oldbody	Fri Jun 28 23:00:36 1996
--- nn6.body	Sun Jul 28 23:00:33 1996
***************
*** 34,42 ****
  address, etc.), you need not count the header in the 60 line maximum. Please
  confine your HTML to features that are supported by most browsers,
! especially NCSA Mosaic 2.0; avoid tables for example--use <pre> instead. Try
! to make the descriptions objective, and avoid making implicit or explicit
! assertions about competing products, such as "Our product is the *only* one
! that does so-and-so." The FAQ maintainer reserves the right to remove
! excessive marketing hype and to edit submissions to conform to size
  requirements; if he is in a good mood, he may also correct your spelling and
  punctuation. 
--- 34,42 ----
  address, etc.), you need not count the header in the 60 line maximum. Please
  confine your HTML to features that are supported by most browsers,
! especially NCSA Mosaic 2.0; avoid tables, for example--use <pre> instead.
! Try to make the descriptions objective, and avoid making implicit or
! explicit assertions about competing products, such as "Our product is the
! *only* one that does so-and-so." The FAQ maintainer reserves the right to
! remove excessive marketing hype and to edit submissions to conform to size
  requirements; if he is in a good mood, he may also correct your spelling and
  punctuation. 

==> nn7.changes.body <==
-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
