Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!news.acsu.buffalo.edu!news.sunydutchess.edu!zombie.ncsc.mil!newsgate.duke.edu!interpath!news.interpath.net!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: changes to "comp.ai.neural-nets FAQ" -- monthly posting
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn.changes.posting_851832034@hotellng.unx.sas.com>
Supersedes: <nn.changes.posting_849240042@hotellng.unx.sas.com>
Date: Sun, 29 Dec 1996 04:00:35 GMT
Expires: Sun, 2 Feb 1997 04:00:34 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: modifications, new, additions, deletions
Followup-To: comp.ai.neural-nets
Lines: 967

==> nn1.changes.body <==
*** nn1.oldbody	Thu Nov 28 23:00:15 1996
--- nn1.body	Sat Dec 28 23:00:07 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part1
! Last-modified: 1996-11-27
  URL: ftp://ftp.sas.com/pub/neural/FAQ.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part1
! Last-modified: 1996-12-13
  URL: ftp://ftp.sas.com/pub/neural/FAQ.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 41,44 ****
--- 41,50 ----
  Usenet access, too--try the URL news:comp.ai.neural-nets. 
  
+ The FAQ posting departs to comp.ai.neural-nets on the 28th of every month.
+ It is also sent to the groups comp.answers and news.answers where it should
+ be available at any time (ask your news manager). The FAQ posting, like any
+ other posting, may a take a few days to find its way over Usenet to your
+ site. Such delays are especially common outside of North America. 
+ 
  This FAQ is not meant to discuss any topic exhaustively.
  
***************
*** 95,99 ****
     What is early stopping?
     What is weight decay?
!    What is Bayesian estimation?
     How many hidden layers should I use?
     How many hidden units should I use?
--- 101,105 ----
     What is early stopping?
     What is weight decay?
!    What is Bayesian learning?
     How many hidden layers should I use?
     How many hidden units should I use?
***************
*** 103,107 ****
  Part 4: Books, data, etc.
  
!    Good literature about Neural Networks?
     Journals and magazines about Neural Networks?
     The most important conferences concerned with Neural Networks?
--- 109,113 ----
  Part 4: Books, data, etc.
  
!    Books and articles about Neural Networks?
     Journals and magazines about Neural Networks?
     The most important conferences concerned with Neural Networks?
***************
*** 283,288 ****
  
  The intent in providing a FAQ is to make the information freely available to
! whomever needs it. You may copy all or part of the FAQ, but please be sure
! to include a reference to the URL of the master copy,
  ftp://ftp.sas.com/pub/neural/FAQ.html, and do not sell copies of the FAQ. If
  you want to include information from the FAQ in your own web site, it is
--- 289,294 ----
  
  The intent in providing a FAQ is to make the information freely available to
! whoever needs it. You may copy all or part of the FAQ, but please be sure to
! include a reference to the URL of the master copy,
  ftp://ftp.sas.com/pub/neural/FAQ.html, and do not sell copies of the FAQ. If
  you want to include information from the FAQ in your own web site, it is
***************
*** 634,637 ****
  ------------------------------------------------------------------------
  
! Next part is part 2 (of 7). 
  
--- 640,643 ----
  ------------------------------------------------------------------------
  
! Next part is part 2 (of 7). @
  

==> nn2.changes.body <==
*** nn2.oldbody	Thu Nov 28 23:00:22 1996
--- nn2.body	Sat Dec 28 23:00:12 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part2
! Last-modified: 1996-11-27
  URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part2
! Last-modified: 1996-12-23
  URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 517,525 ****
  range. But if the target values have no known bounded range, it is better to
  use an unbounded activation function, most often the identity function
! (which amounts to no activation function). There are certain natural
! associations between output activation functions and various noise
! distributions which have been studied by statisticians in the context of
! generalized linear models. The output activation function is the inverse of
! what statisticians call the "link function". See: 
  
     McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
--- 517,528 ----
  range. But if the target values have no known bounded range, it is better to
  use an unbounded activation function, most often the identity function
! (which amounts to no activation function). If the target values are positive
! but have no known upper bound, you can use an exponential output activation
! function (but beware of overflow if you are writing your own code). 
! 
! There are certain natural associations between output activation functions
! and various noise distributions which have been studied by statisticians in
! the context of generalized linear models. The output activation function is
! the inverse of what statisticians call the "link function". See: 
  
     McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
***************
*** 791,796 ****
  HUs  MLP   ORBFEQ  ORBFUN  NRBFEQ  NRBFEW  NRBFEV  NRBFEH  NRBFUN
                                                             
!  2  0.218   0.247   0.247   0.230   0.230   0.230   0.230   0.230
!  3  0.192   0.244   0.143   0.218   0.218   0.036   0.012   0.001
   4  0.174   0.216   0.096   0.193   0.193   0.036   0.007
   5  0.160   0.188   0.083   0.086   0.051   0.003
--- 794,799 ----
  HUs  MLP   ORBFEQ  ORBFUN  NRBFEQ  NRBFEW  NRBFEV  NRBFEH  NRBFUN
                                                             
!  2  0.218   0.247   0.247   0.230   0.230   0.230   0.230   0.230 
!  3  0.192   0.244   0.143   0.218   0.218   0.036   0.012   0.001 
   4  0.174   0.216   0.096   0.193   0.193   0.036   0.007
   5  0.160   0.188   0.083   0.086   0.051   0.003
***************
*** 1293,1297 ****
  must be in the interval [0,1]. There is in fact no such requirement,
  although there often are benefits to standardizing the inputs as discussed
! below. 
  
  If your output activation function has a range of [0,1], then obviously you
--- 1296,1301 ----
  must be in the interval [0,1]. There is in fact no such requirement,
  although there often are benefits to standardizing the inputs as discussed
! below. But it is better to have the input values centered around zero, so
! scaling the inputs to the interval [0,1] is usually a bad choice. 
  
  If your output activation function has a range of [0,1], then obviously you
***************
*** 1353,1357 ****
  same outputs as you had before. However, there are a variety of practical
  reasons why standardizing the inputs can make training faster and reduce the
! chances of getting stuck in local optima. 
  
  The main emphasis in the NN literature on initial values has been on the
--- 1357,1362 ----
  same outputs as you had before. However, there are a variety of practical
  reasons why standardizing the inputs can make training faster and reduce the
! chances of getting stuck in local optima. Also, weight decay and Bayesian
! estimation can be done more conveniently with standardized inputs. 
  
  The main emphasis in the NN literature on initial values has been on the
***************
*** 1376,1380 ****
  hyperplanes will miss the data entirely. With such a poor initialization,
  local minima are very likely to occur. It is therefore important to center
! the inputs to get good random initializations. 
  
  Standardizing input variables also has different effects on different
--- 1381,1388 ----
  hyperplanes will miss the data entirely. With such a poor initialization,
  local minima are very likely to occur. It is therefore important to center
! the inputs to get good random initializations. In particular, scaling the
! inputs to [-1,1] will work better than [0,1], although any scaling that sets
! to zero the mean or median or other measure of central tendency is likely to
! be as good or better. 
  
  Standardizing input variables also has different effects on different
***************
*** 1397,1400 ****
--- 1405,1475 ----
     details of the implementation. 
  
+ Two of the most useful ways to standardize inputs are: 
+ 
+  o Mean 0 and standard deviation 1 
+  o Midrange 0 and range 2 (i.e., minimum -1 and maximum 1) 
+ 
+ Formulas are as follows: 
+ 
+ Notation:
+ 
+    X = value of the raw input variable X for the ith training case
+     i
+    
+    S = standardized value corresponding to X
+     i                                       i
+    
+    N = number of training cases
+ 
+                            
+ Standardize X  to mean 0 and standard deviation 1:
+              i   
+ 
+           sum X
+            i   i   
+    mean = ------
+              N
+    
+                         2
+           sum( X - mean)
+            i    i
+    std  = ---------------
+                N - 1
+                            
+ 
+        X  - mean
+         i
+    S = ---------
+     i     std
+ 
+                            
+ Standardize X  to midrange 0 and range 2:
+              i   
+ 
+               max X  +  min X
+                i   i     i   i
+    midrange = ----------------
+                      2
+ 
+ 
+    range = max X  -  min X
+             i   i     i   i
+ 
+ 
+        X  - midrange
+         i
+    S = -------------
+     i     range / 2
+        
+ 
+ Various other pairs of location and scale estimators can be used besides the
+ mean and standard deviation, or midrange and range. Robust estimates of
+ location and scale are desirable if the inputs contain outliers. For
+ example, see: 
+ 
+    Iglewicz, B. (1983), "Robust scale estimators and confidence intervals
+    for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., 
+    Understanding Robust and Exploratory Data Analysis, NY: Wiley. 
+ 
  Subquestion: Should I standardize the target variables (column
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************
*** 1422,1425 ****
--- 1497,1544 ----
  importance. 
  
+ For weight decay and Bayesian estimation, the scaling of the targets affects
+ the decay values and prior distributions. Hence it is usually most
+ convenient to work with standardized targets. 
+ 
+ If you are standardizing targets to equalize their importance, then you
+ should probably standardize to mean 0 and standard deviation 1, or use
+ related robust estimators, as discussed under Should I standardize the input
+ variables (column vectors)? If you are standardizing targets to force the
+ values into the range of the output activation function, it is important to
+ use lower and upper bounds for the values, rather than the minimum and
+ maximum values in the training set. For example, if the output activation
+ function has range [-1,1], you can use the following formulas: 
+ 
+    Y = value of the raw target variable Y for the ith training case
+     i
+    
+    Z = standardized value corresponding to Y
+     i                                       i
+    
+               upper bound of Y  +  lower bound of Y
+    midrange = -------------------------------------
+                                 2
+ 
+ 
+    range = upper bound of Y  -  lower bound of Y
+ 
+ 
+        Y  - midrange
+         i
+    Z = -------------
+     i    range / 2
+ 
+ For a range of [0,1], you can use the following formula: 
+ 
+                Y  - lower bound of Y
+                 i
+    Z = -------------------------------------
+     i  upper bound of Y  -  lower bound of Y  
+ 
+ If the target variable does not have known upper and lower bounds, it is not
+ advisable to use an output activation function with a bounded range. You can
+ use an identity output activation function or other unbounded output
+ activation function instead; see Why use activation functions? 
+ 
  Subquestion: Should I standardize the input cases (row vectors)?
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************
*** 1465,1473 ****
  corresponds to exposure in the image example. 
  
! If the input data are measured on an interval scale, you can control for
! size by subtracting a measure of the over-all size of each case from each
! datum. For example, if no other direct measure of size is available, you
! could subtract the mean of each row of the input matrix, producing a
! row-centered input matrix. 
  
  If the data are measured on a ratio scale, you can control for size by
--- 1584,1594 ----
  corresponds to exposure in the image example. 
  
! If the input data are measured on an interval scale (for information on
! scales of measurement, see "Measurement theory: Frequently asked questions",
! at ftp://ftp.sas.com/pub/neural/measurement.html) you can control for size
! by subtracting a measure of the over-all size of each case from each datum.
! For example, if no other direct measure of size is available, you could
! subtract the mean of each row of the input matrix, producing a row-centered
! input matrix. 
  
  If the data are measured on a ratio scale, you can control for size by
***************
*** 1557,1561 ****
  
  where c is a constant that controls how far the extreme values are brought
! in towards the mean. 
  
  References: 
--- 1678,1684 ----
  
  where c is a constant that controls how far the extreme values are brought
! in towards the mean. Using robust estimates of location and scale (Iglewicz
! 1983) instead of the mean and standard deviation will work even better for
! pathological distributions. 
  
  References: 
***************
*** 1569,1572 ****
--- 1692,1699 ----
     Huber, P.J. (1981), Robust Statistics, NY: Wiley. 
  
+    Iglewicz, B. (1983), "Robust scale estimators and confidence intervals
+    for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., 
+    Understanding Robust and Exploratory Data Analysis, NY: Wiley. 
+ 
     McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
     ed., London: Chapman and Hall. 
***************
*** 1843,1846 ****
--- 1970,1976 ----
     Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
     Oxford University Press. 
+ 
+    K. I. Diamantaras, S. Y. Kung (1996) Principal Component Neural
+    Networks: Theory and Applications, NY: Wiley. 
  
     Deco, G. and Obradovic, D. (1996), An Information-Theoretic Approach to

==> nn3.changes.body <==
*** nn3.oldbody	Thu Nov 28 23:00:27 1996
--- nn3.body	Sat Dec 28 23:00:17 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part3
! Last-modified: 1996-09-19
  URL: ftp://ftp.sas.com/pub/neural/FAQ3.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part3
! Last-modified: 1996-12-19
  URL: ftp://ftp.sas.com/pub/neural/FAQ3.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 21,25 ****
     What is early stopping?
     What is weight decay?
!    What is Bayesian estimation?
     How many hidden layers should I use?
     How many hidden units should I use?
--- 21,25 ----
     What is early stopping?
     What is weight decay?
!    What is Bayesian learning?
     How many hidden layers should I use?
     How many hidden units should I use?
***************
*** 172,177 ****
  leading to overfitting. Overfitting is especially dangerous because it can
  easily lead to predictions that are far beyond the range of the training
! data with many of the common types of NNs. But underfitting can also produce
! wild predictions in multilayer perceptrons, even with noise-free data. 
  
  For an elementary discussion of overfitting, see Smith (1993). For a more
--- 172,177 ----
  leading to overfitting. Overfitting is especially dangerous because it can
  easily lead to predictions that are far beyond the range of the training
! data with many of the common types of NNs. Overfitting can also produce wild
! predictions in multilayer perceptrons even with noise-free data. 
  
  For an elementary discussion of overfitting, see Smith (1993). For a more
***************
*** 196,200 ****
   o Weight decay 
   o Early stopping 
!  o Bayesian estimation 
  
  There approaches are discussed in more detail under subsequent questions. 
--- 196,200 ----
   o Weight decay 
   o Early stopping 
!  o Bayesian learning 
  
  There approaches are discussed in more detail under subsequent questions. 
***************
*** 583,587 ****
  
  Fortunately, there is a superior alternative to weight decay: hierarchical 
! Bayesian estimation. Bayesian estimation makes it possible to estimate
  efficiently numerous decay constants. 
  
--- 583,587 ----
  
  Fortunately, there is a superior alternative to weight decay: hierarchical 
! Bayesian learning. Bayesian learning makes it possible to estimate
  efficiently numerous decay constants. 
  
***************
*** 604,612 ****
  ------------------------------------------------------------------------
  
! Subject: What is Bayesian estimation? 
! ======================================
  
! I haven't written an answer for this yet, but here are some references: 
  
     Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
     (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
--- 604,766 ----
  ------------------------------------------------------------------------
  
! Subject: What is Bayesian Learning?
! ===================================
  
! By Radford Neal. 
  
+ Conventional training methods for multilayer perceptrons ("backprop" nets)
+ can be interpreted in statistical terms as variations on maximum likelihood
+ estimation. The idea is to find a single set of weights for the network that
+ maximize the fit to the training data, perhaps modified by some sort of
+ weight penalty to prevent overfitting. 
+ 
+ The Bayesian school of statistics is based on a different view of what it
+ means to learn from data, in which probability is used to represent
+ uncertainty about the relationship being learned (a use that is shunned in
+ conventional--i.e., frequentist--statistics). Before we have seen any data,
+ our prior opinions about what the true relationship might be can be
+ expresssed in a probability distribution over the network weights that
+ define this relationship. After we look at the data (or after our program
+ looks at the data), our revised opinions are captured by a posterior
+ distribution over network weights. Network weights that seemed plausible
+ before, but which don't match the data very well, will now be seen as being
+ much less likely, while the probability for values of the weights that do
+ fit the data well will have increased. 
+ 
+ Typically, the purpose of training is to make predictions for future cases
+ in which only the inputs to the network are known. The result of
+ conventional network training is a single set of weights that can be used to
+ make such predictions. In contrast, the result of Bayesian training is a
+ posterior distribution over network weights. If the inputs of the network
+ are set to the values for some new case, the posterior distribution over
+ network weights will give rise to a distribution over the outputs of the
+ network, which is known as the predictive distribution for this new case. If
+ a single-valued prediction is needed, one might use the mean of the
+ predictive distribution, but the full predictive distribution also tells you
+ how uncertain this prediction is. 
+ 
+ Why bother with all this? The hope is that Bayesian methods will provide
+ solutions to such fundamental problems as: 
+ 
+  o How to judge the uncertainty of predictions. This can be solved by
+    looking at the predictive distribution, as described above. 
+  o How to choose an appropriate network architecture (eg, the number hidden
+    layers, the number of hidden units in each layer). 
+  o How to adapt to the characteristics of the data (eg, the smoothness of
+    the function, the degree to which different inputs are relevant). 
+ 
+ Good solutions to these problems, especially the last two, depend on using
+ the right prior distribution, one that properly represents the uncertainty
+ that you probably have about which inputs are relevant, how smooth the
+ function is, how much noise there is in the observations, etc. Such
+ carefully vague prior distributions are usually defined in a hierarchical
+ fashion, using hyperparameters, some of which are analogous to the weight
+ decay constants of more conventional training procedures. The use of
+ hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a,
+ 1996), who in particular use an "Automatic Relevance Determination" scheme
+ that aims to allow many possibly-relevant inputs to be included without
+ damaging effects. 
+ 
+ Selection of an appropriate network architecture is another place where
+ prior knowledge plays a role. One approach is to use a very general
+ architecture, with lots of hidden units, maybe in several layers or groups,
+ controlled using hyperparameters. This approach is emphasized by Neal
+ (1996), who argues that there is no statistical need to limit the complexity
+ of the network architecture when using well-designed Bayesian methods. It is
+ also possible to choose between architectures in a Bayesian fashion, using
+ the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b). 
+ 
+ Implementing all this is one of the biggest problems with Bayesian methods.
+ Dealing with a distribution over weights (and perhaps hyperparameters) is
+ not as simple as finding a single "best" value for the weights. Exact
+ analytical methods for models as complex as neural networks are out of the
+ question. Two approaches have been tried: 
+ 
+ 1. Find the weights/hyperparameters that are most probable, using methods
+    similar to conventional training (with regularization), and then
+    approximate the distribution over weights using information available at
+    this maximum. 
+ 2. Use a Monte Carlo method to sample from the distribution over weights.
+    The most efficient implementations of this use dynamical Monte Carlo
+    methods whose operation resembles that of backprop with momentum. 
+ 
+ The first method comes in two flavours. Buntine and Weigend (1991) describe
+ a procedure in which the hyperparameters are first integrated out
+ analytically, and numerical methods are then used to find the most probable
+ weights. MacKay (1992a, 1992b) instead finds the values for the
+ hyperparameters that are most likely, integrating over the weights (using an
+ approximation around the most probable weights, conditional on the
+ hyperparameter values). There has been some controversy regarding the merits
+ of these two procedures, with Wolpert (1993) claiming that analytically
+ integrating over the hyperparameters is preferable because it is "exact".
+ This criticism has been rebutted by Mackay (1993). It would be inappropriate
+ to get into the details of this controversy here, but it is important to
+ realize that the procedures based on analytical integration over the
+ hyperparameters do not provide exact solutions to any of the problems of
+ practical interest. The discussion of an analogous situation in a different
+ statistical context by O'Hagan (1985) may be illuminating. 
+ 
+ Monte Carlo methods for Bayesian neural networks have been developed by Neal
+ (1993a, 1996). In this approach, the posterior distribution is represented
+ by a sample of perhaps a few dozen sets of network weights. The sample is
+ obtained by simulating a Markov chain whose equilibrium distribution is the
+ posterior distribution for weights and hyperparameters. This technique is
+ known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review.
+ The method is exact in the limit as the size of the sample and the length of
+ time for which the Markov chain is run increase, but convergence can
+ sometimes be slow in practice, as for any network training method. 
+ 
+ Work on Bayesian network learning has so far concentrated on multilayer
+ perceptron networks, but Bayesian methods can in principal be applied to
+ other network models, as long as they can be interpreted in statistical
+ terms. For some models (eg, RBF networks), this should be a fairly simple
+ matter; for others (eg, Boltzmann Machines), substantial computational
+ problems would need to be solved. 
+ 
+ Software implementing Bayesian neural network models (intended for research
+ use) is available from the home pages of David MacKay and Radford Neal. 
+ 
+ There are many books that discuss the general concepts of Bayesian
+ inference, though they mostly deal with models that are simpler than neural
+ networks. Here are some recent ones: 
+ 
+    Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York:
+    John Wiley. 
+ 
+    Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian
+    Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5. 
+ 
+    O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced
+    Theory of Statistics), ISBN 0-340-52922-9. 
+ 
+    Robert, C. P. (1995) The Bayesian Choice, New York: Springer-Verlag. 
+ 
+ The following books and papers have tutorial material on Bayesian learning
+ as applied to neural network models: 
+ 
+    Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford:
+    Oxford University Press. 
+ 
+    MacKay, D. J. C. (1995) "Probable networks and plausible predictions - a
+    review of practical Bayesian methods for supervised neural networks",
+    available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz. 
+ 
+    Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural
+    Network Models," Institute of Statistics and Decision Sciences Working
+    Paper 95-31, available at 
+    ftp://ftp.isds.duke.edu/pub/WorkingPapers/95-31.ps 
+ 
+    Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York:
+    Springer-Verlag, ISBN 0-387-94724-8. 
+ 
+    Ripley, B. D. (1996) Pattern Recognition and Neural Networks,
+    Cambridge: Cambridge University Press. 
+ 
+    Thodberg, H. H. (1996) "A review of Bayesian neural networks with an
+    application to near infrared spectroscopy", IEEE Transactions on Neural
+    Networks, 7, 56-72. 
+ 
+ Some other references: 
+ 
     Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
     (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
***************
*** 613,639 ****
     (North-Holland). 
  
!    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
!    Oxford University Press. 
  
!    Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995), Bayesian
!    Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5. 
  
!    MacKay, D.J.C. (1992), "A practical Bayesian framework for
     backpropagation networks," Neural Computation, 4, 448-472. 
  
!    MacKay, D.J.C. (199?), "Probable networks and plausible predictions--a
!    review of practical Bayesian methods for supervised neural networks," 
!    ftp://mraos.ra.phy.cam.ac.uk/pub/mackay/network.ps.Z. 
! 
!    Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis,
!    University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z. 
! 
!    O'Hagan, A. (1985), "Shoulders in hierarchical models," in Bernardo et
!    al. (1985), 697-710. 
  
!    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
!    Cambridge University Press. 
  
!    Sarle, W.S. (1995), "Stopped Training and Other Remedies for
     Overfitting," Proceedings of the 27th Symposium on the Interface of
     Computing Science and Statistics, 352-360, 
--- 767,797 ----
     (North-Holland). 
  
!    Buntine, W. L. and Weigend, A. S. (1991) "Bayesian back-propagation", 
!    Complex Systems, 5, 603-643. 
  
!    MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation,
!    4, 415-447. 
  
!    MacKay, D. J. C. (1992b) "A practical Bayesian framework for
     backpropagation networks," Neural Computation, 4, 448-472. 
  
!    MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?",
!    available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz. 
  
!    Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L.
!    Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
!    Information Processing Systems 5, San Mateo, California: Morgan
!    Kaufmann, 475-482. 
! 
!    Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte
!    Carlo Methods, available at 
!    ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z. 
! 
!    O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo,
!    M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian
!    Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland),
!    697-710. 
  
!    Sarle, W. S. (1995) "Stopped Training and Other Remedies for
     Overfitting," Proceedings of the 27th Symposium on the Interface of
     Computing Science and Statistics, 352-360, 
***************
*** 641,644 ****
--- 799,828 ----
     compressed postscript file, 747K, 10 pages) 
  
+    Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C.
+    L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
+    Information Processing Systems 5, San Mateo, California: Morgan
+    Kaufmann, 539-546. 
+ 
+ Finally, David MacKay maintains a FAQ about Bayesian methods for neural
+ networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html . 
+ 
+ Comments on Bayesian learning
+ +++++++++++++++++++++++++++++
+ 
+ By Warren Sarle. 
+ 
+ Bayesian purists may argue over the proper way to do a Bayesian analysis,
+ but even the crudest Bayesian computation (maximizing over both parameters
+ and hyperparameters) is shown by Sarle (1995) to generalize better than
+ early stopping when learning nonlinear functions. This approach requires the
+ use of slightly informative hyperpriors and at least twice as many training
+ cases as weights in the network. A full Bayesian analysis by MCMC can be
+ expected to work even better under even broader conditions. Bayesian
+ learning works well by frequentist standards--what MacKay calls the
+ "evidence framework" is used by frequentist statisticians under the name
+ "empirical Bayes." Although considerable research remains to be done,
+ Bayesian learning seems to be the most promising approach to training neural
+ networks. 
+ 
  ------------------------------------------------------------------------
  
***************
*** 1073,1076 ****
  ------------------------------------------------------------------------
  
! Next part is part 4 (of 7). Previous part is part 2. @
  
--- 1257,1260 ----
  ------------------------------------------------------------------------
  
! Next part is part 4 (of 7). Previous part is part 2. 
  

==> nn4.changes.body <==
*** nn4.oldbody	Thu Nov 28 23:00:31 1996
--- nn4.body	Sat Dec 28 23:00:21 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part4
! Last-modified: 1996-11-07
  URL: ftp://ftp.sas.com/pub/neural/FAQ4.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part4
! Last-modified: 1996-12-18
  URL: ftp://ftp.sas.com/pub/neural/FAQ4.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 16,20 ****
  Part 4: Books, data, etc.
  
!    Good literature about Neural Networks?
     Journals and magazines about Neural Networks?
     The most important conferences concerned with Neural Networks?
--- 16,20 ----
  Part 4: Books, data, etc.
  
!    Books and articles about Neural Networks?
     Journals and magazines about Neural Networks?
     The most important conferences concerned with Neural Networks?
***************
*** 29,34 ****
  ------------------------------------------------------------------------
  
! Subject: Good literature about Neural Networks?
! ===============================================
  
  The Best
--- 29,34 ----
  ------------------------------------------------------------------------
  
! Subject: Books and articles about Neural Networks?
! ==================================================
  
  The Best
***************
*** 51,55 ****
  a rung above most other introductions to NNs. There are also brief chapters
  on data preparation and diagnostic plots, topics usually ignored in
! elementaty NN books. Only feedforward nets are covered in any detail. 
  
  Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
--- 51,55 ----
  a rung above most other introductions to NNs. There are also brief chapters
  on data preparation and diagnostic plots, topics usually ignored in
! elementary NN books. Only feedforward nets are covered in any detail. 
  
  Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
***************
*** 275,292 ****
  dirty!"
  
- Swingler , K. (1996), Applying Neural Networks: A Practical Guide, London:
- Academic Press. 
- This book has lots of good advice liberally sprinkled with errors, some bad
- advice, and the occasional howler. Experts will learn nothing, while
- beginners will be unable to separate the useful information from the
- dangerous. The most ludicrous thing I've found in the book is the claim that
- Hecht-Neilson used Kolmogorov's theorem to show that "you will never require
- more than twice the number of hidden units as you have inputs" (p. 53) in an
- MLP with one hidden layer. Hecht-Neilson has made an occasional published
- mistake himself, but I am sure he has never said anything this idiotic! Then
- Swingler goes on to say that Kurkova, V. (1991), "Kolmogorov's theorem is
- relevant," Neural Computation, 3, 617-622, confirmed this alleged upper
- bound on the number of hidden units--this is a gross insult to Kurkova! 
- 
  Wasserman, P. D. (1989). Neural Computing: Theory & Practice. Van Nostrand
  Reinhold: New York. (ISBN 0-442-20743-3) 
--- 275,278 ----
***************
*** 469,472 ****
--- 455,461 ----
  +++++++++
  
+ How not to use neural nets in any programming language
+ ------------------------------------------------------
+ 
     Blum, Adam (1992), Neural Networks in C++, Wiley. 
  
***************
*** 509,512 ****
--- 498,597 ----
  My comments apply only to the text of the above books. I have not examined
  or attempted to compile the code. 
+ 
+ An impractical guide to neural nets
+ -----------------------------------
+ 
+    Swingler , K. (1996), Applying Neural Networks: A Practical Guide, 
+    London: Academic Press. 
+ 
+ This book has lots of good advice liberally sprinkled with errors, incorrect
+ formulas, some bad advice, and some very serious mistakes. Experts will
+ learn nothing, while beginners will be unable to separate the useful
+ information from the dangerous. For example, there is a chapter on "Data
+ encoding and re-coding" that would be very useful to beginners if it were
+ accurate, but the formula for the standard deviation is wrong, and the
+ description of the softmax function is of something entirely different than
+ softmax (see What is a softmax activation function?). Even more dangerous is
+ the statement on p. 28 that "Any pair of variables with high covariance are
+ dependent, and one may be chosen to be discarded." Although high
+ correlations can be used to identify redundant inputs, it is incorrect to
+ use high covariances for this purpose, since a covariance can be high simply
+ because one of the inputs has a high standard deviation. 
+ 
+ The most ludicrous thing I've found in the book is the claim that
+ Hecht-Neilsen used Kolmogorov's theorem to show that "you will never require
+ more than twice the number of hidden units as you have inputs" (p. 53) in an
+ MLP with one hidden layer. Actually, Hecht-Neilsen, says "the direct
+ usefulness of this result is doubtful, because no constructive method for
+ developing the [output activation] functions is known." Then Swingler
+ implies that V. Kurkova (1991, "Kolmogorov's theorem is relevant," Neural
+ Computation, 3, 617-622) confirmed this alleged upper bound on the number of
+ hidden units, saying that, "Kurkova was able to restate Kolmogorov's theorem
+ in terms of a set of sigmoidal functions." If Kolmogorov's theorem, or
+ Hecht-Nielson's adaptation of it, could be restated in terms of known
+ sigmoid activation functions in the (single) hidden and output layers, then
+ Swingler's alleged upper bound would be correct, but in fact no such
+ restatement of Kolmogorov's theorem is possible, and Kurkova did not claim
+ to prove any such restatement. Swingler omits the crucial details that
+ Kurkova used two hidden layers, staircase-like activation functions (not
+ ordinary sigmoidal functions such as the logistic) in the first hidden
+ layer, and a potentially large number of units in the second hidden layer.
+ Kurkova later estimated the number of units required for uniform
+ approximation within an error epsilon as nm(m+1) in the first hidden
+ layer and m^2(m+1)^n in the second hidden layer, where n is the number
+ of inputs and m "depends on epsilon/||f|| as well as on the rate with
+ which f increases distances." In other words, Kurkova says nothing to
+ support Swinglers advice (repeated on p. 55), "Never choose h to be more
+ than twice the number of input units." Furthermore, constructing a counter
+ example to Swingler's advice is trivial: use one input and one output, where
+ the output is the sine of the input, and the domain of the input extends
+ over many cycles of the sine wave; it is obvious that many more than two
+ hidden units are required. For some sound information on choosing the number
+ of hidden units, see How many hidden units should I use? 
+ 
+ Choosing the number of hidden units is one important aspect of getting good
+ generalization, which is the most crucial issue in neural network training.
+ There are many other considerations involved in getting good generalization,
+ and Swingler makes several more mistakes in this area: 
+ 
+  o There is dangerous misinformation on p. 55, where Swingler says, "If a
+    data set contains no noise, then there is no risk of overfitting as there
+    is nothing to overfit." It is true that overfitting is more common with
+    noisy data, but severe overfitting can occur with noise-free data, even
+    when there are more training cases than weights. There is an example of
+    such overfitting under How many hidden layers should I use? 
+ 
+  o Regarding the use of added noise (jitter) in training, Swingler says on
+    p. 60, "The more noise you add, the more general your model becomes."
+    This statement makes no sense as it stands (it would make more sense if
+    "general" were changed to "smooth"), but it could certainly encourage a
+    beginner to use far too much jitter--see What is jitter? (Training with
+    noise). 
+ 
+  o On p. 109, Swingler describes leave-one-out cross-validation, which he
+    ascribes to Hecht-Neilsen. But Swingler concludes, "the method provides
+    you with L minus 1 networks to choose from; none of which has been
+    validated properly," completely missing the point that cross-validation
+    provides an estimate of the generalization error of a network trained on
+    the entire training set of L cases--see What are cross-validation and
+    bootstrapping? Also, there are L leave-one-out networks, not L-1. 
+ 
+ While Swingler has some knowldege of statistics, his expertise is not
+ sufficient for him to detect that certain articles on neural nets are
+ statistically nonsense. For example, on pp. 139-140 he uncritically reports
+ a method that allegedly obtains error bars by doing a simple linear
+ regression on the target vs. output scores. To a trained statistician, this
+ method is obviously wrong (and, as usual in this book, the formula for
+ variance given for this method on p. 150 is wrong). On p. 110, Swingler
+ reports an article that attempts to apply bootstrapping to neural nets, but
+ this article is also obviously wrong to anyone familiar with bootstrapping.
+ While Swingler cannot be blamed entirely for accepting these articles at
+ face value, such misinformation provides yet more hazards for beginners. 
+ 
+ Swingler addresses many important practical issues, and often provides good
+ practical advice. But the peculiar combination of much good advice with some
+ extremely bad advice, a few examples of which are provided above, could
+ easily seduce a beginner into thinking that the book as a whole is reliable.
+ It is this danger that earns the book a place in "The Worst" list. 
  
  ------------------------------------------------------------------------

==> nn5.changes.body <==

==> nn6.changes.body <==
*** nn6.oldbody	Thu Nov 28 23:00:39 1996
--- nn6.body	Sat Dec 28 23:00:28 1996
***************
*** 1,4 ****
  Archive-name: ai-faq/neural-nets/part6
! Last-modified: 1996-11-07
  URL: ftp://ftp.sas.com/pub/neural/FAQ6.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--- 1,4 ----
  Archive-name: ai-faq/neural-nets/part6
! Last-modified: 1996-12-12
  URL: ftp://ftp.sas.com/pub/neural/FAQ6.html
  Maintainer: saswss@unx.sas.com (Warren S. Sarle)
***************
*** 164,180 ****
        Phone: (919) 677-8000          (49) 6221 4160
          Fax: (919) 677-4444          (49) 6221 474 850
  
     Operating systems: Windows 3.1, OS/2, HP/UX, Solaris, AIX
  
!    The SAS Neural Network Application trains a variety of neural nets and
!    includes a graphical user interface, on-site training and customisation.
!    Features include multilayer perceptrons, radial basis functions,
!    statistical versions of counterpropagation and learning vector
!    quantization, a variety of built-in activation and error functions,
!    multiple hidden layers, direct input-output connections, missing value
!    handling, categorical variables, standardization of inputs and targets,
!    and multiple preliminary optimizations from random initial values to
!    avoid local minima. Training is done by state-of-the-art numerical
!    optimization algorithms instead of tedious backprop. 
  
  3. NeuralWorks
--- 164,184 ----
        Phone: (919) 677-8000          (49) 6221 4160
          Fax: (919) 677-4444          (49) 6221 474 850
+       Email: software@sas.sas.com
  
     Operating systems: Windows 3.1, OS/2, HP/UX, Solaris, AIX
  
!    To find the addresses and telephone numbers of other SAS Institute
!    offices, including those outside the USA and Europe, connect your web
!    browser to http://www.sas.com/offices/intro.html. The SAS Neural Network
!    Application trains a variety of neural nets and includes a graphical user
!    interface, on-site training and customisation. Features include
!    multilayer perceptrons, radial basis functions, statistical versions of
!    counterpropagation and learning vector quantization, a variety of
!    built-in activation and error functions, multiple hidden layers, direct
!    input-output connections, missing value handling, categorical variables,
!    standardization of inputs and targets, and multiple preliminary
!    optimizations from random initial values to avoid local minima. Training
!    is done by state-of-the-art numerical optimization algorithms instead of
!    tedious backprop. 
  
  3. NeuralWorks

==> nn7.changes.body <==
-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
 *** Do not send me unsolicited commercial or political email! ***

