Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news4.ner.bbnplanet.net!news.ner.bbnplanet.net!news.mathworks.com!newsfeed.internetmci.com!in1.uu.net!news.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: some questions about FF-nets
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DoHEHx.6E4@unx.sas.com>
Date: Mon, 18 Mar 1996 21:03:33 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <x1KocEp.predictor@delphi.com> <4i7aj6$6vl@airgun.wg.waii.com> <4i7s7v$7ki@delphi.cs.ucla.edu> <4ialh8$k6@llnews.ll.mit.edu> <DoAqIr.2rB@unx.sas.com> <4idsma$pn7@llnews.ll.mit.edu>
Organization: SAS Institute Inc.
Lines: 108


Regarding the ratio "r" of training cases to trainable weights to get
useful results:

In article <4idsma$pn7@llnews.ll.mit.edu>, heath@ll.mit.edu (Greg Heath) writes:
|> In article <DoAqIr.2rB@unx.sas.com>, saswss@hotellng.unx.sas.com (Warren Sarle) 
|> writes:
|> |> The theoretical lower limit is 1 if you have no noise and the model is
|> |> specified correctly and parsimoniously. For typical NN applications,
|> |> none of those conditions hold, so the lower limit is >1, but how much
|> |> greater isn't known.
|> 
|> Hmm...I need to go to the library. Meanwhile...what do you get when you have
|> 1) no noise,
|> 2) a correct parsimonious model,
|> and
|> 3) N_d(# of design vectors) = N_p(# of model parameters)?
|> 
|> either
|> 1) a zero-error classifier for training data (unhappy sponsor),
|> or
|> 2) a minimum error classifier for non-training data (happy sponsor),
|> or
|> 3) something else?

Both (1) and (2). For example, consider fitting a straight line in the
regression case(one input, one output), which has two weights (intercept
and slope). Two training cases exactly determine the line under
assumption (1), and that line must give perfect generalization under
assumption (2).

|> I'm baaack (from the library). My original comments were based on what I thought I 
|> remembered from Nilsson, "The Mathematical Foundations of Learning Machines", 
|> Morgan-Kaufmann, 1990, ISBN 1-55860-123-6. His argument is for "PHI" functions 
|> (i.e., the output depends linearly on the N_p parameters to be estimated). This 
|> includes MLPs and RBFs for which all parameters are fixed except a bias for the 
|> linear output neuron and N_p - 1 connecting weights between the last hidden layer 
|> and the output neuron.

"PHI" functions are what statisticians call "linear models".

|> The results apply to P(r,N_p), the probability of achieving a zero-error classifier 
|> for the resubstitution of N_d = r*N_p training data vectors that are in "general 
|> position". There are several important points:
|> ...

It's not clear to me what random mechanism is behind these probabilites.
I may or may not have time to look it up soon. But anyhow, 

|> Now, given that I want to minimize the expected generalization error rate for for 
|> N_p = 61, what do I do?. Well, looking at Nilsson's plot I see that 61 ~ infinity. 
|> Next I see from 5a that r < 2 just about guarantees me a training data lookup table.
|> Expecting no generalization properties here, I now consider r >= 2 and proceed as 
|> follows: 
|> 
|> 1. Initialize r = 2.
|> 
|> 2. Train with N_d = r*N_p and test with nontraining data.
|> 
|> 3. Increase r (I double it) and repeat step 2.
|> 
|> 4. If increasing r hasn`t significantly lowered the nontraining data error rate,
|>    then I assume that the previous value is what I want. Otherwise I go back to
|>    step 3.

That seems generally sensible, although there are questions about how
you generate the training and test data and what "significantly lowered"
means. If both training and test data are generated randomly, then this
approach should let you make valid inferences about the average
generalization error under the assumed distribution of inputs.  The one
thing I would suggest is that you compute confidence intervals for the
test-set error rate, which is easy to do for classification problems.
See:

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That
   Learn, Morgan Kaufmann.

|> So...I find the following problems:
|> 
|> 1. The analysis is not valid if the input and non-last hidden layers are 
|>    trainable.

Your algorithm doesn't really depend on linearity, so I don't think
this is a problem.

|> 2. I've assumed that when P = 1, the classifier is no more than a training data 
|>    lookup table and has no redeeming generalization qualities.

With noisy data, that's a reasonable assumption.

|> 3. I've assumed that  P ~ 1 is equivalent to P = 1. So I've neglected to search 
|>    for r values inthe range 1 < r < 2 where the chance of success is very low 
|>    but still nonzero.

You're probably not missing much.

|> 4. I`ve been doing it "my way" for 13 years and ain't gonna change.

I won't try to make you change!

|> Boy, this is interesting! Any comments? Have recent theoretical results gone beyond 
|> the old PHI function limitation? 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
