Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!newshost!goldenapple.srv.cs.cmu.edu!das-news2.harvard.edu!fas-news.harvard.edu!oitnews.harvard.edu!cmcl2.nyu.edu!yale!newsxfer.itd.umich.edu!newsxfer3.itd.umich.edu!howland.erols.net!news.mathworks.com!newsgate.duke.edu!interpath!news.interpath.net!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Backprop NN...what ratio of training to test data?
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <E7D9tM.n15@unx.sas.com>
Date: Fri, 21 Mar 1997 00:10:34 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <33247A4D.344C@ais.net> <5gja8e$mjn$1@mhadf.production.compuserve.com> <Pine.SUN.3.95.970319113345.9129A-100000@solitude>
Organization: SAS Institute Inc.
Lines: 56


In article <Pine.SUN.3.95.970319113345.9129A-100000@solitude>, Ted Heron <heron@mpd.tandem.com> writes:
|> On 17 Mar 1997, Will Dwinnell wrote:
|> 
|> > With 190 vectors, I'g go for:
|> > 
|> > 80 vectors (chosen randomly?) = training set
|> > 20 vectors (also chosen randomly?) = test set
|> > remaining 90 vectors = validation set
|> ...
|> When each of these sets are chosen I assume that "chosen randomly" is
|> "chosen randomly without replacement" - statistics books that I have read
|> say that this (randomly without replacement) should never be done unless
|> the data set is very large (like in the 1000's). What am I missing here? -

What statistics books say this regarding what purpose for the sampling?

When you are collecting data for statistical analysis, sampling
without replacement from a finite population provides more reliable
estimates than does sampling with replacement. For an infinite
population, it makes no difference.

Regarding training and test sets, the ideal situation would be to sample
the training set without replacement from the population, and
independently sample the test set without replacement from the
population.  By this method, any given case could appear in both the
training and test sets, but the probability of such overlap would be low
if the population were large.  This method would give you an unbiased
estimate of generalization error from the test set.

If you first obtain a data set by sampling from a population (with or
without replacement), and then choose a training set by subsampling with
replacement from the full data set, and also choose a test set by
subsampling with replacement from the full data set, then the training
and test sets will probably overlap much more than in the previous
scenario. The test error will therefore be positively correlated with
the training error, and hence will be optimistically biased.

If you first obtain a data set by sampling from a population (with or
without replacement), and then choose a training set by subsampling
without replacement from the full data set, and also choose a test set
by subsampling without replacement from the full data set, so that the
training and test sets do not overlap at all, then the test error will
be negatively correlated with training error, and hence will be
pessimistically biased. However, if the popluation is much larger than
the full data set, this pessimistic bias will be negligible. Some
people actually prefer to measure generalization error this way, i.e.,
as off-training-set error.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
 *** Do not send me unsolicited commercial or political email! ***

