Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!gatech!news.mathworks.com!zombie.ncsc.mil!nntp.coast.net!news.sprintlink.net!news.interpath.net!sas!mozart.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: feature selection: bias & the need for separate train and test sets
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DopGr5.Gv0@unx.sas.com>
Date: Sat, 23 Mar 1996 05:33:05 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4h73d9$4pf@eis.wfunet.wfu.edu> <4hrah6$ml@llnews.ll.mit.edu>
Organization: SAS Institute Inc.
Lines: 32


In article <4hrah6$ml@llnews.ll.mit.edu>, heath@ll.mit.edu (Greg Heath) writes:
|> ...
|> Second, it is well known (e.g., Fukanaga, Academic Press, 1972, 1990) that for 
|> large design sets with N_d random samples, large independent test sets with N_t  
|> random samples, and classifiers with N_p parameters that have to be estimated 
|> with the design set:
|> 
|>   1. The classification error bias is O(N_p/N_d), 

That would be training error, and is strictly correct only for linear models
under certain assumptions and with certain training methods such as maximum
likelihood. This formula does not apply to regularized or stopped training.

|> and
|>   2. The classification error standard deviation is O(1/sqrt(N_t)).

That must be test error.

|> Therefore, provided the data set is sufficiently large, the optimal split into 
|> independent design and test sets should satisfy
|>  
|>                         N_t >> N_d >> N_p.

I don't follow this argument at all. Would Greg care to elaborate, or should
I get the book?

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
