Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!agate!library.ucla.edu!csulb.edu!csus.edu!netcom.com!park
From: park@netcom.com (Bill Park)
Subject: What-if analysis with sparse training data?
Message-ID: <parkD5DBHG.4DA@netcom.com>
Followup-To: comp.ai.neural-nets
Summary: When do you trust a net trained with sparse data?
Keywords: sparse data generalization what if neural network connectionism
Bcc: park@netcom.com
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Date: Mon, 13 Mar 1995 07:29:40 GMT
Lines: 72
Sender: park@netcom7.netcom.com

When doing a "what-if" experiment with a net trained on
sparse data, how can you tell in general whether the case
you are giving the net is sufficiently close to enough of
the cases in the training set to allow the net to produce a
reliable answer?

I have very sparse data -- only a couple of hundred training
cases throughout a ten-dimensional input space (the net has
ten input values).  The net produces a single output value,
and can be trained to give acceptable accuracy as measured
with a holdout set of cases.  I'm training a three-layer
feedforward net using backprop, only 3 hidden nodes.

The training cases are not uniformly distributed throughout
the input space, but are concentrated in one or more regions
that extend throughout only a small percentage of the total
hypervolume of the input space.  For example, in one
three-dimensional projection of the 10-dimensional input
space, all the training cases lie closely along almost one
full turn of a helix.  So, if you happen to pick a "what-if"
case that lies near the axis of this helix, the net will
probably give the wrong answer, since there are no training
cases near there.

Because the number of cases is small and they are compactly
distributed, it is extremely unlikely that a "what if" case
consisting of ten random input values will fall anywhere
near the training set.  Consequently, the trained net will
give unreliable answers for the vast majority of inputs one
might give it.  Therefore it is important to make sure a
"what-if" case is representative of the training cases
before you place any confidence in the net's answer.

It's not possible in this application to simply generate
more training cases algorithmically to cover the rest of the
input space and train the net some more: The data values are
completely empirical and no useful theoretical models exist
for the system that produces them.  Neither is it practical
to obtain more empirical data at this time.

Some grossly unrepresentative "what-if" cases are easily
detected.  For instance, if any input value in a case falls
outside its range of variation in the training set, I can
safely reject that case.  But deciding whether to reject a
case that falls within those bounds would be much more
difficult, as can be seen from the helical distribution of
the training cases mentioned above: The axis of the helix is
well within the range of the training set (they surround
it!), yet we have no information about what answer the net
should give for cases in that region.  We can only trust the
net's answers for cases that lie close to the helix.  I
could fit a helix, transform varables, and reduce these
three inputs to one, etc., but that would only take care of
three of the ten dimensions in this problem, the other
dimensions don't show any obvious patterns as clean and
simple as the helix, and it's not a general solution.

So, briefly, I need a general method for vetting "what-if"
cases.  -- Maybe a cluster analysis approach.  Maybe
training another net to model the error in the first net.
I'm open to suggestions and have found nothing on this in
the literature I have at hand.  I should think this happens
frequently in real applications, and am surprised that there
seems to be no well-known solution.

Thank you in advance for any suggestion for a general
approach to this problem -- especially if there is a
commericial neural net development package that provides it!

Bill Park
=========
-- 
Grandpaw Bill's High Technology Consulting & Live Bait, Inc.
