Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!news.duq.edu!newsfeed.pitt.edu!portc02.blue.aol.com!howland.erols.net!EU.net!CERN.ch!news
From: john nigel gamble <gamble@dxcoms.cern.ch>
Subject: Re: Beginners questions
X-Nntp-Posting-Host: dxcoms.cern.ch
Content-Type: text/plain; charset=us-ascii
To: "Stefan C. Kremer" <stefan.kremer@crc.doc.ca>
Message-ID: <329C06AA.41C6@dxcoms.cern.ch>
Sender: news@news.cern.ch (USENET News System)
Content-Transfer-Encoding: 7bit
Organization: CERN European Lab for Particle Physics
References: <32998DF2.41C6@dxcoms.cern.ch> <57d1eo$pik@crc-news.doc.ca> <329AE2A8.41C6@dxcoms.cern.ch> <57euvb$fgt@crc-news.doc.ca>
Mime-Version: 1.0
Date: Wed, 27 Nov 1996 09:15:22 GMT
X-Mailer: Mozilla 3.01Gold (X11; I; OSF1 V3.2 alpha)
Lines: 154

Thanks  Stefan,  this  exchange  was  very  interesting,   if   rather
depressing.  I learnt quite a lot.  For a long time I have been trying
out various (simple) algorithms such as the common backprop and  Rprop
in  an  attempt to make my networks learn "better".  I've made surface
plots of gain vrs momentum vrs you name it ... very pretty but  now  I
understand why they didn't help too much.

However, just a short excursion into the analogy of the kangaroo shows
that  in  fact  whether  one algorithm converges faster or better than
another is not the main issue (although important) in  solving  (some)
problems  with neural networks - its finding the right starting point.
Am I right?  Gee, maybe this should go in the FAQ.

>                  Two specific ways of incorportating
>prior knowledge to help the network learn are:  (1) setting the intial
>weight values to an approximate solution to the problem, or (2)

How do you do this? How do you map "prior knowedge" into weights?

>P.S.  If my answers seem more vague, its because your questions are
>getting harder.

You're a good teacher.

=========================================== original message




Stefan C. Kremer wrote:
> 
> In article <329AE2A8.41C6@dxcoms.cern.ch>, gamble@dxcoms.cern.ch says...
> 
> >Being a programmer rather than mathematician
> 
> (You make it sound like the two are mutually exclusive.)
> 
> >, (with regret) I
> >considered that having found a (peak) valley I have a set of
> >coordinates in weight space that is the local minimum. I could simply
> >"hop" along each of the coordinate axis in turn till I started going
> >up-hill. (e.g in two dimensions go -North, South, East and West from
> >the minimum). This is just feed-forward (isn't it?).
> 
> There are definately limitations to the method you propose.  Consider
> for example a ring shaped valley.  Proceeding from a low point within
> the ring you will not be able to map out the whole ring in this fashion.
> I don't think this is "just feed-forward", since to make the hops,
> you need to compute the derivative of one weight with respect to the
> error function.  To do this you in turn will need to feedforward to
> compute the error and then back-prop (but only to the one weight) to
> make the change.
> 
> >Now, if the analogy holds, the further I go the more ground I am likely
> >to miss (the area between the axis) - so maybe I shouldn't go too far.
> 
> Or maybe you shouldn't go ON the axis.
> 
> >There again, by looking at the local surface gradient as I hop I can
> >build up some confidence concerning the local terrain. (I don't see
> >how to do this easily).
> >If the optimum is really an isolated steep-sided hole in a hill, then
> >it is going to be difficult for any step-wise approximating algorithm
> >to find it.
> 
> Right.
> 
> >>Of course, there is the possibility that "one side" of the mountain may
> >>extend to infinity, dropping in altitude as one or more weights are
> >>made larger (or smaller).  I suspect this case would be very prevalent
> >>for binary output problems (and maybe continous ones as well).
> >
> >But thats OK isn't it. I was trying to eliminate areas (volumes is
> >perhaps more correct) of weight space that I don't need to look at.
> 
> Sure.
> 
> >>arbitrary contour and even contain holes).  If my supposition about
> >>arbitrarily complex regions is true, then in order to be sure that there
> >>are no other mountains hiding in what you think might be the region of
> >>your current mountain.  You would have to actually visit every point
> >>in the hypothesised region.  If the weight space is continuous then
> >>the problem of identifying the region of a mountain is impossible.
> >
> >I agree (I think) to this as a generalised mathematical statement, but
> >doesn't this observation apply to all step-wise algorithms?
> 
> I think its even worse than that.  I believe it would apply to all
> weight optimizing algorithms.
> 
> >If we are looking for a minimum, and this happens to be the crater of
> >a volcano, then I don't see how any gradient based algorithm can
> >find it - without an exhaustive search - or good luck!.
> 
> Yes, exactly!  The lesson I take from this is that finding optimal
> weights in the general case can be quite hard, and that learning
> in the general case can be very hard.  One solution to this problem
> is to introduce some sorts of assumptions about what is to be learned.
> I.e.  to do problem specific learning rather than general learning.
> 
> As a programmer (I've done my share of it too), it is often very
> tempting to "just implement a general NN algorithm and let the network
> do the work", investing as little effort in understanding the network
> or the problem to be solved as possible.  This is what I did when I
> first began working with networks.  You clearly have already progressed
> beyon that stage though in thinking about the error functions and finding
> minima.
> 
> In actual fact, its been my experience that forcing the
> network to do all the learning often results in failure.  If the
> programmer knows some things about the problem, then that knowledge
> can often be used to simply the learning task given to the network
> (a sort of division of labour).  Two specific ways of incorportating
> prior knowledge to help the network learn are:  (1) setting the intial
> weight values to an approximate solution to the problem, or (2)
> encoding the input to the network in such a way that the network's
> task is simpler.  (There's obviously a lot more to it than that.)
> 
> >I have (at least) two subsequent questions
> >
> >1). what other algorithms exist that don't "follow" the surface?
> 
> What do you mean by "follow"?  In the sense that the surface reflects
> the error, all algorithms that reduce error (i.e. try to learn the
> problem you present) are based on the surface.
> 
> >2). How close does the analogy hold. For example In the kangaroo's
> >    land I understood weights to be used as if they were equivalent
> >    to the axis. North-South is orthogonal to East-West.
> >    But is this true in "weight-space" .. is this a stupid question?
> >    I have difficulty picturing it, but maybe the number of "axis" (in
> >    an n-dimentional coordinate system) that represents weight space
> >    is not the same as the number of weights.
> 
> I think the analogy holds at this level.  Each weight (and bias) in
> the network represents an orthogonal direction in weight space (since
> each weight can be varied independently of every other).  The error
> value for a particular point in weight space adds one more dimension.
> I.e. if there are two weights in your network (a one-input, one-output,
> one-weight, one-bias network), then the two weights dimensions can be
> thought of as NS and EW (in a flat-earth sense).  The error function
> of the network can be represented as a third dimesion--altitude.
> 
> >3). I guess all this also depends on the training set being complete.
> >    Do you have an analogy for the landscape when the training set is
> >    incomplete?
> 
> I don't have a good analogy here.  Infact, the whole thing breaks down
> when you consider presenting one training example at a time.
> 
>         -Stefan
> 
> P.S.  If my answers seem more vague, its because your questions are
> getting harder.
