From newshub.ccs.yorku.ca!ists!helios.physics.utoronto.ca!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!uwm.edu!csd4.csd.uwm.edu!markh Mon Jan  6 10:30:14 EST 1992
Article 2461 of comp.ai.philosophy:
Newsgroups: comp.ai.philosophy
Path: newshub.ccs.yorku.ca!ists!helios.physics.utoronto.ca!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!uwm.edu!csd4.csd.uwm.edu!markh
>From: markh@csd4.csd.uwm.edu (Mark William Hopkins)
Subject: A learning automaton for temporal sequences
Message-ID: <1991Dec31.220604.2892@uwm.edu>
Sender: news@uwm.edu (USENET News System)
Organization: Computing Services Division, University of Wisconsin - Milwaukee
Date: Tue, 31 Dec 1991 22:06:04 GMT
Lines: 102

   Way back in 1986 I tried out a series of simple experiments to probe in the
mind.  This is how it worked: you just write down a random sequence of 0's and
1's trying to make it as random as possible, and going as fast as you can
without stopping to think.  Then you analyse the sequence.
   These are general facts that emerged:

       (1) The first several times a person tries this, there will be
           large numbers of repititions and cycles in the sequence, and
           the conditional probability

                           P(x; x1 ... xN)

           (the probability that x will appear after the subsequence x1 ... xN)
           will diverge to 0's and 1's for N = 1(!)

       (2) This probability will ALWAYS diverge to near 0 and 1 past some
           cut-off for N (assuming you've generated a long enough sequence to
           be able to take enough samples to see this).  That's the threshold
           where cycles start appearing too.

Try it out.  You'll see it happen every time.  This is also a very good way to
get information on how the brain generates and processes temporal sequences.
   Those observations imply that a particular structure exists in the brain to
generate (and learn) temporal sequences.  So I tried formalizing what this was
and came up with this stochastic learning automaton:

             X = Input set, Q = a fixed state set
             PX = Set of distributions over X.  A distribution is defined here
                  as a function mapping X to non-negative integers.
             d: Q x X -> Q ... a fixed "random" transition function.
             s: X, the start state
             S: Q x X x [Q->PX] -> [Q->PX] ... an update function
                [Q->PX] is defined as the function space for function mapping
                Q to PX.
             m: PX -> X ... a stochastic "search" function.

(Everything's written in pseudo-C)

The routine TRAIN takes a pre-existing function e: Q -> PX and updates it by
passing it through a sequence (x1, x2, x3, ...)  in X*:

               TRAIN(e:[Q->PX], (x1, x2, x3, ...) : X*) {
                  q: State ... the current state
                  q = s;
                  for (x = x1, x2, x3, ...)
                     e = S(q, x, e), q = d(q, x);
                  return e;
               }

So for example, to train on several sequence S1, S2, ..., Sm: X*, you'd
carry out the routine:

                  e = some initialized value
                  e = TRAIN(e, S1), e = TRAIN(e, S2), ..., e = TRAIN(e, Sm)

The routine GENERATE produces a random sequence based on a function e:[Q->PX].

                GENERATE(e:[Q->PX]) {
                   q: State ... the current state
                   q = s;
                   forever {
                      x = m(e(q)), q = d(q, x);
                      output x;
                   }
                }

So here I was digging through my scrap paper heap the other day and on this
sheet of paper is a description (basically) of the above that I wrote back in
'86 but simply never got around to coding.  So I sit down and I implement the
automaton above with the following definitions:

                 Q = set of substrings of length N over X.
                 d(x1 x2 ... xN, x) = x2 ... xN x
                 s = (0, 0, 0, ..., 0)    (N 0's).

Associated with each substring of length N is a distribution over X, updated
as follows:
                 S(X, x, e)(Y) = 1 + e(X) if X = Y
                               = e(Y)     else

The search function treats the distribution as a probability distribution and
mimics it:
                         for p: [Q->PX]
                  m(p) = x with probability p(x)/( sum over y in X of p(y) ).

   Don't be daunted by the formalism, the program itself is only 100 lines
long!  And the first time I ran it I had to look twice, it was mimicing the
style of its input so faithfully that I thought there HAD to be a bug in it
somewhere!  For N = 6, the word formation was nearly flawless but the
syntax and cohesion was awful.  For N = 10, things started appearing a whole
lot more orderly.  In both cases, there was novel word formation and novel
syntax formation (and it kept on wanting to say eithere, instead of either).

   Training it on about 150k of erotically written text resulted in the
most ridiculously funny output...

   One thing that is guaranteed: as N approaches infinity, the family of
algorithms will converge to a 100% accurate look-up table of everything that
was used as training input.  But long before then, you'll start seeing
increasing cohesion, then increasing syntatic regularity, and then even
semantic regularity.  And for large enough N, even the Turing Test could be
passed (if conversational natural language input is used).


