15-859(A) Machine Learning Theory            04/10/06

* Learning finite state environments w/o a reset
=========================================================================
Now: what if you don't have a reset?  Extension by Rivest & Schapire.
To deal with this, use notion of a Homing sequence:

For a given sequence of actions h, and a given state q, let obs(q,h)
be the sequence of observations you see, starting at state q and
performing h.  Look at {obs(q,h) : q in Q}.  Let's think of this as a
partition of the state space Q, where two states are in the same
partition if they have the same observation sequence.  h is a Homing
Sequence if each partition has a unique *ending* state.  I.e., if
obs(q,h) = obs(q',h) then it must be that qh = q'h.

[using "qh" to mean the final state after running h from q]

E.g., "aa" in the example in the slides is a homing sequence.

Claim 1: for any two states q,q' there must exist a sequence A of
length at most n such that obs(q,A) != obs(q',A).

   Proof: Think of the following Dynamic Programming algorithm.  For
   each pair of states, say they are k-equivalent if they look
   equivalent for all strings A of length <= k.  So, at k=0 we have 2
   equivalence classes: white states and black states (or more if we
   have more observations).  To go from k to k+1, we just see, for
   each pair of k-equivalent states (q,q'), whether there exists an
   action a such that qa and q'a are not k-equivalent.  If so, then
   q,q' are not k+1-equivalent; else they are.  So, notice that if the
   number of equivalence classes does not increase going from k to
   k+1, then it *never* will increase, so that k-equivalent states are
   truly equivalent.  So, that means we are done by the time we reach
   k=n-2.  BTW, this is also an algorithm for DFA minimization.

[but there might not be a single sequence, of any length, that works
to distinguish all pairs q,q'.  Such a sequence would be called a
"distinguishing sequence"].

Claim 2: every DFA has a homing sequence.  In fact, every n-state DFA
has a homing sequence of size at most O(n^2).

   Proof: Start with h=lambda.  If there exists two states q, q' that
   violate property, append a sequence A s.t. obs(qh,A) != obs(q'h,A).
   Must exist one of length at most n. This increases # elements in
   the partition implied by h.

Can we hope to get O(n)?  o(n^2)?

   No: consider an ``L'' shaped environment with 2n states.  actions
   right, up.  topmost state is black and rest are white.  If you do
   ``up'' from a state with nothing above, then you stay put.  If you
   do ``right'' from bottom-right, it takes you to bottom-left.  If
   you go right from the vertical part of the L, then you go down to
   the bottom of the L and go right by 1. (so in terms of (x,y)
   coordinates, going right sets x = x+1 mod n and sets y=0; going up
   sets y = y+1 mod n if x=0, and does nothing if x != 0).

Note: existence of homing sequence implies existence of distinguishing
sequence if each action induces a permutation among the states.  Why?

Say you are given a homing sequence. How to use?

	Run h and record the observation.   Instantiate a new copy of
	Learn-DFA, indexed by the observation seen.  When you need to reset,
	run h instead.  If you see a new observation, start up a new copy of
	the algorithm.  If you see an old observation, go to the associated
	copy.  There will be at most n algorithms running (interleaved) in
	total and every time through we make one unit of progress on one of
	them.  So, just blows up running time by factor of n.

How to learn a homing sequence as we learn the environment?  Start off
with guess of h = lambda and run the above algorithm.  Two things
might happen.

1) Some copy (learn-DFA)_sigma might discover *inconsistent behavior*.
	I.e., it might notice that two runs of action sequence A have produced
	two different outputs.  In this case, A is a distinguishing sequence
	for two states we *thought* were the same.  So, append A onto h and
	start over.
	
	Actually, for better performance, can build a "homing tree" (also
	called an "adaptive homing sequence") which is a sequence where the
	next action depends on the observations you've seen.  In particular,
	we'll do h, and then only add A on the end if we saw sigma as our
	observation. This means we don't need to discard ALL of the
	learn-DFAs, just (Learn-DFA)_sigma.

2) Some copy grows a model with > n states.
	Here is a probabilistic way to improve h, based on the
	following "thought experiment".  (We're not going to actually
	run this).  Let's say we are in state q, and we run homing
	sequence h, and we enter this copy of Learn-DFA.  This copy
	has its states labeled with strings s_1, s_2,..., s_{n+1}
	(remember how the algorithm worked).  Since there are 
	really only n states, it must be that for some i != j,

		q h s_i = q h s_j   <- final states after these action seqs

	On the other hand, row(s_i) != row(s_j) in the algorithm's
	table.  Let's let e be the experiment responsible for that (if there
	are several, just pick one).  This means that for some starting states
	q' and q'' (these are before the homing sequence was run)

		obs(q', h) = obs(q'', h) = obs(q, h)
	but
		final-obs(q'h, s_i e) != final-obs(q''h, s_j e)

	(final-obs means just the last observation)

	But,
		final-obs(qh, s_i e) = final-obs(qh, s_j e)

	So, it can't possibly be that the three states qh, q'h, q''h
	are all the same.  In fact, it must be the case that either
	final-obs(q'h, s_i e) != final-obs(qh, s_i e), or final-obs(q''h,
	s_j e) != final-obs(qh, s_j e).  That means that either "s_i e" or
	"s_j e" is a good string to add on to h.

	The problem is we don't know what i and j are.  Even if we ran the
	experiment for real, we couldn't run both s_i and s_j from the state
	qh.  But, we can guess and pick one.  Alternatively, we can guess a
	random entry in the table and use that as "s e".  In either case we
	have at least a 1/(n+1)^2 chance of guessing right.    Now rerun the
	entire algorithm.  Since can improve the homing sequence at most n
	times, we will get here at most O(n^3) times in expectation.
	
	This gives a crazy n^6-ish bound, but Schapire's experiments point out
	that this is really a big overcount and in practice it is not so bad.
	If fact, can usually get close to a homing sequence with just a random
	string.