15-859(A) Machine Learning Theory 03/23/04 * Learning DFAs cont. - recap of hardness results - recap of Angluin's algorithm - more about DFAs, DFAs and dynamic programming - learning without a reset handout hwk5 ========================================================================= - recap of hardness results - recap of Anlguin's result Now: what if you don't have a reset? Extension by Rivest & Schapire. To deal with this, use notion of a Homing sequence: [also, see slides] For a given sequence of actions h, and a given state q, let obs(q,h) be the sequence of observations you see, starting at state q and performing h. Look at {obs(q,h) : q in Q}. Let's think of this as a partition of the state space Q, where two states are in the same partition if they have the same observation sequence. h is a Homing Sequence if each partition has a unique *ending* state. I.e., if obs(q,h) = obs(q',h) then it must be that qh = q'h. [using "qh" to mean the final state after running h from q] More about DFAs --------------- Claim 1: for any two states q,q' there must exist a distinguishing sequence A for those two states (obs(q,A) != obs(q',A)) of length at most n. Proof: Think of the following Dynamic Programming algorithm. For each pair of states, say they are k-equivalent if they look equivalent for all strings A of length <= k. So, at k=0 we have 2 equivalence classes: white states and black states (or more if we have more observations). To go from k to k+1, we just see, for each pair of k-equivalent states (q,q'), whether there exists an action a such that qa and q'a are not k-equivalent. If so, then q,q' are not k+1-equivalent; else they are. So, notice that if the number of equivalence classes does not increase going from k to k+1, then it *never* will increase, so that k-equivalent states are truly equivalent. So, that means we are done by the time we reach k=n-2. BTW, this is also an algorithm for DFA minimization. [but there might not be a single sequence, of any length, that works as distinguishing sequence for the whole DFA]. Claim 2: every DFA has a homing sequence. In fact, every n-state DFA has a homing sequence of size at most O(n^2). Proof: Start with h=lambda. If there exists two states q, q' that violate property, append a disinguishing sequence for (qh,q'h) onto h. I.e., sequence A s.t. obs(qh,A) != obs(q'h,A). Must exist one of length at most n. This increases # elements in the partition implied by h] Can we hope to get O(n)? o(n^2)? No: consider an ``L'' shaped environment with 2n states. actions right, up. topmost state is black and rest are white. If you do ``up'' from a state with nothing above, then you stay put. If you do ``right'' from bottom-right, it takes you to bottom-left. If you go right from the vertical part of the L, then you go down to the bottom of the L and go right by 1. (so in terms of (x,y) coordinates, going right sets x = x+1 mod n and sets y=0; going up sets y = y+1 mod n if x=0, and does nothing if x != 0). Note: existence of homing sequence implies existence of distinguishing sequence if each action induces a permutation among the states. Why? Say you are given a homing sequence. How to use? Run h and record the observation. Instantiate a new copy of Learn-DFA, indexed by the observation seen. When you need to reset, run h instead. If you see a new observation, start up a new copy of the algorithm. If you see an old observation, go to the associated copy. There will be at most n algorithms running (interleaved) in total and every time through we make one unit of progress on one of them. So, just blows up running time by factor of n. How to learn a homing sequence as we learn the environment? Start off with guess of h = lambda and run the above algorithm. Two things might happen. 1) Some copy (learn-DFA)_sigma might discover *inconsistent behavior*. I.e., it might notice that two runs of action sequence A have produced two different outputs. In this case, A is a distinguishing sequence for two states we *thought* were the same. So, append A onto h and start over. Actually, for better performance, can build a "homing tree" (also called an "adaptive homing sequence") which is a sequence where the next action depends on the observations you've seen. In particular, we'll do h, and then only add A on the end if we saw sigma as our observation. This means we don't need to discard ALL of the learn-DFAs, just (Learn-DFA)_sigma. 2) Some copy grows a model with > n states. Here is a probabilistic way to improve h, based on the following "thought experiment". (We're not going to actually run this). Let's say we are in state q, and we run homing sequence h, and we enter this copy of Learn-DFA. This copy has its states labeled with strings s_1, s_2,..., s_{n+1} (remember how the algorithm worked). Since there are really only n states, it must be that for some i != j, q h s_i = q h s_j <- final states after these action seqs On the other hand, row(s_i) != row(s_j) in the algorithm's table. Let's let e be the experiment responsible for that (if there are several, just pick one). This means that for some starting states q' and q'' (these are before the homing sequence was run) obs(q', h) = obs(q'', h) = obs(q, h) but final-obs(q'h, s_i e) != final-obs(q''h, s_j e) (final-obs means just the last observation) But, final-obs(qh, s_i e) = final-obs(qh, s_j e) So, it can't possibly be that the three states qh, q'h, q''h are all the same. In fact, it must be the case that either final-obs(q'h, s_i e) != final-obs(qh, s_i e), or final-obs(q''h, s_j e) != final-obs(qh, s_j e). That means that either "s_i e" or "s_j e" is a good string to add on to h. The problem is we don't know what i and j are. Even if we ran the experiment for real, we couldn't run both s_i and s_j from the state qh. But, we can guess and pick one. Alternatively, we can guess a random entry in the table and use that as "s e". In either case we have at least a 1/(n+1)^2 chance of guessing right. Now rerun the entire algorithm. Since can enlarge the homing sequence at most n times, we will get here at most O(n^3) times. This gives a crazy n^6-ish bound, but Schapire's experiments point out that this is really a big overcount and in practice it is not so bad. If fact, can usually get close to a homing sequence with just a random string.