15-859(B) Machine Learning Theory 03/29/10 * recap on learning of DFA * Learning finite state environments w/o a reset ========================================================================= Recap on learning of DFA: - algorithm uses membership queries and counterexamples. - finds DFA that either is minimum consistent DFA or else has too few states. Any counter example causes alg to find two states it had been mistakenly viewing as identical and to increase size of hypothesis, maintaining this invariant.n - Can't do with MQ's only (combination lock automata) - Can't do with counterexamples/PAC only (crypto hardness) ========================================================================= Now: what if you don't have a reset? Extension by Rivest & Schapire. To deal with this, use notion of a Homing sequence: For a given sequence of actions h, and a given state q, let obs(q,h) be the sequence of observations you see, starting at state q and performing h. Let qh be the state you end at. * def: q != q' means there exists h such that obs(q,h) != obs(q',h). * def: h is a distinguishing sequence if it works to distinguish all pairs of states. I.e., if q != q' then obs(q,h) != obs(q',h). A DFA might not have a distinguishing sequence. But, it must have a homing sequence: * def: h is a homing sequence if obs(q,h) uniquely determines the state you *end* at. I.e., if qh != q'h then obs(q,h) != obs(q',h) E.g., "aa" in the example in the slides is a homing sequence. Now will prove these always exist and don't have to be *too* long. Claim 1: for any two states q,q' there must exist a sequence A of length at most n such that obs(q,A) != obs(q',A). Proof: Think of the following Dynamic Programming algorithm. For each pair of states, say they are k-equivalent if they look equivalent for all strings A of length <= k. So, at k=0 we have 2 equivalence classes: white states and black states (or more if we have more observations). To go from k to k+1, we just see, for each pair of k-equivalent states (q,q'), whether there exists an action a such that qa and q'a are not k-equivalent. If so, then q,q' are not k+1-equivalent; else they are. So, notice that if the number of equivalence classes does not increase going from k to k+1, then it *never* will increase, so that k-equivalent states are truly equivalent. So, that means we are done by the time we reach k=n-2. BTW, this is also an algorithm for DFA minimization. Claim 2: every DFA has a homing sequence. In fact, every n-state DFA has a homing sequence of size at most O(n^2). Proof: Start with h=lambda. If there exists two states q, q' that violate property, append a sequence A s.t. obs(qh,A) != obs(q'h,A). Must exist one of length at most n. This increases # elements in the partition implied by h. Can we hope to get O(n)? o(n^2)? No: consider an ``L'' shaped environment with 2n states. actions right, up. topmost state is black and rest are white. If you do ``up'' from a state with nothing above, then you stay put. If you do ``right'' from bottom-right, it takes you to bottom-left. If you go right from the vertical part of the L, then you go down to the bottom of the L and go right by 1. (so in terms of (x,y) coordinates, going right sets x = x+1 mod n and sets y=0; going up sets y = y+1 mod n if x=0, and does nothing if x != 0). Note: existence of homing sequence implies existence of distinguishing sequence if each action induces a permutation among the states. Why? Say you are given a homing sequence. How to use? Run h and record the observation. Instantiate a new copy of Learn-DFA, indexed by the observation seen. When you need to reset, run h instead. If you see a new observation, start up a new copy of the algorithm. If you see an old observation, go to the associated copy. There will be at most n algorithms running (interleaved) in total and every time through we make one unit of progress on one of them. So, just blows up running time by factor of n. How to learn a homing sequence as we learn the environment? Start off with guess of h = lambda and run the above algorithm. Two things might happen. 1) Some copy (learn-DFA)_sigma might discover *inconsistent behavior*. I.e., it might notice that two runs of action sequence A have produced two different outputs. In this case, A is a distinguishing sequence for two states we *thought* were the same. So, append A onto h and start over. Actually, for better performance, can build a "homing tree" (also called an "adaptive homing sequence") which is a sequence where the next action depends on the observations you've seen. In particular, we'll do h, and then only add A on the end if we saw sigma as our observation. This means we don't need to discard ALL of the learn-DFAs, just (Learn-DFA)_sigma. 2) Some copy grows a model with > n states. Here is a probabilistic way to improve h, based on the following "thought experiment". (We're not going to actually run this). Let's say we are in state q, and we run homing sequence h, and we enter this copy of Learn-DFA. This copy has its states labeled with strings s_1, s_2,..., s_{n+1} (remember how the algorithm worked). Since there are really only n states, it must be that for some i != j, q h s_i = q h s_j <- final states after these action seqs On the other hand, row(s_i) != row(s_j) in the algorithm's table. Let's let e be the experiment responsible for that (if there are several, just pick one). This means that for some starting states q' and q'' (these are before the homing sequence was run) obs(q', h) = obs(q'', h) = obs(q, h) but final-obs(q'h, s_i e) != final-obs(q''h, s_j e) (final-obs means just the last observation) But, final-obs(qh, s_i e) = final-obs(qh, s_j e) So, it can't possibly be that the three states qh, q'h, q''h are all the same. In fact, it must be the case that either final-obs(q'h, s_i e) != final-obs(qh, s_i e), or final-obs(q''h, s_j e) != final-obs(qh, s_j e). That means that either "s_i e" or "s_j e" is a good string to add on to h. The problem is we don't know what i and j are. Even if we ran the experiment for real, we couldn't run both s_i and s_j from the state qh. But, we can guess and pick one. Alternatively, we can guess a random entry in the table and use that as "s e". In either case we have at least a 1/(n+1)^2 chance of guessing right. Now rerun the entire algorithm. Since can improve the homing sequence at most n times, we will get here at most O(n^3) times in expectation. This gives a crazy n^6-ish bound, but Schapire's experiments point out that this is really a big overcount and in practice it is not so bad. If fact, can usually get close to a homing sequence with just a random string.