SECURITY and CRYPTOGRAPHY 15-827 4 DEC 01 Lecture #20 M.B. 4615 Wean MIDTERM EXAM 2 will be in-class on Tuesday, 11 December, one week from today. It will have three problems: Problem 1 will be indistinguishable from Midterm Exam 1, except that you better be fast at it. Problem 2 will be to critique the proof of the Theorem below. Either find a nontrivial bug in it (can you fix the bug?) or weaken the assumptions. Problem 3 (extra credit) will be your choice. For example, you might want to describe your own idea for a CAPTCHA. FINAL EXAM: Should we have one? TODAY: On the (im)possibility of an English-text-only CAPTCHA. THE MODEL: A CAPTCHA is a randomizing algorithm with access to a large public data base (the world wide web as perceived through GOOGLE) that carries on a conversation with an opponent -- either a bot that has access to GOOGLE or a human. The conversation proceeds in stages, starting with stage 1. In each stage, the CAPTCHA presents a CHALLENGE and the opponent gives a RESPONSE. At that point, the end of a stage, the CAPTCHA either ACCEPTS (believes its opponent to be human), REJECTS (believes its opponent to be a bot), or continues on to the next stage. The CAPTCHA is required to make its final decision (ACCEPT or REJECT) in a small (eg 10) number of stages. MORE DETAILS ON THE MODEL: The CAPTCHA initially sets stage k:=1. In STAGE k: *CAPTCHA generates a random #, random(k), and then uses it [if k>1, it also uses its history(k-1) of conversation and its state(k-1) -- EXCLUSIVE of all random #s generated in previous stages] to generate public challenge(k) and private state(k). *It awaits/gets public response(k). *It then uses its current history(k) = challenge(1) response(1) : : challenge(k) response(k) and current state(k) to evaluate: ACCEPT, REJECT, or continue. We prove impossibility of an English-text-only CAPTCHA based on the following: ASSUMPTIONS: 1. RANDOM NUMBERS are used, if at all, only to create a public challenge (a continuation of the conversation) and some small amount of private state information. The CAPTCHA never uses its random numbers, if any, to decide between ACCEPT, REJECT, or continue (the conversation). 2. CONSISTENCY (HISTORY IRRESPECTIVE OF STATE DETERMINES ACCEPTANCE or REJECTION): At the end of stage k, the decision to ACCEPT, REJECT, or continue is completely determined by history(k). The only purpose of state(k) is to help the CAPTCHA to decide EFFICIENTLY -- whether to ACCEPT, REJECT, or continue. 3. CONTINUE => A PATH TO ACCEPT EXISTS: If, at the end of a stage, the CAPTCHA neither ACCEPTS nor REJECTS, then in the next stage the CAPTCHA is guaranteed to present a challenge (continuation of the conversation) for which there exists a non-rejectable response -- one that causes the CAPTCHA to ACCEPT or continue. 4. CHALLENGE POSSIBLE => THAT (SAME) CHALLENGE PROBABLE: Given history(k-1) and challenge(k), it is efficiently possible for a bot to find a random # and a state(k-1) that causes its own private virtual copy of the CAPTCHA to generate the same challenge(k), and therefore to evaluate response(k). 5. NONREJECTABLE RESPONSE POSSIBLE => SOME (POSSIBLY DIFFERENT) NONREJECTABLE RESPONSE IS PROBABLE: If a response is not rejected by the CAPTCHA, then a random response has nontrivial probability (greater than 1% say) to not be rejected. 6. ALL BRANCHES ARE FINITE: Every branch is a FINITE sequence consisting of ROOT, CHALLENGE(1), RESPONSE(1), (continue), ..., CHALLLENGE(k), RESPONSE(k), ACCEPT or REJECT. It follows by Knig's Lemma (compactness) that the entire game tree is finite. THEOREM: In the above model and under the above assumptions, an English-text CAPTCHA is impossible. PROOF: Assume to the contrary that a CAPTCHA is possible. We construct a bot that is accepted by the CAPTCHA, thus falsifying our assumption to the contrary. The bot works as follows: STAGE 1. When presented with CHALLENGE(1), it searches for a random number that causes its own private virtual copy of the (public) CAPTCHA to generate (the same) CHALLENGE(1). This is possible by assumption 4. This CHALLENGE(1) has a nonrejectable response, by assumption 3. The bot can find a nonrejectable RESPONSE(1) by running its virtual CAPTCHA starting from the state it entered after generating CHALLENGE(1) -- on randomly chosen responses, to look for a nonrejectable response. For each rejecting response, it reinitializes the CAPTCHA back to where it had just generated CHALLENGE(1) and runs it again on another random response. It does this again and again until it finds a nonrejectable RESPONSE(1). It is sure to find such nonrejectable RESPONSE(1) by assumption 5. The bot then supplies this RESPONSE(1) to the (actual) CAPTCHA. STAGE k: The actual CAPTCHA now generates CHALLENGE(k). The bot looks for this CHALLENGE(k) and a nonrejectable RESPONSE(k). This continues until the CAPTCHA ACCEPTS or REJECTS, which must happen since, by assumption 6, the tree is finite, and by assumption 3, "continue" is not a leaf. By assumption 3, the branch cannot lead to REJECT, so the branch must end in ACCEPT. Somewheres above we need to make the point that if the actual CAPTCHA and the virtual CAPTCHA have the same history up to but not including the final ACCEPT or REJECT, then either both ACCEPT or both REJECT. This follows from assumption 2. QED QUESTION: How does the OCR-based CAPTCHA circumvent this Theorem? The CAPTCHA chooses one of a very large set of random numbers to select the challenge (a distorted image) and the state information (the actual word). The opponent cannot guess the random number (efficiently). It therefore cannot simulate the generation (of a copy) of the actual challenge. QUESTION: Is BRIAN'S idea for a CAPTCHA possible? Brian's idea is to build into the CAPTCHA some model of the world, e.g. SHRDLU, the blocks world model of the world. The CAPTCHA describes in English whatever manipulations it performs on its blocks world. A human can perform these manipulations on its own internal model of the same world. This enables the human to answer the CAPTCHA's questions about the state of the world resulting from these manipulations. If a human can do this, why can't a bot? A bot has SHRDLU's model of the world. The bot gets an English statement of how to manipulate that world (which is exactly the kind of statement that SHRDLU works so well with, but no matter). The bot must find a random number (manipulation) that causes its private virtual CAPTCHA to generate the same challenge. If such a (random) manipulation is highly improbable, then CAPTCHA itself would most probably not have found it. This is a highly unlikely event. It can therefore be disregarded. Note that this argument depends fiercely on the fact that the possible challenges are small in number. QUESTION: Is the Brighten GODFREY / Roni ROSENFELD idea for a CAPTCHA possible? Their CAPTCHA works as follows: it selects a semantically meaningful syntactically correct sentence from the web. It also generates a syntactically correct but semantically meaningless sentence using a Markov Model of sentence generation. Then it replaces key words in both sentences by synonyms, randomizes the order of the two sentences, and finally asks the opponent to decide which sentence is derived from (is ss(*) of) the semantically meaningful one. To show that such a CAPTCHA is impossible, we show that if it exists, then it is possible to construct a CAPTCHA' that is at least as powerful as CAPTCHA -- in the sense that it distinguishes bots from humans at least as well as CAPTCHA -- and satisfies the assumptions of the Theorem: Let S denote an English sentence. Let ss(S) be a random variable representing the sentence S with key words substituted by synonyms. 1. RANDOMNESS is used to generate the Markov Model sentence and to select a random sentence from the world wide web. It is also used to select random synonyms to replace key words, and to decide which synonym substituted sentence comes first. Nothing else. The state keeps info on which of the two sentences was taken from the web (defined to be the semantically meaningful one) and how many of the responses in a conversation have been answered correctly. So 1 is satisfied. 2. CONSISTENCY (HISTORY IRRESPECTIVE OF STATE DETERMINES ALL). This means that at the end of stage k, the decision to ACCEPT, REJECT, or to continue is completely determined by history(k). An INCONSISTENCY would imply that there exist two sentences S and S', one generated by Markov Model, the other selected from the web, and there exist random synonym substitutions such that ss(S) = ss(S'). Actually, this is possible, though we believe that the CAPTCHA designer would try to avoid it. In any case, we construct a CAPTCHA' that is at least as powerful as CAPTCHA in the sense that with the same information CAPTCHA' makes at least as good a decision as CAPTCHA. CAPTCHA' has an oracle that tells it, for any given sentence ss(S), the relative probability that S was generated by Markov Chain or by random selection from the web. Since the bot can fool this more powerful CAPTCHA', it can also fool the less powerful CAPTCHA. 3. CONTINUE => A PATH TO ACCEPT EXISTS: If, at the end of a stage, the CAPTCHA neither ACCEPTS nor REJECTS, then in the next stage the CAPTCHA is guaranteed to present a challenge (continuation of the conversation) for which there exists a non-rejectable response. The BG/RR CAPTCHA may or may not have this property, but it can be modified to have it without changing the decisions it makes. If the modified CAPTCHA cannot exist, then neither can the original. 4. CHALLENGE POSSIBLE => (SAME) CHALLENGE PROBABLE: Given history(k-1) and challenge(k), it is efficiently possible for a bot to find a random # and a state(k-1) that causes its own private virtual copy of the CAPTCHA to generate the same challenge(k), and therefore to evaluate response(k). At any stage, the number of challenges is small, but maybe not THAT small. 1.5 bits of entropy per character means 2^90 bits for a 60 character sentence. Fortunately, the bot knows the challenge that it must create is a synonym substituted replacement of the two sentences in the given challenge. It can try substituting synonyms for half a dozen initial key words and see if GOOGLE continues the sentence on the web. It can do this for both sentences. 5. NONREJECTABLE RESPONSE POSSIBLE => SOME (POSSIBLY DIFFERENT) NONREJECTABLE RESPONSE IS PROBABLE: If a response is not rejected by the CAPTCHA, then a random response has nontrivial probability (greater than 1% say) to not be rejected. No problem. The response is just a single bit. A single bit has a 50% probability to not be rejected. 6. ALL BRANCHES ARE FINITE: Every branch is a FINITE sequence consisting of ROOT, CHALLENGE(1), RESPONSE(1), (continue), ... CHALLLENGE(k), RESPONSE(k), ACCEPT or REJECT. After 10 challenge-response pairs, say, the CAPTCHA must make a decision, even a borderline decision which could go either way must be made.