SECURITY and CRYPTOGRAPHY 15-827                   4 DEC 01
             Lecture #20                                 M.B.
                                                      4615 Wean

MIDTERM EXAM 2 will be in-class on Tuesday, 11 December, one week from
today. It will have three problems:
  Problem 1 will be indistinguishable from Midterm Exam 1, except that you
better be fast at it.
  Problem 2 will be to critique the proof of the Theorem below. Either find
a nontrivial bug in it (can you fix the bug?) or weaken the assumptions.
  Problem 3 (extra credit) will be your choice. For example, you might want
to describe your own idea for a CAPTCHA.

FINAL EXAM: Should we have one?


TODAY: On the (im)possibility of an English-text-only CAPTCHA.

THE MODEL:
A CAPTCHA is a randomizing algorithm with access to a large public data
base (the world wide web as perceived through GOOGLE) that carries on a
conversation with an opponent -- either a bot that has access to GOOGLE or
a human. The conversation proceeds in stages, starting with stage 1. In
each stage, the CAPTCHA presents a CHALLENGE and the opponent gives a
RESPONSE. At that point, the end of a stage, the CAPTCHA either ACCEPTS
(believes its opponent to be human), REJECTS (believes its opponent to be a
bot), or continues on to the next stage. The CAPTCHA is required to make
its final decision (ACCEPT or REJECT) in a small (eg 10) number of stages.

MORE DETAILS ON THE MODEL:
The CAPTCHA initially sets stage k:=1.
In STAGE k:
*CAPTCHA generates a random #, random(k), and then uses it
    [if k>1, it also uses its history(k-1) of conversation and
    its state(k-1) -- EXCLUSIVE of all random #s generated in
    previous stages]
 to generate public challenge(k) and private state(k).
*It awaits/gets public response(k).
*It then uses its current history(k) = challenge(1) response(1)
                                                 :           :
                                       challenge(k) response(k)
 and current state(k) to evaluate: ACCEPT, REJECT, or continue.


We prove impossibility of an English-text-only CAPTCHA based on the following:

ASSUMPTIONS:
   1. RANDOM NUMBERS are used, if at all, only to create a public challenge
(a continuation of the conversation) and some small amount of private state
information.  The CAPTCHA never uses its random numbers, if any, to decide
between ACCEPT, REJECT, or continue (the conversation).

   2. CONSISTENCY (HISTORY IRRESPECTIVE OF STATE DETERMINES ACCEPTANCE or
REJECTION): At the end of stage k, the decision to ACCEPT, REJECT, or
continue is completely determined by history(k). The only purpose of
state(k) is to help the CAPTCHA to decide EFFICIENTLY --  whether to
ACCEPT, REJECT, or continue.

   3. CONTINUE => A PATH TO ACCEPT EXISTS: If, at the end of a stage, the
CAPTCHA neither ACCEPTS nor REJECTS, then in the next stage the CAPTCHA is
guaranteed to present a challenge (continuation of the conversation) for
which there exists a non-rejectable response -- one that causes the CAPTCHA
to ACCEPT or continue.

   4. CHALLENGE POSSIBLE => THAT (SAME) CHALLENGE PROBABLE: Given
history(k-1) and challenge(k), it is efficiently possible for a bot to find
a random # and a state(k-1) that causes its own private virtual copy of the
CAPTCHA to generate the same challenge(k), and therefore to evaluate
response(k).

   5. NONREJECTABLE RESPONSE POSSIBLE => SOME (POSSIBLY DIFFERENT)
NONREJECTABLE RESPONSE IS PROBABLE: If a response is not rejected by the
CAPTCHA, then a random response has nontrivial probability (greater than 1%
say) to not be rejected.

   6. ALL BRANCHES ARE FINITE: Every branch is a FINITE sequence consisting
of ROOT, CHALLENGE(1), RESPONSE(1), (continue), ..., CHALLLENGE(k),
RESPONSE(k), ACCEPT or REJECT.
It follows by Knig's Lemma (compactness) that the entire game tree is finite.


THEOREM: In the above model and under the above assumptions,
         an English-text CAPTCHA is impossible.

PROOF:
 Assume to the contrary that a CAPTCHA is possible. We construct a bot that
is accepted by the CAPTCHA, thus falsifying our assumption to the contrary.

 The bot works as follows:

STAGE 1. When presented with CHALLENGE(1), it searches for a random number
that causes its own private virtual copy of the (public) CAPTCHA to
generate (the same) CHALLENGE(1). This is possible by assumption 4. This
CHALLENGE(1) has a nonrejectable response, by assumption 3. The bot can
find a nonrejectable RESPONSE(1) by running its virtual CAPTCHA starting
from the state it entered after generating CHALLENGE(1) -- on randomly
chosen responses, to look for a nonrejectable response. For each rejecting
response, it reinitializes the CAPTCHA back to where it had just generated
CHALLENGE(1) and runs it again on another random response. It does this
again and again until it finds a nonrejectable RESPONSE(1). It is sure to
find such nonrejectable RESPONSE(1) by assumption 5. The bot then supplies
this RESPONSE(1) to the (actual) CAPTCHA. 

STAGE k: The actual CAPTCHA now generates CHALLENGE(k). The bot looks for
this CHALLENGE(k) and a nonrejectable RESPONSE(k).

This continues until the CAPTCHA ACCEPTS or REJECTS, which must happen
since, by assumption 6, the tree is finite, and by assumption 3, "continue"
is not a leaf.  By assumption 3, the branch cannot lead to REJECT, so the
branch must end in ACCEPT.

 Somewheres above we need to make the point that if the actual CAPTCHA and
the virtual CAPTCHA have the same history up to but not including the final
ACCEPT or REJECT, then either both ACCEPT or both REJECT. This follows from
assumption 2.

QED


QUESTION: How does the OCR-based CAPTCHA circumvent this Theorem? The
CAPTCHA chooses one of a very large set of random numbers to select the
challenge (a distorted image) and the state information (the actual word).
The opponent cannot guess the random number (efficiently).  It therefore
cannot simulate the generation (of a copy) of the actual challenge.


QUESTION: Is BRIAN'S idea for a CAPTCHA possible?
Brian's idea is to build into the CAPTCHA some model of the world, e.g.
SHRDLU, the blocks world model of the world.
 The CAPTCHA describes in English whatever manipulations it performs on its
blocks world.  A human can perform these manipulations on its own internal
model of the same world. This enables the human to answer the CAPTCHA's
questions about the state of the world resulting from these manipulations.
 If a human can do this, why can't a bot?  A bot has SHRDLU's model of the
world.  The bot gets an English statement of how to manipulate that world
(which is exactly the kind of statement that SHRDLU works so well with, but
no matter). The bot must find a random number (manipulation) that causes
its private virtual CAPTCHA to generate the same challenge. If such a
(random) manipulation is highly improbable, then CAPTCHA itself would most
probably not have found it.  This is a highly unlikely event. It can
therefore be disregarded. Note that this argument depends fiercely on the
fact that the possible challenges are small in number.


QUESTION: Is the Brighten GODFREY / Roni ROSENFELD idea for a CAPTCHA
possible?  Their CAPTCHA works as follows: it selects a semantically
meaningful syntactically correct sentence from the web. It also generates a
syntactically correct but semantically meaningless sentence using a Markov
Model of sentence generation. Then it replaces key words in both sentences
by synonyms, randomizes the order of the two sentences, and finally asks
the opponent to decide which sentence is derived from (is ss(*) of) the
semantically meaningful one.

   To show that such a CAPTCHA is impossible, we show that if it exists,
then it is possible to construct a CAPTCHA' that is at least as powerful as
CAPTCHA -- in the sense that it distinguishes bots from humans at least as
well as CAPTCHA -- and satisfies the assumptions of the Theorem:

   Let S denote an English sentence. Let ss(S) be a random variable
representing the sentence S with key words substituted by synonyms.  

 1. RANDOMNESS is used to generate the Markov Model sentence and to select
a random sentence from the world wide web.  It is also used to select
random synonyms to replace key words, and to decide which synonym
substituted sentence comes first. Nothing else. The state keeps info on
which of the two sentences was taken from the web (defined to be the
semantically meaningful one) and how many of the responses in a
conversation have been answered correctly. So 1 is satisfied.

2. CONSISTENCY (HISTORY IRRESPECTIVE OF STATE DETERMINES ALL). This means
that at the end of stage k, the decision to ACCEPT, REJECT, or to continue
is completely determined by history(k).
   An INCONSISTENCY would imply that there exist two sentences S and S',
one generated by Markov Model, the other selected from the web, and there
exist random synonym substitutions such that ss(S) = ss(S'). Actually, this
is possible, though we believe that the CAPTCHA designer would try to avoid
it. In any case, we construct a CAPTCHA' that is at least as powerful as
CAPTCHA in the sense that with the same information CAPTCHA' makes at least
as good a decision as CAPTCHA. CAPTCHA' has an oracle that tells it, for
any given sentence ss(S), the relative probability that S was generated by
Markov Chain or by random selection from the web.  Since the bot can fool
this more powerful CAPTCHA', it can also fool the less powerful CAPTCHA.

3. CONTINUE => A PATH TO ACCEPT EXISTS: If, at the end of a stage, the
CAPTCHA neither ACCEPTS nor REJECTS, then in the next stage the CAPTCHA is
guaranteed to present a challenge (continuation of the conversation) for
which there exists a non-rejectable response.
   The BG/RR CAPTCHA may or may not have this property, but it can be
modified to have it without changing the decisions it makes. If the
modified CAPTCHA cannot exist, then neither can the original.

4. CHALLENGE POSSIBLE => (SAME) CHALLENGE PROBABLE: Given history(k-1) and
challenge(k), it is efficiently possible for a bot to find a random # and a
state(k-1) that causes its own private virtual copy of the CAPTCHA to
generate the same challenge(k), and therefore to evaluate response(k).
   At any stage, the number of challenges is small, but maybe not THAT
small. 1.5 bits of entropy per character means 2^90 bits for a 60 character
sentence. Fortunately, the bot knows the challenge that it must create is a
synonym substituted replacement of the two sentences in the given
challenge.  It can try substituting synonyms for half a dozen initial key
words and see if GOOGLE continues the sentence on the web. It can do this
for both sentences.

5. NONREJECTABLE RESPONSE POSSIBLE => SOME (POSSIBLY DIFFERENT)
NONREJECTABLE RESPONSE IS PROBABLE: If a response is not rejected by the
CAPTCHA, then a random response has nontrivial probability (greater than 1%
say) to not be rejected.
   No problem. The response is just a single bit. A single bit has a 50%
probability to not be rejected.

6. ALL BRANCHES ARE FINITE: Every branch is a FINITE sequence consisting of
ROOT, CHALLENGE(1), RESPONSE(1), (continue), ... CHALLLENGE(k),
RESPONSE(k), ACCEPT or REJECT.
   After 10 challenge-response pairs, say, the CAPTCHA must make a
decision, even a borderline decision which could go either way must be made.