So I was thinking about the problem of the iterated prisoner's dilemma
and trying to come up with a formulation in which always defecting is
not the right thing to do.  I think the suggestion from the beginning
of last night's meeting is the right one: instead of playing for a
fixed number of rounds, pick a gamma in (0,1) and, after each round,
play another one with probability gamma.

For reference, the payoff matrix for PD is

       (-1, -1)    (-9,  0)
       ( 0, -9)    (-6, -6)

For iterated PD, strategies are a mapping from the history of your and
your opponent's past plays to probability distributions over
{cooperate, defect}.

Now, suppose that my opponent is playing tit-for-tat, and I haven't
defected yet.  If I plan to defect for this and all future rounds, my
expected payoff is

  0 + gamma * -6 + gamma^2 * -6 + ...  = -6 gamma / (1-gamma)

On the other hand, if I plan to cooperate for this and all future
rounds, my expected payoff is 

 -1 + gamma * -1 + gamma^2 * -1 + ...  = -1/(1-gamma)

So, if gamma > 1/6, always cooperating is better than always
defecting.

Of course, there are infinitely many other strategies besides
always-cooperate and always-defect.  Unfortunately, when playing
against tit-for-tat, not all of these strategies are distinct.  For
example, any two strategies that differ only on what to do when the
opponent defects first will be the same, since tit-for-tat never
defects first.  So, it's hard to define the best response to
tit-for-tat: both tit-for-tat and always-cooperate are equally good.

There are a couple of ways we might try to make different strategies
distinct.  One is to assume that the history reported to me might
contain some errors.  Another is to assume that I might accidentally
cooperate when I meant to defect or vice versa (called the "trembling
hand" assumption).

I don't know what the best response to tit-for-tat is under either of
these assumptions.  I think always-defect is a Nash equilibrium under
either assumption: if you know your opponent is always going to defect,
you have no incentive ever to cooperate.  Hopefully, there are also
other, better equilibria; maybe we could find one by doing something
like policy iteration starting from tit-for-tat.

In any case, I do know a better-than-terrible equilibrium for a game
which is similar to iterated PD.  So, here goes.

This example is based on one I saw in a paper ("Evolution and
Information in a Gift-Giving Game," by Johnson, Levine, and
Pesendorfer, at http://levine.sscnet.ucla.edu/PAPERS/evob25.pdf), but
I've changed it to fit our Prisoner's Dilemma framework.

The idea is that we are in a society in which players are randomly
selected to be matched against one another (a different opponent each
round), and in which the players can observe (at least some information
about) the history of their opponents.  On every round, some players
die and are replaced by new players, and just as before, the whole game
ends with probability (1-gamma) after each round.  I will assume gamma
is close to 1.  Also, to keep strategies distinct, I will assume that
hands tremble with a small probability.

Suppose each player lives for two rounds, and on each round plays a
game of PD against another player.  For simplicity, suppose that every
game matches one old and one young individual.  Since the old
individual is never going to play again, cooperating is a dominated
strategy, and we can eliminate it; so the game becomes

                   OLD
   YOUNG    C    (-9,  0)
            D    (-6, -6)

and it is played twice by each individual, once as a young player and
once as an old player.  The old player has no options, so we can write
a strategy as a function that takes the history of the old player and
produces a play for the young player.

One interesting thing about this game is that it has infinitely many
players even if the society has a finite size.  If the society contains
n young and n old individuals on the first step, then we need n new
players on the second step, n more on the third, and so forth.  In any
one series of rounds, only finitely many of the players will ever get
to make a choice (since gamma<1); but we need all countably many of
them just in case the game gets to the round in which they are
scheduled to be born.

I don't know if it is valid to do so, but I intend to use the same
definition of equilibrium that I would use for finitely many players.
In other words, I will say that a Nash equilibrium is a situation in
which, given the strategies of the other players, no one player wants
to switch.

Another interesting thing about this game is that the young player may,
at a cost of 3 units, cooperate instead of defect; this action gives
the old player 6 units, so it is in the common good for everyone to
cooperate.  But, as a one-shot game, the cooperate strategy is
dominated for the young player, and so the static Nash equilibrium is
to defect.

Now consider the following strategy, which I will call GREEN.  It is
based on labeling each player with a color, either red or green,
according to his/her history.  At the beginning of time label every old
player green.  Now, if a young player cooperates against an older
player, label the young with the same color as the old, and if the
young player defects, label the young with the opposite color from the
old.  (A label doesn't change when a player changes from young to old.)
Finally, if your opponent is labeled green, cooperate, otherwise
defect.  This is a valid strategy because it is a function from the
history of your opponent (including the history of his/her previous
opponent, and the history of ...) to actions.

The GREEN strategy is designed to reinforce greenness.  It does so by
punishing anyone who defects against a green player.  Crucially, it is
also policing-friendly: it rewards anyone who punishes a red player
(since red players include everyone who defected against a green
player).  So, unlike tit-for-tat, it doesn't punish you for punishing
someone else, as long as you were justified in punishing him/her.

Now consider my choice of a strategy as a new young player on round t:
if all the other players are playing GREEN, then in particular all of
the players born on round t+1 will play GREEN.  So, I should play GREEN
too, since I will then get a total reward of -9+0=-9 instead of
-6+-6=-12 for defecting against a green-labeled player or -9+-6=-17 for
cooperating against a red-labeled player.  (The fact that gamma is a
little less than 1 and the small probability of trembling hands change
these calculations slightly, but not enough to change the ordering of
payoffs.)  Of course, there are strategies which don't depend on
whether my opponent is labeled green or red, but any change away from
GREEN only increases the probability that I will be labeled red, and so
isn't worth it.

Now, the same reasoning applies to all players by symmetry.  So, GREEN
is an equilibrium for this game, and it is different from the static
Nash equilibrium.

Incidentally, the static Nash equilibrium is also an equilibrium for
this game, since if I know that I will never get rewarded in the second
round I have no incentive to do anything but defect in the first round.
But, it is a suboptimal equilibrium, since GREEN gets better payoffs.


====================


I thought some more, and I came up with an equilibrium for the original
iterated Prisoner's Dilemma game.  It works either one-on-one or in a
population where players change opponents every round.  Like the
strategy I described earlier, it is based on labeling other players red
or green, then cooperating with green-labeled players and defecting
against red-labeled ones.  A player will be labeled green in the first
round and on all rounds where, for every previous round, s/he
cooperated when faced with a green-labeled player and defected when
faced with a red-labeled player.  Otherwise the player will be labeled
red.

This strategy is an equilibrium (for large enough gamma) in either the
trembling-hands model or the history-may-be-inaccurate model: if my
opponent(s) play(s) this strategy, it is never to my advantage to turn
my own label red.  The overall outcome is much better, though, in the
history-may-be-inaccurate model.  In this model, I will occasionally
misread my opponent's label, but doing so will not turn my own label
red, since my history will reflect that I saw the wrong label.  In the
trembling-hands model, on the other hand, I will eventually turn my own
label red no matter how small the mistake probability is, and then I
can never recover.  But in either case the expected total reward for
playing this equilibrium is better than for playing always-defect.


====================


Policy iteration starting from tit-for-tat doesn't give us a good
policy for iterated PD.  The best response to tit-for-tat in either
trembling-hands or inaccurate-history is always-cooperate.  It's not
tit-for-tat since there will eventually be a defection due to some
mistake, and that will degenerate into CDCDCD vs. DCDCDC, which will in
turn either degenerate to DDD vs. DDD or revert to CCC vs. CCC.  In
fact, any intentional defection against tit-for-tat hurts my overall
expected score: the D vs. C round will be followed by some number of D
vs. D rounds, then possibly a C vs. D to get back on track again.  The
D vs. D rounds all hurt my score, and the C vs. D hurts my score more
than the D vs. C helped it.

But the best response to always-cooperate is always-defect, and the
best response to always-defect is always-defect.  So, policy iteration
starting from tit-for-tat leads to always-defect, which is a Nash
equilibrium but not a very nice one.


====================


Iterated PD is very popular on the web; a quick search turned up (among
others) http://socrates.berkeley.edu/~jgwacker/GA/IPD.html, a web page
describing a paper in which the author used GAs to evolve societies of
PD-players in the presence of noise.  Apparently the strategies which
evolved tended to be what he called "secret handshakes": players would
verify that they had the same genotype by performing a complicated
sequence of moves, then go into either always-cooperate or
always-defect mode depending on whether the handshake was performed
correctly or not.  Such a strategy is clearly not an equilibrium, since
its best response is the strategy which performs the secret handshake
then goes into always-defect mode.  (The web page didn't mention this
best response, but I would be surprised if it hadn't evolved at some
point, since its representation is very similar to the representation
of the secret-handshake player.)

It would be nice if the GA had found the equilibrium I described
earlier.  But, it turns out that the description language the author
used couldn't represent this equilibrium, so I guess it's not too
surprising the GA didn't find it.