So I was thinking about the problem of the iterated prisoner's dilemma and trying to come up with a formulation in which always defecting is not the right thing to do. I think the suggestion from the beginning of last night's meeting is the right one: instead of playing for a fixed number of rounds, pick a gamma in (0,1) and, after each round, play another one with probability gamma. For reference, the payoff matrix for PD is (-1, -1) (-9, 0) ( 0, -9) (-6, -6) For iterated PD, strategies are a mapping from the history of your and your opponent's past plays to probability distributions over {cooperate, defect}. Now, suppose that my opponent is playing tit-for-tat, and I haven't defected yet. If I plan to defect for this and all future rounds, my expected payoff is 0 + gamma * -6 + gamma^2 * -6 + ... = -6 gamma / (1-gamma) On the other hand, if I plan to cooperate for this and all future rounds, my expected payoff is -1 + gamma * -1 + gamma^2 * -1 + ... = -1/(1-gamma) So, if gamma > 1/6, always cooperating is better than always defecting. Of course, there are infinitely many other strategies besides always-cooperate and always-defect. Unfortunately, when playing against tit-for-tat, not all of these strategies are distinct. For example, any two strategies that differ only on what to do when the opponent defects first will be the same, since tit-for-tat never defects first. So, it's hard to define the best response to tit-for-tat: both tit-for-tat and always-cooperate are equally good. There are a couple of ways we might try to make different strategies distinct. One is to assume that the history reported to me might contain some errors. Another is to assume that I might accidentally cooperate when I meant to defect or vice versa (called the "trembling hand" assumption). I don't know what the best response to tit-for-tat is under either of these assumptions. I think always-defect is a Nash equilibrium under either assumption: if you know your opponent is always going to defect, you have no incentive ever to cooperate. Hopefully, there are also other, better equilibria; maybe we could find one by doing something like policy iteration starting from tit-for-tat. In any case, I do know a better-than-terrible equilibrium for a game which is similar to iterated PD. So, here goes. This example is based on one I saw in a paper ("Evolution and Information in a Gift-Giving Game," by Johnson, Levine, and Pesendorfer, at http://levine.sscnet.ucla.edu/PAPERS/evob25.pdf), but I've changed it to fit our Prisoner's Dilemma framework. The idea is that we are in a society in which players are randomly selected to be matched against one another (a different opponent each round), and in which the players can observe (at least some information about) the history of their opponents. On every round, some players die and are replaced by new players, and just as before, the whole game ends with probability (1-gamma) after each round. I will assume gamma is close to 1. Also, to keep strategies distinct, I will assume that hands tremble with a small probability. Suppose each player lives for two rounds, and on each round plays a game of PD against another player. For simplicity, suppose that every game matches one old and one young individual. Since the old individual is never going to play again, cooperating is a dominated strategy, and we can eliminate it; so the game becomes OLD YOUNG C (-9, 0) D (-6, -6) and it is played twice by each individual, once as a young player and once as an old player. The old player has no options, so we can write a strategy as a function that takes the history of the old player and produces a play for the young player. One interesting thing about this game is that it has infinitely many players even if the society has a finite size. If the society contains n young and n old individuals on the first step, then we need n new players on the second step, n more on the third, and so forth. In any one series of rounds, only finitely many of the players will ever get to make a choice (since gamma<1); but we need all countably many of them just in case the game gets to the round in which they are scheduled to be born. I don't know if it is valid to do so, but I intend to use the same definition of equilibrium that I would use for finitely many players. In other words, I will say that a Nash equilibrium is a situation in which, given the strategies of the other players, no one player wants to switch. Another interesting thing about this game is that the young player may, at a cost of 3 units, cooperate instead of defect; this action gives the old player 6 units, so it is in the common good for everyone to cooperate. But, as a one-shot game, the cooperate strategy is dominated for the young player, and so the static Nash equilibrium is to defect. Now consider the following strategy, which I will call GREEN. It is based on labeling each player with a color, either red or green, according to his/her history. At the beginning of time label every old player green. Now, if a young player cooperates against an older player, label the young with the same color as the old, and if the young player defects, label the young with the opposite color from the old. (A label doesn't change when a player changes from young to old.) Finally, if your opponent is labeled green, cooperate, otherwise defect. This is a valid strategy because it is a function from the history of your opponent (including the history of his/her previous opponent, and the history of ...) to actions. The GREEN strategy is designed to reinforce greenness. It does so by punishing anyone who defects against a green player. Crucially, it is also policing-friendly: it rewards anyone who punishes a red player (since red players include everyone who defected against a green player). So, unlike tit-for-tat, it doesn't punish you for punishing someone else, as long as you were justified in punishing him/her. Now consider my choice of a strategy as a new young player on round t: if all the other players are playing GREEN, then in particular all of the players born on round t+1 will play GREEN. So, I should play GREEN too, since I will then get a total reward of -9+0=-9 instead of -6+-6=-12 for defecting against a green-labeled player or -9+-6=-17 for cooperating against a red-labeled player. (The fact that gamma is a little less than 1 and the small probability of trembling hands change these calculations slightly, but not enough to change the ordering of payoffs.) Of course, there are strategies which don't depend on whether my opponent is labeled green or red, but any change away from GREEN only increases the probability that I will be labeled red, and so isn't worth it. Now, the same reasoning applies to all players by symmetry. So, GREEN is an equilibrium for this game, and it is different from the static Nash equilibrium. Incidentally, the static Nash equilibrium is also an equilibrium for this game, since if I know that I will never get rewarded in the second round I have no incentive to do anything but defect in the first round. But, it is a suboptimal equilibrium, since GREEN gets better payoffs. ==================== I thought some more, and I came up with an equilibrium for the original iterated Prisoner's Dilemma game. It works either one-on-one or in a population where players change opponents every round. Like the strategy I described earlier, it is based on labeling other players red or green, then cooperating with green-labeled players and defecting against red-labeled ones. A player will be labeled green in the first round and on all rounds where, for every previous round, s/he cooperated when faced with a green-labeled player and defected when faced with a red-labeled player. Otherwise the player will be labeled red. This strategy is an equilibrium (for large enough gamma) in either the trembling-hands model or the history-may-be-inaccurate model: if my opponent(s) play(s) this strategy, it is never to my advantage to turn my own label red. The overall outcome is much better, though, in the history-may-be-inaccurate model. In this model, I will occasionally misread my opponent's label, but doing so will not turn my own label red, since my history will reflect that I saw the wrong label. In the trembling-hands model, on the other hand, I will eventually turn my own label red no matter how small the mistake probability is, and then I can never recover. But in either case the expected total reward for playing this equilibrium is better than for playing always-defect. ==================== Policy iteration starting from tit-for-tat doesn't give us a good policy for iterated PD. The best response to tit-for-tat in either trembling-hands or inaccurate-history is always-cooperate. It's not tit-for-tat since there will eventually be a defection due to some mistake, and that will degenerate into CDCDCD vs. DCDCDC, which will in turn either degenerate to DDD vs. DDD or revert to CCC vs. CCC. In fact, any intentional defection against tit-for-tat hurts my overall expected score: the D vs. C round will be followed by some number of D vs. D rounds, then possibly a C vs. D to get back on track again. The D vs. D rounds all hurt my score, and the C vs. D hurts my score more than the D vs. C helped it. But the best response to always-cooperate is always-defect, and the best response to always-defect is always-defect. So, policy iteration starting from tit-for-tat leads to always-defect, which is a Nash equilibrium but not a very nice one. ==================== Iterated PD is very popular on the web; a quick search turned up (among others) http://socrates.berkeley.edu/~jgwacker/GA/IPD.html, a web page describing a paper in which the author used GAs to evolve societies of PD-players in the presence of noise. Apparently the strategies which evolved tended to be what he called "secret handshakes": players would verify that they had the same genotype by performing a complicated sequence of moves, then go into either always-cooperate or always-defect mode depending on whether the handshake was performed correctly or not. Such a strategy is clearly not an equilibrium, since its best response is the strategy which performs the secret handshake then goes into always-defect mode. (The web page didn't mention this best response, but I would be surprised if it hadn't evolved at some point, since its representation is very similar to the representation of the secret-handshake player.) It would be nice if the GA had found the equilibrium I described earlier. But, it turns out that the description language the author used couldn't represent this equilibrium, so I guess it's not too surprising the GA didn't find it.