Evolving Strategies for the Prisoner’s DilemmaJENNIFER GOLBECKComputer Science DepartmentUniversity of Maryland, College ParkCollege Park, MDUSA[email protected] http://www.cs.umd.edu/~golbeckAbstract: – This paper investigates the use of Genetic Algorithms (GA’s) to evolve optimal strategies tothe Prisoner’s Dilemma, a classic game theory problem. The problem was heavily studied in the 1980’s,but using more advanced computing techniques, this research extends the existing body of research. Myhypothesis is that populations evolve to exhibit two traits: the ability to defend against defectors, and theability to cooperate with other cooperators. Two successful, well studied strategies which embody thesetraits, Pavlov and Tit for Tat, are used as controls. Populations that do not possess these traits a priori areevolved and then compared to the performance of the controls. The results presented here stronglyindicate that the hypothesized traits are present in all evolved populations.Key-Words: – genetic algorithms, Prisoner’s dilemma, game theory, evolutionary computation1 IntroductionThe Prisoner’s Dilemma is a traditional andelegant model for studying decision making andself-interest. It has been studied extensively tomodel behavior from petty theft to nuclear war.Much of the current body of research hasfocused on which strategy in the game is “best.”Different strategies have been analyzed,competed in tournaments, and even subjected toDarwinian selection. In the latter case, strategieswere usually played against populations whichwere considered representative of the body ofpossible strategies.Two strategies in particular have been found tobe evolutionarily successful: Tit for Tat, whichhas long been touted as one of the best strategiesin the game[3], and Pavlov which has beenshown to be an excellent strategy in more recentliterature[7,8,9]. In looking at this problem, wewere not concerned with which particularstrategy was superior. After all, to use the gameas a model of behavior in complex systems, it isunlikely that an actor’s behavior can be neatlysummarized as perfectly fitting the Tit for Tatmodel, the Pavlov model, or any otherdeterministic model. Even those models thatintroduce randomness and probability are notcapturing the essence of the question. Indeed,the behavior of decision makers may not beaccurately modeled by a deterministic strategy,but neither can their decisions accurately beexplained as “random” or “probabilistic.”Instead of searching for a single strategy toaccurately capture complex and situationallydependent behavior, we are interested in thetraits of those strategies which have proven tobe highly effective in the past. Isolating thosetraits facilitates the development of a general setof strategies which all should perform well.My hypothesis is that superior strategies havetwo traits in common. First, they can defendagainst defectors, and second, they can exploitthe advantages of mutual cooperation with othersuperior players. Both the Tit for Tat and Pavlovstrategies mentioned before have these traits. Inthis paper, a genetic algorithm evolves strategiesfrom five different initial populations. By testingthe behavior of those evolved populations in thepresence of defectors and cooperators it can beshown in this study that they perform identicallyto the Tit for Tat and Pavlov strategies. Itfollows that superior strategies, as discoveredthrough genetic algorithms, share the twobehavioral traits of the hypothesis.2 The Prisoner’s DilemmaThe Prisoner’s Dilemma is a decision model.Anecdotally described, two people, indicatedhere as player 1 and player 2, are arrested forespionage and placed in separate interrogationrooms. Each player has the option to cooperatewith his peer, or to defect against him. If bothplayers cooperate (C), they are rewarded with ashorter sentence of length R, while if both defect(D), they receive a penalty P. If one playerdefects and the other cooperates, the defectorreceives the best payoff (or temptation payoff) Tfor exploiting his opponent. The opponent ispunished with the sucker payoff S, the worstpossible outcome [11]. Here we declare[R,S,T,P] = [3,0,5,1]. The payoff matrix for thisgame, indicating the points received by eachplayer in each situation, is shown in figure 1(scores listed as [player 1, player 2]).Player 1 CooperateDefectasCooperate3,35,0Defect0,51,1 Fig 1: Payoff Matrix for the Prisoner’s DilemmaNash Equilibrium describes what move eachplayer will make to maximize her score based oncorrect assumptions about the other player’sactions. In the prisoners’ dilemma, regardless ofwhat one player does, the other player will bebetter off defecting. If player 1 cooperates,player 2 will get 5 points for defecting instead of3 for cooperating. If player 2 defects, player 1 isstill better off to defect as well, giving him ascore of 1 instead of 0. Each player will use thislogic, resulting in a Nash Equilibrium of Defect,Defect.If both players were to change their moves toCooperate, they each would triple their score.Another approach to analyzing game theoreticproblems is the concept of Pareto Efficiency. Asolution is Pareto Efficient if there is no othersolution that makes one player better off withoutmaking the other player worse off. In the case ofthe Prisoner’s Dilemma, mutual cooperation isthe Pareto Efficient solution.If players met and played several games in a row(the iterated Prisoner’s Dilemma), the NashEquilibrium of mutual defection becomesincreasingly inefficient. If individuals are able toremember some set of prior games with theiropponent, then they can each develop a morecomplicated strategy to maximize their score.Some strategies have no advantages over thesingle game. A player who cooperates regardlessof previous behavior (AllC) or who alwaysdefects (AllD) will score no better than theirmemory-less counterpart. Much researchsuggests, however, that the Tit For Tat (TFT)strategy is very successful. This strategy simplystates that a player should repeat the opponent’smove of the previous round. In earlier research,TFT has been shown to outperform most otherstrategies [2]. Another strategy shown toperform well against a wide range of opponentsis the Pavlov strategy. This strategy, also knownWin Stay Lose Switch or Simpleton,cooperates when rewarded for cooperating orpunished for defecting, and defects otherwise. Ina memory-one system, where players canremember their own move and their opponent’smove from the previous game only, Pavlovplayers would cooperate on a history of mutualcooperation and mutual defection. Since they arerewarded with a score of 3 for mutualcooperation, Pavlov players continue tocooperate. With a history of DD, players willalso choose too cooperate in the next roundsince they had been punished with a low score of1. On the other hand, Pavlov players woulddefect on a history of DC since they had justbeen rewarded with the best score of 5 points fordefection, and would also defect with a historyof CD since they had just been severelypunished with a score of 0 for cooperating. Thisstrategy was shown to perform as well or betterthan any other strategies in the memory-oneiterated Prisoner’s Dilemma [9].The TFT and Pavlov strategies have been widelystudied. What features do they have in commonthat makes them consistently competitive? Bothare able to use mutual cooperation, the ParetoEfficient solution, to maximize their score, andboth are able to defend themselves fromreceiving Sucker payoffs from a defector. Thisresearch searches for these two traits in evolvedpopulations.3 The Genetic AlgorithmGenetic algorithms lend themselves wellstudying strategies in the prisoner’s dilemma.Each player is represented by its strategy. In thememory-three game used in this study, eachplayer’s strategy must address sixty-fourpossible histories. We use the set of moves tocreate a 64 bit string which represents eachplayer in the algorithm. Table 1 shows stringposition, the history it represents, and a samplestrategy.After calculating fitness, which is described inthe next section, this study implements roulettewheel selection, also called stochastic samplingwith replacement [4]. In this stochasticalgorithm, the fitness of each individual isnormalized. Based on their fitness, individualsare mapped to contiguous segments of a line,such that each individual’s segment is equal insize to its fitness. A random number is generatedand the individual whose segment spans therandom number is selected. The process isrepeated until the correct number of individualsis obtained.Table 2 shows a sample population withcalculated and normalized fitness. Figure 2shows the line with selection probability for tenindividuals using a roulette wheelimplementation. Individual 1 has a normalizedfitness of approximately 0.20 which gives it a 1in 5 chance of being selected. Individual 10 hasthe lowest fitness, with a normalized value of0.02. If an individual had a fitness of zero, itwould have no chance of being selected topropagate into the new population. Randompoints are selected on this line to selectindividuals to reproduce. Children’schromosomes (strategies) are produced by singlepoint crossover at a random point in the parent’schromosome.The mutation rate was 0.001 which producedapproximately one mutation in the populationper generation, and the recombination rate wasset at 0.8.4 The SimulationSimulations in this study utilized a geneticalgorithm to evolve strategies for the Prisoner’sDilemma. Each simulation began with an initialpopulation of twenty players represented bytheir strategies.Several terms are used in this section. A gamerefers to one “turn” in the Prisoner’s Dilemma.Both players make simultaneous moves, andeach are awarded points based on the outcome.A round is a set of games between two players.Rounds in this study are 64 games long. A cycleis completed when every player has played oneround against every other player.To determine fitness, each player was pairedwith every other for one round of 64 games.Players did not compete against themselves.Since there are sixty-four possible histories, thisnumber of games ensures that each reachablehistory is visited at least once. After each game,the players’ scores are tallied and their historiesare updated. Players maintain a performancescore which is the sum of the points that theyreceive in every game against every player. Themaximum possible performance score is 6080: ifa player defected in every game and hisopponents all cooperated in every game hewould receive 5 points X 64 games X 19opponents. For a player who is able to mutuallycooperate in every game, the performance scorewould be 3,648 (3 points X 64 games X 19opponents).After a full cycle of play, players are rankedaccording to their performance score, andselected to reproduce. Recombination occurs,children replace parents, and the cycle repeats.At the end of each generation, the total score forthe population is tallied. This value is the sum ofthe scores of all members in the population. String PositionRepresented HistoryMove0CCCCCCC1CCCCCDD2CCCCDCD3CCCCDDD4CCCDCCC5CCCDCDC6CCCDDCC7CCCDDDD8CCDCCCC9CCDCCDD10CCDCDCD11CCDCDDD12CCDDCCD13CCDDCDD14CCDDDCC15CCDDDDC16CDCCCCC17CDCCCDC18CDCCDCD19CDCCDDC20CDCDCCC21CDCDCDD22CDCDDCD23CDCDDDC24CDDCCCC25CDDCCDC26CDDCDCD27CDDCDDC28CDDDCCD29CDDDCDC30CDDDDCC31CDDDDDD String PositionRepresented HistoryMove32DCCCCCD33DCCCCDC34DCCCDCD35DCCCDDD36DCCDCCC37DCCDCDC38DCCDDCD39DCCDDDD40DCDCCCD41DCDCCDC42DCDCDCC43DCDCDDC44DCDDCCC45DCDDCDD46DCDDDCD47DCDDDDC48DDCCCCC49DDCCCDD50DDCCDCC51DDCCDDC52DDCDCCC53DDCDCDC54DDCDDCD55DDCDDDD56DDDCCCC57DDDCCDC58DDDCDCD59DDDCDDD60DDDDCCC61DDDDCDC62DDDDDCD63DDDDDDC Table 1: The table above shows the strategy for a randomly generated player. The corresponding64 bit string used to represent this strategy in the algorithm isCDDDCCCD CDDDDDCC CCDCCDDC CCDCDCCD DCDDCCDD DCCCCDDC CDCCCCDD CCDDCCDC Individual12345678910Fitness2722181513129843Normalized Fitness0.200.170.140.110.100.100.070.060.030.02 Table 2: Sample Population showing fitness and normalized fitness for each individualFigure 2: Continuous line divided for the 10 individuals of the sample population. Randompoints generated along this line determine which individuals will be selected for reproduction.While the maximum score for an individual is6080, the maximum score for a population of 20cannot be 20 times that. For one individual toscore the maximum, all others must score verylow. The highest cumulative score achievable inan individual game is 6, when both playersreceive a score of 3 for mutual cooperation.Mutual defection would have a total game scoreof 2 (1 point each), and mixed plays, with onecooperator and one defector, have a game totalof 5 (5 for the defector plus 0 for thecooperator). Thus, the highest score that apopulation can achieve is 72,960 (3 points X 64games X 19 opponents gives 3,648 per player X20 players total). In the end, the fitness of apopulation is measured by what percentage ofthe highest possible score is achieved. Apopulation with a total score of 63,480, forexample, would have a population fitness of50% (63,480 / 72,960).In these experiments, each evolutionarysimulation ran for 200,000 generations. Thisgave plenty of time for genomic stabilizationwithout requiring too much computing time.Results for much longer selected simulations (upto 4,000,000 generations) were not significantlydifferent.5 MethodologyTo test whether or not a population had each ofthe two traits described in the hypothesis,players’ behavior in these experiments iscompared to the behavior of both the Tit for Tatand Pavlov populations. A straight comparison,however, would not yield productive results;through the evolutionary process, noise isintroduced into any population, even those wellevolved.Consider a population that has evolved Tit forTat like behavior. That population is likely usingonly one quarter to one third of its encodedgenes because many of the possible histories arenot achievable by a Tit for Tat player (i.e.DCDCDC where a cooperating player neverdefends against the defecting opponent). Thismeans that an individual might look very littlelike an unevolved Tit for Tat player. This noisein the gene can show up in simulations becauseeach player starts with a randomly generatedinitial history. Thus, it is possible for a Tit forTat player to start with and visit severalstrategies which would not otherwise bereachable in play alone. If a noisy Tit for Tatplayer winds up with an initial history that visitsa “corrupted” gene (which is not unlikely), playcan take a very non-Tit for Tat like path byplaying though these genes which are lessaffected by selection. Thus noise is added intothe Tit for Tat and Pavlov populations whenusing them for comparison to discount the effectof noise in the evolved populations and focus inon the actual evolved behavior.Five distinct populations were used to comparebehavior before and after evolution. Tit For Tatand Pavlov, as discussed previously, were thetwo control populations for this experiment.Both have the inherent ability to exploit mutualcooperation and defend against defectors. Threeother populations were respectively comprisedof AllC players, of AllD players, and ofindependently randomly initialized players.To measure the performance of populations, theaverage fitness over the last 10,000 generationsof each simulation was studied. Starting with thefive initial populations, each was evolved for200,000 generations. This evolution wassimulated several hundred times for each initialpopulation.Significance was calculated by the standard 2tailed t-test for data sets with unequal variance.Each population was compared to the Tit for Tatand Pavlov controls.6 ResultsRecall the hypothesis: Evolved populations ofplayers develop 1) the ability to defend againstdefectors, and 2) the ability to take advantage ofmutual cooperation. Below, the results areoutlined which support this hypothesis.Statistical data generated for these results iscontained in Table 3.After a period of evolution and play as describedearlier, the average performances of the fivepopulations were statistically equal. The naturalquestion to ask is whether or not this equalitycame about as a result of “random drift” of thepopulations, or because they were evolutionarilydriven in that direction. Random drift occurswhen strategies are recombined and mutatedwithout selection. Since there is no pressure toperform, there is no meaning associated with thegenomic makeup of a population after evolution.Each specific gene has occurred simply bychance mutation or recombination, and theperformance of such a population is generallylow. By turning off the selection mechanism inthe genetic algorithm, results for a random driftpopulation were generated. The evolvedpopulations all performed well above the levelof the random drift population, indicating thatthey exhibit evolutionarily preferred traits(shown in Table 3). The next step is to testwhether or not the improved performance is dueto the presence of the traits described in thehypothesis, namely, if they are defensive andcooperative abilities.The first experiment looks for the ability todefend against defectors. In the experiment, thefive unevolved, initial populations were mixedwith a small set of AllD players. Fitness of thosepopulations was calculated over the first 10,000generations immediately following inoculation.Both Pavlov and TFT performed well, withscores around 80%. Neither Random, AllC, norAllD came near this level. By the standard t-test,all were significantly lower than both Pavlovand TFT with p = .01.The same experiment was performed with thefive populations after evolution. After 200,000generations, the populations were mixed with asmall group of AllD players. Fitness wascalculated over the next 10,000 generations.Looking at the average fitness of all fivepopulations, it was found that there was nostatistical difference in performance among themwith p=.01. Additionally, comparing theseresults to the performance of unevolved Tit forTat and Pavlov players, there was no statisticaldifference. Additionally, there was no statisticaldifference between performance of theinoculated populations, and the uninoculatedevolved populations, indicating that defectorshad no effect on performance of evolvedpopulations.The second part of the hypothesis predicts thatpopulations evolve the ability to cooperate withother cooperators. Repeating the sameexperimental structure above, the five unevolvedpopulations were mixed with a small set of AllCplayers. Tit for Tat and Pavlov again performedat nearly 80% if the maximum fitness, as did theinitial population of AllC players. AllC playersalways cooperate by their nature. In an initialpopulation made up entirely of AllC players,mutual cooperation is the norm. Introducingmore AllC players to that initial populationobviously does not change it. The prevalenceof mutual cooperation explains the excellentperformance of the unevolved AllC population.Unevolved Populations inoculated with AllC Tit For TatPavlovCooperateDefectRandomMean0.79540.79990.78730.87840.4384t-test with unevolved Tit for Tat10.84230.71791.9440E-051.9420E-45t-test with unevolved Pavlov0.842310.56593.2140E-054.6290E-48 Unevolved populations inoculated with AllD Tit For TatPavlovCooperateDefectRandomMean0.81380.82400.73480.88250.4444t-test with unevolved Tit for Tat0.39740.20220.00777.3307E-061.3276E-44t-test with unevolved Pavlov0.51530.27400.00351.1989E-053.3702E-47 Evolved populations with no inoculation Tit For TatPavlovCooperateDefectRandomRandom Drift PopulationMean0.79900.79700.81890.80450.82060.7517t-test with unevolved Tit for Tat0.87680.94270.28990.27330.27330.0091t-test with unevolved Pavlov0.96670.89610.38370.36030.36030.0029t-test with uninoculated Tit for Tat10.93100.36810.34590.34590.0046t-test with uninoculated Pavlov0.931010.31220.29400.29390.0047 Evolved populations inoculated with AllC Tit For TatPavlovCooperateDefectRandomMean0.81450.79800.79770.80290.8297t-test with unevolved Tit for Tat0.39750.91170.91900.73640.1147t-test with unevolved Pavlov0.51150.92850.91630.89430.1622t-test with uninoculated Tit for Tat0.49110.96280.95180.86180.1562t-test with uninoculated Pavlov0.42770.96780.97700.78760.1228 Evolved populations inoculated with AllD Tit For TatPavlovCooperateDefectRandomMean0.79810.80190.80150.80840.8248t-test with unevolved Tit for Tat0.90630.78370.79040.56960.1878t-test with unevolved Pavlov0.93450.93290.94550.70870.2561t-test with uninoculated Tit for Tat0.96900.90190.91320.68010.2459t-test with uninoculated Pavlov0.96190.83350.84210.61060.2019 Figure 3: Statistical ResultsCompared to the controls and the AllCpopulation, both Random and AllD populationsperformed significantly worse, at about 74% and44% fitness respectively.After 200,000 generations of evolution, thepopulations were again mixed with a small set ofAllC cooperators. As was the case wheninoculated with a defector, the fitnesses of thefive evolved populations were statisticallyindistinguishable from one another. Comparingthe performance of the evolved populations withthe performance of the unevolved Tit for Tat andPavlov populations, there was again nodifference among populations. Finally, theperformance of the inoculated population wascompared to that of the un-inoculated populationand they were not statistically different.7 DiscussionThese results lead to several conclusions. Ourfirst experiment shows that defectors effect allfive of the evolved populations in the same way.They react identically, but does this necessarilyindicate that they all have a defensive ability?Since the populations which were unable todefend against defectors a priori exhibit thatbehavior after evolution they must evolve thatability over time.The second set of conclusions that can be drawnare those regarding mutual cooperation. Resultshere show that evolved populations are able tocooperate among themselves since they performthe same as the control populations in thepresence of cooperators. Further, once canconclude that populations exhibit this behavioreven without the experimental conditions, sincethere is no difference between performance inthe natural, evolved environment andperformance in the presence of pure cooperators.With the results outlined above, it follows that inthis experiment, evolved populations performedequivalently to Tit for Tat and Pavlov.Specifically, these experiments show thatevolved populations are able to fend offdefectors and mutually cooperate with otherevolved individuals. Since these populations didnot have such abilities a priori, it follows thatevolution introduced this behavior over time.Some preliminary simulations have been run tostudy this phenomenon in probabilisticstrategies. Initial results show no difference inresults between deterministic and probabilisticpopulations. A more thorough study would beuseful in generalizing or limiting the resultsdescribed here.8 AcknowledgementsMy deepest gratitude to Prof. Stuart Kurtz of theUniversity of Chicago and Prof. Jim Reggia ofthe University of Maryland at College Park.Without their support and advice this paperwould not have been possible.References:[1] Axelrod, Robert. The Evolution ofCooperation, New York: Basic Books,1984.[2] Axelrod, Robert. “Laws of Life: HowStandards of Behavior Evolve.” TheSciences 27 (Mar/Apr. 1987): 44-51.[3] Axelrod, Robert. The Complexity ofC o o p e r a t i o n , Princeton: PrincetonUniversity Press, 1997.[4] Baker, J. E. “Reducing Bias and Inefficiencyin the Selection Algorithm.” Proceedings ofthe Second International Conference onGenetic Algorithms and their Application,Hillsdale, New Jersey, USA: LawrenceErlbaum Associates, 1987: 14-21[5] Goldberg, David. Genetic Algorithms inSearch, Optimization and MachineLearning, New York: Addison-WesleyPublishing Co, 1989.[6] Holland, John. Hidden Order: HowAdaptation Builds Complexity, New York:Persus Co, 1996.[7] Kraines, David and Vivian Kraines. “Pavlovand the Prisoner’s Dilemma”, Theory andDecision, 26: 47-49.[8] Kraines, David and Vivian Kraines.“Evolution of Learning among PavlovStrategies in a Competitive Environmentwith Noise,” Journal of Conflict Resolution,39: 439-466.[9] Kraines, David and Vivian Kraines.“Natural Selection of Memory-oneStrategies for the Iterated Prisoner’sDilemma,” Journal of Theoretical Biology,203: 335-355.[10] Mitchell, Melanie. An Introduction toGenetic Algorithms, Cambridge: MITPress, 1998.[11] Osborne, Martin J. and Ariel Rubinstein.A Course In Game Theory, Cambridge,Massachusetts: The MIT Press, 1994.[12] Smith, John Maynard. Evolution and theTheory of Games, Cambridge: CambridgeUniv Press, 1982.