SOLUTION: Genetic Algorithms to Develop Strategies – My Assignment Tutor

Munich Personal RePEc ArchiveUsing Genetic Algorithms to DevelopStrategies for the Prisoners DilemmaAdnan HaiderDepartment of Economics, Pakistan Institute of DevelopmentEconomics, Islamabad12. November 2005Online at http://mpra.ub.uni-muenchen.de/28574/MPRA Paper No. 28574, posted 4. February 2011 06:40 UTC1USING GENETIC ALGORITHMS TO DEVELOP STRATEGIESFOR THE PRISONER’S DILEMMAADNAN HAIDERPhD FellowDepartment of EconomicsPakistan Institute of Development EconomicsIslamabad, Pakistan.The Prisoner‘s Dilemma, a simple two-person game invented by Merrill Flood & Melvin Dresherin the 1950s, has been studied extensively in Game Theory, Economics, and Political Sciencebecause it can be seen as an idealized model for real-world phenomena such as arms races (Axelrod1984). In this paper, I describe a GA to search for strategies to play the Iterated Prisoner‘sDilemma, in which the fitness of a strategy is its average score in playing 100 games with itself andwith every other member of the population. Each strategy remembers the three previous turns witha given player, by using a population of 20 strategies, fitness-proportional selection, single-pointcrossover with Pc=0.7, and mutation with Pm=0.001.JEL Classifications: C63, C72Keywords: GA, Crossover, Mutation and Fitness-proportional.1. IntroductionThe Prisoner’s Dilemma, can be formulated as follows: Two individuals (call them Mr. Xand Mr. Y) are arrested for committing a crime together and are held in separate cells,with no communication possible between them. Mr. X is offered the following deal: If heconfesses and agrees to testify against Mr. Y, he will receive a suspended sentence withprobation, and Mr. Y will be put away for 5 years. However, if at the same time Mr. Yconfesses and agrees to testify against Mr. X, his testimony will be discredited, and eachwill receive 4 years for pleading guilty. Mr. X is told that Mr. Y is being offered preciselythe same deal. Both Mr. X and Mr. Y know that if neither testifies against the other theycan be convicted only on a lesser charge for which they will each get 2 years in jail.Should Mr. X “defect” against Mr. Y and hope for the suspended sentence, risking, a 4-year sentence if Mr. Y defects? Or should he “cooperate” with Mr. Y (even though theycannot communicate), in the hope that he will also cooperate so each will get only 2years, thereby risking a defection by Mr. Y that will send him away for 5 years?The game can be described more abstractly. Each player independently decides whichmove to make— i.e., whether to cooperate or defect. A “game” consists of each player’smaking a decision (a “move”). The possible results of a single game are summarized in apayoff matrix like the one shown in table 1.1. Here the goal is to get as many points (asopposed to as few years in prison) as possible. (In table 1.1, the payoff in each case canbe interpreted as 5 minus the number of years in prison.) If both players cooperate, each____________________________________________The author wishes to thank to an anonymous referee for providing useful comments.2gets 3 points. If player A defects and player B cooperates, then player A gets 5 points andplayer B gets 0 points, and vice versa if the situation is reversed. If both players defect,each gets 1 point. What is the best strategy to use in order to maximize one’s own payoff?If you suspect that your opponent is going to cooperate, then you should surely defect. Ifyou suspect that your opponent is going to defect, then you should defect too. No matterwhat the other player does, it is always better to defect. The dilemma is that if bothplayers defect each gets a worse score than if they cooperate. If the game is iterated (thatis, if the two players play several games in a row), both players always defecting will leadto a much lower total payoff than the players would get if they cooperated.Table 1. The Payoff MatrixPlayer ACooperate DefectPlayer BCooperate 3, 3 0, 5Defect 5, 0 1, 1Assume a rational player is faced with playing a single game (known as one-shot) of thePrisoner’s Dilemma described above and that the player is trying to maximize theirreward. If the player thinks his/her opponent will cooperate, the player will defect toreceive a reward of 5 as opposed to cooperation, which would have earned him/her only 3points. However if the player thinks his/her opponent will defect, the rational choice is toalso defect and receive 1 point rather than cooperate and receive the sucker‘s payoff of 0points. Therefore the rational decision is to always defect.But assuming the other player is also rational he/she will come to the same conclusionas the first player. Thus both players will always defect; earning rewards of 1 point ratherthan the 3 points that mutual cooperation could have yielded. Therein lays the dilemma.In game theory the Prisoner’s Dilemma can be viewed as a two players, non zero-sum andsimultaneous game. Game theory has proved that always defecting is the dominantstrategy for this game (the Nash Equilibrium). This holds true as long as the payoffsfollow the relationship T > R > P > S, and the gain from mutual cooperation is greaterthan the average score for defecting and cooperating, R > (S + T)/ 2. While this gamemay seem simple it can be applied to a multitude of real world scenarios. Problemsranging from businesses interacting in a market, personal relationships, super powernegotiations and the trench warfare ―live and let live‖ system of World War I have allbeen studied using some form of the Prisoner’s Dilemma.2. Iterated Prisoner’s DilemmaThe Iterated Prisoner’s Dilemma (IPD) is an interesting variant of the above game inwhich two players play repeated games of the Prisoner’s Dilemma against each other. In3the above discussion of the Prisoner’s Dilemma the dominant mutual defection strategyrelies on the fact that it is a one-shot game, with no future. The key to the IPD is that thetwo players may play each other again; this allows the players to develop strategies basedon previous game interactions. Therefore a player‘s move now may affect how his/heropponent behaves in the future and thus affect the player‘s future payoffs. This removesthe single dominant strategy of mutual defection as players use more complex strategiesdependant on game histories in order to maximize the payoffs they receive. In fact, underthe correct circumstances mutual cooperation can emerge. The length of the IPD (i.e. thenumber of repetitions of the Prisoner’s Dilemma played) must not be known to eitherplayer, if it was the last iteration would become a one-shot play of the Prisoner’sDilemma and as the players know they would not play each other again, both playerswould defect. Thus the second to last game would be a one-shot game (not influencingany future) and incur mutual defection, and so on till all games are one-shot plays of thePrisoner’s Dilemma.This paper is concerned with modeling the IPD described above and devisingstrategies to play it. The fundamental Prisoner’s Dilemma will be used without alteration.This assumes a player may interact with many others but is assumed to be interactingwith them one at a time. The players will have a memory of the previous three gamesonly (memory-3 IPD).3. Genetic AlgorithmsGenetic Algorithms are search algorithms based on the mechanics of natural selectionand natural genetics. John Holland at the University of Michigan originally developedthem. They usually work by beginning with an initial population of random solutions to agiven problem. The success of these solutions is then evaluated according to a speciallydesigned fitness function. A form of ‗natural selection‘ is then performed wherebysolutions with higher fitness scores have a greater probability of being selected. Theselected solutions are then ‗mated‘ using genetic operators such as crossover andmutation. The children produced from this mating go on to form the next generation. Thetheory is that as fitter genetic material is propagated from generation to generation thesolutions will converge towards an optimal solution. This research utilizes GeneticAlgorithms to develop successful strategies for the Prisoner’s Dilemma.A simple GA works on the basis of the following steps:Step 1.Start with a randomly generated population of n l-bit chromosomes (Candidatesolution to a problem).Step 2.Calculate the fitness f(x) of each chromosome x in the population.Step 3.Repeat the following steps until n offspring have been created:4o Select a pair of parent chromosomes from the current population, theprobability of selection being an increasing function of fitness. Selectionis done ―with replacement‖ meaning that, the same chromosome can beselected more than once to become a parent.o With probability Pc (the ―crossover probability‖), crossover the pair at arandomly chosen point to form two offspring. If no crossover takesplace, form two offspring that are exact copies of their respectiveparents. (Note: crossover may be in ―single point‖ or ―multi-point‖version of the GA.)o Mutate the two offspring at each locus with probability Pm (the mutationprobability or mutation rate), and place the resulting chromosomes in thenew population. (Note: if n is odd, one new population member can bedescribed at random.)Step 4.Replace the current population with the new population.Step 5.Go to step 2.4. Experimental SetupGenetic Algorithms provide the means by which strategies for the Prisoner’s Dilemma aredeveloped in this paper. As this is the principal objective of the research, naturally thegenetic algorithm used is one of the systems major components. The other systemcomponents have been designed to suit the Genetic Algorithm. As such, in describing thegenetic algorithms implementation most of the other components will also be described.What follows is a description of how a genetic algorithm was implemented to evolvestrategies to play the Iterated Prisoner’s Dilemma.4.1. Figuring out StrategiesThe first issue is figuring out how to encode a strategy as a string. Suppose thememory of each player is one previous game. There are four possibilities for the previousgame: Case 1:CCCase 2:CDCase 3:DCCase 4:DD Where C denotes ―cooperate‖ and D denotes ―defect‖. Case I is when both playerscooperated in the previous game, case II is when player A cooperated and player Bdefected, and so on.A strategy is simply a rule that specifies an action in each of these cases.5 If CC (Case 1)ThenCIf CD (Case 2)ThenDIf DC (Case 3)ThenCIf DD (Case 4)ThenD If the cases are ordered in this canonical way, this strategy can be expressed compactly asthe string CDCD. To use the string as a strategy, the player records the moves made inthe previous game (e.g., CD), finds the case number i by looking up that case in a table ofordered cases like that given above (for CD, i = 2), and selects the letter in the ithposition of the string as its move in the next game (for i = 2, the move is D). Consider thetournament involved strategies that remembered three previous games, then there are 64possibilities for the previous three games:CC CC CC (Case 1),CC CC CD (Case 2),CC CC DC (Case 3),… … … …… … … …… … … …i… … … …… … … …… … … …DD DD DC (Case 63)DD DD DD (Case 64)Thus, a 64-letter string, e.g., CCDCDCDCDCDC… can encode a strategy. Since usingthe strategy requires the results of the three previous games, we can use a 70-letter string,where the six extra letters encoded three hypothetical previous games used by thestrategy to decide how to move in the first actual game. Since each locus in the string hastwo possible alleles (C and D), the number of possible strategies is 270. The search spaceis thus far too big to be searched exhaustively.The history in table 2 is stated in the order Your first move, Opponents first Move,Your second move, Opponents second Move, Your third move, Opponents third Move.The move column indicates what move to play for the given history. Table 2 also showshow the TFT strategy is encoded using this scheme. The resulting TFT chromosome is:CCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDThis is actually stored as a Bit-Set in computer memory as follow:6110101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010Table 1. Encoding Strategies.Bit History Move Bit History Move0 First Move C 36 C D D D C D D1 Opponent C C 37 C D D D D C C2 Opponent D D 38 C D D D D D D3 Opponent CC C 39 D C C C C C C4 Opponent CD D 40 D C C C C D D5 Opponent DC C 41 D C C C D C C6 Opponent DD D 42 D C C C D D D7 C C C C C C C 43 D C C D C C C8 C C C C C D D 44 D C C D C D D9 C C C C D C C 45 D C C D D C C10 C C C C D D D 46 D C C D D D D11 C C C D C C C 47 D C D C C C C12 C C C D C D D 48 D C D C C D D13 C C C D D C C 49 D C D C D C C14 C C C D D D D 50 D C D C D D D15 C C D C C C C 51 D C D D C C C16 C C D C C D D 52 D C D D C D D17 C C D C D C C 53 D C D D D C C18 C C D C D D D 54 D C D D D D D19 C C D D C C C 55 D D C C C C C20 C C D D C D D 56 D D C C C D D21 C C D D D C C 57 D D C C D C C22 C C D D D D D 58 D D C C D D D23 C D C C C C C 59 D D C D C C C24 C D C C C D D 60 D D C D C D D25 C D C C D C C 61 D D C D D C C26 C D C C D D D 62 D D C D D D D27 C D C D C C C 63 D D D C C C C28 C D C D C D D 64 D D D C C D D29 C D C D D C C 65 D D D C D C C30 C D C D D D D 66 D D D C D D D31 C D D C C C C 67 D D D D C C C32 C D D C C D D 68 D D D D C D D33 C D D C D C C 69 D D D D D C C34 C D D C D D D 70 D D D D D D D35 C D D C C C C4.2. Fitness FunctionThe next problem faced when designing a Genetic Algorithm is how to evaluate thesuccess of each candidate solution. The Prisoner’s Dilemma provides a natural means ofevaluating the success, or fitness, of each solution – the game payoffs. These payoffs arestored in Rules objects within the system. We can state that the strategy, which earns thehighest payoff score according to the rules of the IPD, is the fittest, while the lowestscoring strategy is the weakest. Thus fitness can be evaluated by playing the Prisoner7avgavg avgavg f ff c fb fmax** ( ( ))avgavgf ffa cmax( 1)*objects in some form of IPD. The Game object was implemented to play a game of IPDfor a specified number of rounds between two players. This object simply keeps track oftwo Prisoner‘s scores and game histories while asking them for new moves until therounds limit is met. The tournament object uses this class to organize a round robin IPDtournament, akin to Axelrod‘s [1] computer tournaments. In such a tournament an arrayof Prisoner‘s is supplied as the population and every Prisoner plays an IPD Game againstevery other Prisoner and themselves. Each player‘s payoff after these interactions havecompleted is deemed to be the player‘s fitness score.4.3. Fitness ScalingThe fitness scores calculated above will serve to determine which players go on toreproduce and which players ‗die off‘. However these ‗raw‘ fitness values present someproblems. The initial populations are likely to have a small number of very high scoringindividuals in a population of ordinary colleagues. If using fitness proportional selection,these high scorers will take over the population rapidly and cause the population toconverge on one strategy. This strategy will be a mixture of the high scorers‘ strategies,however as the population did not get time to develop these strategies may be suboptimal, and the population will have converged prematurely. In the later generations ofevolution the individuals should have begun to converge on a strategy. Thus they will allshare very similar chromosomes and the population‘s average fitness will likely be veryclose to the population‘s best fitness. In this situation average members and aboveaverage members will have a similar probability of reproduction. In this situation thenatural selection process has ended and the algorithm is merely performing a randomsearch among the strategies.It is useful to scale the ‗raw‘ fitness scores to help avoid the above situations. Thisalgorithm uses linear scaling as described by Goldberg, Linear scaling produces a linearrelationship between raw fitness, f, and scaled fitness, f’, as follows:f’ = a f + b (1)Coefficients a and b may be calculated as follows:(2)(3)Where c is the number of times the fittest individual should be allowed to reproduce. Avalue of 2 was found to produce accurate scaling in this method. The effect of this fitnessscaling is shown in the fig. 1.8f fminfaavgavgmin*minf ffb favgavgFig. 1. Linear Fitness ScalingThis scaling works fine for most situations, however in the later stages of evolution whenthere are relatively few ‗low‘ scoring strategies problems may arise. The average and bestscoring strategies have very close raw fitness and extreme scaling is required to separatethem. Applying this scaling to the few low scorers may result in them becoming negative.[Fig. 2]Fig. 2. Linear Fitness Scaling Negative ValuesThis can be overcome by adjusting the scaling coefficients to scale the weak strategies tozero and scale the other strategies as much as is possible. In the case of negative scaledfitness values the coefficients may be calculated as follows:(4)(5)This scaling helps prevent the early dominance of high scorers, while later ondistinguishes between mediocre and above average strategies. It is implemented in the9Genetic object and applied to all raw fitness scores (i.e. the IPD payoffs) beforeperforming genetic algorithm selection.4.4. ReproductionHaving selected two strategies from the population the Genetic Algorithm proceeds tomate these two parents and produce their two children. This reproduction is a mirror ofsexual reproduction in which the genetic material of the parents is combined to producethe children. In this research reproduction allows exploration of the search space andprovides a means of reaching new and hopefully better strategies. Reproduction isaccomplished using two simple yet effective genetic operators–crossover and mutation.Crossover is an artificial implementation of the exchange of genetic informationthat occurs in real-life reproduction. This algorithm, breaking both the parentchromosomes at the same randomly chosen point and then rejoining the parts, canimplement it. [Fig. 3]Fig. 3. CrossoverThis crossover action, when applied to strategies selected proportional to their fitness,constructs new ideas from high scoring building blocks. The genetic algorithmimplemented in this research performs crossover a large percentage of the time, howeveroccasionally (5% of the time by default) crossover will not be performed and simplenatural selection will occur. In nature small mutations of the genetic material exchangedduring reproduction occurs a very small percentage of the time. However if thesemutations produce an advantageous result they will be propagated throughout thepopulation by means of natural selection. The possibility of small mutations occurringwas included in this system. A very small percentage of the time (0.1% of the time by10default) a bit copied between the parent and the child will be flipped, representing amutation. These mutations provide a means of exploration to the search.4.5. ReplacementThe genetic algorithm is run across the population until it has produced enoughchildren to build a new generation. The children then replace all of the originalpopulation. More complicated replacement techniques such as fit-weak and child parentreplaced were researched but they were unsuitable for the round robin tournament natureof the system.4.6. Search TerminationThe only termination criteria implemented is a limit to the maximum number ofgenerations that will run; the user may set this. Other termination criteria wereinvestigated, for example detecting when a population has converged and strategies arereceiving equal payoffs, however these criteria resulted in many false positives and it wasdecided better to allow the user to judge when the algorithm had reached the end ofuseful evolution.5. ConclusionGA often found strategies that scored substantially higher than any other algorithm. But itwould be wrong to conclude that the GA discovered strategies that are ―batter‖ than anyhuman designed strategy. The performance of a strategy depends very much on itsenvironment- that is, on the strategies with which it is playing. Here the environment wasfixed- that did not change over the course of a run. Therefore it may conclude that theabove-mentioned environment is static (Unchanged).11Appendix:The following flowchart describes the completed genetic algorithm of thePrisoner‘s Dilemma Problem.Fig. 4. Flow Chart of Prisoner‘s Dilemma Problem12References:Axelrod R, (1990). The Evolution of Cooperation, Penguin BooksB. Routledge, (1993). Co-Evolution and Spatial Interaction, mimeo, University of BritishColumbiaChambers (ed.), (1995). Practical Handbook of Genetic Algorithms Applications, Volume II, CRCPressConor Ryan, Niche Species, (1995). Formation in Genetic Algorithms, In L. Chambers (ed.),Practical Handbook of Genetic A lgorithms Applications Volume I, CRC PressDavid E. Goldberg, (1989). Genetic Algorithms in search, optimization, and machine learning,Addison-Wesley PublishingFrank Schweitzer, Laxmidhar Behera, Heinz Mühlenbein, (2002). Evolution of Cooperation in aSpatial Prisoner’s Dilemma, Advances in Complex Systems, vol. 5, no. 2-3, pp. 269-299John R. Koza, (1992). Genetic Programming On the programming of computers by means ofnatural selection, MIT PressPeter J.B. Hancock, (1995). Selection Methods for Evolutionary Algorithms, In L. Geoff Bartlett,Genie: A First GA, In L. Chambers (ed.), Practical Handbook of Genetic AlgorithmsApplications, Volume I, CRC PressShaun P. Hargreaves Heap and Yanis Varoufakis, (1995). Game Theory a Critical Introduction,McGraw Hill Press