1. The Many Roads to Cooperation
Explaining how cooperative behavior—or prosocial behavior, more generally—might emerge has become a cottage industry. Mechanisms that have been shown to work, in certain contexts, include the following items: reliable signals, or the “secret handshake” (Robson Reference Robson1990); costly signals, or “the handicap principle” (Zahavi Reference Zahavi1975; Zahavi and Zahavi Reference Zahavi and Zahavi1997); punishment (Boyd and Richerson Reference Boyd and Richerson1992; Gintis Reference Gintis2000); compliance with social norms (Axelrod Reference Axelrod1986; Bowles and Gintis Reference Bowles and Gintis2004; Bicchieri Reference Bicchieri2005); correlated interactions induced by social structure (Ellison Reference Ellison1993; Nowak and May Reference Nowak and May1993; Skyrms Reference Skyrms2003; Alexander Reference Alexander2007); reciprocal altruism (Trivers Reference Trivers1971); and group selection (Sober and Wilson Reference Sober and Wilson1998). This list is by no means exhaustive.
Some of these mechanisms support the emergence of cooperative behavior simply because the mechanism is considerably flexible it terms of what it may yield; recall, after all, that the title of Boyd and Richerson’s paper is “Punishment Allows the Evolution of Cooperation (or Anything Else) in Sizable Groups.” Other mechanisms have more limited scope. Social norms require people to know not only the behavior of others, and the underlying rule that governs their behavior, but also other people’s expectations. And local interaction models, although quite effective in supporting cooperation (Nowak and May Reference Nowak and May1993), fairness (Alexander and Skyrms Reference Alexander and Skyrms1999), and trust (Skyrms Reference Skyrms2001, Reference Skyrms2003), face greater difficulty explaining the behavior of agents in the ultimatum game.
Finally, the mechanism of costly signals seems to have the most limited scope of all the methods listed above. Why? Zahavi argued signals must be costly in order to ensure that they are reliable, or honest, for otherwise such signals could be easily forged by opportunistic agents. Yet forging signals would only be problematic in cases involving altruistic behavior, such as the Prisoner’s Dilemma or the Sir Philip Sidney game (Maynard Smith Reference Maynard Smith1991)—games in which “cooperating” leaves the agent vulnerable to exploitation. In a pure coordination game, like the Driving Game, or an impure coordination game, like Battle of the Sexes, or even a trust game like the Stag Hunt, an honest signal need not be costly. In these games, receipt of an honest signal may increase the chance of arriving at a Pareto-optimal Nash equilibrium.
Furthermore, some have challenged whether signals need be costly in order to be effective even in cases of altruistic behavior. In a series of three articles (Bergstrom and Lachmann Reference Bergstrom and Lachmann1997, Reference Bergstrom and Lachmann1998; Lachmann and Bergstrom Reference Lachmann and Bergstrom1998), Carl Bergstrom and Michael Lachmann consider the effects of costly versus costless signals in the Sir Philip Sidney game. They find that, in some cases, costly signaling can be so costly that individuals are worse off than not being able to signal at all; they also show that honest cost-free signals are possible under a wide range of conditions. Similarly, Huttegger and Zollman (Reference Huttegger and Zollman2010) show—again for the Sir Philip Sidney game—that the costly signaling equilibrium turns out to be less important for understanding the overall evolutionary dynamics than previously considered.
In what follows, I contribute to the critique of the importance of costly signals for the emergence of cooperation, but using a rather different approach than what has previously been considered. In section 2, I present a general model of reinforcement learning in network games, which builds on the work of Alexander (Reference Alexander2007) and Skyrms (Reference Skyrms2010). Section 3 introduces the possibility of costless cheap talk into this model, as well as the possibility of conditionally responding to received signals. I then show that—in accordance with the Handicap Principle—cooperative behavior does not emerge. However, in section 4 I show that when cheap talk and reinforcement learning is combined with discounting the past, costless signaling enables individuals to learn to cooperate despite originally settling on a “norm” of defecting.
2. Reinforcement Learning in Network Games
If one were to identify one general trend in philosophical studies of evolutionary game theory over the past 20 years, it would be a movement toward ever-more-limited models of boundedly rational agents. Contrast the following items: Skyrms (Reference Skyrms1990) modeled interactions between two individuals who updated their beliefs using either Bayesian or Brown–von Neumann–Nash dynamics, both fairly cognitively demanding. In his most recent book, Skyrms (Reference Skyrms2010) almost exclusively employs reinforcement learning in his simulations.
This strategy of attempting to do more with less has some advantages. For one, in real life we rarely know the actual payoff structure of the games we play. Without knowing the payoff matrix, we cannot even begin to calculate the expected best response (to say nothing of the difficulty of trying to attribute degrees of belief to our opponents). Reinforcement learning does not require that agents know the payoff matrix.Footnote 1 Second, even a relatively simple learning rule like imitate the best requires knowledge of two things: the strategy used by our opponents and the payoffs they received. Even if we set aside worries about interpersonal comparison of utilities, there is still the problem that the behavioral outcome of different strategies can be observationally the same—so which strategy does an agent adopt using imitate the best?Footnote 2
Another advantage of reinforcement learning is that several different varieties have been studied empirically, with considerable effort made to develop descriptively accurate models of human learning. Two important variants are due to Bush and Mosteller (Reference Bush and Mosteller1951, Reference Bush and Mosteller1955) and Roth and Erev (Reference Roth and Erev1995). Let us consider each of these in turn.
Suppose that there are N actions available, and pi(t) denotes the probability assigned to action i at time t. Bush-Mosteller reinforcement learning makes incremental adjustments to the probability distribution over actions, so as to move the probability of the reinforced action toward 1. The speed with which this occurs is controlled by a learning parameter a.Footnote 3 If the kth action is reinforced, the new probability pk(t + 1) is just pk(t) + a(1 − pk(t)). All other probabilities are decremented by the amount [1/(N − 1)]a(1 − pk(t)) in order to ensure that the probabilities sum to 1. One point to note is that this means Bush-Mosteller reinforcement learning does not take into account past experience: if you assign probability 3/4 to an action, and reinforce, your probability distribution shifts by the same amount it would if you assigned probability 3/4 to an action after a thousand trials.
Roth-Erev reinforcement learning, in contrast, is a form of reinforcement learning that takes past experience into account. It can be thought of as a Pólya urn: each agent has an urn filled with an initial assortment of colored balls representing the actions available. An agent draws a ball, performs the corresponding act, and then reinforces by adding a number of similarly colored balls on the basis of the reward of the act. More generally, an agent may assign arbitrary positive-valued weights to each act, choosing an act with probability proportional to its weight. This latter representation drops the requirement that the weights assigned to acts be integral valued, which allows one to incorporate additional aspects of human psychology into the model, such as discounting the past.
To see how Roth-Erev takes experience into account, suppose that the reward associated with an act is always 1 and that there are exactly two available acts. If the agent initially starts with an urn containing a red ball representing act 1 and a green ball representing act 2, then the initial probability of each act is 1/2. Reinforcing act 1 will cause the urn to have two red balls and one green ball, so the new probability of act 1 is 2/3. But now suppose that after 20 trials the urn contains exactly 10 red balls and 10 green balls. Reinforcing act 1, at this point, causes the probability of act 1 to increase to only 11/20.
Roth-Erev reinforcement learning has some nice theoretical properties, aside from the limited epistemic requirements it imposes on agents. Consider the idealized problem of choosing a restaurant in a town where you do not speak the language. The challenge you face is the trade-off between exploration and exploitation. You do not want to settle for always eating at the first restaurant that serves you a decent meal. However, you also do not want to keep sampling indefinitely, so that you never fixate on a single restaurant.Footnote 4 How should you learn from your experience so as to avoid both of these errors? If you approach the restaurant problem as a Roth-Erev reinforcement learner, with the urn initially containing one ball for each restaurant,Footnote 5 then in the limit you will converge to eating at the best restaurant in town, always.Footnote 6 Because of these nice theoretical properties, I will concentrate exclusively on Roth-Erev reinforcement learning in what follows.
Now consider the following basic model: let P = {a 1, … , an} be a population of boundedly rational agents situated within a social network (P, E), where E is a set of undirected edges. This network represents the structure of the population, in the sense that two agents interact and play a game if and only if they are connected by an edge.
For simplicity, let us assume that the underlying game is symmetric. (This ensures that we do not need to worry whether a player takes the role of row or column, potentially having different strategy sets.) Each agent begins life with a single Pólya urn containing one ball of a unique color for each of her possible strategies.
Each iteration, the pair-wise interactions occur asynchronously and in a random order. When two agents interact, each reaches into his or her urn and draws a ball at random, with replacement. Each agent plays the strategy corresponding to the ball drawn from his or her urn, receiving a payoff. After the interaction, both agents reinforce by adding additional balls to their urn, the same color as the one drawn, where the number of new balls added is determined by the payoff amount.Footnote 7
Figure 1 illustrates the outcome of Roth-Erev reinforcement learners on three different social networks: the ring, wheel, and a grid. The underlying game was the canonical Prisoner’s Dilemma with payoffs as indicated. The probability of agents choosing either cooperate or defect is displayed as a pie chart, with the white region representing the probability of cooperating and the black region representing the probability of defecting. Each action had an initial weight of 10, which prevented the outcome from the first round of play from severely skewing the probability of future actions.

Figure 1. Effective convergence to defect in the Prisoner’s Dilemma played on three different structures. Payoff matrix: T = 4, R = 3, P = 2, S = 1 with an initial weight of 10 on the actions cooperate and defect. Nodes are pie charts showing a player’s probability of choosing cooperate (white) or defect (black) from the urn. a, c, e, initial configuration; b, d, f, 100,000 iterations.
This result is in accordance with the results of Beggs (Reference Beggs2005), who showed that in a 2 × 2 game the probability a Roth-Erev reinforcement learner will play a strictly dominated strategy converges to 0. The one difference between this model and that of Beggs is that, here, the asynchronous dynamics allows two opponents to play a game with the collective urn configuration in a state not obtainable in Beggs’s framework. That is, if a player A is connected to B and C by two edges, and A first interacts with B, then A—who will have reinforced after his interaction with B—may interact with C whose urn is in the same state as at the end of the previous iteration. However, as figure 1 illustrates, this has no real difference in the long-term convergence behavior.
3. Cheap Talk and Reinforcement Learning in Networked Games
In game theory, “cheap talk” refers to the possibility of players exchanging meaningless signals before choosing a strategy in a noncooperative game. Since players do not have the capability to make binding agreements, signal exchange was initially thought to be irrelevant for purposes of equilibrium selection in one-shot games. However, cheap talk is more interesting than it might initially appear. In the case of evolutionary game theory, Skyrms (Reference Skyrms2003, 69–70) shows how cheap talk in the Stag Hunt creates a new evolutionarily stable state that does not exist in the absence of cheap talk.
Consider, then, an extension of the model presented in section 2 that incorporates a preplay round of cheap talk on which players may condition their response. Since there seems little reason to restrict the number of signals a player may send, let us model the cheap talk exchange using the method of signal invention from Skyrms (Reference Skyrms2010), based on Hoppe-Pólya urns. Each player begins with a signaling urn containing a single black ball, known as the mutator.Footnote 8 When the mutator is drawn, the player chooses a new ball of a unique color and sends that as the signal. On receipt of a signal, a player conditions her response on the signal as follows: if this is the first time that the signal was received, the player creates a new response urn (a Pólya urn) labeled with that signal. The new response urn initially contains one ball of a unique color for each strategy available to the player. A strategy is selected at random by sampling from the response urn with replacement. The game is played, after which reinforcement occurs; unlike the previous model, though, here both the signaling and response urns are reinforced, with the amount of reinforcement determined by the payoff. If the signal received by the player had been received previously, the player selects a strategy the same way but uses the already-existing response urn labeled with that signal. Hence, the probability of choosing any particular strategy for a received signal will vary in a path-dependent way on the basis of previous reinforcement.
Figure 2 illustrates aggregate results from 100 simulations for a simple cycle graph consisting of five agents. Cheap talk, here, makes essentially no difference in the long-term behavior of the population: people still converge on defection.Footnote 9 This should come as no surprise: the method of incorporating cheap talk means that a single individual, instead of playing the Prisoner’s Dilemma with a single Pólya urn, can be thought of as being “partitioned” into several individuals each of whom play the Prisoner’s Dilemma with their own Pólya urns. Since we know that Roth-Erev reinforcement learning (which is what the Pólya urn models) learns to avoid playing strictly dominated strategies in the limit, so will people using Roth-Erev reinforcement learning when they have the ability to conditionally respond to cheap talk.

Figure 2. Aggregate results of 100 simulations featuring conditional response to cheap talk and reinforcement learning, on a cyclic network with five agents. Payoff matrix for the Prisoner’s Dilemma had T = 4, R = 3, P = 2, and S = 1.
4. Discounting, Cheap Talk, and the Emergence of Cooperation
It has been known for some time that models of cheap talk with signal invention often benefit from including a method of pruning the number of signals created. In Lewis sender-receiver games, for example, signal invention and reinforcement learning lead to efficient signaling systems, but with the side effect of there being infinitely many signals in the limit (see Skyrms Reference Skyrms2010). However, Alexander, Skyrms, and Zabell (Reference Alexander, Skyrms and Zabell2012) later showed that, if the model of signal invention and reinforcement learning is supplemented with signal “de-enforcement,” efficient—and often minimal—signaling systems are produced. In a separate paper, also concerned with Lewis sender-receiver games, Alexander (Reference Alexander2014) showed that models of signal invention and reinforcement learning in which past information is discounted avoid excessive lock-in to particular signaling systems. This means that individuals are able to coordinate on efficient signaling systems yet, at the same time, respond rapidly to external stochastic shocks that change what the “correct” action is.
Consider, then, the model from the previous section with one final addition: each player has a discount factor δ that is applied to the weights of the signaling and response urns at the start of each iteration.Footnote 10 Since a seldom used signal will have its weight eroded over time, let us introduce a cutoff threshold τ such that, if the signal’s weight drops below τ, it is eliminated entirely.Footnote 11 Finally, although the weights in the response urns are discounted, the cutoff threshold does not apply to them.Footnote 12
With these adjustments to the model, we find a striking result: individuals rapidly move to defection, in the beginning, but then learn to cooperate over time. The combination of signal invention, reinforcement learning, and discounting the past enables the population to crawl out of the collectively suboptimal state in which they initially find themselves. Figure 3 illustrates this for a population of 12 agents on a cyclic network.

Figure 3. Emergence of cooperation in the Prisoner’s Dilemma under cheap talk, reinforcement learning, and discounting the past. Discount factor used was 0.95, and the cutoff threshold was 0.01. a, 100 iterations; b, 1,100 iterations; c, 2,100 iterations; d, 10,000 iterations.
Figure 4 shows that the phenomenon of agents starting off with defection, and then learning to cooperate, occurs quite generally. Each line shows the aggregate results of 1,000 simulations, where the y-value at the xth iteration is the frequency of cooperative acts across all 1,000 simulations at that iteration.Footnote 13 The graph topology has a notable effect on the speed with which cooperation emerges, but what is striking in light of the earlier results is how often cooperation happens. Recall that it was a theorem of Beggs (Reference Beggs2005) that Roth-Erev reinforcement learning would play a strictly dominated strategy with a probability converging to 0 in the limit.

Figure 4. Emergence of cooperation, aggregate results for 1,000 simulations of 10,000 iterations each on a variety of networks.
Why does cooperation emerge in the presence of discounting but not otherwise? It seems to involve the following interaction of factors. First, discounting places a cap on the overall weight a signal or action can receive as a result of reinforcement. For modest discount factors, say 95%, this means that there is a significant chance that a new signal will be attempted at any period in time. Second, suppose that a new signal is used between two agents, both of whom cooperate. If that signal (simply through chance) is used two or three times in a row, notice what happens: the amount of reinforcement in the default Prisoner’s Dilemma adds two balls to both the signaling and response urn for the respective signal and action. If that happened, say, three times in a row, the weights attached to the other signals and responses would have been decreased by nearly 15%, on top of the fact that mutual cooperation pays twice that of mutual defection.Footnote 14 Third, since the mutual punishment payoff for the standard Prisoner’s Dilemma used here awards 1 to each player, the maximum possible weight that could be attached to the previously used signal would be on the order of 20, since , whereas the maximum possible weight for signals used to coordinate cooperation would be on the order of 40.Footnote 15 Finally, when multiple signals are available to the sender, each signal will be used less often. Since discounting applies to each urn each iteration, that means the weights attached to actions in unused response urns are all discounted by the same amount each iteration. This does not affect the actual probability of any action in the urn being selected the next time the response urn is used, since
, but it does mean that the next amount of reinforcement will have greater effect than it would otherwise have. The combination of these four factors, taken together, favors the emergence of cooperation.Footnote 16
5. Conclusion
On the many roads leading to cooperation, Roth-Erev reinforcement learning seldom appeared in cases in which the cooperative outcome required people to use a strictly dominated strategy. It has been shown here that if the basic mechanism of Roth-Erev reinforcement learning is supplemented by the psychologically plausible addition of temporal discounting of the past, cheap talk and signal invention, cooperation can regularly emerge in cases in which that requires use of a strictly dominated strategy. Perhaps the most interesting and unexpected feature of this model is that, in the short term, individuals typically defect but then, over time, eventually learn to cooperate. We have thus identified one formal mechanism that suffices to generate the following well-known social phenomenon: that people, albeit initially uncooperative, may, by means of repeated interactions over time, eventually engage in mutually beneficial cooperative behavior even in the absence of formal institutions to establish, monitor, or police it.