1 Introduction
The production and perception of randomness has a long research history in cognitive psychology (see Reference NickersonNickerson, 2002, for an overview) and rightly so. The perception or judgment of randomness is a core human competency (see Reference Oskarsson, Van Boven, McClelland and HastieOskarsson et al., 2009, for a review). There is ample evidence that humans are capable of learning patterns (both implicitly and explicitly) in sequences of events (Reference Clegg, DiGirolamo and KeeleClegg et al., 1998; Reference Remillard and ClarkRemillard & Clark, 2001). Our ability to discover the correlations (e.g., Reference KareevKareev, 1995; Reference Kareev, Lieberman and LevKareev et al., 1997; Reference KareevKareev, 2000) arising from the causal relationships in our environment allow us to adapt to and exploit the environmental structure. With respect to the production or generation of random behavior, subjects in laboratory tasks (without strategic interactions) are typically inefficient at creating serially uncorrelated sequences. Subjects tend to produce over-alternating sequences (with too many runs) and regress towards the representative frequencies of the distribution they are emulating (Reference Kahneman and TverskyKahneman & Tversky, 1972; Reference Bar-Hillel and WagenaarBar-Hillel & Wagenaar, 1991; Reference Rapoport and BudescuRapoport & Budescu, 1997). Explanations of these deviations in random generation range from cognitive bounds such as short-term memory (Reference KareevKareev, 1992; Reference KareevKareev, 1995; Reference KareevKareev, 2000) and the complexity (or difficulty of encoding) of sequences (Reference Falk and KonoldFalk & Konold, 1997) to the statistical properties of small samples of random behavior (Kareev et al., 1997; Reference Sun and WangSun & Wang, 2010; Reference Sun and WangSun & Wang, 2011), or the interaction of both (Reference Hahn and WarrenHahn & Warren, 2009; Reference Farmer, Warren and HahnFarmer et al., 2017; Reference Warren, Gostoli, Farmer, El-Deredy and HahnWarren et al., 2018).
Although the judgment of randomness is typically applicable to interactions with nature or individual decision making, the production or generation of random behavior is naturally most relevant to strategic interactions with other decision makers in our environment, i.e., in strategic games. In contrast to the above studies that investigate random sequence generation in individual decision-making tasks, Reference Rapoport and BudescuRapoport & Budescu (1992) and Reference Budescu and RapoportBudescu & Rapoport (1994) used laboratory games where it is optimal to be unpredictable. While they found similar qualitative deviations in randomization behavior for both individual and strategic decision-making, Reference Budescu and RapoportBudescu & Rapoport (1994) show that people are more efficient randomizers in the latter. Random behavior is called for in strategic interactions of conflict or competition, where one player’s gain is another’s loss and being unpredictable is beneficial. Such games have an equilibrium in mixed-strategies, where a player chooses to randomize over the actions at his/her disposal, rather than play one of them with certainty. Situations where mixed strategies are relevant include bluffing in poker, penalty shootouts in soccer, and serve directions in tennis, which is the environment that I will study here. In repeated games, the normative prediction is that the action chosen in a round should be independent of chosen actions in the previous round, i.e., players should be randomizing perfectly. Otherwise, a player could learn the dependencies or patterns in the opponent’s behavior and exploit them appropriately (Reference SpiliopoulosSpiliopoulos, 2012; Reference SpiliopoulosSpiliopoulos, 2013a; Reference SpiliopoulosSpiliopoulos, 2013b; Reference SpiliopoulosSpiliopoulos, 2018; Reference Ioannou and RomeroIoannou & Romero, 2014).
Field data from competitive sports are particularly useful, combining the benefits of a real-world domain where randomization is important (guaranteeing high ecological validity) with a high-level of incentivization and the opportunity for significant learning beyond what is feasible in the laboratory. The existing literature using field data has been primarily conducted by game theorists and economists rather than cognitive psychologists, despite its obvious relationship to pioneering work by psychologists on randomization. In this paper, I analyse a large dataset of tennis serves with the goal of resolving an open debate on whether professional players deviate from efficient randomization in their serve direction (Reference Walker and WoodersWalker & Wooders, 2001) or not (Reference Hsu, Huang and TangHsu et al., 2007). Furthermore, I extend the existing literature by exploiting the large number of within-player observations to examine whether the degree of randomization depends on a player’s own rank and the rank of the opponent, experience, the round of the match (e.g., final, semi- or quarter-final), and the difficulty and length of the match. These analyses are related to existing laboratory studies investigating the impact of learning, feedback and other variables on the efficiency of randomization. Specifically, Reference Lopes and OdenLopes & Oden (1987) concluded that statistically sophisticated subjects performed better than average subjects, although they exhibited the same qualitative misperceptions of randomness. Regarding whether feedback induces better randomization, the evidence thus far is mixed. Feedback has been found to improve the identification of non-random sequences (Reference Zhao, Hahn and OshersonZhao et al., 2014) and generation of random sequences (Reference NeuringerNeuringer, 1986); however, Reference BudescuBudescu (1987) did not find a significant effect of feedback. Of course, the degree of learning that can occur in the laboratory is limited by practical ceilings on the amount of exposure and the incentives to perform well. The field data from highly-paid and competitive tennis tournaments addresses both of these limitations and permits the investigation of other potential mediators.
Before proceeding, I summarise the state of the art in the game theory literature. Recall that the normative solution to repeated games with a stage unique mixed-strategy Nash equilibrium is perfect randomization, i.e., actions must be independent of the prior history of play. One strand of experimental studies tests the equilibrium predictions in the laboratory, finding significant deviations from the equilibrium predictions (Reference BloomfieldBloomfield, 1994; Reference Brown and RosenthalBrown & Rosenthal, 1990; Reference Chiappori, Levitt and GrosecloseChiappori et al., 2002; Reference OchsOchs, 1995; Reference Rapoport and BudescuRapoport & Budescu, 1997; Reference O’NeillO’Neill, 1987; Reference Levitt, List and ReileyLevitt et al., 2010; Reference WoodersWooders, 2010; Reference Palacios-Huerta and VolijPalacios-Huerta & Volij, 2008; Reference OkanoOkano, 2013; Reference ShachatShachat, 2002). Experience can reduce the magnitude of these deviations, however this is conditional on features of the game — see Reference OchsOchs (1995), Reference Roth and ErevRoth & Erev (1995), Reference Erev and RothErev & Roth (1998), Reference Binmore, Swierzbinski and ProulxBinmore et al. (2001), Reference Nyarko and SchotterNyarko & Schotter (2002). Another finding is that experience from the field does not transfer well to the laboratory for new tasks. Despite initial claims that professionals, to a large degree, transfer their experience to new tasks in the laboratory (Reference Palacios-Huerta and VolijPalacios-Huerta & Volij, 2008), later studies have not found evidence of this effect (Reference Levitt, List and ReileyLevitt et al., 2010; Reference WoodersWooders, 2010; Reference Van Essen and WoodersVan Essen & Wooders, 2015). Finally, subjects exploit both deviations from the equilibrium marginal distributions (Reference Shachat and SwarthoutShachat & Swarthout, 2004) and deviations from serially independent or random play, in ways that can be explained by learning models capable of detecting temporal patterns (Reference SpiliopoulosSpiliopoulos, 2012; Reference SpiliopoulosSpiliopoulos, 2013a; Reference SpiliopoulosSpiliopoulos, 2013b; Reference SpiliopoulosSpiliopoulos, 2018; Reference Ioannou and RomeroIoannou & Romero, 2014).
Another strand of research utilizes field data from competitive sports. The first paper to examine the optimality of tennis serves in the field is Reference Walker and WoodersWalker & Wooders (2001) — see also the comment by Reference Hsu, Huang and TangHsu et al. (2007) (I refer to these two studies as WW and HHT respectively). Both studies concluded that mixing proportions were not statistically different from the equilibrium; however, while the former concluded that significant deviations existed from the theoretical prediction of serial independence, the latter concluded the opposite. The predictions of minimax play in the field have also been tested in other sports, such as soccer and the NFL.Footnote 1 To summarize, the majority of studies confirm equilibrium behavior in terms of mixing proportions, whereas the findings regarding serial independence are mixed. Of these different sports, tennis allows for the most powerful tests of minimax behavior for individual players rather than a population of players. In soccer, since players rarely make penalty shots, the data afford low statistical power to reject the null hypothesis of equilibrium behavior at the individual level. Also, because there exist large intervals between a player’s consecutive penalties, this could encourage equilibrium behavior by inducing memory-less behavior, which may be conducive to the generation of serially independence sequences. In the NFL, different players are involved in each play and strategies are called by the coach; hence, tests of equilibrium behavior are essentially a test not of an individual, but a joint test of the behavior of the coach and a group of players.
This study is most similar to WW and HHT, but uses a new tennis serve dataset that is two orders of magnitude larger than those in existing published studies. I will specifically address the conflicting findings regarding serial (in)dependence in WW and HHT, which remains an important open issue. While WW do not reject serial independence of serves, HHT find evidence of statistically significant correlation across serves. I extend the work in WW and HHT in three directions. First, by including analyses of behavior relative to the player’s (own) ranking, i.e., examining whether more highly ranked players conform more closely to equilibrium predictions. A working paper by Reference Gauriot, Page and WoodersGauriot et al. (2016), henceforth GPW, uses another large dataset from another source to examine minimax behavior in tennis and its relationship to player ranking. Note, the latter manuscript also investigates the equilibrium prediction that winning rates for left and right serves are equal. While there is some overlap between our manuscripts in terms of the hypotheses tested, they are largely complementary.
The following hypotheses based on individual player-level analyses (rather than only population analyses) differentiate my work from WW, HHT and GPW. The first hypothesis regards whether players strategically condition their behavior on the ranking of their opponent. Players capable of using the equilibrium strategy — but consciously choosing not to play accordingly — may in fact be rational if they hold correct beliefs that their opponent will not choose the normative solution (Reference Plott, Arrow, Colombatto, Perlman and SchmidtPlott, 1996). Consider the case where low ranked players are imperfect randomizers. If high-ranked players are sophisticated in the sense of correctly predicting low-ranked players’ deviations from the equilibrium, then rationality dictates that they exploit this. Of course, this would lead to non-equilibrium behavior by the highly-ranked players, which however, would indicate rational behavior given their opponent’s type. The second set of hypotheses regard whether players condition on, or are affected by, match characteristics such as: a) the tournament round of the match (e.g., whether it is a final, semi-final etc.), which would factor in the effects of stress and the probability of winning the tournament prize given the tournament’s progression, b) the difficulty of the match (e.g., how close the score is), and the number of points played in a match (a proxy for fatigue and difficulty). To the best of my knowledge this is the first field study in tennis to address all of these additional questions.
2 Modeling tennis serves
I briefly describe the tennis serve model introduced by WW and adopted by HHT (see WW for more details). Tennis serves alternate in terms of the area of the court (box) within which each serve must land to be valid (i.e., not declared as a fault), referred to as deuce and ad courts. The collections of points served by a player in each of these two courts are referred to as either ad or deuce point-games. For example, all points in a match where a specific player’s serve was directed to the ad box are referred to as that player’s ad point-game. Since there are two players and two boxes, each match has four point-games. Each point-game in the match is modeled as a 2×2 normal form game with action spaces left (L) and right (R) for both the server s and the receiver r — see Table 1. The payoffs of this game are equivalent to the probabilities πa s,a r of winning each point-game for the action profile a s,a r — consequently, this game is constant-sum. The probabilities of winning differ conditional on whether serves are made to the ad or deuce court due to differences in serving and returning abilities; this is the reason why we must distinguish between point-games. Reference Walker, Wooders and AmirWalker et al. (2011) show that tennis belongs to the class of Binary Markov games, which possess the property that the equilibrium play for every point in the match can be solved independently of all other past points and outcomes in the match. That is, the equilibrium of the match corresponds to equilibrium play in each point of a specific point-game. Every point-game played has a unique mixed strategy Nash equilibrium under the assumptions that .Footnote 2
3 Data
The data originate from the crowd-sourced Match Charting Project accessible at http://www.tennisabstract.com/charting/meta.html, which compiles tennis match statistics.Footnote 3 The dataset covers 391 male players whose career-high ranking ranged from Number 1 to Number 2076 (mean = 123, median = 71) in the world, and includes 1,093 matches from 1975 to 2016.Footnote 4 In total, I analyze the data from 143,743 serves resulting from 4,372 point-games. This is two orders of magnitude larger than prior published studies of tennis serves: 3,026 serves from ten matches in WW and 2,490 serves from ten mens matches in HHT.Footnote 5 The mean (median) number of matches per player is 5.6 (2); for top players who are more likely to participate in these tournaments the dataset holds significantly more matches, e.g., the maximum is for Federer (160 matches), followed by Nadal (142), Djokovic (126) and Murray (68). For these players there are 12413, 8708, 8984, and 4935 serve observations respectively. This amount of data permits much more powerful tests of the mixing behavior of athletes than previous studies. Furthermore, hypothesis testing targeted at the top players provides the best chance of observing equilibrium behavior, as these players are the most capable and the most highly incentivized to pursue optimal behavior.
In the dataset, tennis serve directions were encoded as either “4”, “5”, or “6”, corresponding to left, center and right respectively. I found 21,159 cases, where other symbols were used in the encoding. This may be either due to data-entry error, or because the data-coders were uncertain how to categorize the serve direction. These cases were not included in the analysis, as is the case for serves in the direction of the center — the latter is standard practice in the literature, i.e., prior studies analyzed only the left and right serve directions. The complete dataset was compiled by merging point-by-point data files with player-ranking data files ranging from 22/12/1980 to 1/2/2016. I use both the career-high and current ranking (at the time of the match) in the analyses; in some cases the former may be more appropriate as players’ current rankings may misleadingly fluctuate wildly due to injuries.Footnote 6 The following notation is used throughout. Let i index the players, pg index the point-games from all players’ matches, and pg i index the point games for a player i.
4 Results
The serial independence of tennis serve directions is tested at two different levels of aggregation: the (dis-aggregated) point-game level and the player level (aggregating over the point-games of each player). WW and HHT tested serial dependence using the distribution of point-game statistics since they did not have not enough observations per player. Testing at the player-level is more desirable because it matches the expected structure of the data, particularly the heterogeneity that may exist between players (based on their ability, experience etc.). Below, I summarize the statistical procedures — details can be found in Appendix A.
Serial dependence for each point-game in the data is examined using the two-sided exact runs test (see Eq. 2, in the appendix). To test whether a set of these points-game statistics (either for the whole population of players or for a specific player) are distributed according to the null hypothesis of no serial dependence requires the randomization of the test statistics. These randomized statistics are generated according to Reference Walker and WoodersWalker & Wooders (2001), p. 1533 — a set of these statistics can then be tested using the standard Kolmogorov-Smirnov test. The individual point-game level test is a KS-test on the distribution of the randomized (exact run) test statistics at the point-game level for all the players — this is the test in WW and HHT. For the player-level analysis it is the Kolmogorov-Smirnov test on the distribution of point-game statistics for each player only. The latter permits the testing of serial independence for each player rather than the set of players.
4.1 Analysis at the point-game level
At the point-game level, the null hypothesis of serially independent serve directions was rejected at the 5% level for 12% of the point-games (9.5% for over-alternating, 2.5% for under-alternating). Controlling for multiple comparisons using the Bonferonni-Holm correction, the hypothesis of no serial correlation is rejected for 172 individual point-games, i.e., 3.9% of the cases (3.7% for over-alternating, 0.2% for under-alternating). Alternatively, following WW and HHT, the randomized KS-test on the distribution of the point-games strongly rejects the null hypothesis of serial independence (K = 0.06, p = 2.2 × 10−14). I conclude that the hypothesis of serial independence is rejected, predominantly due to over-alternation of the serve direction. This finding corroborates the conclusions drawn by WW, but not HHT — calculations in Appendix C reveal that, given the samples sizes of these two studies, there would be roughly a 50% chance that two independent studies would arrive at the opposite conclusions. The GPW working paper also rejects serial independence using a large dataset with sufficient power.
4.2 Analysis at the player level
Let the marginal probabilities of each player i serving to the left and to the right be denoted by q iL and q iR respectively. Recall that these must be separately estimated for the ad and deuce point-games; the subscript related to the point-games is dropped for simplicity. Similarly, the set of conditional probabilities are denoted by , where the first letter of the superscript denotes the serve direction at t − 1 and the second to the serve direction at t. The conditional probabilities reveal whether players tend to over- or under-alternate. Since these probabilities are conditioned only on the immediately prior serve direction (for the same point-game), they can be represented as the first-order transition matrix of a two-state Markov-chain model:
Over-alternation implies that q iLL<q iL and q iRR<q iR, and under-alternation implies the opposite signs. The maximum likelihood estimates of the marginal and conditional probabilities of serve directions for both point games are presented in Table 2 — these are the averages of the estimates for each point-game in the dataset. As expected, there are differences in the marginal and conditional probabilities for the two point games, arising from differences in the ability to serve and return for the ad and deuce courts. For both the ad and deuce point-games and both serve directions, the population of players exhibit over-alternation on average, as q LL<q L and q RR<q R for both point-games — see Table 2.
Aside from the conditional probabilities, an alternative measure of the degree of deviation from serial independence, which can be used to compare across players, can be constructed based on the number of runs in a sequence. Let r devi be the % deviation for each player in the number of runs in all point-games r pg i compared to the expected number of runs of a serially independent sequence Exp(r pg i):
Figure 1 presents a histogram of the empirical distribution of r idev. Over-alternation, i.e., switching too often or negative serial correlation, occurs if r idev > 0 and under-alternation if r idev < 0. Negative serial correlation is found for 69% of the players and the mean of r idev is 5.55%, in line with the conclusions of over-alternation on average drawn from the empirical conditional probabilities. Furthermore, the 5th and 95th percentiles are large in magnitude, −14.6% and 23.7% respectively.
The power of the statistics conditional on this dataset is significantly higher than prior investigations, but varies according to the data available per player. Detailed simulations verifying the statistical power can be found in Appendix B. Based on these calculations, I refer to subjects with at least fifty matches as the high-power group, between twenty and fifty matches as the moderate-power group, and less than twenty matches as the low-power group. For the high-power group, the statistics have 80% power to detect an average effect size, and for the moderate-power group, 80% power to detect a higher (yet still plausible) effect size. Table 3 presents the statistics of all players who are represented in the high-power and moderate-power groups. The high power group consists of four players, three of whom were ranked Number 1 (Federer, Nadal, Djokovic) and the other Number 2 (Murray) in the world. The null hypothesis of serial independence is strongly rejected (p<0.005) for all four players. The mean percentage deviation in the number of runs, r idev, is −10.7%, 6.9%, −13.4%, and −3.4% respectively. Also, the probability of finding over-alternation in each player’s point-game runs statistics is 0.25, 0.65, 0.2 and 0.42 respectively.
The moderate-power group of players consists of fourteen players, all of whom are Top 10 (career-high) ranked players with the exception of two players. The null hypothesis of serial independence is rejected for nine players. Notably, out of both power groups (high and moderate), four out of the five number 1 ranked players were found to exhibit serial dependence — the exception is Andre Agassi. These results are not sensitive to the grouping of players according to high- and moderate-power. Running the KS-tests on all players (including the low-power group) leads to a rejection of the null hypothesis of no serial correlation at the 5% level for 19.69% of the players. The B-H multiple-comparisons correction yields a rejection rate of 2.3% (nine players) — this correction is overly conservative due to the players in the low-power group. The rejections still include very highly ranked players (including No. 1), e.g., Federer, Nadal, and Djokovic. A table reporting all the player-level statistics can be found in the supplement.
Returning to the question of economic significance of the observed deviations, note that the mean number of runs per point-game is 16.5, i.e., roughly 33 per match. Therefore, the deviations from serial independence of the top-ranked players in the high-power group Federer, Nadal, Djokovic correspond to approximately 3.5 fewer runs, 2.3 more runs, and 4.4 fewer runs than expected respectively. The moderate-power group also includes players whose statistically significant deviations are approximately of the same magnitude — see Berdych, Ferrer and Dimitrov. These individual deviations may be difficult to detect for the average player, unless two players are matched up often enough, which is not unreasonable for the very best players. Furthermore, the average population tendency to over-alternate will be more readily detectable by attentive players due to the large number of observations. I return to this issue later in the manuscript.
4.2.1 Are within-subject deviations from serial (in)dependence a result of strategic best response to lower-ranked players or match characteristics?
Experimental and field studies have shown that beliefs about opponents’ rationality can affect the likelihood of reaching the normative solution of a game. Reference Palacios-Huerta and VolijPalacios-Huerta & Volij (2009) find that the subgame-perfect equilibrium in the Centipede game is more likely to result if both players are expert chess players, less likely if chess players are matched with students, and least likely when students play other students. Similarly, Reference Bosch-Domenech, Montalvo, Nagel and SatorraBosch-Domenech et al. (2002) find extensive iterated belief-based reasoning about the sophistication of opponents in a guessing game — many subjects who showed an understanding of the Nash equilibrium nevertheless chose to deviate. For example, suppose that the receiver is susceptible to the representativeness bias concerning random sequences or the law of small numbers (Reference Tversky and KahnemanTversky & Kahneman, 1971), i.e., believes that a switch in the serve direction is more likely after a sequence of the same serve directions. Consequently, even if the server is randomizing efficiently, the receiver will expect the sequence of the serves to over-alternate. The latter could be exploited by a server choosing to under-alternate, leading to an increase in the probability of mismatch in the sender and receiver directions, thereby increasing the probability of the server winning the point.
I examine whether such strategic deviations occur in tennis by using a cross-sectional regression model with fixed effects to absorb the between-subject variation leaving the within-subject variation to be modeled, i.e., within-player strategic adaptation to an opponent and/or match characteristics. The following independent variables are included. The current match rankings (not career-high) of both the server and the receiver (or opponent). The former captures within-player variation in randomization, which may occur as a result of the accumulation of experience/expertise (proxied by the player’s own ranking). The latter represents the combined ability and expertise of the opponent. If lower-ranked players are more susceptible to the law of small numbers, then servers should deviate more from serial independence in the direction of under-alternating, the lower-ranked their opponent is. The current match rankings of both the server and receiver (Rank ts,Rank tr) are transformed into R town=8−log2Rank ts as suggested by Reference Klaassen and MagnusKlaassen & Magnus (2009); the same transformation is used for R topp where the subscript t denotes the current — not career high — ranking.
Other variables capture possibly important match characteristics. The length of a match, specifically the total number of points played, is included as the variable N points. This variable could capture the effects of fatigue (and difficulty of the match) on the efficiency of serve randomization. The variable denotes the mean number of shots played per point, or the length of a rally. This variable could influence serve randomization in two possible ways. First, the greater the rally length, the more time that elapses between serves — consequently, a player who incorrectly conditions on prior behavior in a biased attempt to randomize, may actually benefit from greater rally lengths. Second, although the ranking of the opponent would capture the expected difficulty of a match, a greater length rally might indicate that this specific match differs in difficulty.Footnote 7 Consequently, this might increase the incentives for a player to exert more effort or greater care in randomizing efficiently. The variable W diff also captures the difficulty of a specific match as it is the difference between the rate at which the player won and lost points in a given match. If W diff is close to zero, then the match is relatively even. Finally, the variable ln(Round) denotes the round of the match and is a proxy for the expected tournament payoff (factoring in the probability of winning) and the effects of stress or pressure in the later rounds. The variable Round is coded as follows: if the match is a final (Round = 1), semi-final (Round = 2), quarter-final (Round = 3), pre-quarter-final or Top 16 (Round = 4), or any lower qualifying round (Round = 5). Taking the logarithm of this scale imposes a concave relationship, i.e., that the effects of qualifying for each round have an increasingly larger additional effect through the increase in player incentives (monetary or otherwise).Footnote 8
Note, however, that the causality of the variables (N points,,W diff) may also run in the opposite direction. That is, poor serve randomization could conceivably have direct effects on these variables. For example, if a server exhibits serial correlation and the receiver exploits this, then the receiver would be more likely to return the serve leading to longer rallies on average. Similarly, this could also affect the percentage of points won by the player or the number of points in the match. To remove the problem of endogeneity, I calculate N points,,W diff only using data where the player was the receiver, not the server.
The complete model is shown below in Equation 1, errors are normally distributed.Footnote 9 To allow for the possibility that players’ strategic adaptation may depend on whether they are, on average, players who over- or under-alternate, the model estimates the set of coefficients separately for these two groups (the distinction is made on the basis of the sign of r idev). The coefficients are denoted separately as β+ for players where r idev > 0 and β− for players where r idev < 0.
The results of the regression are displayed in Table 4 — the top half of the table presents the coefficients for players that over-alternate on average, the bottom half those that under-alternate. A joint F-test of the null hypothesis that the set of independent variables are not different from zero cannot be rejected (F(12,3941)=0.38,p=0.97). Similarly, tests for each independent variable cannot be rejected at the 5% level. I conclude that players do not systematically strategically manipulate their serve randomization according to the rank of their opponent, and furthermore that there is also no significant effect of match characteristics on behavior. This finding is robust to adding interactions between R town and the other variables in the regression model, which would allow sensitivity to match characteristics to depend on a player’s own rank — see Table 8 (Appendix D) for the regression results. Importantly, there is significant heterogeneity in r idev between players as captured by the estimated fixed-effects (F(382,3941)=1.19,p=0.01).
Table 5 presents individual regressions using the same set of regressors as above for the players in the high-power group. These regressions allow for the possibility of heterogeneity not only in the fixed-effects but also in the estimated coefficients of the independent variables. For example, it is possible that top-ranked players may adapt strategically to their opponent or the characteristics of each match, but lower ranked players may lack this ability. The prior regression on the whole set of players may thus have masked this heterogeneity. Note, that the bulk of the observations of R town in these individual regressions fall within the Top 10 ranking range. Therefore, conclusions regarding the within-subject variation in randomization with rank are valid only within this range — it is possible that learning more efficient randomization may occur at much lower rankings. By contrast, there is significant variation in R topp allowing for more general conclusions.
* denotes significance at the 5% level.
From Table 5, none of the variables are statistically significant for Federer and Murray; however, the β1(R town) coefficient for Nadal and β3(N points) and β5(W diff) coefficients for Djokovic are significantly different from zero (p=0.023,0.025 and 0.036, respectively). Of course, by increasing the sample size enough it is possible to reject any hypothesis for an arbitrarily small effect size. Therefore, the economic significance, or effect size, of the deviations is important — if they are small, then we should be cautious in concluding that players are not serving optimally even if statistical significance is found. Relatively small deviations may be either too difficult or too costly to detect and/or exploit. The economic significance, or effect size of these coefficients is more clearly illustrated by ω2 or converting them to standardized beta coefficients.Footnote 10 For Nadal, ω2 for β1(R town) is equal to 0.015. For Djokovic, ω2 for β3(N points) and β5(W diff) is 0.016 and 0.014, respectively. Consequently, I conclude that, while statistically significant, these within-subject findings explain very little variation, particularly compared to the between-subject (unconditional) deviations from serial independence for these players found above. In conjunction with the insignificant findings in the panel regression (Table 4), I conclude that there is no significant and systematic evidence of the existence of strategic deviations conditional either on the opponent’s rank or the characteristics of a match.
4.2.2 Are between-subject deviations from serial independence dependent on a player’s own career-high ranking?
Since the previous results have ruled out any systematic within-subject variation in serve randomization, in this section I focus solely on the between-player variation after averaging the r pg idev observations into the player averages r idev. Figure 2 shows the estimated function for all players relating , where , where the subscript max indicates the career-high rank.Footnote 11 The coefficient δ0 + 8δ1 corresponds to the mean value of r idev for No. 1 ranked players. To account for the different number of observations determining r idev for each player i, a weighted regression is employed with weights proportional to the number of point-games available for each player. Robustness tests were performed by running the same regression not only on the whole set of players, but also the Top 100, Top 20 and Top 10 players separately — see regressions (1)–(4) in Table 6. In all regressions, the estimate δ1 was negative and statistically different from zero, i.e., players were increasingly less prone to under-alternating and more prone to over-alternating the lower ranked they were. Indicatively, the conditional on player rankings [1,10,20,50,100,500] are [−2.8,1.9,3.3,5.2,6.6,9.8] respectively (for the regression including all players) — see the plotted regression fit in Figure 2. In all the regressions, the value of is negative and significantly different from zero.Footnote 12 GPW also find that more highly ranked men players’ behavior is closer to equilibrium, but their estimated logit regression on serve directions does not imply under-alternation on average for No. 1 ranked players.
The group of players ranked No. 1 and No. 2 therefore exhibit an average tendency to under-alternate, although at the individual player level analysis above, we rejected serial independence for some Top 2 players both because of under- and over-alternating. Although under-alternating serves is not an equilibrium strategy, if the majority of players are over-alternating as receivers, then this would be consistent with a best-response to the population of receivers. Recall however, that no evidence was found of conditioning server randomization on the opponent’s rank, so players would have to be learning deviations at the population level. Unfortunately, this cannot be directly tested because the receivers’ actions are not easily observable. They would depend on the exact position of the player in the court (further to the left or right), also the grip they are using on the tennis racket, i.e., whether it is more appropriate for a backhand or forehand shot, and any other preparation to receive the serve whether mental or physical.
5 Conclusion
Using a new dataset with sufficient power to efficiently investigate the serial dependence in serve directions, I resolve the striking difference in the conclusions drawn by Reference Walker and WoodersWalker & Wooders (2001) and Reference Hsu, Huang and TangHsu et al. (2007) with respect to the serial (in)dependence of tennis serves. I corroborate the conclusion of the former study that there exist statistically significant deviations from serial independence in serves. Importantly, serial independence has been rejected even for players ranked Number 1 in the world at some point in their careers such as Federer, Nadal, and Djokovic. Over-alternation, or switching too often (negative serial correlation), was found to be more prevalent than under-alternation in the whole group of players — this is in line with the earlier results of the literature both in the laboratory and the field. Interestingly, Top 2 players were found to under-alternate on average — this would be a best response to a belief that the majority of tennis players tend to over-alternate in their direction as receivers. Furthermore, the lower the ranking of a player, the higher the degree of expected over-alternation. Within-player analyses did not find evidence of strategic deviations from serial independence by higher-ranked players when competing against lower-ranked players. Consequently, the observed serial dependence cannot be explained away as a rational response to non-equilibrium behavior of individual lower-ranked players with less experience and/or ability than the top players. These deviations might be difficult to detect and exploit at the level of each individual player, or within a single match, due to the small number of datapoints available for inference. However, learning the population-level tendency (outside of the Top 2 players) to over-alternate should be feasible and is one possible strategic explanation for the Top 2 players under-alternating on average. This is backed by extensive laboratory evidence that subjects playing repeated constant-sum games are capable of learning and exploiting the serial dependencies in their opponent’s behavior given enough rounds of play (Reference SpiliopoulosSpiliopoulos, 2012; Reference SpiliopoulosSpiliopoulos, 2013a; Reference SpiliopoulosSpiliopoulos, 2013b; Reference SpiliopoulosSpiliopoulos, 2018). Future work could be directed at ascertaining whether the observed magnitude of deviations from randomness are easily detectable given the samples sizes observed in tennis matches and whether doing so would lead to an economically important advantage for a player. The latter would require a formal model associating deviations from perfect randomization with the probability of winning points and ultimately the whole match. Also, more data covering the whole career span of players would allow for more powerful tests of within-player learning of randomization behavior. Finally, match characteristics proxying for the difficulty of a match, fatigue, induced pressure and incentives were not found to systematically influence the randomization behavior of players.
Appendix A: Statistical tests
Serial dependence for each point-game in the data is performed using the two-sided exact runs test. Let n L and n R denote the number of serves to the left and right in a single point-game, respectively; let n r denote the number of runs in the sequence of n L + n R serves. The probability q pg(r) of finding r runs in such a sequence is given by Equation 2 below; note, Q pg(r) denotes the cumulative distribution.
Randomized test statistics are generated according to Reference Walker and WoodersWalker & Wooders (2001), p. 1533, to satisfy the requirements of the Kolmogorov-Smirnov test, namely that they be identically and independently distributed, and possess a continuous cumulative distribution function. The randomized test statistic t pg is a draw from the uniform distribution .
The individual point-game level test is a KS-test on the distribution of the randomized (exact run) test statistics at the point-game level for all the players — this is the test in WW and HHT. For the player-level analysis it is the Kolmogorov-Smirnov test on the distribution of point-game statistics pg i for each player only. The latter permits the testing of serial independence for each player rather than the set of players. All Kolmogorov-Smirnov tests were implemented by the kstest function in Matlab, based on the algorithm in Reference Marsaglia, Tsang and WangMarsaglia et al. (2003) capable of computing p-values with 13-15 digit accuracy for n≥2 samples. The reported critical values are , where F(t pg) is the cumulative distribution function of the (randomized) exact runs test statistic t pg (or t pg i for the player-level analysis) — note, the reported critical values are not scaled by , as is often done when using the asymptotic distribution.
Appendix B: Statistical power calculations
The amount of data for each player varies greatly in the dataset, from only one match to 160 matches. Consequently, the power of the statistical tests will also vary significantly by player. I present approximate power and size calculations of the proposed player-level test below. To simplify the presentation of the calculations, I perform these on a single “representative” point-game. From Table 2, the probability of serving left and right are in the vicinity of 0.5 across the ad and deuce point-games. Therefore, I perform the calculations assuming that q L = q R = 0.5, and varying the probabilities q LL = q RR from 0.4 to 0.6, which includes the range of empirical estimates found in the data. Figure 3 displays the results from 10,000 simulated draws for players with varying numbers of matches ranging from 1 to 150 (the maximum in the dataset is 160 matches for Federer) and assuming 65 (left or right) serves per match (the average observed in the data). Table 7 presents the exact values.
The probability of rejecting the null hypothesis of serial independence (at the 5% significance level) is shown on the y-axis against different effect sizes on the x-axis (deviations from serial dependence as determined by q LL−q L = q RR−q R). If q LL = q RR = 0.5, then the simulated data exhibits serial independence. Therefore, the probability of (incorrectly) rejecting H 0 is equivalent to the size of the test, which should be 5%. As can be seen in the figure, the test has the appropriate size regardless of the amount of serve data available for a player. For all other values of q LL = q RR, the curves specify the power of the test, i.e., of correctly rejecting H 0. I compute an approximate expected effect size from the empirical data in the following way. Define the average effect size as the mean of the differences |q L−q LL| and |q R−q RR| for both point-games of the estimates presented in Table 2 — this is roughly 0.03.Footnote 13 Therefore, treating this as the expected effect size and under the assumption that q L = q R = 0.5, the corresponding conditional probabilities associated with the expected effect size are q LL = q RR = 0.47 and 0.53. The evolved norm in the literature for adequate power is 80%; this can be achieved at q LL = q RR = 0.47 with data from 50 matches. For a larger effect size at q LL = q RR = 0.45, data from 20 matches achieves a little more than 80% power. Based on these calculations, I refer to subjects with at least fifty matches as the high-power group, between twenty and fifty matches as the moderate-power group, and less than twenty matches as the low-power group. Note, that the findings in WW and HHT were based on twenty players from ten matches. Therefore, the approximate power of these studies to detect the expected effect size (at the population level) of 0.03 is 44%.Footnote 14 Therefore there is a significant chance that WW and HHT reached opposite conclusions due to the relatively lower statistical power of the tests.
Appendix C: Reconciling WW and HHT
The conflicting conclusions found in HHT and WW with respect to the serial (in)dependence of tennis serves deserve attention, especially since both studies used the same statistical tests. I conjecture that this difference may be attributed to sampling variation (in the selection of matches to include in a study) due to the relatively small number of matches used in these studies. Both WW and HHT had 40 point-games from ten men’s matches. Suppose that only 40 point-games were randomly chosen from this study’s dataset to perform the runs test. What is the probability of rejecting the null hypothesis using the same number of point-games as both WW and HHT? I resample 40 point-games for a total of 10,000 times from the complete set of point-games in the dataset (alternatively, only from point-games for players ranked No. 1), and calculate the probability of rejecting serial independence using these re-sampled datasets. Based on these calculations the probability of rejecting serial independence is 0.11 (or 0.36 for No. 1 ranked), i.e., in 11% (36%) of the sub-samples. Consequently, the probability of two studies (of 40 point-games each) reaching the opposite conclusions, one rejecting and the other not rejecting the null is approximately 0.2 (or 0.46 for No.1). Note, that WW and HHT predominantly had very highly-ranked players including many No. 1 players; therefore, the value of 0.46 estimated from the No. 1 ranked players is likely the more accurate estimate. I conclude that the sampling variation hypothesis — lower power of the earlier datasets due to their small size — is a likely cause of the different results reached in WW and HHT. I note that the authors of both studies explicitly considered the power of their statistical tests for their datasets, but they were limited by the practical constraints of collecting and encoding the data from a large number of tennis matches, which is very time-consuming.