If we could record every international interaction in the realms of diplomacy, conflict, economics, and beyond, how much unique information would this chronicle amount to, and how surprised would we be to see something new? In other words, what is the entropy of international relations? While this record could in principle be unbounded, the central conceit of social science is that there are structural regularities that limit what actors can do, their best options, and even which actors are likely to survive (Brecher, Reference Brecher1999; Reiter, Reference Reiter2015). If so, then these events can be recorded and systematically measured by social scientists interested in these regularities.Footnote 1 A large and growing measurement literature seeks to do just that, using human coding and improving natural language processing techniques to capture unstructured streams of events from text such as international news reports.Footnote 2
We advance existing efforts to identify and structure regularized events and actors in international politics by combining human coding with natural language processing to create (1) a large, flexible ontology of international affairs and (2) a fine-grained and structured event dataset of international crises from 1918 to 2017, which we developed by applying our ontology to an unusually high-quality corpus of historical narratives of international crises (Brecher, Reference Brecher1999; Wilkenfeld and Brecher, Reference Wilkenfeld, Brecher and Midlarsky2000; Brecher et al., Reference Brecher, Wilkenfeld, Beardsley, James and Quinn2016). We then develop several methods for objectively gauging how well these event codings reconstruct the information contained in the original crisis narrative. We conclude by benchmarking our event codings against several current state-of-the-art event data collection efforts. The underlying fine-grained variation in international affairs is unrecognizable through the lens of current quantification efforts. We find that existing models produce data on historical episodes that do not contain enough information to reconstruct the underlying event. In focusing this initial effort on international crises as a proof of concept sample, we demonstrate our ontology and method's potential to improve upon existing empirical identifications of patterns of international interactions.
Over the next five sections, this measurement paper makes the following arguments. First, there is a real-world unobserved latent concept known as international relations that can and should be systematically measured. Second, we propose a method for systematic large-scale measurement of the actors and behaviors in international affairs and as a proof of concept apply that method to a well-regarded and salient sample of events known as international crises. Third, in doing so, we confirm that those measurements exhibit several desirable kinds of internal and external validity and out-perform existing approaches. Fourth, this validation can be evaluated in detail via new event visualizations, with examples provided for case studies of the 1962 Cuban Missile Crisis and 2014 Crimea-Donbas Crisis. A final section concludes.
1. Identifying and measuring international relations
1.1 Motivation
Our knowledge of any historical episode, including the participants and their preferences, behaviors, and beliefs, is only indirectly observed from historical records that most often take the form of unstructured natural language text. Despite its complexity, all international interactions fundamentally involve a finite set of actors expressing their interests through at least theoretically observable behaviors. So how can we abstract and measure discrete events that make up a historical episode in international relations? The easiest way to convey the desired product is with an example. Figure 1 shows a narrative account of the Cuban Missile Crisis (1962) in natural language sentences alongside a mapping to discrete machine-readable abstractive events. From this, scholars can identify similarities and differences across events like what foreign policy actions deter versus inflame (Jervis, Reference Jervis1978; Glaser, Reference Glaser2000), when third parties mediate (Haffar, Reference Haffar2002; Quinn et al., Reference Quinn, Wilkenfeld, Smarick and Asal2006), and how actors communicate resolve (Trager, Reference Trager2016; Lupton, Reference Lupton2018). Identifying patterns of international interactions is not just an inherently interesting enterprise; it is a necessary precondition to important efforts to predict where policymakers should turn their attention to improve global welfare (Ward et al., Reference Ward, Metternich, Dorff, Gallop, Hollenbach, Schultz and Weschle2013; Beger et al., Reference Beger, Morgan and Ward2021).
1.2 Existing state of the art measurements
We begin by drawing informative prior beliefs about the underlying process of international relations that we expect to govern behavior during historical episodes and their later transcription into the historical record. We organize our prior beliefs along two overarching axes: (1) existing efforts to identify the actors/actions of international relations; and (2) the types of behaviors and information we hope to recover. Table 1 describes these two axes as columns and rows, respectively.
The rows in Table 1 represent the types of information we expect to find in international relations and forms the basis for our proposed ontology. We began the ontology by first doing a full natural language processing pass of the corpus and identifying all of the named entities and verbs mentioned in the text. To identify possible behaviors, we matched verbs to the most likely definition found in Wordnet (Miller, Reference Miller1995), tallied them (SI Appendix 1.2), and then aggregated them into a smaller number of behaviors balancing conceptual detail with manageable sparsity for human coding (informed by existing conceptual literature and measurement research). We used the International Crisis Behavior (ICB) project actor level data to identify likely actors for each crisis and location options relative to each actor. For behavior, actor, and location, coders could write-in a value if the given options were insufficient. The codebook lists eleven behaviors added post-coding as coders flagged events that were not captured by the initial ontology (e.g., propaganda).
As we are not the first to attempt to measure international relations in a structured manner, the columns of Table 1 compare the ontological coverage of ICBe to existing state of the art systems in production and with global coverage. We choose these datasets and models as they represent frequently used and reputable efforts to structure and describe historical events of interest to scholars of international politics. The first column starts with our contribution, ICBe, alongside other event-level datasets including CAMEO dictionary lookup-based systems (Historical Phoenix (Althaus et al., Reference Althaus, Bajjalieh, Carter, Peyton and Shalmon2019); ICEWS (Boschee et al., Reference Boschee, Lautenschlager, O'Brien, Shellman, Starz and Ward2015); Terrier (Grant et al., Reference Grant, Halterman, Irvine, Liang and Jabr2017)), the Militarized Interstate Disputes Incidents dataset, the UCDP-GED dataset (Sundberg and Melander., Reference Sundberg and Melander2013; Davies et al., Reference Davies, Pettersson and Öberg2022), and ACLED (Raleigh et al., Reference Raleigh, Linke, Hegre and Karlsen2010).Footnote 3 The final set of columns compares episode-level datasets beginning with the original ICB project (Brecher et al., Reference Brecher, Wilkenfeld, Beardsley, James and Quinn2016; Brecher and Wilkenfeld, Reference Brecher and Wilkenfeld1982; Beardsley et al., Reference Beardsley, James, Wilkenfeld and Brecher2020), the Militarized Interstate Disputes dataset (Gibler, Reference Gibler2018; Palmer et al., Reference Palmer, McManus, D'Orazio, Kenwick, Karstens, Bloch, Dietrich, Kahn, Ritter and Soules2022), and the Correlates of War (Sarkees and Wayman, Reference Sarkees and Wayman2010). We include episode-level datasets as they remain a common and trusted tool for analyzing international relations, and because ICBe is unique among event-level datasets as events are matched to crises and can be aggregated to the episode level. There is imperfect overlap between their intended depth and scope of coverage; “international crises” are similar, but not identical to, “interstate wars” and “militarized interstate disputes,” which differ yet again from “individual events of organized violence” and “non-violent action.” Even like-concepts require care in comparison, as an “aim” in ICBe is the same as in MIPS, but an “alert” in ICBe is not the same as an “alert” in MIDs.
This comparison is not intended to fault existing data and models for not including every variable in ICBe's ontology, as some of these variables fall outside the scope of a particular dataset's intended purpose. Rather, it serves as an initial basis for identifying the heterogeneity in existing efforts to abstract and measure discrete historical events of interest and to provide theoretical justifications from existing research about what is included in our dataset's ontology and where ICBe's detail about historical events can be compared to the current state of the art.
With the exception of large-scale CAMEO dictionary-based systems (the first grouping of columns), our ontology improves upon the existing state of the art quantitative datasets that ignore important information about international interactions.Footnote 4 We highlight two particular innovations. First, we separate the “chess pieces” from the “chess players” in distinguishing between different actors within a state. By virtue of our ontology, coding military versus civilian actors and national leaders versus bureaucrats, our data can be used to explore important questions concerning civilian-military relations (Narang and Talmadge, Reference Narang and Talmadge2018), Track Two diplomacy, the role of sub-national actors (Hsu et al., Reference Hsu, Höhne, Kuramochi, Vilariño and Sovacool2020), and the evolution of which actors are engaged in crises—a topic of increasing interest as states engage in gray zone conflict by employing the coast guard or paramilitary mercenaries instead of internationally recognized state militaries (Gannon, Reference Gannon2022). Second, we add information about the domains in which actors behave—whether in land, air, sea, space, or cyber—since they differ in their technology, tactics, geography, and purpose (Gartzke and Lindsay, Reference Gartzke and Lindsay2019). Doing so allows researchers to identify and explain patterns in escalation conditional on the military means states use in conflict. Recent concerns about cross-domain conflict, and the effect of new domains of conflict like space and cyber, have made this an endeavor of increased interest to practitioners (Gannon, Reference Gannon2022).
2. Methodology and data
2.1 Corpus
For our corpus, we select a set of unusually high-quality historical narratives from the ICB project (n = 471) with coverage spanning 1918–2017 (SI Appendix 1.1) (Brecher and Wilkenfeld, Reference Brecher and Wilkenfeld1997; Brecher et al., Reference Brecher, Wilkenfeld, Beardsley, James and Quinn2016). ICB defines a crisis as meeting three conditions: (1) an actor perceives a threat to one of more of its core values, (2) the actor has a finite time horizon for responding to the perceived threat, and (3) the probability of military hostility has increased (Brecher and Wilkenfeld, Reference Brecher and Wilkenfeld1982). Crises are a significant focus of detailed single case studies and case comparisons because they provide an opportunity to examine behaviors in international relations short of, or at least prior to, full conflict (Holsti, Reference Holsti1965; Paige, Reference Paige1968; Allison and Zelikow, Reference Allison and Zelikow1971; Brecher and Wilkenfeld, Reference Brecher and Wilkenfeld1982; Gavin, Reference Gavin2014; Iakhnis and James, Reference Iakhnis and James2019). The corpus is also unique in that it was designed to be used in a downstream quantitative coding project, meaning each narrative was written by a small number of scholars using a uniform coding scheme where things like word choice, writing style, and level of specificity were done deliberately and consistently (Hewitt, Reference Hewitt2001). Case selection was exhaustive based on a survey of world news archives and region experts, cross-checked against other databases of war and conflict, and non-English sources (Brecher et al., Reference Brecher, Wilkenfeld, Beardsley, James and Quinn2016; Kang and Yu-Ting Lin., Reference Kang and Lin2019, 59).
2.2 Coding process
The ICBe ontology follows a hierarchical design philosophy where a smaller number of significant decisions are made early on and then progressively refined into more specific details (Brust and Denzler, Reference Brust and Denzler2020).Footnote 5 Each coder was instructed to first thoroughly read the full crisis narrative and then presented with a custom graphical user interface (GUI) (SI Appendix 2.1). Coders then proceeded sentence by sentence, choosing the number of events (0–3) that occurred, the highest behavior (thought, speech, or action), a set of players, whether the means were primarily armed or unarmed, whether there was an increase or decrease in aggression (uncooperative/escalating or cooperative/de-escalating), and finally one or more specific and non-mutually exclusive activities. Some additional details were always collected (e.g., location and timing) while other details were only collected if appropriate (e.g., force size, fatalities, domains, units). While each event was matched to a sentence, coders could fill in details outside that sentence (e.g., antecedents to pronouns). We reviewed, standardized, and normalized where coders listed a behavior, actor, or location outside the ontology.Footnote 6
A unique feature of the ontology is that thought, speech, and do behaviors can be nested into combinations, e.g. an offer for the U.S.S.R. to remove missiles from Cuba in exchange for the U.S. removing missiles from Turkey. Through compounding, the ontology can capture what players were said to have known, learned, or said about other specific fully described actions.
No existing event data distinguishes thoughts, speeches, and actions. In fact, most only try to code actions and entirely omit thoughts and speech acts despite recognition of their importance in international politics (Smith, Reference Smith1998). Scholars have opted against coding thoughts and speech acts because of a lack of confidence the full universe could be readily observed and consequently at least theoretically be included.Footnote 7 But the perfect should not be the enemy of the good, and measurement challenges are only overcome after an initial attempt to estimate difficult-to-observe concepts of interest. The ICB narratives are one of the better sources for this endeavor due to the consistent use of high-quality primary source material that takes advantage of qualitative methods well-suited to identifying thoughts and speech acts like archival work and expert interviews.
Each crisis was typically assigned to two expert coders and two novice coders with an additional tie-breaking expert coder assigned to sentences with high disagreement.Footnote 8 For the purposes of measuring intercoder agreement and consensus, we temporarily disaggregate the unit of analysis to the Coder-Crisis-Sentence-Tag (n = 993,731), where a tag is any unique piece of information a coder can associate with a sentence such as an actor, date, behavior, etc. We then aggregate those tags into final events (n = 18,783), using a consensus procedure (SI Appendix 2.2) that requires a tag to have been chosen by at least one expert coder and either a majority of expert or novice coders. This screens noisy tags that no expert considered possible but leverages novice knowledge to tie-break between equally plausible tags chosen by experts. Requiring sentence-tag matching may underestimate agreement but minimizes the inclusion of noise and allows for additional validation. Once filtered for agreement, we find 472 actors and 119 different behaviors: 12 thought, 13 speech, and 94 actions.
3. Performance comparison
3.1 Internal consistency
We evaluate the internal validity of the coding process in several ways. For every tag applied we calculate the observed intercoder agreement as the percent of other coders who also applied that same tag (SI Appendix 2.3). Across all concepts, the Top 1 Tag Agreement was low among novices (31 percent), moderate for experts (65 percent), and high (73 percent) following the consensus screening procedure.
We attribute the remaining disagreement primarily to three sources. First, we required coders to rate and justify their confidence in the coding. They reported low confidence for 20 percent of sentences; 45 percent of those were due to a mismatch between the ontology and the text (“survey doesn't fit event”) and 46 percent were from a lack of information or confused writing in the source text (40 percent “more knowledge needed,” 6 percent “confusing sentence”). Observed disagreement varied predictably with self-reported confidence (SI Appendix 2.4). Second, as intended, agreement is higher (75–80 percent) for questions with fewer options near the root of the ontology compared to agreement for questions near the leaves of the ontology (50–60 percent). Third, individual coders exhibit nontrivial coding styles, e.g. some more expressive coders applied many tags per concept while others focused on only the single best match. We further observed unintended synonymity, e.g. the same information can be framed as either a threat to do something or a promise not to do something.
3.2 Improvement over existing efforts
To evaluate our coding process relative to existing datasets, we measure the recall and precision of ICBe events in absolute terms and relative to other existing systems. Recall measures the share of desired information recovered by a sequence of coded events while precision measures the degree to which a sequence of events correctly and usefully describes the information in history. To aid in subjective evaluation of the precision and recall of ICBe for each event, we provide full ICB narratives, ICBe coding in an easy-to-read iconographic form, and a wide range of visualizations for every case on the companion website.
Recall for historical episodes is poorly defined for two reasons. History may or may not be written by the victors but by virtue of being written by someone there is no genuine ground truth about what occurred, only surviving texts about it (Turberville, Reference Turberville1933). Second, there is no a priori guide to what information is necessary detail and what is ignorable trivia. History suffers from what is known as the Coastline Paradox (Mandelbrot, Reference Mandelbrot1983)—it has a fractal dimension greater than one such that the more you zoom in, the more detail you will find about individual events as well as in between any two discrete events. The ICBe ontology is a proposal about what information is important, but we need an independent benchmark to evaluate whether that proposal is a good one and that allows for comparing proposals from event projects that had different goals. We need a yardstick for history.
Our strategy for dealing with both problems is a plausibly objective yardstick called a synthetic historical narrative. We collect a large diverse corpus of narratives spanning timelines, encyclopedia entries, journal articles, news reports, websites, and government documents. Using natural language processing (fully described in SI Appendix 3.1), we identify details that appear across multiple accounts. A detail refers to the smallest textual unit for which we can calculate similarity across corpora to identify whether sentences semantically refer to the same broader observed event (Narayan et al., Reference Narayan, Cohen and Lapata2018). The more accounts that mention a detail, the more central it is to understanding the true historical episode. The theoretical motivation is that authors face word limits which force them to pick and choose which details to include, and they choose details that serve the specific context of the document they are producing. With a sufficiently large and diverse corpus of documents, we can vary the context while holding the overall episode constant and see which details tend to be invariant to context. Sufficiently similar details were binned together and then summarized so they could be compared to the coding in ICBe. This presents a harder evaluation baseline than comparing ICBe's recall to just that of ICB since there are non-crisis aspects of these events that may be included in other narratives but are out of the scope of our data. For example, the nationalization of businesses in Cuba may be included as important context in the Cuban Missile Crisis in documents that do not focus on the crisis dimensions like ICB. Using this hard case, a recall measure of ICBe on the synthetic narratives thus serves as a way to evaluate the breadth of ICBe's ontology and potential application to non-crisis international events.
We find substantive variation in recall across existing state of the art methods. Mentions of a detail across accounts are exponentially distributed with context-invariant details appearing dozens to hundreds of times more than context-dependent details.Footnote 9 Furthermore, crisis start and stop dates are arbitrary, and the historical record points to many precursor events as necessary detail for understanding later events. Figure 2 compares ICBe's recall with that of existing datasets for the two case studies detailed in Section 4. ICBe strictly dominates all of the systems but ICEWs in recall though we note that the small sample sizes mean these systems should be considered statistically indistinguishable. Across all existing datasets and ICBe, recall increases with the number of document mentions which is an important sign of validity for both them and our benchmark. The one outlier is Phoenix which in the Cuban Missile Crisis case is so noisy that its recall curve is flat to decreasing as mentions increase. The two episode-level datasets (MIDs and ICM) have low coverage of contextual details. The two other dictionary systems ICEWs and Terrier have higher coverage, with ICEWs outperforming Terrier. Importantly our corpus of ICB narratives has high recall of frequently mentioned details giving us confidence in how those summaries were constructed, and ICBe lags only slightly behind showing that it left little additional information on the table.Footnote 10
The second component of event measurement validation is precision. It does little good to recall a historical event but too vaguely (e.g., MIDs describes the Cuban Missile Crisis as a blockade, a show of force, and a stalemate) or with too much error to be useful for downstream applications (e.g., ICEWS records 263 “Detonate Nuclear Weapons” events between 1995 and 2019). ICBe's ontology and coding system are designed to strike a balance so that the most important information is recovered accurately but also abstracted to a level that is still useful and interpretable.
We demonstrate ICBe's precision in a number of different ways. First, we develop the iconography system for presenting event codings as coherent statements that can be compared side by side to the original source narrative for every case on the companion website. We further provide a stratified sample of event codings alongside their source text (SI Appendix 4.2). We find both the visualizations of macrostructure and head-to-head comparisons of ICBe codings to the raw text to strongly support the quality of ICBe. Second, we develop a visualization we call a crisis map, a directed graph intersected with a timeline. A researcher should be able to lay out the events of a crisis on a timeline and read off the macrostructure of an episode from each individual move. A crisis map using ICBe for the Cuban Missile Crisis case study is provided in Figure 5, crisis maps for the two case studies using existing event datasets can be found in SI Appendix 4.3 and 4.4, and crisis maps for all crises using all datasets can be found on the companion website. The crisis maps reveal episode-level datasets like MIDs or the original ICB are too sparse and vague to reconstruct the structure of the crisis (SI Appendix 4.3 and 4.4). On the other end of the spectrum, the high recall dictionary-based event datasets like Terrier and ICEWs produce so many noisy events (several hundred thousand) that even with heavy filtering their crisis maps are completely unintelligible. Further, because of copyright issues, none of these datasets directly provide the original text spans making event-level precision difficult to verify.
We further want to automatically verify the precision of individual ICBe event codings, which we can do in the case of ICBe because each event is mapped to a specific span of text. Our proposed measure is a reconstruction task to see whether our intended ontology can be recovered through only unsupervised clustering of sentences they were applied to. Figure 3 shows the location of every sentence from the ICBe corpus in semantic space, as embedded using the same large language model as before, and the median location of each ICBe event tag applied to those sentences.Footnote 11 Labels reflect the individual leaves of the ontology and colors reflect the higher level coarse branch nodes of the ontology. If ICBe has high precision, substantively similar tags ought to have been applied to substantively similar source text, which is what we see both in two dimensions in the main plot and via hierarchical clustering on all dimensions in the dendrogram along the right-hand side.Footnote 12
4. Case illustrations
In this section, we focus our validation on two case studies for which we have produced synthetic narratives using the method described in Section 3.2. The first is the Cuban Missile Crisis which took place primarily in the second half of 1962, involved the United States, the Soviet Union, and Cuba, and is widely known for bringing the world to the brink of nuclear war (Figure 1). The second is the Crimea-Donbas Crisis which took place primarily in 2014, involved Russia, Ukraine, and NATO, and within a decade spiraled into a full-scale invasion (SI Appendix 4.1). We choose these cases because they are significant in contemporary international relations, are widely known across academic disciplines as well as among the public, and are sufficiently brief to evaluate in depth. They are similar in that both cases involve a superpower in crisis with a neighbor that changed from a friendly to a hostile regime, both held implications for the economic and military security for the superpower by risking full-scale invasion, and both eventually invited intervention by an opposing superpower.
4.1 Cuban Missile Crisis (1962)
A synthetic historical narrative for the Cuban Missile Crisis appears in Figure 4, with 51 events drawn from 2,020 documents. Each row represents a detail that appeared in at least five documents along with an approximate start date, a handwritten summary, the number of documents it was mentioned in, and whether it could be identified in the text of the original ICB corpus, our ICBe events, and any of the competing existing models.
ICBe's improved recall of the Cuban Missile Crisis relative to the state of the art was summarized in Section 3.2, but the events that explain that improvement can now be seen. Our ground truth ICB narrative contains 17/51 of the events from the synthetic narrative of a case that includes high-level previously classified details. ICBe captures nearly all details included in ICB as well as more details from the synthetic narrative than any competing dataset. Phoenix includes some earlier information than ICBe like the nationalization of businesses and back channel negotiations, but the crisis narrative has a clean canonical end with the Soviets agreeing to withdraw missiles. ICBe stands out in including more communicative behavior (do–speech) than existing datasets like US threats to attack and later promises not to invade. Given the recognized importance of threat credibility for understanding international conflict, the addition of this information is a substantively important improvement over the existing state of the art (Slantchev, Reference Slantchev2011).
Figure 5 shows the crisis map for the Cuban Missile Crisis. Looking at the crisis on a timeline, one can now identify the structure of actors and the environment, along with its supporting details, in a way that validates the precision of ICBe. Although harder to measure objectively, this crisis map provides face validity that ICBe's account is not too vague, but also not unnecessarily detailed. We include much of the geopolitically important details like Soviet deployment, US discovery of that deployment, heightened alert levels, a blockade, and negotiations that ended with a formal agreement. At the same time, the crisis map indicates that ICBe does not include unnecessary nuances that preclude useful comparison to other international events.
4.2 Crimea-Donbas (2014)
A synthetic historical narrative for the 2014 Crimea-Donbas Crisis (30 events drawn from 971 documents) appears in Figure 6. As in the earlier case, rows represent details that appeared in at least five documents and whether it is identified in ICBe and existing datasets.
Again quantitatively summarized earlier in Section 3.2 (Figure 2), our ground truth ICB narrative contains 23/30 of the events from the synthetic narrative. Like the gray zone precursor to the Cuban Missile Crisis (Cormac and Aldrich, Reference Cormac and Aldrich2018), Ukraine provided several security guarantees to Russia that were potentially undone, e.g. a long-term lease on naval facilities in Crimea. But unlike the Cuban Missile Crisis, the end of this crisis is unclear, with the event meekly ending with a second cease-fire agreement (Minsk II) but continued fighting. ICBe again recalls more important information about the crisis than any existing dataset, particularly information concerning the behavior of non-state separatist groups like the Donetsk People's Republic (DPR) and Luhansk People's Republic (LPR).
As this more recent case reflects primarily public reporting rather than the previously classified details relevant for the Cuban Missile Crisis, ICBe's improvement relative to the global and real-time coverage of dictionary-based event systems is still present, but less pronounced. We want to take seriously the possibility that some functional transformation could recover the precision of ICBe. For example, Terechshenko (Reference Terechshenko2020) attempts to correct for the mechanically increasing amount of news coverage each year by de-trending violent event counts from Phoenix using a human-coded baseline. Others have focused on verifying precision for ICEWs on specific subsets of details against known ground truths, e.g. geolocation (Cook and Weidmann, Reference Cook and Weidmann2019), protest events (80 percent) (Wüest and Lorenzini, Reference Wüest, Lorenzini, Kriesi, Lorenzini, Wüest and Hausermann2020), and anti-government protest networks (46.1 percent) (Jäger, Reference Jäger2018).
We take the same approach here in Figure 7, selecting four specific CAMEO event codings and checking how often they reflect a true real-world event from the Crimea-Donbas synthetic narrative. We choose four event types around key moments in the crisis. The start of the crisis revolves around Ukraine backing out of a trade deal with the EU in favor of Russia, but “sign formal agreement” events act more like a topic detector with dozens of events generated by discussions of a possible agreement but not the actual agreement which never materialized. The switch is caught by the “reject plan, agreement to settle dispute” event type, but also continues for Viktor Yanukovych even after he was removed from power because of articles retroactively discussing the cause of his removal. Events for “use conventional military force” capture a threshold around the start of hostilities and who the participants were but not any particular battles or campaigns. Likewise, “impose embargo, boycott, or sanctions” captures the start of waves of sanctions and from who but are effectively constant as the news coverage does not distinguish between subtle changes or additions. In sum, dictionary-based methods on news corpora tend to have high recall because they parse everything in the news, but for the same reason, their specificity for most event types is too low to back out individual chess-like sequencing that ICBe aims to record.
5. Conclusion
The scope and complexity of international politics should not discourage the identification of trends, patterns, and regularities. In undertaking event abstraction from narratives about key historical episodes in international relations, this paper has proposed a mapping between unstructured historical records and a structured ontology of these events with high coverage of concepts of interest. Multiple validity checks find the resulting codings have high internal validity (e.g., intercoder agreement) and external validity (i.e., matching source material in both micro-details at the sentence level and macro-details spanning full historical episodes). Further, these codings perform much better in terms of recall, precision, coverage, and overall coherence in capturing these historical episodes than existing event systems used in international relations.
These data, along with the open-source code, documentation, and companion website provide several substantive and methodological contributions to the discipline. Substantively, these data are appropriate for statistical analysis of hard questions in the study of crises like interactions between means of warfare and the preconditions for conflict escalation (Gannon, Reference Gannon2022). Methodologically, our mapping from codings to source text at the sentence level provide a new resource for natural language processing with access to coder-level disaggregation that furthers the study of uncertainty in the interpretation of international events and in the quantitative coding of historical events. Finally, we provide a companion website (crisisevents.org) that incorporates detailed visualizations of all the data introduced here as a new resource for the study of international crises in a scalable, yet detailed, manner.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2024.17. To obtain replication material for this article, https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FMNVUEP&version=DRAFT
Data
This article's data, supplementary appendix, replication material, and visualizations of every historical episode are available on the GitHub repository ICBEventData and through the companion website crisisevents.org.
Acknowledgements
We thank the ICB Project and its directors and contributors for their foundational work and their help with this effort. We make special acknowledgement of Michael Brecher for helping found the ICB project in 1975, creating a resource that continues to spark new insights to this day. We thank the many undergraduate coders. Thanks to the Center for Peace and Security Studies and its membership for comments. Special thanks to Rebecca Cordell, Philip Schrodt, Zachary Steinert-Threlkeld, and Zhanna Terechshenko for their generous feedback. Thank you to the cPASS research assistants: Helen Chung, Daman Heer, Syeda ShahBano Ijaz, Anthony Limon, Erin Ling, Ari Michelson, Prithviraj Pahwa, Gianna Pedro, Tobias Stodiek, Yiyi ‘Effie’ Sun, Erin Werner, Lisa Yen, and Ruixuan Zhang.
Author contributions
Conceptualization: R. W. D., E. G., J. L.; methodology: R. W. D., T. L. S.; software: R. W. D.; validation: R. W. D., T. L. S.; formal analysis: R. W. D., T. L. S.; investigation: S. C., R. W. D., J. A. G., C. A., N. L., E. M., J. M. C. N., D. P., D. M. Q., J. W.; data curation: R. W. D., D. M. Q., T. L. S., J. W.; writing — original draft: R. W. D., T. L. S.; writing — review and editing: R. W. D., J. A. G., E. G., T. L. S.; visualization: R. W. D., T. L. S.; supervision: E. G.; project administration: S. C., R. W. D., J. A. G., D. M. Q., T. L. S., J. W.; funding acquisition: E. G., J. L.
Financial support
This work was supported by a grant from the Office of Naval Research N00014-19-1-2491 and from the Charles Koch Foundation 20180481. The financial sponsors played no role in the design, execution, analysis and interpretation of data, or writing of the study.
Competing interest
The authors declare that there are no competing interests.