Among political scientists, the practice of using digitized documents from archives has become increasingly common. This article is a practical introduction to doing digital archival research. First, it explains when and why political scientists use evidence based on archival research. Second, it argues that the remote accessibility of digitized records provides new opportunities for comparative and transnational research. However, digital archival research also risks aggravating five types of biases that pose challenges for scholarship relying on qualitative, quantitative, interpretive, and mixed-methods: survival, transfer, digitization, and reinforcement bias at the level of record collection and source bias at the level of record creation. Third, the article offers concrete strategies for anticipating and mitigating these biases by walking readers through the experience of entering, being in, and leaving an archive, while also underscoring the importance of learning the structure of an archive. The article concludes by addressing the ethical implications to archival research as a type of field research for political scientists.
This article is a practical introduction to doing digital archival research.
WHEN AND WHY POLITICAL SCIENTISTS TURN TO ARCHIVES
For empirically oriented subfields, “doing archival research” often means collecting data from historical records for a quantitative or a mixed-methods approach to causal inference or primary sources for qualitative case studies aimed at theory testing (American Political Science Association 2019, 2021). Political scientists also do archival research to gather original evidence for descriptive inference, case studies for theory building, and interpretive analyses of causal processes and concept histories. Despite methodological differences, political scientists tend to collect archival records as a means to an end. We seek records with words and numbers that may support, confirm, refine, or rule out answers to questions formulated before entering the archive. Unlike historians who explore questions in the archives, political scientists pursue theory-driven evidence.Footnote 1
Specifically, archives beckon political scientists for several reasons.Footnote 2 First, the study of formal institutions occupies a privileged place in the discipline. The reasons why bureaucracies, legislatures, courts, the military, and the police behave the way they do—making decisions and enforcing them, extracting and distributing resources, and exercising coercive and symbolic power—are difficult to ascertain directly. Declassified records from the archives of relevant agencies can yield information about institutional behavior. Internal correspondence, draft policy memos, and minutes of meetings and debates are especially helpful for theorizing about why leaders make choices; how coalitions and rivalries in politics and society emerge and evolve; and the role of ideology, ideas, beliefs, and other factors relating to human agency on outcomes of interest (Blaydes Reference Blaydes2018; Lawrence Reference Lawrence2013; Mendoza Reference Mendoza2021; Saunders Reference Saunders2011; Subotic Reference Subotic2019).
Second, and relatedly, the archives of formal institutions can provide indirect information about social actors who do not generate their own written records or lack resources to conserve them. For instance, reports from Truth and Reconciliation Commissions contain interview transcripts with survivors of major atrocities; police records include statements made by citizens arrested or under surveillance; and legal and court records include people’s testimonies and petitions (Hussin Reference Hussin2016; Leiby Reference Leiby2009; Luft Reference Luft2020; Nako Reference Nako2019). Informal archives—namely, “unmapped, non-systematized collections of materials kept by individuals and groups in the spaces under study”—may yield more proximate records into the lived past of social actors (Auerbach Reference Auerbach2018, 345; Balcells and Sullivan Reference Balcells and Sullivan2018; Davenport Reference Davenport2010).
Third, political scientists do archival research to find original evidence for case studies using process tracing to identify causal mechanisms and for theory-testing and theory-building purposes. Most process tracing requires precise sequencing of independent, dependent, and intervening variables, as well as careful descriptions of each step in a chronological trajectory or causal narrative (Collier Reference Collier2011; Faletti Reference Faletti2006; Ricks and Liu Reference Ricks and Liu2018). Archival research helps scholars address two key challenges that arise in this line of research: confirmation bias and imperfect counterfactuals. One may inadvertently “cherry-pick” evidence that supports a hypothesis without observing evidence that contradicts it or lends credibility to a rival hypothesis. There also are risks of teleological explanations because it is difficult to reconstruct plausible alternative trajectories through which a given outcome is obtained (i.e., the “paths not taken”). Both problems are more likely when case studies rely heavily on secondary sources. Given authors’ own biases and the types of historiographical or methodological debates in which they are engaged, there is a risk of deliberately or unconsciously focusing on specific factors or events, giving them greater visibility that can be mistaken for greater causal importance (Lustick Reference Lustick1996; Møller and Skaaning Reference Møller and Skaaning2018). When indebted solely to secondary sources, there also is a risk of mistaking an historical narrative for a social process, conflating “what happened” and “that which is said to have happened” (Trouillot Reference Trouillot1995, 2). Political scientists seek to reduce these problems through archival research of primary sources.
Fourth, a turn to archives coincides with a distinctive turn to history in political science (American Political Science Association 2019; Mahoney and Thelen Reference Mahoney and Thelen2015). For those studying the long-run consequences of institutions or events, archival records can yield granular time-series data amenable to rigorous quantitative causal analyses that also are descriptively valuable for identifying empirically puzzling or theoretically surprising patterns (Guardado Reference Guardado2018; Suryanarayan and White Reference Suryanarayan and White2021).
Fifth, interpretive analyses of historical events, processes, and concept histories may rely on archives for texts that capture, mediate, and represent the ideas, linguistic communities, and worldviews of actors in context (Grant Reference Grant2015; Kim Reference Kim2020; Mackinnon Reference Mackinnon2019).
PROMISES AND PITFALLS OF DIGITAL ARCHIVAL RESEARCH
The large-scale digitization of archival records has affected research practices in both positive and negative ways (Trivellato Reference Trivellato2019, 8–10; Turnbull Reference Turnbull, Bode and Arthur2014). On the one hand, the online availability of documents digitized by archives has reduced economies of scale for identifying, accessing, and collecting records that once required significant time investment, resources, and country-specific and regional expertise (Putnam Reference Putnam2016, 389). The full-text searchability of archival records and improved quality of optical character recognition (OCR) enables multisite, multilanguage archival research. Skimming through large quantities of records has become efficient. Neural machine translation interfaces (e.g., Google Translate) make it possible to collect sources in foreign languages without knowing the language itself. To paraphrase the historian Lara Putnam (Reference Putnam2016, 383), political scientists also are “able to find without knowing where to look.”
On the other hand, the availability of digitized records that are too easily accessed remotely may generate problems of excess abundance. It becomes more difficult for researchers to ascertain how the documents they consult fit within an archive’s general structure. Without knowing how representative a subset of documents is of which universe of records, it is difficult to make broader inferences about the empirical reality they capture.
Excess abundance aggravates four types of biases at the level of data collection.Footnote 3 Survival bias occurs when records are missing and destroyed in a nonrandom way. Transfer bias occurs when the records that an archive acquires (i.e., “accession”) and catalogues reflect asymmetries of power, wealth, and privilege that favor certain agencies and individuals or the archive’s own institutional interests. Digitization bias can amplify transfer bias when archives are selective about which records are digitized and made accessible remotely. Reinforcement bias occurs when researchers focus on collecting a subset of records that confirm their hypotheses without consulting other record groups. At the level of record creation, digital archival research also faces greater risks of source bias, which reflects the extent to which governments and the powerful tend to be those who write records in the first place.Footnote 4
To be sure, these challenges have always plagued onsite archival research. What has changed with more digitization and remote access is a disruption in the ways that researchers are able to discern and address biases through the practical experience of being in an archive and the physical tasks of requesting and accessing documents. Political scientists have always relied on humanistic solutions to problems of abundance in archival research.Footnote 5 The repetitive act of using call/shelf numbers to order documents in a reading room is also an act of tacit learning about the record hierarchy in which a document is embedded. Locating the “right” document that serves as evidence for (or counterevidence to) one’s hypothesis requires browsing and skimming through a large quantity of seemingly irrelevant records. This serves as a quasi-forced check against reinforcement bias and generates serendipitous encounters with evidence that one does not necessarily know to look for. Finite physical capacity also forces political scientists to make deliberate choices about what to consult and collect. When doing archival research in-person, there are only so many documents that one can request and copy in a day. This limit is less so at an “infinite archive” with digitized records.
Political scientists have always relied on humanistic solutions to problems of abundance in archival research.
How may political scientists doing archival research anticipate and address challenges of excess abundance from digitized records? The following subsections offer several concrete strategies that commonly center on ways to tacitly learn the structure of an archive.
Before Entering an Archive
Imagine that you are planning to travel to a new country.Footnote 6 Learn the basic language that people at the archives use. Provenance (also known as respect des fonds) refers to the original creator of a record and its history ownership. It is a type of principle for arranging records in a way that preserves their integrity based on how they originated. It also informs the practice of original order, by which archives maintain records according to how creators arranged them originally.Footnote 7 This is why records are not necessarily found chronologically, alphabetically, or according to geography. The organizing categories are those created by the individual, family, or agency from whom the archive acquired the documents.
Many archives arrange records according to a hierarchy. Collections are a general grouping of records that do not necessarily share the same provenance. Within a collection, the highest level of description is a fonds (or “record group”) in which records share provenance. A fonds is subdivided into series (and subseries), which are further subdivided into files. The lowest level of the hierarchy is an item, which is a record that is indivisible. The item-level record is what we usually understand as an archival document—the piece(s) of paper for a surveillance file of an individual, a court-proceedings transcript, or a tax record (figure 1).Footnote 8
Finding aids are one of the most important tools for navigating an archive (figure 2). They are detailed inventories of the records in a collection, containing metadata of a collection’s provenance, summary of contents and organization, administrative history and biographical notes, and size (e.g., number of boxes and linear feet of records). Some finding aids may include a file-by-file, item-by-item list of the collection’s contents. An index (or catalogue) is a list of records with shelf/call numbers. Research guides provide descriptive explanations of how to explore an archive’s holdings and often are written by an archivist or subject specialist. Think of each tool as a different genre of storytelling about an archive. Finding aids and indexes are often cryptic and not necessarily meant to be read from cover to cover. Rather, consult them selectively. Often hidden within their flat prose is invaluable contextual information about a collection. Research guides are more reader friendly, with rich narratives that should be consulted discerningly, not least because they are the products of another’s interpretation of an infinite archive.
Born-digital records are items that are created originally in digital form, such as emails, social media posts, and other types of electronic records. Digitized records generally refer to scanned copies of an original analog record; they are a type of access derivative. Just as paper-based records are fragile and can experience wear over time, digitized and born-digital records also face risks of data degradation (i.e., “bit rot”), losses in the process of transcoding and compression, and obsolescent formats.
Entering an Archive
Now you are at the archive, in the (virtual) reading room (figures 3 and 4).Footnote 9 How do you begin to find documents? For political scientists in search of theory-driven evidence, a first helpful step is to develop a list of search keywords relating to their research question and tentative argument: X causes Y; A influences B. What are your concepts, words, and proper nouns relating to X and Y, A and B?
A surprising amount of archival research in the twenty-first century, whether in person or remote access, is time spent doing reiterative keyword searches. The search box, whether on an archive’s closed intranet terminal or public website, always mediates access.
However, digital archival research is not a Google search. To effectively use the search box, consider creating a two-layered set of keywords: (1) words that reflect how the archive labels and categorizes records relating to your X/A and Y/B; and (2) words in context—that is, what past actors and institutions would have called your X/A and Y/B. To figure out the former, consult a finding aid, research guide, or index. To ascertain the latter, consult a seminal history or empirical study relating to your research.
For instance, suppose your X/A relates to street-level bureaucrats. The concept itself is an academic term of art. Which alternative words might capture the presence of those actors in the archive? Commonsense may suggest “local administrators” and “municipal officials.” Now think back to Lipsky’s (Reference Lipsky1980, 17–18) canonical study of street-level bureaucrats in the United States, which refers to them as “public service workers,” “public employees,” and “low-level workers.” Add these three keywords to your list. Perhaps urban Pakistan in the 2000s is your context; Hull’s (Reference Hull2013, 57–59) Government of Paper guides you toward this word: “clerks.” Perhaps the historical context of the early-twentieth-century British empire is relevant: “district officers” abound in Lugard’s (Reference Lugard1922) The Dual Mandate in British Tropical Africa and Mamdani’s (Reference Mamdani1996) Citizen and Subject. Now turn to your Y/B. It includes the word “opium.” A digitized finding aid for the India Office Records at the British Library indicates that “opium” is cross-referenced with terms such as “Abkaree” and “Separate Revenue.” You already are starting to identify the controlled vocabulary of the archive, which has grouped together records according to their provenance of the British Indian colonial government for Burma’s Excise Department (Kim Reference Kim2020).Footnote 10
Archival research is a reiterative process of discovery. By refining a list of search keywords rooted in both historical and archival contexts, political scientists learn tacitly about how “their” documents fit within the broader structure of an archive’s records, and they are better able to identify survival and transfer bias. Reiteration is an investigative process, not dull repetition, especially in a digital archive. Allow yourself to be distracted, especially by unexpected words, unfamiliar concepts, and odd proper nouns that pop up while scrolling and browsing. These are the moments when chance encounters that lead to new discoveries may occur.
Being in an Archive
There is something both exhilarating and disorienting about being in an archive, physically or virtually. You have just found a 140-page archival document that seems promising, and there are so many more. What do you do? Taking notes systematically and organizing them according to original order are two seemingly mundane yet powerful strategies for addressing challenges of excess abundance.
First, design a consistent template for taking notes on each item that you consult.Footnote 11 There is no right or wrong approach to notetaking. The template only needs to be one that can be repeated over and over again. A minimalist may include only the file’s call number, title, and a brief summary of its contents. A maximalist may add the date accessed, date of original creation, copyright restrictions, a more detailed item-by-item description of its contents, and transcribed notes. Consistent and systematic notetaking at the item level is tedious and, at first, time-consuming. However, it becomes a habit that saves time in the long run. Crucially, it establishes a cumulative record of not only what types of records and information you selected to include in the final analysis but also what was not incorporated, which helps to mitigate reinforcement bias.
Second, systematic notetaking goes hand in hand with systematically organizing those notes. Consider mimicking the original order of the archive when storing notes as well as digitized copies of original documents. For instance, your 140-page document is from the UK National Archives in Kew. You located a digitized copy online and the original reference number is CO/885/1/20. First, create four nested files. Then, under “UK National Archives, Kew (digital),” create a second file labeled CO/885, followed by a third file labeled CO/885/1, and finally “20.” You have just replicated the record hierarchy for this specific document: Item No. 20 in the Miscellaneous Files (1) in the War and Colonial Department and Colonial Office Series (885), within the Colonial Office Records (CO) (figure 5).
Leaving an Archive
A realization hits you as you leave an archive, whether walking out of the building or closing your web browser for the day. The documents that you consulted and the records that you accessed bear traces of the lives of others. Inevitably, the archival research you have just done is an encounter with people in history.
Digital archival research is inextricably tied to ethical considerations for political scientists. Survival, transfer, digitization, and reinforcement biases are products of how archives are not impartial repositories but rather institutions shaped by power, politics, and privilege. Remote accessibility makes it easier to forget how selective and partial an archive’s holdings are, not least because it eliminates the many inconveniences that remind researchers of its arbitrariness and incompleteness. Thus, there is a greater risk of overrepresenting the coherence of past events, processes, and human experiences because it is easier to presume the digitized available documents capture a greater share of historical reality than actually warranted. The technology of digitization also generates new considerations that bring the ethics of historical representation squarely into the ambit of political science’s theory-driven evidence seeking. OCR errors can result in the erasure of an individual’s trace in the archives and, conversely, digitizing microlevel data may inadvertently “find” people with vexed histories unknown to their descendants. The boundaries of copyrights also are blurry for digitized records extending to photographs.
Digital archival research is inextricably tied to ethical considerations for political scientists.
A keen sensitivity to these issues animates emerging scholarship on archival and historical research for political science (American Political Science Association 2019, 2021; Balcells and Sullivan Reference Balcells and Sullivan2018). To bring ethical considerations to the forefront, a basic and necessary question for researchers to ponder is: Why do we choose to do what we do with archival records? Although there is no right answer, there are many different types of thoughtful responses. For some scholars, there is an impulse of social justice, of advocacy on behalf of the weaker, the voiceless—not least to rescue them from “the enormous condescension of posterity” (Thompson Reference Thompson1963, 12). Other scholars may seek not to speak for other people. “The intention here isn’t anything as miraculous as recovering the lives of the enslaved or redeeming the dead,” Hartman (Reference Hartman2008, 11) wrote of her approach to the archives of the eighteenth-century transatlantic slave trade. Rather, she explained, it is about “laboring to paint as full a picture of the lives of the captives as possible” (Hartman Reference Hartman2008, 11). Different epistemologies may counsel archival research as a way to “better understand how local history and context can be leveraged to inform the design of better policy” or, alternatively, to gain a richer menu of counterfactual empirical realities for rigorous social-scientific inquiry (Fouka Reference Fouka2020; Nunn Reference Nunn2020, 1). In doing archival research, choices are already being made, with stakes that are amplified in the process of remotely accessing digitized records. Digital archival research gains ethical import when political scientists are able to recognize and explicitly articulate these choices.
ACKNOWLEDGMENTS
I am indebted to Jennifer Cyr, Daragh Grant, Diana Kapiszewski, N. M. Kim, Ian Kumekawa, Jean Lachapelle, Lauren MacLean, Kate McNamara, Emma Rothschild, Kyle Shen, Yuhki Tajima, and Htet Thiha Zaw for illuminating conversations and constructive comments. I also am grateful to participants in the modules for Designing and Conducting Field Research and Interpretation and History at the 2021 Summer Institute for Qualitative and Multi-Method Research (IQMR).
SUPPLEMENTARY MATERIALS
To view supplementary material for this article, please visit http://doi.org/10.1017/S104909652100192X.
CONFLICTS OF INTEREST
The authors declare no ethical issues or conflicts of interest in this research.