Introduction
In this article, I present two new datasets that contain individual-level data on all candidates between 1867 and 2019 in both the Canadian federal and Ontario provincial elections. In total, the data provide information on 44,462 candidates at the federal level and 15,529 in the Ontario provincial elections.Footnote 1 While there are existing studies on political elites in Canada, until now there were no longitudinal datasets covering all candidates going back to 1867.
These data will allow researchers to investigate a number of important topics, including: the share of female candidates over an extended period of time; which occupations do better in politics (Sevi, Blais and Mayer, Reference Sevi, Blais and Mayer2020); if women get fewer votes than men (Sevi, Arel-Bundock and Blais, Reference Sevi, Arel-Bundock and Blais2018; Sevi, Blais and Arel-Bundock, Reference Sevi, Blais, Arel-Bundock, Geus, Tolley, Goodyear-Grant and Loewen2021); how long politicians tend to stay in politics; the advantage gained by incumbency; how well independents do; the consequences for politicians who are elected under a party banner and then switch their party affiliation and run again either at the same level of government (Sevi, Yoshinaka and Blais, Reference Sevi, Yoshinaka and Blais2018) or across two different levels (provincial and federal); the progressive ambition of politicians across different levels of elections; and if by-elections are more favourable to smaller parties and/or independent candidates.
Data Collection
The original sources for these datasets are the Library of Parliament of Canada and Elections Ontario. The federal data come from the Library of Parliament and consist of the names of all the candidates, the date of the election, the number of ballots cast, the occupations of the candidates, and the name of the constituency, province and party affiliation. I manually recorded these data twice between 2014 and 2017.Footnote 2
I then completed an extensive cleaning of the data, using candidate websites, historical newspaper archivesFootnote 3 and other journalist summaries of the candidates; in doing so, I also added variables that include the gender for all candidates, birth year for elected politiciansFootnote 4 and whether the candidates were acclaimed or switched parties after being elected. I also included the parliament number, and I calculated the percentage of the vote obtained by the candidates, as well as whether the candidate was elected or not. Moreover, I assigned unique IDs to all candidates and matched the IDs of the same individuals over time. Unique IDs help mitigate several problems: first, many candidates’ names were not spelled consistently across different elections in the Library of Parliament database—for example: the same name could be given as John A., J. A., Sir John, or J.; furthermore, sometimes different candidates have the same name, or—in earlier elections—the same candidate may have run in different ridings in different years, and sometimes even in the same year.
To assign unique IDs, I looked up every candidate's profile in the Library of Parliament and used alternative biographical information to triangulate their identity. Because I manually recorded the unique IDs, I was also able to create an incumbent variable that indicates whether the candidate ran in the previous election. The names of parties are also spelled differently across different elections, so a similar treatment was necessary—for example: Liberal, Liberal Party of Canada, Liberal Progressive, Opposition, Opposition/Laurier Liberals, and so on. I give researchers the option to use either the 155 unique party names or the categories I created that put together similar parties but also parties that are named differently across different elections. All the data were independently checked at another time to ensure accuracy.
Researchers may reasonably express concerns about the quality of my federal dataset, given that it is gathered manually. To address these concerns, I re-collected all the variables after six months of not touching the dataset and merged these with my initial data collection. This second step was an opportunity to verify my initial data collection.
Ontario is the province in Canada where the constituencies and parties are most similar to those at the federal level; therefore, in 2019, I collected similar data from 1867 to 2018 for the Ontario provincial elections. The original PDF documents that were retrieved from Elections Ontario contained the candidates' name, party, constituency, date of the election and the number of ballots cast. To this dataset, I added gender by first making use of the R package, genderizerR (Wais, Reference Wais2016),Footnote 5 which infers the gender of candidates by analyzing first names. GenderizerR is based on the genderize.io API, which is a web scraping tool (http://genderize.io). GenderizerR provides a likely gender and probability score for each candidate. I kept all the probabilities. I then verified each entry on two different occasions. I kept all the entries and not simply probabilities that are close to 100 per cent. (Both the genderizeR probabilities and my manual check are retained in the Ontario dataset. I made a total of 2,376 corrections to the gender variable.Footnote 6)
Applications
The data presented in this paper allow political scientists to better understand political phenomena in Canada. They can be used to replicate existing studies on political elites in Canada and verify whether patterns hold over an extended period. These studies cover questions such as minority representation (Black and Erickson, Reference Black and Erickson2006; Black, Reference Black2013), women's representation (Blais and Gidengil, Reference Blais and Gidengil1991; Hunter and Denton, Reference Hunter and Denton1984; Thomas and Bodet, Reference Thomas and André Bodet2013; Tremblay and Trimble, Reference Tremblay and Trimble2006; Tolley, Reference Tolley2011) and incumbency advantages (Kendall and Rekkas, Reference Kendall and Rekkas2015). Much of the existing literature on elites in Canada has examined a few elections or a few variables because longitudinally rich data were previously not collected or available. My data are the first to contain information on all candidates in all the elections since 1867 and therefore make available a unique tool for researchers focusing on political elites in Canada. As an example, there is a large body of literature on the inclusion of women in politics (Blais and Gidengil, Reference Blais and Gidengil1991; Stockemer, Reference Stockemer2017; Trimble and Arscott, Reference Trimble and Arscott2003; Trimble et al., Reference Trimble, Arscott and Tremblay2013; Tolley, Reference Tolley2011); however, my data are the first to offer longitudinal data on the number of female candidates and their success compared to their male counterparts.
Figure 1 highlights four applications of my federal dataset that are of particular relevance. Given that the federal dataset is unique not only in terms of the variables collected but also in its longitudinal nature, it can be used for all types of exploratory and descriptive questions. First, in the upper left corner, we see that the mean number of candidates per constituency has increased over time, but since 1997 it has decreased, with an uptick in the 2019 election. Second, the share of incumbents has also decreased over time. This decrease can partially be explained by the increase in the number of candidates and the increasing competitiveness of elections, but it is worth exploring further in future studies. Third, since 1921, the share of elected women in Parliament has increased from .004 in 1921, when the first woman was elected to Parliament, to .29 in 2019. Fourth, the federal dataset will be useful for researchers interested in the age of Members of Parliament (MPs) over time. The mean age of elected MPs varies between 45 and 53. There is a notable decline and rebound in the mean age, reaching a minimum in the 1970s, which is worth exploring further.
These data are flexible in that it is possible to create many more variables or to merge them with other existing data. The data could be merged with existing datasets using candidate and constituency names. For example, a researcher could merge the federal dataset with data from the Canadian Census at the level of federal electoral districts to examine the characteristics of MPs’ constituencies compared to MPs themselves. It could also be combined with data on nomination contests to study how parties select and/or appoint candidates. Or it could be merged with administrative data to study campaign contributions across gender (Tolley et al., Reference Tolley, Besco and Sevi2020). Comparative analyses are also possible given that similar datasets exist for other countries (see Klarner, Reference Klarner2018; Kollman et al., Reference Kollman, Hicken, Caramani, Backer and Lublin2019; Yoshinaka, Reference Yoshinaka2016). Finally, the data can be used as a teaching tool in introductory courses in Canadian politics, as well as in research methods.
Conclusion
I have introduced two new datasets that are the largest available data on candidates in Canadian and Ontario elections, respectively. My data cover the period 1867–2019 and contain detailed information on candidates in both federal elections and Ontario provincial elections. These are unique data that can be used by researchers as well as nonacademic stakeholders such as journalists to address substantive research questions about gender, incumbency, the careers of candidates over time, and so on. I created uniform datasets with standardized information about all candidates who run in elections in Canada federally and in the province of Ontario.
So far, these data have been used for five peer-reviewed manuscripts (Sevi, Yoshinaka and Blais, Reference Sevi, Yoshinaka and Blais2018; Sevi, Arel-Bundock and Blais, Reference Sevi, Arel-Bundock and Blais2018; Sevi, Blais and Mayer, Reference Sevi, Blais and Mayer2020; Tolley, Besco and Sevi, Reference Tolley, Besco and Sevi2020; Sevi, Blais and Arel-Bundock, Reference Sevi, Blais, Arel-Bundock, Geus, Tolley, Goodyear-Grant and Loewen2021); they have also been used in a number of presentations at academic conferences, in reports by think tanks,Footnote 7 by journalists,Footnote 8 and in many papers currently online or in the pipeline. As such, these data have been scrutinized by different researchers.
The data are available on the Harvard Dataverse here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ABFNSQ
Acknowledgments
I am grateful to James N. Druckman for encouraging me to write this research note and to André Blais and Ruth Dassonneville for their feedback and comments. I would also like to thank all my co-authors using these datasets: Vincent Arel-Bundock, Randy Besco, André Blais, Danielle Mayer, Erin Tolley and Antoine Yoshinaka. Thank you also to Christopher Cochrane, Jean-François Godbout and Can Mekik for encouraging me to embark on the journey of collecting all this data when it was still an idea; I could not have done this without you! Finally, I would like to thank Cameron Anderson and the journal's three anonymous reviewers for their helpful comments and suggestions.