1 Introduction
The creation of majority-minority districts for underrepresented racial minorities remains a key point of contention within the field of redistricting and representation. There are the constant dangers of “packing” racial minorities into too few districts and minimizing their influence within the legislature, or “cracking” racial minorities into districts with no representatives of the same race. Striking the correct balance is not only of great normative and theoretical concern, but also methodological. It is difficult to identify the optimal racial composition of districts that avoids wasting the votes of racial minorities.
Typically, to address this problem, researchers have turned to ecological inference (EI) methodology to estimate both the electoral turnout by race and electoral preference of those who turn out given the demographics of an area (Goodman Reference Goodman1953; King Reference King1997). This allows mapmakers to calculate the racial composition needed in districts to allow minorities to elect their candidate of choice. Using Census data alone, however, does not account for differential levels of registration or turnout rates across racial groups, both of which can significantly affect racial minorities’ political influence.Footnote 1 Even the best EI methods coupled with on-the-ground qualitative research are far from perfect, and often necessarily lead to district plans “erring” on the side of packing minorities into districts so as to avoid unintentional cracking (Hicks et al. Reference Hicks, Klarner, McKee and Smith2018).
In some cases, race data on voters contained in states’ voter registration files can aid in the creation of majority-minority districts. Voter registration files contain the set of registered voters, and often individual-level voter history. Therefore, it is possible to estimate an individual-level likely turnout model by race using a voter registration file, replacing or supplementing EI produced turnout estimates used to estimate vote choice (King Reference King1997, 92–94). However, many states do not collect individual race data in their voter files—including states like Texas, Pennsylvania, and Wisconsin, where there is often contentious redistricting litigation.
A new development to imputing missing voter race data is Bayesian Improved Surname Geocoding (BISG) estimation. Implemented first in the field of public health by Elliott et al. (Reference Elliott, Fremont, Morrison, Pantoja and Lurie2008), BISG calculates the joint probability of racial membership given surname and geographic residence. Imai and Khanna (Reference Imai and Khanna2016) find the joint information of surname and residence reduces the bias and errors of estimated race and turnout beyond that of even advanced EI methods by a magnitude of 10 (270). Therefore, the relatively new BISG methodology strictly dominates the turnout-stage estimates present within EI, which is already employed within redistricting and associated litigation. However, to date, there has been no research validating the extent to which BISG can be used to construct accurate estimates of the racial composition of proposed legislative districts.
In this letter, we estimate and proffer the best practices and baseline uncertainty for BISG in the context of redistricting. We first implement BISG on voter file data using two posterior allocation methods: polygon-aggregated probability summed method (PSM) estimates versus individual-level plurality method (PM) assignment. We show that at the precinct level, using the PSM method results in significantly lower error rates relative to PM, particularly when estimating the Black share of precinct populations, as required by the Voting Rights Act (VRA) within North Carolina and Georgia. We then estimate district-level uncertainty around each BISG method by simulating 10,000 congressional district plans for each state and compare BISG estimates of the racial composition of districts to actual voter file data. Using PSM, BISG district-level estimates of the share of minority voters in districts typically fall within five percentage points of self-reported voter file racial data, although the magnitude of the errors vary across states and racial groups. These findings demonstrate that summing probabilities produce better precinct- and district-level racial composition estimates relative to plurality assignment.
2 Using BISG in Redistricting
BISG uses an individual’s surname and location to estimate their race via Bayes’ rule (Elliott et al. Reference Elliott, Fremont, Morrison, Pantoja and Lurie2008; Imai and Khanna Reference Imai and Khanna2016). Using individuals’ surnames matched to a surname dictionary as the prior, joined to Census geography demographics for the conditional probability, produces more accurate racial estimates relative to other methods (Imai and Khanna Reference Imai and Khanna2016). While the errors tend to be greatest where surnames are uninformative and geographic units heterogeneous by race (Imai and Khanna Reference Imai and Khanna2016; King Reference King1997), BISG greatly reduces the number of individuals afflicted by such uncertainty. As long as subcounty units are employed as the geography, BISG racial estimates outperform alternative methods when verified against states with race in their voter files (Clark, Curiel, and Steelman Reference Clark, Curiel and Steelman2021; Imai and Khanna Reference Imai and Khanna2016). The benefits of BISG therefore earned its widespread use within political science, such as estimating the race of political donors (Alvarez, Katz, and Kim Reference Alvarez, Katz and Kim2020; Grumbach, Sahn, and Staszak Reference Grumbach, Sahn and Staszak2020), candidate emergence (Conroy and Green Reference Conroy and Green2020), and minority candidate performance (Shah and Davis Reference Shah and Davis2017). These developments in BISG—not available at the time of the 2010 redistricting cycle—offer an opportunity to efficiently incorporate voter race information to evaluate majority-minority districts in the current cycle.
One pressing question before applying BISG to redistricting is ascertaining the degree of error given how the researcher assigns racial categories from BISG posterior probability estimates. Clark et al. (Reference Clark, Curiel and Steelman2021) follow the practice of summing the estimated probabilities that an individual is of a given race up to a geographic unit of interest, such as a precinct. However, some scholars, such as Enos, Kaufman, and Sands (Reference Enos, Kaufman and Sands2019), assign a single race to a voter given the racial category with the highest estimated probability, also known as deterministic, modal, or pluralistic assignment. The work by Enos et al. (Reference Enos, Kaufman and Sands2019) avoids substantial error by relying on segregated Los Angeles (with extremely homogeneous precincts by race), and also by dropping observations where the predicted posterior for the plurality race is under 90%. Crabtree and Chykina (Reference Crabtree and Chykina2018), Rhinehart and Geras (Reference Rhinehart and Geras2020), Lu et al. (Reference Lu2019), Abott and Magazinnik (Reference Abott and Magazinnik2020), and Grumbach et al. (Reference Grumbach, Sahn and Staszak2020) all employ pluralistic assignment of BISG race estimates.
Plurality assignment goes against best practices within population-level social sciences (King Reference King1997) given the potential for extreme and clustered errors. Normally, plurality assignment of race would not be considered for redistricting. In the aforementioned studies, scholars required whole assignment of their observations to a single racial group due to their research design, or relied upon statistical packages that defaulted to such assignment (Lu et al. Reference Lu2019, 465). Plurality assignment might also appeal to redistricting practitioners; knowing each individual voters’ race could allow for more sophisticated voter targeting while redistricting. However, the utility of either method depends on their accuracy when applied to redistricting. Therefore, it is important to assess the relative accuracy of both the plurality assignment and summing probabilities in the redistricting context. For a more technical explanation of BISG and the differences between PSM and PM, see Section A of the Supplementary Material.
3 BISG Validation and Simulation Results
We validate the application of BISG using two states with racial information in their voter files: North Carolina and Georgia. These states also require Black majority minority districts at the congressional level. We implemented BISG using the R package zipWRUext (Clark et al. Reference Clark, Curiel and Steelman2021), which uses surname and ZIP code demographics to calculate the joint probability of race for individuals. While not as accurate as using addresses matched to Census block data for non-Black racial minority estimates, zipWRUext allows us to quickly produce accurate estimates of the predicted race of each voter using ZIP codes without having to undergo a costly and time-consuming geocoding process.Footnote 2 Section B of the Supplementary Material describes our data in more detail, and we perform diagnostics on the individual-level race BISG predictions in Section C of the Supplementary Material.
Next, we estimate the proportion of Black and White voters in each precinct, using both the PSM and the PM assignment procedures.Footnote 3 These estimates are benchmarked to the actual self-reported racial data within the voter files. Figure 1 shows a density plot of the precinct-level errors in racial estimates, calculated as the absolute percentage difference between the BISG estimated and true reported number of voters of each race. We plot the results separately for both North Carolina and Georgia. For White voters, the modal error approaches zero for both BISG assignment methods, although PM has a longer right tail, indicating worse performance relative to PSM. For Black voters, PSM vastly outperforms PM in reducing precinct-level errors. A recommendation we make confidently from just these precinct-level results is that PSM should be the preferred method when estimating the racial composition of precincts by using BISG on voter files.
To evaluate the accuracy of BISG estimates of race at the district level, we perform 10,000 redistricting simulations, each of North Carolina and Georgia’s congressional district maps using the Redist package in R (Fifield et al. Reference Fifield, Kenny, McCartan, Tarr and Imai2020b), version 2. We craft a base map from the precinct simplified map employed by Curiel and Steelman (Reference Curiel and Steelman2018) for North Carolina, and do the same using Georgia’s precinct shapefile for their 2010 congressional districts.Footnote 4 We then proceed to simulate districts via rook contiguity. For each simulated plan, we calculate the absolute percentage point difference between the BISG estimated proportion and the actual voter file proportion of each race in each district. We use our simulations to observe a distribution of district-level errors and create a 95% confidence interval around these estimates.
The error rates and confidence intervals for North Carolina and Georgia are plotted in Figure 2, for both White and Black voters. The x-axis is the share of the district population that is White when plotting the absolute error for White voters (plots (a) and (b)), and the share of the district population that is Black when plotting the absolute error for Black voters (plots (c) and (d)). In nearly all district-level estimates, summing the estimate probabilities (PSM) results in significantly lower absolute error rates relative to the PM, consistent with the precinct-level diagnostics.
While the error rates for PSM are low in general, they vary both across states and across racial groups. In North Carolina, the errors for PSM are close to zero for the percentage of White voters in each district, and never go above five percentage points for the percentage of Black voters in each district. In Georgia, the PSM error rates are slightly higher—for White voters, they max out around 10 percentage points, but for Black voters, the error rates are lower and, like North Carolina, peak around 5 percentage points. These simulation results further suggest that using the BISG PSM rather than the PM of race assignment in redistricting work will produce more accurate estimates of the racial composition of districts.
4 Discussion
As simulations become more common in redistricting (Fifield et al. Reference Fifield, Imai, Kawahara and Kenny2020a), and as the new redistricting cycle progresses without the previous protections of VRA preclearance, BISG has the potential to help provide researchers constructing optimal majority-minority districts. This can be especially useful in states where voter race data are missing from voter files. Our letter performs the first empirical validation of applying BISG in redistricting, and provides a set of simple recommendations and guidelines for researchers using BISG in redistricting analyses.
First, researchers using BISG should aggregate up to some polygonal unit of interest by summing the estimated probabilities of racial membership. Although it might be tempting to assign race to single voters in order to aid in point-based redistricting attempts, the errors will be drastically higher. Second, researchers should be prepared to deal with around a 5–10 percentage point error rate in estimating race at the district level.
In states where voter race is not collected, BISG offers a quick and fairly accurate work-around. However, the electoral context matters; insofar as electoral preferences can be divided between White and non-White categories, such as the drawing of coalition districts, BISG reaches high levels of accuracy. However, when researchers need to estimate the district composition of a specific racial minority group, such as Blacks or Hispanics, the potential for greater error should be considered.Footnote 5
Lastly, we show that it is possible to achieve these BISG estimates at a relatively low cost via modern BISG packages in programs such as R. Imai and Khanna (Reference Imai and Khanna2016) greatly expanded the ease of integrating Census data and surname dictionaries for BISG, and Clark et al. (Reference Clark, Curiel and Steelman2021) demonstrate the ability to attain accurate estimates using ZIP codes while avoiding the need to geocode altogether. We use these new methods here to provide accurate race estimates for millions of voters in just a couple of minutes, and we demonstrate heterogeneity in errors associated with BISG that researchers should be aware of.
Future work should look at the accuracy of BISG and redistricting in states where non-Black racial minorities and Hispanic voters make up a more significant share of the electorate. Because the two biggest racial categories in North Carolina and Georgia are Black and White, and due to significant residential segregation in each state, BISG will produce more accurate estimates in these states relative to states with more heterogeneity in racial group demographics. Other work can and should try to incorporate BISG estimates with differential turnout across racial groups from voter history (which is often contained in voter files) to create majority-minority districts.
Legislators, researchers, and everyday citizens will have access to a whole new set of quantitative tools during the 2020 redistricting cycle. Many of these tools and methods are aimed at reducing partisan and racial biases in maps to promote more fair and equal representation. However, these tools and methods can still produce biased or inefficient districts if voter race data themselves are unrepresentative of the actual electorate. Our letter helps to reduce the errors in estimating aggregate racial data, and can assist mapmakers using these new quantitative tools create efficient majority-minority districts.
Acknowledgment
We would like to thank the three anonymous reviewers and the Editor (Jeff Gill) for their thoughtful comments and discussion.
Data Availability Statement
Replication materials can be found on Harvard dataverse at Curiel and DeLuca (Reference Curiel and DeLuca2022). For privacy reasons, personal identifying information from the voter files is redacted from the replication materials.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2022.14.