Policy Significance Statement
A growing and robust community is deploying technologies and analytics to address public policy challenges. This landscape review highlights historical trends and priority areas for Data for Policy Area 2: Technologies and Analytics. We review characteristics of submissions from academic and nonacademic authors, comment on relationships from data collection to decision-making, and document advances in policy analytics related to ML, the internet-of-things, digital twins, and distributed ledger system technologies.
1. Introduction
Since 2015, Data for Policy has established itself as a leading global forum for cross-sectoral and interdisciplinary exchange on digital revolutions in policy-making (Verhulst et al., Reference Verhulst, Engin and Crowcroft2019). The conference and its partner journal, Data and Policy, serve a diverse community network that spans a range of disciplines and sectors. The journal has organized this field into six focus areas to capture the emerging trends shaping global discussion (Engin et al., Reference Engin, Gardner, Hyde, Verhulst and Crowcroft2024). Here, we present a report for Focus Area 2: Technologies and Analytics, which expands on both established and new data streams from personal, proprietary, administrative, and public sources and surveys the current landscape of analytical technologies and challenges for both practitioners and researchers.
A community of research and practice has emerged around the deployment of data processing technologies and analytical tools for evidence-based policy-making (Kim et al., Reference Kim, Trimi and Chung2014; Suominen and Hajikhani, Reference Suominen and Hajikhani2021). This community is increasingly using data analytics to generate field evidence and large-scale case studies to inform policymaking on various societal challenges (Verhulst et al., Reference Verhulst, Engin and Crowcroft2019; Mergel et al., Reference Mergel, Rethemeyer and Isett2016; Anshari et al., Reference Anshari, Almunawar and Lim2018). Drawing on foundational knowledge from computational social science (CSS) (Lazer et al., Reference Lazer, Pentland, Watts, Aral, Athey, Contractor, Freelon, Gonzalez-Bailon, King, Margetts, Nelson, Salganik, Strohmaier, Vespignani and Wagn2020) and related fields at the data science-policy interface, this community includes collaborators from academia, industry, and government. As a rapidly evolving field, the collection of technologies and platforms has a breadth of social implications given that technology developments occur more rapidly than use regulations can be established. In this landscape review, we discuss policy data interactions and topics established within Data for Policy over the last few years.
We highlight three fundamental challenges for the implementation of data tools within policy analysis. First, a greater focus on causality is needed where the objective is to uncover the effects of policy changes on a given population. We have seen increased adoption of experimental methods (such as randomized controlled trials) in policy studies, but their use still lags behind traditional approaches (Angrist & Pischke, Reference Angrist and Pischke2010). With access to larger datasets, especially in the context of digital trace data, we expect a greater emphasis on new methods that can be combined with machine learning (ML) to disentangle causality in observational datasets with potentially many more variables than observations (Athey & Imbens, Reference Athey and Imbens2017). These integrative approaches can also offer valuable insights into understanding heterogeneity, which can bring us closer to estimating individual causal effects—and meaningfully distinguish between personalized and population-based decision-making (Athey & Imbens, Reference Athey and Imbens2016; Mueller and Pearl, Reference Mueller and Pearl2023; Wager & Athey, Reference Wager and Athey2018). To overcome challenges in applying causal theories or targeting policy interventions, the Data for Policy community is increasingly encouraging counterfactual thinking, especially by leveraging a combination of both experimental and observational data.
Second, to translate insights from massive quantities of data, the community is increasingly engaging specialists in cross-sector collaborations. This includes data scientists and managers in public or private organizations who have decision authority in data collection platforms and governance, as well as participatory research activities engaging communities affected by policy or technology decisions. For example, of Data for Policy Conference Focus Area 2 submissions for 2021 and 2022, one-fourth of the 129 authors represented cross-sector authorship (e.g. academic-industry, academic-government, academic-non-governmental organizations (NGOs)) as is depicted in Figure 1. We argue that this scientific model of collaborative research is important to accelerate translational research, provide testbeds for learning, develop use cases for data innovations, and meaningfully bridge competing cultures between research and practice.

Figure 1. Data for Policy Contributing Authors for Focus Area 2: Technologies and Analytics (2021–2022) The majority (60 percent) of the 129 submitting authors were solely from academic institutions, while one-fourth represented cross-sector authorship (academic-government, academic-industry, and/or academic-NGO scientific collaborations), 9 percent were government authors, and 6 percent were NGO/Industry authors only.
Third, considering the rate of data innovations in the private sector by platform owners, aggregators, and intermediaries, there is a constant interplay between increasing data access (and therefore greater capabilities to analyze human behavior) and increasing data protections. A classification of submitted Area 2 abstracts to Data for Policy 2021 revealed the most relevant concepts to be data mining, big data analytics, and data protection challenges, which is consistent with this relationship. These keywords were derived directly from the submitted user abstracts and were not predetermined by the journal.
Figure 2 conceptualizes this typical scaffolding between the data, the analytical tools and technologies used, and the applications that support evidence-based decision-making. The middle layer, which represents Technologies and Analytics, is built upon increasingly complex data infrastructure and is constantly evolving. Specifically, this paper expands upon ML, the Internet of Things (IoT), digital twins, and blockchain and distributed ledger systems (DLSs), but the layer encompasses additional tools that are characteristic of Data for Policy’s Focus Area 2 (Engin et al., Reference Engin, Gardner, Hyde, Verhulst and Crowcroft2024) and leaves room for the continual evolution of eligible technologies.

Figure 2. From Data to Decision-Making: A Conceptual Framework of Data-Policy Interactions within Focus Area 2: Technologies and Analytics. The darker shaded boxes indicate topics covered in this Landscape Review. The lighter-shaded boxes indicate topics that exist under the Focus Area 2 umbrella but are not included in this paper.
The remainder of this report is organized as follows: In Section 2, we review aspects of modern research data collection, associated information and real-time communication technologies, and government and administrative records. Section 3 reflects on four data tools that uniquely leverage data from the digital environment: ML applied to policy decision-making (Section 3.1); the internet-of-things in smart, connected infrastructure (Section 3.2); digital twins technologies for planning and design in the built environment (Section 3.3); and distributed ledger technologies that capture distributed trust (Section 3.4). In Section 4, we comment on the application of these analytical tools for policy evaluation.
2. Data sources
Policymakers have always had access to a variety of conventional data sources for measurement and evaluation in government statistics, often in the form of surveys that attempt to measure population trends, health, education, crime, and other aspects of social life (Groves, Reference Groves2011). However, high-quality population surveys are complicated logistical operations that can be expensive or infrequently conducted. The gradual decline in response rates over recent decades (Brick and Williams, Reference Brick and Williams2013; Singer, Reference Singer2006) has motivated the search for alternative sources of data. This wealth of new digital data sources has led to the development of technical solutions for the storage, manipulation, and analysis of data (Lazer and Radford, Reference Lazer and Radford2017).
2.1. Conventional data sources
Government and administrative records have been valuable as a form of big data in social science research (Connelly et al., Reference Connelly, Playford, Gayle and Dibben2016). Administrative data, collected primarily for nonstatistical purposes, is readily available to governments and can be used to produce estimates of attributes that are not easily captured in surveys (Nordbotten, Reference Nordbotten2010). An important element in this discussion is the potential value of record linkage which allows for the integration of different data streams—for instance, combining census responses with tax and property data. Despite these advantages to overcome data silos, there is growing concern over incidental disclosure and reidentification. Electronic health records and other personally identifiable data have captured attention in this context, especially regarding the improvement and personalization of care (Cebul et al., Reference Cebul, Love, Jain and Hebert2011; Cowie et al., Reference Cowie, Blomster, Curtis, Duclaux, Ford, Fritz and Goldman2017; Abul-Husn and Kenny, Reference Abul-Husn and Kenny2019). In addition to privacy concerns, survey data is infrequently updated, which limits our ability to measure social phenomena and evaluate the impacts of policy changes. For example, in the United States, measures of household energy consumption are updated within the Residential Energy Consumption Survey 3 (RECS) and Commercial Buildings Energy Consumption Survey (CBECS), which are conducted every few years. As a result, monitoring the effects of energy-efficient and emissions-reducing policies becomes impractical for impact evaluation (EIA, 2022).
2.2. Digital data sources
Digital innovations in nearly every aspect of modern life have had a transformative impact on the availability of data that can be used to study and inform public policy (Salganik, Reference Salganik2019; Jungherr et al., Reference Jungherr, Rivero and Gayo-Avello2020). Every day, a large portion of our behaviors are captured by digital systems (Golder and Macy, Reference Golder and Macy2014). This is true for the increasing number of activities that are mediated by a computer or a cellular device (ranging from web browsing, connecting on social media, and utilizing mobile apps). Similar arguments can be made for the myriad of sensors and electronic systems with which we interact daily. Other examples include public or private CCTV systems (Taylor and Gill, Reference Taylor and Gill2014) and a wide array of sensors in public and private spaces (Ratti and Claudel, Reference Ratti and Claudel2016; van De Sanden et al., Reference van de Sanden, Willems and Brengman2019).
The decentralized nature of digital data collection and non-representative datasets allows for the repurposing of information for secondary uses (Salganik, Reference Salganik2019). For example, in urban analytics, smart cities projects (Batty et al., Reference Batty, Axhausen, Giannotti, Pozdnoukhov, Bazzani, Wachowicz, Ouzounis and Portugali2012; Albino et al., Reference Albino, Umberto and Dangelico2015) have taken advantage of active and passive monitoring devices (Singleton et al., Reference Singleton, Spielman and Folch2017; Asensio et al., Reference Asensio, Apablaza, Lawson, Chen and Horner2021) to study mobility (Aguilera and Boutueil, Reference Aguilera and Boutueil2018), criminality (Ferguson, Reference Ferguson2017; Meijer and Wessels, Reference Meijer and Wessels2019), and the resilience of urban infrastructure in natural disasters (Khalaf et al., Reference Khalaf, Abir, Al-Jumeily, Fergus and Idowu2015; Dong and Shan, Reference Dong and Shan2013). Further, mobile devices such as smartphones and wearable technology (Gandy et al., Reference Gandy, Baker and Zeagler2017) have become valuable sources of data for automated contract tracing and data collection. These devices consistently capture metrics related to browsing behavior, geolocation (Nikolic and Bierlaire, Reference Nikolic and Bierlaire2017; Wu et al., Reference Wu, Brown and Sreenan2013), patterns of communication (Green et al., Reference Green, Moszczynski, Asbah, Morgan, Klyn, Foutry, Ndira, Selman, Monawe, Likaka, Sibande and Smith2021; Blumenstock et al., Reference Blumenstock, Cadamuro and On2015), and other features that can be used to study real-time aspects of human behavior.
Websites and social media also serve as valuable sources of digital data. Individuals and organizations maintain an online presence for a large variety of activities—learning, shopping, dating, and so forth Some of these activities occur in public spaces into which researchers can gain programmatic access via structured (for instance, through a REST API) or unstructured methods (web scraping or web harvesting); or they can be investigated by the providers of the service through internal data (Kramer et al., Reference Kramer, Guillory and Hancock2014; Yang et al., Reference Yang, Holtz, Jaffe, Suri, Sinha, Weston, Joyce, Shah, Sherman, Hecht and Teevan2021). This has allowed for aspects of public and private life to be studied at scale. For example, internet and social media data have played a role in the research on public attention, communication, and public health (Klašnja et al., Reference Klašnja, Barberá, Beauchamp, Nagler and Tucker2015; Dugas et al., Reference Dugas, Jalalpour, Gel, Levin, Torcaso, Igusa and Rothman2013; Arora et al., Reference Arora, McKee and Stuckler2019). Additionally, satellite and aerial photography have been widely used for the analysis of urban and rural environments, including studying patterns of growth, mobility, and poverty (Wania et al., Reference Wania, Kemper, Tiede and Zeil2014; Zhang et al., Reference Zhang, Wu, Zhu and Liu2019; Jean et al., Reference Jean, Burke, Xie, Davis, Lobell and Ermon2016). Images from interactive panoramas like Google Street View have been used to gain insights into subjective perceptions of streetscapes (Ye et al., Reference Ye, Zeng, Shen, Zhang and Lu2019; Liu et al., Reference Liu, Han, Xiong, Qing, Ji and Peng2019; Rundle et al., Reference Rundle, Bader, Richards, Neckerman and Teitler2011; Liu et al., Reference Liu, Silva, Wu and Wang2017). For further discussion on the alternative applications of data from the digital environment to address public issues, see Verhulst (Reference Verhulst2021).
2.3. Challenges and opportunities underlying data sources
The existing literature characterizes digital data as advantageous compared to surveys, which have limitations that threaten the accuracy of data collection (Salganik, Reference Salganik2019). For example, due to cognitive biases, recalling past behavior is burdensome for many respondents. This limits the accuracy of survey responses especially as respondents are asked to recall information over longer periods of time (Grotpeter, Reference Grotpeter2008). Additionally, respondents are less likely to truthfully report answers to sensitive questions when there is an interviewer present (Tourangeau and Yan, Reference Tourangeau and Yan2007; Krumpal, Reference Krumpal2013). Although methods exist to address these shortcomings (Lensvelt-Mulders et al., Reference Lensvelt-Mulders, Hox, Van der Heijden and Maas2005; Blair and Imai, Reference Blair and Imai2012), the ability to directly measure behaviors offers advantages in the availability, accuracy, and depth of the data. This is clearly illustrated in the use of GPS-enabled devices to complement mobility data from large transportation surveys (Stopher et al., Reference Stopher, FitzGerald and Xu2007). Rather than asking a small sample of respondents to recall detailed information, digital data offers the opportunity to study a larger population, with greater accuracy, precision, and detail (see, for instance, (Merry and Bettinger, Reference Merry and Bettinger2019; Wolf et al., Reference Wolf, Oliveira and Thompson2003)).
Digital data also poses several relevant challenges. For example, in the study of public opinion, social media can be considered an attractive data source for its potential to replace traditional polls (Tumasjan et al., Reference Tumasjan, O’Sprenger, Sandner and Welpe2011; McKelvey et al., Reference McKelvey, DiGrazia and Rojas2014; DiGrazia et al., Reference DiGrazia, McKelvey, Bollen and Rojas2013). However, not everyone owns a digital device, has access to the internet, or maintains a social media presence. This means that the pool of individuals who can be studied is not always representative of the general public. This digital divide has been documented among adults and the right to vote (Gayo-Avello et al., Reference Gayo-Avello, Metaxas and Mustafaraj2011; Murphy et al., Reference Murphy, Link, Childs, Tesfaye, Dean, Stern, Pasek, Cohen, Callegaro and Harwood2014; Barberá and Rivero, Reference Barberá and Rivero2014). In an effort to mitigate biases in population sampling, suitable weighting methods have emerged (Sen et al., Reference Sen, Floeck, Weller, Weiss and Wagner2019; Elliott and Valliant, Reference Elliott and Valliant2017; Kennedy et al., Reference Kennedy, Mercer, Keeter, Hatley, McGeeney and Gimenez2016). Another disadvantage associated with digital data is that researchers often lack control over the concepts and comprehensiveness of measurement. In the case of online behavior, monitoring a respondent’s digital device through tracking cookies may not capture all of their online activity. For example, multiple cookies on user devices may be tracked at different times, which preserves some anonymity (Barthel et al., Reference Barthel, Mitchell, Asare-Marfo, Kennedy and Worden2020). However, online behaviors are often repetitive and predictable over time, and recent studies with close proximity networks have shown that individuals may be identified even in pseudonymized datasets (Cretu et al., Reference Creţu, Frederico, Marrone, Dong, Bronstein and De Montjoye2022). This example highlights the common tradeoff between comprehensiveness and privacy.
In the future, there will be increasing opportunities to repurpose alternative social datasets from the digital environment (Kalton, Reference Kalton2019; Rao Reference Rao2021). Governments and their national statistical offices will be required to make important decisions regarding where and when to use conventional and digital data sources for policy. Survey data, though expensive, can be used to benchmark data from other sources that can be collected more cheaply, frequently, or with more granularity (Blumenstock et al., Reference Blumenstock, Cadamuro and On2015; Keusch et al., Reference Keusch, Struminskaya, Kreuter and Weichbold2020a, Reference Keusch, Bähr, Haas, Kreuter and Trappmann2020b). The expansion of new sources in modern data collection including social data, sensors, and digital platforms are becoming serious complements and in some cases, alternatives, to conventional government surveys. These data sources are faster and cheaper to obtain through application programming interfaces but require increasingly complex tools to parse, handle, and compute. In the following section, we describe four commonly employed technologies and their potential capabilities for near real-time analysis.
3. Technologies and analytics
The abundance of conventional and digital data allows for real-time monitoring and response while simultaneously posing data-processing challenges due to the vast amount of data available, also known as a “data avalanche”. However, policymakers do not often face raw datasets, numeric and textual spreadsheets, or databases when crafting policy. They instead rely on the insights, knowledge, and evidence extracted from datasets through analytical processes.
The Technologies and Analytics Layer in Figure 2 encompasses existing and emerging technologies that link data collection and applications within policy. We do not explicitly differentiate between technology and analytics, as we find it arbitrary to draw a boundary between the two and note that it is not uncommon that an analytical module is underpinned by many different technologies. For example, Digital Twins and the IoT are often intertwined in the context of novel modeling, sensing, and data harvesting technologies, while from a system view, they sometimes also consist of analytical software components such as ML-enhanced IoT systems. Hence, the four widely used tools are blended under the umbrella of technologies and analytics in this section.
3.1. Machine learning
ML in CSS refers to the algorithms that allow computers to build predictions around behavioural data, thus “learning” and optimizing parameters over time. ML approaches may be supervised, in which the analyst provides labeled datasets to train the computer algorithm, or unsupervised, where the computer analyzes datasets without training on labeled data (Baraniuk et al., Reference Baraniuk, Donoho and Gavish2020). In recent years, there has been growth in the application of ML for policy problems, including supervised and unsupervised learning (Athey, Reference Athey2017). Increasingly, governments are using supervised ML in prediction problems to determine how to best allocate resources. For instance, the New York City Fire Department’s FireCast program uses ML to predict which buildings are most vulnerable to fire and deploy inspection teams (Heaton, Reference Heaton2015). Similar algorithms have been proposed for an increasing range of policy-relevant applications, such as environmental monitoring (Hino et al., Reference Hino, Benami and Brooks2018), preventing malfeasance in public procurement (Gallego et al., Reference Gallego, Rivero and Martínez2021), and restaurant hygiene inspections (Glaeser et al., Reference Glaeser, Hillis, Kominers and Luca2016).
Researchers and analysts are also drawing on ML to leverage new datasets that capture hard-to-measure variables for policy analysis. Examples range from using Twitter to identify illegal sales of opioids (Mackey et al., Reference Mackey, Kalyanam, Katsuki and Lanckriet2017), developing economic uncertainty indices from scientific publications (Azqueta-Gavaldon, Reference Azqueta-Gavaldon2017), predicting income levels from phone metadata (Blumenstock et al., Reference Blumenstock, Cadamuro and On2015), optimizing Covid-19 vaccine deployment strategies in Africa (Mellado et al., Reference Mellado, Wu, Kong, Bragazzi, Asgary, Kawonga, Choma, Hayasi, Lieberman, Mathaha, Mbada, Ruan, Stevenson and Orbinski2021), and predicting suicide risk using Reddit data (Yao et al., Reference Yao, Rashidan, Dong, Hongyi, Rosenthal and Wang2021; Allen et al., Reference Allen, Bagroy, Davis and Krishnamurti2019). In many of these cases, researchers trained artificial intelligence (AI) to identify patterns (e.g., timing, wording, or events associated with behaviors of interest) from a small dataset and apply these algorithms to classify instances across a larger number of observations. Previously, tracking activities that are illegal or take place over large geographic areas would require a substantial and costly effort. Now, with deep neural networks, which are scalable and generalizable across domains, predictor variables from digital datasets can be observed or evaluated almost continuously (provided a robust, digitally available data feed). ML also enables researchers to utilize an abundance of unstructured data sources such as textual data, images, video, and other non-numeric data sources that might contain valuable information about the environment, policy preferences, or people’s behaviors, among other aspects.
A major benefit of ML classification is the ability to link easily observed variables with more policy-relevant spatial or temporal qualities that may be harder to obtain. For example, in transportation infrastructure where there is poor data and network interoperability across jurisdictions, researchers have been able to deploy deep learning algorithms to automatically detect failures in electric vehicle charging stations, with accuracy approaching or often exceeding that of human experts (Asensio et al., Reference Asensio, Alvarez, Dror, Wenzel, Hollauer and Ha2020; Ha et al., Reference Ha, Marchetto, Dharur and Asensio2021). Deep learning algorithms have also been applied alongside satellite imagery to measure indicators of household consumption in poorer countries, where government statistics have more limited availability (Jean et al., Reference Jean, Burke, Xie, Davis, Lobell and Ermon2016; Vinuesa and Sirmacek, Reference Vinuesa and Sirmacek2021). More generally, deep neural networks such as transformer-based architectures (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Kaiser2017; Yang et al., Reference Yang, Dai, Yang, Carbonell, Salakhutdinov and Le2019; Devlin et al., Reference Devlin, Chang, Lee and Toutanova2018), and convolutional or recurrent neural networks (Gu et al., Reference Gu, Wang, Kuen, Ma, Shahroudy, Shuai, Liu, Wang, Wang, Cai and Chen2018; LeCun et al., Reference LeCun, Bengio and Hinton2015) are being deployed in an increasing number of policy-relevant applications to automatically discover context-aware, spatially resolved, and domain-specific insights (Hicks et al., Reference Hicks, Zullo, Doshi and Asensio2022).
Large language models (e.g. GPT-4, BERT, BLOOM, Llama) have demonstrated strong performance in text generation. Since such technologies often emerge more rapidly than regulation or consensus around use policies can be determined, there is evidence of misuse (e.g., ChatGPT has been used to plagiarise on academic assignments and research studies and generate false narratives of misinformation (Else, Reference Else2023; Brewster et al., Reference Brewster, Arvanitis and Sadeghi2023)). To address this, several academic publishers, including Springer Nature and Cambridge University Press, have updated their policies to preclude generated text from attributed authorship (Thorpe, Reference Thorpe2023; Cambridge University Press, CUP, 2023).
For policy domains, these AI systems are enhanced by humans who can intervene during training, testing, or validation to consider more complex issues (e.g. behavioral intent, psychological states, minority class representations, and other important social considerations). This blending of human and machine intelligence will also increase the need for high-quality, labeled training data that has been experimentally curated and approved by Institutional Review Boards (IRB) in sponsored research. There is a growing call for businesses to adopt IRB approval processes in data protocols as part of industry self-regulation and AI ethics boards (Blackman, Reference Blackman2021). We recognize that cross-disciplinary approaches that combine algorithmic advances with experimentally curated training data will continue to expand research frontiers and applications for ML/AI in social and policy domains.
Although there have been calls by the community for new regulations on applications of ML/AI, we note that there already exist mechanisms and applicable tools within the regulatory landscape, such as IRBs for the private sector and privacy regulations (e.g. General Data Protection Regulation (GDPR) in Europe and California Consumer Protection Act (CCPA) in the US), which can and have been leveraged to mitigate negative effects of ML/AI (Renieris, Reference Renieris2023). For example, the Information Commissioner’s Office (ICO), which operates within the GDPR, prosecuted the facial recognition technology company Clearview AI for the unlawful collection and use of biometric data in 2022 (ICO, 2022). For further discussion, we refer readers to Focus Area 3: Policy and Literacy for Data and Focus Area 4: Ethics, Equity & Trustworthiness.
The growth of machine intelligence for policy decision-making also presents several challenges. First, social science inquiry that serves as a foundation for policy analysis typically focuses on developing a clear model explaining causal relationships. In contrast, many computational or purely data-driven approaches seek to optimize predictive performance, even if it makes the underlying model extremely complex or uninterpretable. Finding ways to integrate social theories with explainable AI systems (Amarasinghe et al., Reference Amarasinghe, Rodolfa, Lamba and Ghani2023) to illustrate why two variables are related at a micro level or better adapt interventions (Pallmann et al., Reference Pallmann, Bedding, Choodari-Oskooei, Dimairo, Flight, Hampson, Holmes, Mander, Odondi, Sydes, Villar, Wason, Weir, Wheeler and Jaki2018) is important to improve theoretically-driven, computational approaches (Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths, Kleinberg, Margetts, Mullainathan, Salganik, Vazire, Vespignani and Yarkoni2021). Similarly, empirical constructs measured in real datasets often do not adequately capture the underlying social concepts they aim to articulate (Wagner et al., Reference Wagner, Strohmaier, Olteanu, Kıcıman, Contractor and Eliassi-Rad2021), signaling a need for better validation and interpretation of measurements derived through ML (Buckee et al., Reference Buckee, Noor and Sattenspiel2021).
Second, many ML algorithms are seen as a “black box”, wherein the actual mechanism linking data point A to prediction B is both hidden and not linked to a clear theoretical model in transfer functions. When working on policy questions that have real effects on health and well-being, these sorts of “trust the expert” approaches can erode trust in government and compliance (Coyle and Weller, Reference Coyle and Weller2020). This is especially true for supervised ML approaches, where the data scientist’s choice of algorithm and/or prediction weights suggests to the public that bias may be built into the prediction (Carmel, Reference Carmel2016).
Finally, while many in the Data for Policy community (policymakers and researchers alike) are excited about ML-powered data innovations, in many policy settings, the prediction enabled by ML may not be sufficient to fully answer existing policy questions. In some cases, this is because it is fundamentally a social question, wherein the computer can allocate a particular probability to a particular outcome, but society needs to decide what level of risk they are willing to accept (Kleinberg et al., Reference Kleinberg, Ludwig, Millainathan and Obermeyer2015; Athey, Reference Athey2017). In other cases, a stronger causal understanding is needed to develop effective policy solutions. However, integrating theories of policy and governance into ML policy analytics can be challenging because of the hands-off nature of the models and a historical over-reliance on nonexperimental data, where researchers do not precisely control the conditions during data collection, such as with random or quasi-random allocation mechanisms (Dunning, Reference Dunning2012). We are increasingly seeing effective uses of ML to target policy interventions, predict human behavior at a more granular level, and understand future events with greater precision (Amarasinghe et al., Reference Amarasinghe, Rodolfa, Lamba and Ghani2023). Despite practical issues on how best to apply these tools, there is an encouraging future for an increasing number of prediction policy problems.
3.2 Internet of Things
The IoT refers to systems in which computing and network-connected capabilities are incorporated within everyday objects to collect and distribute data without human intervention (Internet Society, 2015). The concept has evolved from a foundation in ubiquitous computing and includes various natural or built environment monitoring devices, wearable accessories, home appliances, vehicles, drones, smartphones, and computers (Satyanarayanan, Reference Satyanarayanan2001; Krumm, Reference Krumm2018; Friedewald and Raabe, Reference Friedewald and Raabe2011; Atzori et al., Reference Atzori, Iera and Morabito2010). These technologies are capable of producing, transferring, and consuming real-time data about themselves and their surrounding environment. The collected data are then either streamed in real time or stored as historical data to be fed into ML and AI models for devising, monitoring, and evaluating various policies and regulations across different countries (Behrendt, Reference Behrendt2020; Salem, Reference Salem2017; Tanczer et al., Reference Tanczer, Brass, Elsden, Carr, Blackstock, Ellis and Mohan2019). On a more granular level, mobile sensing networks, which are comprised of numerous physical sensing nodes that record and wirelessly relay massive amounts of data, have been part of IoT systems for over two decades (Intanagonwiwat et al., Reference Intanagonwiwat, Govindan and Estrin2000; Pottie and Kaiser, Reference Pottie and Kaiser2000; Tilak et al., Reference Tilak, Abu-Ghazaleh and Heinzelman2002).
Cities around the world are increasingly responding to issues such as traffic congestion, air pollution, and natural hazards, which have prompted data-driven interventions to make cities more efficient, adaptable, and resilient. Over the past decades, the concept of IoT has been implemented in many real-world scenarios and ‘smart’ applications, particularly in the urban and city governance context, for example, smart transport and smart cities (Atzori et al., Reference Atzori, Iera and Morabito2010). IoT-enabled decision or planning support systems are perhaps the most commonly used policy tools in the urban governance and smart city domain (Al Sharif and Pokharel, Reference Al Sharif and Pokharel2022). For instance, many cities have deployed various IoTs and sensor networks to cope with the natural hazards related to climate change (see Pantalona et al., Reference Pantalona, Tsalakanidou, Nikolopoulos, Kompatsiaris, Lombardo, Norbiato and Haberstock2021). Milton Keynes, UK is one of the cities that was selected by Innovate UK to test an IoT-empowered real-time air pollution monitoring system to support better public service for citizens and companies (Government Office for Science UK, 2014; Cheng et al., Reference Cheng, Li, Li, Jiang, Li, Jia and Jiang2014). In India, IoT has been adopted in waste management for smart city scenarios (Sharma et al., Reference Sharma, Joshi, Kannan, Govindan, Singh and Purohit2020). IoT systems also act as a very crucial driving force in the global economy. For instance, bike-sharing systems (Behrendt, Reference Behrendt2020) promote greener transport and support the UN Sustainable Development Goals (SDGs). Additionally, mobile sensing networks have been deployed since the 1990s to receive and transmit real-time data for pollution monitoring, satellite imaging and broadcasting, smart-home monitoring, and numerous other applications (Kahn et al., Reference Kahn, Katz and Pister1999; Dinh et al., Reference Dinh, Lee, Niyato and Wang2013; Vermesan et al., Reference Vermesan, Friess, Guillemin, Sundmaeker, Eisenhauer, Moessner, Gall and Cousin2022).
While bringing a number of social and economic benefits to governance and policy (Government Office for Science UK, 2014), IoT is still evolving and facing several major limitations. First, public concerns over data security and privacy protection have arisen from the existing applications of IoT (Tanczer et al., Reference Tanczer, Brass, Elsden, Carr, Blackstock, Ellis and Mohan2019; Ukil et al., Reference Ukil, Bandyopadhyay and Pal2014; Opara et al., Reference Opara, Johng, Hill and Chung2022). Similar to many other big data technologies, IoT systems collect, share, and analyze copious amounts of data about people and the environment, including sensitive information such as locations, trajectories, activities, health, and biometric data. The distributive architecture of IoT could also expose the sensors and devices to potential attacks and data intercepts, which undermine public trust in data security. In contrast with distributed IoT devices that may not have local control, personal mobile sensing devices could be more socially acceptable, considering their activities are more easily turned off (Choudhury et al., Reference Choudhury, Borriello, Consolvo, Haehnel, Harrison, Hemingway, Hightower, Klasnja, Koscher, LaMarca, Landay, LeGrand, Lester, Rahimi, Rea and Wyatt2008). However, it is not transparent enough for people as data contributors to know what data are collected, where they are stored, and who has access to them (Corallo et al., Reference Corallo, Lassi, Lezzi and Luperto2022). Consequently, IoT devices present information problems related to user control and privacy.
Second, our capability of processing the data collected by various IoT systems lags behind the rate of data accumulation. As data is no longer a limitation in many applications, IoT-generated information about people and their environment is harvested continuously and is increasing in an exponential manner. The heterogeneous forms of data collected in texts, images, audio, and video materials further compound the complexity of data processing (Kazmi et al., Reference Kazmi, Serrano and Lenis2018). Thus, it is still a challenging task to digest and refine the data and present more comprehensible knowledge to policymakers. There is a shortage of efficient and effective data analytical models as well as skilled IoT specialists, which is recognized in the governmental Blackett Review (Government Office for Science UK, 2014).
Third, there are a plethora of IoT solutions and applications for governance and regulation, while IoT systems themselves lack standards and policies. As a novel technology in multiple industries, IoT systems have many different domain-specific terminologies and standards. Countries may also have different focuses on their own views and definitions of IoT strategies. These inconsistencies have the potential to cause interoperability issues when coupling disparate IoT components on the internet, which is one of the barriers to adopting IoT more widely. Although a few countries have already explored and devised IoT-related regulations, for instance, India (Chatterjee and Kar, Reference Chatterjee and Kar2018), the UK (Tanczer et al., Reference Tanczer, Brass, Elsden, Carr, Blackstock, Ellis and Mohan2019), and the EU (Remotti et al., Reference Remotti2021), many regulations are still not IoT-specific and do not keep pace with the IoT evolution.
In response to the above challenges, future interactions between IoT and policymaking could be explored in several directions. In terms of data processing and knowledge discovery to inform policy-making processes, the IoT technology should be implemented with a more powerful and smarter analytical backend. For instance, utilizing ML and AI to extract meaningful patterns and evidence out of massive, heterogeneous datasets collected by IoT systems. This also requires a closer collaboration between policymakers and data experts (Government Office for Science UK, 2014). Policymakers should clearly present their real-world problems and define their queries to data experts, and in return, data experts need to inform them of both the findings of models and, more importantly, caveats that arise from possible data bias and model premises.
With respect to data security problems, IoT systems themselves should further embrace new digital technologies in data encryption and protection (Minoli and Occhiogrosso, Reference Minoli and Occhiogrosso2018). Automatic security-enhancing mechanisms need to be implemented to detect essential or non-essential traffic over the IoT network, in order to restrict the transferring of sensitive information such as personally identifiable information without compromising the normal functionality of the IoT devices (Mandalari et al., Reference Mandalari, Dubois, Kolcun, Paracha, Haddadi and Choffnes2021). Moreover, concerning regulation, policymakers should come up with more IoT-specific standards, policies, and laws (e.g., Government Office for Science UK, 2014) to guide the development of IoT. Policymaking and governance need to be more forward-thinking and adaptive to cope with the rapidly evolving IoT technologies.
3.3 Digital twins
The concept of Digital Twins, first coined by Michael Grieves at a Society of Manufacturing Engineers conference in 2003, has now proliferated beyond its origin in product lifecycle management into many other domains, including manufacturing, farming, healthcare, architecture, and city planning (Grieves, Reference Grieves2015). Unlike models and simulations, digital twins are more complex virtual environments that utilize real-time data to generate multiple analyses (Bennett et al., Reference Bennett, Birkin, Ding, Duncan and Engin2023; Wright and Davidson, Reference Wright and Davidson2020). The Digital Twins ecosystem is underpinned by the various data sources mentioned in Section 2, as well as novel technologies such as sensor networks, IoT, 5G communication, cloud computing, ML, AI, virtual reality, augmented reality, mixed reality, geographic information systems (GISs), and building information modeling (BIM) (Wang et al., Reference Wang, Xu, Jiang and Zhong2022). Recent years have seen the increased use of various digital twin scenarios and applications, particularly during the Covid-19 pandemic when people moved many physical, face-to-face activities to virtual cyberspace. There are several domain-specific definitions of Digital Twins, but in general, they are real-time, virtual representations of various physical or functional entities, and examples include digital human bodies, jet engines, buildings, infrastructures, and cities (Batty, Reference Batty2018).
With innovative data and analytic techniques, the performance and dynamics of real-world entities can be measured, modeled, simulated, and predicted by their Digital Twins in virtual and software environments. These capabilities are effective and powerful tools for data-informed and evidence-based policymaking, particularly in the urban planning and management context (Engin et al., Reference Engin, van Dijk, Lan, Longley, Treleaven, Batty and Penn2020). Digital Twins are employed to test various what-if scenarios for long-term urban planning and development. In order to achieve sustainable development, Digital Twins provide promising solutions to mitigate urban and regional issues such as poverty and inequalities (Birks et al., Reference Birks, Heppenstall and Malleson2020), carbon footprints (Bauer et al., Reference Bauer, Stevens and Hazeleger2021; Solman et al., Reference Solman, Kirkegaard, Smits and Van Vliet2022), traffic congestion (Kumar et al., Reference Kumar, Madhumathi, Chelliah, Tao and Wang2018), natural hazards (Fernández and Ceacero-Moreno, Reference Fernández and Ceacero-Moreno2021), and public health problems (El Saddik et al., Reference El Saddik, Badawi, Velazquez, Laamarti, Diaz, Bagaria and Arteaga-Falconi2019). Digital twins are often difficult to replicate and scale. Recently, probabilistic graphical models have been proposed to ensure that digital twin representations and processes can be sufficiently scaled from experimental data to other physical assets (Mohammadi and Taylor, Reference Mohammadi and Taylor2021; Kapteyn et al., Reference Kapteyn, Pretorius and Willcox2021).
The applications of Digital Twins in governance and policymaking are numerous. During the Covid-19 pandemic, individual-level biometric data was collected and analyzed in the digital replica of the activity space to detect coexistence with the infected people (Ada Lovelace Institute, 2023). Smartphone-based digital contact tracing apps were developed and deployed to alert citizens of possible exposure in countries around the globe (Phillips et al., Reference Phillips, Babcock and Orbinski2022). On a larger scale, Virtual Singapore is a government-led initiative, aiming to build a dynamic, three-dimensional, and city-scale Digital Twin of Singapore (Singapore Land Authority, n.d.). It enables different stakeholders, including members of the government, citizens, businesses, and the research community, to perform virtual experimentation, virtual test-bedding, long-term urban planning and decision-making, and research and development. Amaravati City in India is reported to be the first city that is newly developed on a greenfield site and born as a Digital Twin (Jansen, Reference Jansen2019). It ambitiously aims to digitally recreate everything happening in the city. For instance, it allows for real-time construction progress monitoring and advanced mobility simulations. The European Space Agency also launched several Digital Twin activities to visualize, monitor, model, and forecast natural and human activities, using earth observation data combined with AI, which would help human beings tackle pressing global issues such as climate change (European Space Agency, 2021).
Although Digital Twins offer promising platforms of data and policy interaction for integrating existing and emerging data sources and technologies, they also face many critiques and challenges, such as model difficulties (Tao and Qi, Reference Tao and Qi2019) and scaling issues (Niederer et al., Reference Niederer, Sacks, Girolami and Willcox2021). Scholars have also argued for the need to make more rapid adaptations in response to natural disasters and other challenges (Mohammadi and Taylor, Reference Mohammadi and Taylor2021). Batty (Reference Batty2018) argues that most of the current computer models are abstractions or simplifications, rather than Digital Twins of the real world, and calls for a collaborative exploration as a society of how close our models can get to real-world systems. To address data privacy and availability problems, Papyshev and Yarime (Reference Papyshev and Yarime2021) borrow the concept of “data labeling” from the ML industry practice and propose a task-based approach to generating synthetic data for City Digital Twins. On one hand, we call for the integration of novel data and technologies into Digital Twins to provide information and evidence for policymaking; on the other, we must pay more attention to the data and technology issues faced in Digital Twins implementation such as data infrastructure construction, data sharing, data security, privacy protection, interoperability, and platform standards, which can be regulated, directed and coordinated by relevant policies.
3.4 Blockchain and DLSs
Blockchain and DLSs are gaining attention in the government and business sectors due to their unique data-sharing features which are designed to increase transparency, authenticity, and reliability (Zutshi et al., Reference Zutshi, Grilo and Nodehi2021; Guo and Yu, Reference Guo and Yu2022). DLSs refer to a digital framework that employs ledgers across multiple nodes or participants within a network, aiming to guarantee the security and accuracy of data (Marbouh et al., Reference Marbouh, Simsekler, Salah, Jayaraman and Ellahham2022). Blockchain emerged as an evolution of DLS, with an inherent capability to record transactions in chronological order in a secure and verifiable manner (Salah et al., Reference Salah, Rehman, Nizamuddin and Al-Fuqaha2019). Blockchain technologies append the data into “blocks” offering a range of benefits to support analytical applications, such as traceability, built-in anonymity, and secure transaction protocols (Mirabelli and Solina, Reference Mirabelli and Solina2020). Further, this technology offers decentralization that democratizes decision-making with no single authority in control (Beduschi, Reference Beduschi2021). In particular, smart contract technology is gaining a growing focus due to its ability to streamline transaction processes and its potential for automating legal protocols (Hawashin et al., Reference Hawashin, Jayaraman, Salah, Yaqoob, Simsekler and Ellahham2022). The features embedded in DLSs have the potential to bring change to the economic landscape with a new business model where the end customer is placed as the primary beneficiary (Upadhyay et al., Reference Upadhyay, Mukhuty, Kumar and Kazancoglu2021).
Fundamentally, two different types of blockchain networks have emerged, namely permissionless (i.e., public) and permissioned (i.e., private) networks (Engin and Treleaven, Reference Engin and Treleaven2019). While any user can add nodes to the network in a public blockchain (e.g., Bitcoin), only preauthorized users can add nodes to a private blockchain network (e.g., Hyperledger Fabric) to reach consensus. For instance, many public networks use a Proof-of-Work (PoW) consensus mechanism with no single actor dominating it. However, public networks may suffer as more personal information will be required to verify the data added to the blockchain in attempts to prevent fraudulent activity. Further, although PoW ensures data immutability, its environmental and sustainability effects, such as bandwidth, electricity usage, and CPU time are significant challenges. In private networks, however, protocols are developed to better utilize computational resources. Despite such benefits, the challenge in private networks is to identify a technical solution to balance data verifiability and optimize the level of privacy among stakeholders. In both networks, another challenge is the increasing size of blockchains that may create storage and synchronization issues (Wong et al., Reference Wong, Yeung, Lau and So2021).
Various industries and domains, including finance (Tapscott and Tapscott, Reference Tapscott and Tapscott2017), supply chain (Jabbar and Dani, Reference Jabbar and Dani2020, Kayikci et al., Reference Kayikci, Gozacan-Chase, Rejeb and Mathiyazhagan2022), and healthcare (Omar et al., Reference Omar, Jayaraman, Salah, Simsekler, Yaqoob and Ellahham2020, Bali et al., Reference Bali, Bali, Mohanty and Gaur2022), explore the potential benefits of DLSs. Several academic initiatives aim to leverage the use of blockchain in various application areas. For instance, Cambridge Centre for Carbon Credits (4C) builds a trusted, decentralized voluntary carbon market for funding nature-based projects and seeks further partnerships with governments, the private sector, and NGOs to promote projects concerned with biodiversity and the climate crisis (Cambridge Zero Policy Forum, 2021). Despite the benefits of the technology, there have been a considerable number of failures in blockchain implementations. For instance, Browne (Reference Browne2017) shows that of the 26,000 blockchain projects that started in 2016, only 8% were still active in 2017. Various causes may explain the failure and hesitancy of the technology, mainly the hype around blockchain due to the volatility of cryptocurrencies (Jalal et al., Reference Jalal, Alon and Paltrinieri2021; Guo and Yu, Reference Guo and Yu2022). The recent decline of cryptocurrencies and the downfall of major cryptoenterprises have raised further questions about the future of the technologies. In addition to these dramatic declines, concerns about money laundering, tax evasion attempts, and illicit payments have led financial services firms and venture capitalists to question the worth of investing in DLSs.
It should be noted that mainstream blockchain research primarily emphasizes technological aspects, often overlooking current regulatory functions. Although regulatory bodies have initiated working groups (e.g., the Australian Government National Blockchain Roadmap Working Group and European Blockchain Partnership), questions persist on the effectiveness and usability of legal mechanisms, particularly due to disintermediation (De Filippi et al., Reference De Filippi, Mannan and Reijers2022). To address this challenge, various government organizations across the globe, such as Estonia (Ojo and Adebayo, Reference Ojo, Adebayo, Ojo and Millard2017), the United Kingdom (Carson, Reference Carson2018), the United Arab Emirates (Alketbi et al., Reference Alketbi, Nasir and Abu Talib2020), and New Zealand (Demestichas et al., Reference Demestichas, Peppes, Alexakis and Adamopoulou2020), started embracing the DLSs, particularly blockchain, as a strategic driver for technology and policy transformation. A recent study by IBM revealed that nine in ten government organizations explore opportunities to enhance their operations in different application areas, including financial transaction management, contract management, regulatory compliance, and citizen services (Cuomo et al., Reference Cuomo, Pureswaran and Zaharchuk2017).
Further application areas, such as in elections (Baudier et al., Reference Baudier, Kondrateva, Ammi and Seulliet2021) and vaccine passports to protect personal privacy (Tsoi et al., Reference Tsoi, Sung, Lee, Yiu, Fung and Wong2021), were also explored by governments to leverage the technology. To successfully develop and implement such applications, a suitable policy environment is imperative to support early collaborations between technology developers and policymakers and foster innovation compliance. For instance, some scholars support the idea that “minimum regulatory brakes” are the key to adding more value and efficiency to application areas (Yeoh, Reference Yeoh2017). Such “hands-off” regulatory approaches have to date been adopted in the US and EU and show the potential for distributed trust frameworks. However, other scholars advocate for increased policy intervention, specifically on scalability, privacy, security, sustainability, and anonymity (Hassan et al., Reference Hassan, Ali, Rahouti, Latif, Kanhere, Singh, Janjua, Mian, Qadir and Crowcroft2020; Liiv, Reference Liiv and Liiv2021). Considering the potential benefits and challenges of the technology, policy environments may enable experimentation (McQuinn and Castro, Reference McQuinn and Castro2019) and learn from the experiences of others in the global landscape (PWC, 2019) to recommend informed regulatory changes accordingly. Future studies may benefit from exploring the potential impact of DLSs and blockchain in the entire technology sector and disruption in government and business operations and policymaking.
4. Using new data sources and analytics in policy-making
Evidence-based policy refers to the efforts to prioritize data-based decision-making in policy processes (Head, Reference Head2008; Howlett, Reference Howlett2009; Evidence-Based Policymaking Collaborative, 2016). The proliferating data sources and analytical techniques made available through the big data science enable new ways of bringing evidence to the design, implementation, and monitoring of policies and programs (Anshari et al., Reference Anshari, Almunawar and Lim2018; Kim et al., Reference Kim, Trimi and Chung2014; Giest, Reference Giest2017; Suominen and Hajikhani, Reference Suominen and Hajikhani2021). However, it is important to note that these advances do not necessarily translate into automatic uptake by policymakers. Governments face numerous constraints, from limited budgets, to external political and social pressures, to varying technical expertise, that limit their capability to fully capitalize upon the information available (Mergel et al., Reference Mergel, Rethemeyer and Isett2016; Schweinfest and Jansen, Reference Schweinfest and Jansen2021). This is true not only of new sources of data and innovative methodologies, but as a more general concern that also traverses traditional evidence-based approaches. How governments draw on data to inform decision-making is covered in greater depth by Data for Policy Focus Area 1: Digital and Data-Driven Transformations in Governance.
The data and analytical approaches discussed in sections 2 and 3 raise several unique challenges with respect to their uptake in evidence-based decision-making. First, governments tend to lag behind the private sector in adopting new computing technologies (Dunleavy et al., Reference Dunleavy, Margetts, Bastow and Tinkler2006). Given the rapid pace of advancements in data and analytics, government agencies are often delayed in adopting the newest approaches. Second, many government workers perceive a skills gap in the use of data and analytics, despite viewing data as a central component of their jobs (SAS, 2014). The World Economic Forum Jobs Report estimates that 24 percent of government and public sector organizations are making big data and AI a reskilling priority (World Economic Forum, WEF, 2023). When government agencies lack data expertise but want to use new analytical approaches, they have to rely on other actors (primarily from the private sector), leading to the growth of public–private partnerships for data (Geist, 2017). These public-private partnerships add to institutional complexity (Head, Reference Head2008) but also offer opportunities for innovation (Janssen et al., Reference Janssen, Konopnicki, Snowdon and Ojo2017). Third, some government administrators lack an understanding of what data analytics entails or are skeptical about its ability to address policy problems (Guenduez et al., Reference Guenduez, Mettler and Schedler2020). Fourth, the integration of big data can vary widely between developed and developing countries due to challenges in basic data availability and skills within the public sector. Applications are more common in developed countries, where access to data technology skills is more readily available (Purkayastha and Braa, Reference Purkayastha and Braa2013). Additional known challenges that could hinder developing countries from integrating the same technologies include limited data capture, infrastructural constraints, human resource scarcity, privacy and security constraints, and cultural barriers (Luna et al., Reference Luna, Mayan, García, Almerares and Househ2014; Hilbert, Reference Hilbert2015).
Despite these limitations, there are many examples of government agencies drawing on new data sources and analytical techniques, as we have illustrated throughout this paper. New training programs are helping government employees build expertise in data analytics which could overcome existing skills gaps (Kreuter et al., Reference Kreuter, Ghani and Lane2019). Further uptake of these approaches could be facilitated by policymakers, analytical experts, and members of policy-affected communities collaboratively identifying data needs and codeveloping analytical approaches, as the coproduction of knowledge is known to result in credible, salient, and trusted information (Ulibarri, Reference Ulibarri2018; Cravens, Reference Cravens2016; Morisette et al., Reference Morisette, Cravens, Miller, Talbert, Talbert, Jarnevich, Fink, Decker and Odell2017).
5. Closing
The Data for Policy community is contributing to innovations in digital data use and supporting technologies and analytics for policy decision-making. We conclude this initial landscape report with three observations highlighted by our community: there is a need for a greater emphasis on (i) model explainability, (ii) broader cross-sector collaboration, and (iii) data accessibility.
First, we note that without the integration of appropriate social science theories or hypothesis testing to guide feature selection in computational modeling, there is often the “black box” temptation to model phenomena using fully data-driven approaches. Although this continues to be very useful in domains (e.g. cancer detection, pollution monitoring, etc.), a greater focus on causal inference can help prevent the social ills sometimes observed in algorithmic decision-making (for additional information, see Veale et al., Reference Veale, van Kleek and Binns2018; Data for Policy Focus Area 5: Algorithmic Governance).
Our second observation is the need to increase cross-sector collaborations. Broadening this network between academics and practitioners is especially important as significant decisions regarding the use of personal data are being made largely outside of academia. In addition, such collaborations could benefit from more direct engagement with representatives of the affected communities (both positively and negatively) and/or the general public as a way to increase trust and reduce unintended side effects. These new models of scientific collaboration will be beneficial to catalyzing principled engagement for data-informed decision-making within the public sector.
The third observation relates to challenges associated with data accessibility and preserving anonymity. A majority of digital data sources are concentrated in platforms controlled by private companies, often inaccessible to government agencies. Consequently, there continue to be significant legal and financial barriers to accessing this data even when there is a compelling need (Salganik, Reference Salganik2019). In addition, digital data presents significant challenges related to incidental disclosure or re-identification and we have recently learned that personally identifiable information can be recovered even from anonymized or pseudonymized datasets (Kearns and Roth, Reference Kearns and Roth2019; De Montjoye et al., Reference De Montjoye, Redaelli, Kumar Singh and Pentland2015; Cretu et al., Reference Creţu, Frederico, Marrone, Dong, Bronstein and De Montjoye2022). Numerous approaches have been developed to address this problem, most notably the framework of differential privacy, but its application to standard social datasets has been met with criticism and limits related to data integrity (Cummings et al., Reference Cummings, Gupta, Kimpara and Morgenstern2019; Dwork, Reference Dwork2008; Abowd, Reference Abowd2018; Dwork, Reference Dwork, Kohli and Mulligan2019; Ruggles et al., Reference Ruggles, Fitch, Magnuson and Schroeder2019). These issues are addressed in further detail in Data for Policy Focus Area 4: Ethics, Equity, & Trustworthiness.
The Area 2 committee will be focusing on manuscripts that investigate the impacts of LLMs and other generative AI models and how regulators can respond to ensure sound decision-making that benefits humanity and societies. We invite authors working at the data science-policy interface to engage with the community through the Data for Policy conference series and submissions to the Data and Policy journal.
Acknowledgments
We thank members of the editorial board and the Data for Policy community who provided valuable discussion and reviews.
Author contribution
Conceptualization: O.I.A., T.L., C.M., G.R., M.C.E.S., N.U.; Methodology: O.I.A., T.L., C.M., G.R., M.C.E.S., N.U.; Data curation: O.I.A., M.C.E.S.; Data visualization: O.I.A., T.L., C.M.; Investigation: O.I.A., T.L., C.M., G.R., M.C.E.S., N.U.; Project administration: O.I.A., C.M.; Validation: O.I.A., C.M., N.U.; Writing—original draft: O.I.A., T.L., C.M., G.R., M.C.E.S., N.U.; Writing—review & editing: O.I.A., C.M., N.U. All authors approved the final submitted draft.
Data availability statement
Manuscript submission data that support the findings of this study are available upon request from dataandpolicy@cambridge.org. The data are not publicly available due to author privacy restrictions for double-blind peer review.
Provenance
This article was authored by the Editors associated with Data for Policy Focus Area 2: Data Technologies & Analytics. It was independently reviewed.
Funding statement
O.I.A. and C.M. were partially supported by National Science Foundation Award #1945332. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interest
The authors declare no competing interests.
Comments
No Comments have been published for this article.