We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The nineteenth century was the first era of “big data” in the modern world, and American literary texts published during this time, such as Herman Melville’s Moby-Dick (1851), offer an aesthetic reframing of how individuals and institutions within a culture of data use information at scale to claim authority over knowledge and, by extension, power over people. Moby-Dick also gestures toward the ways that African and African American bodies were subjected to the most brutal regimes of quantification that the nineteenth century had to offer in the form of the transatlantic and intra-American slave trade. One of the major problems facing American literary studies and digital humanities today is the question of how to excavate and explicate the quantitative turn of earlier centuries as we seek to better understand the cultures of data we live in today. The best initial response to this problem is not to begin with a specific digital tool per se, but to build a set of guiding principles for how to critically approach data, media, and power from within a context that recognizes the distinctive contributions of literary texts as aesthetic objects. This essay models one such approach to do so.
Machine learning (ML) techniques have emerged as a powerful tool for predicting weather and climate systems. However, much of the progress to date focuses on predicting the short-term evolution of the atmosphere. Here, we look at the potential for ML methodology to predict the evolution of the ocean. The presence of land in the domain is a key difference between ocean modeling and previous work looking at atmospheric modeling. Here, we look to train a convolutional neural network (CNN) to emulate a process-based General Circulation Model (GCM) of the ocean, in a configuration which contains land. We assess performance on predictions over the entire domain and near to the land (coastal points). Our results show that the CNN replicates the underlying GCM well when assessed over the entire domain. RMS errors over the test dataset are low in comparison to the signal being predicted, and the CNN model gives an order of magnitude improvement over a persistence forecast. When we partition the domain into near land and the ocean interior and assess performance over these two regions, we see that the model performs notably worse over the near land region. Near land, RMS scores are comparable to those from a simple persistence forecast. Our results indicate that ocean interaction with land is something the network struggles with and highlight that this is may be an area where advanced ML techniques specifically designed for, or adapted for, the geosciences could bring further benefits.
Originating from a unique partnership between data scientists (datavaluepeople) and peacebuilders (Build Up), this commentary explores an innovative methodology to overcome key challenges in social media analysis by developing customized text classifiers through a participatory design approach, engaging both peace practitioners and data scientists. It advocates for researchers to focus on developing frameworks that prioritize being usable and participatory in field settings, rather than perfect in simulation. Focusing on a case study investigating the polarization within online Christian communities in the United States, we outline a testing process with a dataset consisting of 8954 tweets and 10,034 Facebook posts to experiment with active learning methodologies aimed at enhancing the efficiency and accuracy of text classification. This commentary demonstrates that the inclusion of domain expertise from peace practitioners significantly refines the design and performance of text classifiers, enabling a deeper comprehension of digital conflicts. This collaborative framework seeks to transition from a data-rich, analysis-poor scenario to one where data-driven insights robustly inform peacebuilding interventions.
The application of data analytics to product usage data has the potential to enhance engineering and decision-making in product planning. To achieve this effectively for cyber-physical systems (CPS), it is necessary to possess specialized expertise in technical products, innovation processes, and data analytics. An understanding of the process from domain knowledge to data analysis is of critical importance for the successful completion of projects, even for those without expertise in these areas. In this paper, we set out the foundation for a toolbox for data analytics, which will enable the creation of domain-specific pipelines for product planning. The toolbox includes a morphological box that covers the necessary pipeline components, based on a thorough analysis of literature and practitioner surveys. This comprehensive overview is unique. The toolbox based on it promises to support and enable domain experts and citizen data scientists, enhancing efficiency in product design, speeding up time to market, and shortening innovation cycles.
Behavioral Network Science explains how and why structure matters in the behavioral sciences. Exploring open questions in language evolution, child language learning, memory search, age-related cognitive decline, creativity, group problem solving, opinion dynamics, conspiracies, and conflict, readers will learn essential behavioral science theory alongside novel network science applications. This book also contains an introductory guide to network science, demonstrating how to turn data into networks, quantify network structure across scales, and hone one's intuition for how structure arises and evolves. Online R code allows readers to explore the data and reproduce all the visualizations and simulations for themselves, empowering them to make contributions of their own. For data scientists interested in gaining a professional understanding of how the behavioral sciences inform network science, or behavioral scientists interested in learning how to apply network science from the ground up, this book is an essential guide.
Maximise student engagement and understanding of matrix methods in data-driven applications with this modern teaching package. Students are introduced to matrices in two preliminary chapters, before progressing to advanced topics such as the nuclear norm, proximal operators and convex optimization. Highlighted applications include low-rank approximation, matrix completion, subspace learning, logistic regression for binary classification, robust PCA, dimensionality reduction and Procrustes problems. Extensively classroom-tested, the book includes over 200 multiple-choice questions suitable for in-class interactive learning or quizzes, as well as homework exercises (with solutions available for instructors). It encourages active learning with engaging 'explore' questions, with answers at the back of each chapter, and Julia code examples to demonstrate how the mathematics is actually used in practice. A suite of computational notebooks offers a hands-on learning experience for students. This is a perfect textbook for upper-level undergraduates and first-year graduate students who have taken a prior course in linear algebra basics.
Drawing examples from real-world networks, this essential book traces the methods behind network analysis and explains how network data is first gathered, then processed and interpreted. The text will equip you with a toolbox of diverse methods and data modelling approaches, allowing you to quickly start making your own calculations on a huge variety of networked systems. This book sets you up to succeed, addressing the questions of what you need to know and what to do with it, when beginning to work with network data. The hands-on approach adopted throughout means that beginners quickly become capable practitioners, guided by a wealth of interesting examples that demonstrate key concepts. Exercises using real-world data extend and deepen your understanding, and develop effective working patterns in network calculations and analysis. Suitable for both graduate students and researchers across a range of disciplines, this novel text provides a fast-track to network data expertise.
Multivariate biomarker discovery is increasingly important in the realm of biomedical research, and is poised to become a crucial facet of personalized medicine. This will prompt the demand for a myriad of novel biomarkers representing distinct 'omic' biosignatures, allowing selection and tailoring treatments to the various individual characteristics of a particular patient. This concise and self-contained book covers all aspects of predictive modeling for biomarker discovery based on high-dimensional data, as well as modern data science methods for identification of parsimonious and robust multivariate biomarkers for medical diagnosis, prognosis, and personalized medicine. It provides a detailed description of state-of-the-art methods for parallel multivariate feature selection and supervised learning algorithms for regression and classification, as well as methods for proper validation of multivariate biomarkers and predictive models implementing them. This is an invaluable resource for scientists and students interested in bioinformatics, data science, and related areas.
This commentary discusses opportunities for advancing the field of developmental psychopathology through the integration of data science and neuroscience approaches. We first review elements of our research program investigating how early life adversity shapes neurodevelopment and may convey risk for psychopathology. We then illustrate three ways that data science techniques (e.g., machine learning) can support developmental psychopathology research, such as by distinguishing between common and diverse developmental outcomes after stress exposure. Finally, we discuss logistical and conceptual refinements that may aid the field moving forward. Throughout the piece, we underscore the profound impact of Dr Dante Cicchetti, reflecting on how his work influenced our own, and gave rise to the field of developmental psychopathology.
Cyber-Physical-Systems provide extensive data gathering opportunities along the lifecycle, enabling data-driven design to improve the design process. However, its implementation faces challenges, particularly in the initial data capturing stage. To identify those, a comprehensive approach combining a systematic literature review and an industry survey was applied. Four groups of interrelated challenges were identified as most relevant to practitioners: data selection, data availability in systems, knowledge about data science processes and tools, and guiding users in targeted data capturing.
In “How Can Spin, Ply, and Knot Direction Contribute to Understanding the Quipu Code?” (2005), mathematician Marcia Ascher referenced new data on 59 Andean khipus to assess the significance of their variable twists and knots. However, this aggregative, comparative impulse arose late in Ascher's khipu research; the mathematical relations she had identified among 200+ previously cataloged khipus were specified only at the level of individual specimens. This article pursues a new scale of analysis, generalizing the “Ascher relations” to recognize meaningful patterns in a 650-khipu corpus, the largest yet subjected to computational study. We find that Ascher formulae characterize at least 74% of khipus, which exhibit meaningful arrangements of internal sums. Top cords are shown to register a minority of sum relationships and are newly identified as markers of low-level, “working” khipus. We reunite two fragments of a broken khipu using arithmetic properties discovered between the strings. Finally, this analysis suggests a new khipu convention—the use of white pendant cords as boundary markers for clusters of sum cords. In their synthesis, exhaustive search, confirmatory study, mathematical rejoining, and hypothesis generation emerge as distinct contributions to khipu description, typology, and decipherment.
Political opposition to fiscal climate policy, such as a carbon tax, typically appeals to fiscal conservative ideology. Here, we ask to what extent public opposition to the carbon tax in Canada is, in fact, ideological in origin. As an object of study, ideology is a latent belief structure over a set of issue topics—and in particular their relationships—as revealed through stated opinions. Ideology is thus amenable to a generative modeling approach within the text-as-data paradigm. We use the Structural Topic Model, which generates word content from a set of latent topics and mixture weights placed on them. We fit the model to open-ended survey responses of Canadians elaborating on their support of or opposition to a carbon tax, then use it to infer the set of mixture weights used by each response. We demonstrate this set, moreso than the observed word use, serves efficient discrimination of opposition from support, with near-perfect accuracy on held-out data. We then operationalize ideology as the empirical distribution of inferred topic mixture weights. We propose and use an evaluation of ideology-driven beliefs based on four statistics of this distribution capturing the specificity, variability, expressivity, and alignment of the underlying ideology. We find that the ideology behind responses from respondents who opposed the carbon tax is more specific and aligned, much less expressive, and of similar variability as compared with those who support the tax. We discuss the implications of our results for climate policy and of broad application of our approach in social science.
This study aimed to identify and understand the major topics of discussion under the #sustainability hashtag on Twitter (now known as “X”) and understand user engagement. The sharp increase in social media usage combined with a rise in climate anomalies in recent years makes the area of sustainability with respect to social media a critical topic. Python was used to gather Twitter posts between January 1, 2023, and March 1, 2023. User engagement metrics were analyzed using a variety of statistical analysis methods, including keyword-frequency analysis and Latent Dirichlet Allocation (LDA), which were used to identify significant topics of discussion under the #sustainability hashtag. Additionally, histograms and scatter plots were used to visualize user engagement. LDA analysis was conducted with 7 topics after trials were run with various topics and results were analyzed to determine which number of topics best fit the dataset. The frequency analysis provided a basic overview of the discourse surrounding #sustainability with the topics of technology, business and industry, environmental awareness, and discussion of the future. The LDA model provided a more comprehensive view, including additional topics such as Environmental, Social, and Governance (ESG) and infrastructure, investing, collaboration, and education. These findings have implications for researchers, businesses, organizations, and politicians seeking to align their strategies and actions with the major topics surrounding sustainability on Twitter to have a greater impact on their audience. Researchers can use the results of this study to guide further research on the topic or contextualize their study with existing literature within the field of sustainability.
The Sustainable Development Goals are global objectives set by the UN. They cover fundamental issues in development such as poverty, education, economic growth, and climate. Despite growing data across policy dimensions, popular statistical approaches offer limited solutions as these datasets are not big or detailed enough to meet their technical requirements. Complexity Economics and Sustainable Development provides a novel framework to handle these challenging features, suggesting that complexity science, agent-based modelling, and computational social science can overcome these limitations. Building on interdisciplinary socioeconomic theory, it provides a new framework to quantify the link between public expenditure and development while accounting for complex interdependencies and public governance. Accompanied by comprehensive data of worldwide development indicators and open-source code, it provides a detailed construction of the analytic toolkit, familiarising readers with a diverse set of empirical applications and drawing policy implications that are insightful to a diverse readership.
There are now an estimated 114 million forcibly displaced people worldwide, some 88% of whom are in low- and middle-income countries. For governments and international organizations to design effective policies and responses, they require comparable and accessible socioeconomic data on those affected by forced displacement, including host communities. Such data is required to understand needs, as well as interactions between complex drivers of displacement and barriers to durable solutions. However, high-quality data of this kind takes time to collect and is costly. Can the ever-increasing volume of open data and evolving innovative techniques accelerate and enhance its generation? Are there applications of alternative data sources, advanced statistics, and machine-learning that could be adapted for forced displacement settings, considering their specific legal and ethical dimensions? As a catalytic bridge between the World Bank and UNHCR, the Joint Data Center on Forced Displacement convened a workshop to answer these questions. This paper summarizes the emergent messages from the workshop and recommendations for future areas of focus and ways forward for the community of practice on socioeconomic data on forced displacement. Three recommended areas of future focus are: enhancing and optimizing household survey sampling approaches; estimating forced displacement socioeconomic indicators from alternative data sources; and amplifying data accessibility and discoverability. Three key features of the recommended approach are: strong complementarity with the existing data-collection-to-use-pipeline; data responsibility built-in and tailored to forced displacement contexts; and iterative assessment of operational relevance to ensure continuous focus on improving outcomes for those affected by forced displacement.
Based on the authors' extensive teaching experience, this hands-on graduate-level textbook teaches how to carry out large-scale data analytics and design machine learning solutions for big data. With a focus on fundamentals, this extensively class-tested textbook walks students through key principles and paradigms for working with large-scale data, frameworks for large-scale data analytics (Hadoop, Spark), and explains how to implement machine learning to exploit big data. It is unique in covering the principles that aspiring data scientists need to know, without detail that can overwhelm. Real-world examples, hands-on coding exercises and labs combine with exceptionally clear explanations to maximize student engagement. Well-defined learning objectives, exercises with online solutions for instructors, lecture slides, and an accompanying suite of lab exercises of increasing difficulty in Jupyter Notebooks offer a coherent and convenient teaching package. An ideal teaching resource for courses on large-scale data analytics with machine learning in computer/data science departments.
Despite the critical role that quantitative scientists play in biomedical research, graduate programs in quantitative fields often focus on technical and methodological skills, not on collaborative and leadership skills. In this study, we evaluate the importance of team science skills among collaborative biostatisticians for the purpose of identifying training opportunities to build a skilled workforce of quantitative team scientists.
Methods:
Our workgroup described 16 essential skills for collaborative biostatisticians. Collaborative biostatisticians were surveyed to assess the relative importance of these skills in their current work. The importance of each skill is summarized overall and compared across career stages, highest degrees earned, and job sectors.
Results:
Survey respondents were 343 collaborative biostatisticians spanning career stages (early: 24.2%, mid: 33.8%, late: 42.0%) and job sectors (academia: 69.4%, industry: 22.2%, government: 4.4%, self-employed: 4.1%). All 16 skills were rated as at least somewhat important by > 89.0% of respondents. Significant heterogeneity in importance by career stage and by highest degree earned was identified for several skills. Two skills (“regulatory requirements” and “databases, data sources, and data collection tools”) were more likely to be rated as absolutely essential by those working in industry (36.5%, 65.8%, respectively) than by those in academia (19.6%, 51.3%, respectively). Three additional skills were identified as important by survey respondents, for a total of 19 collaborative skills.
Conclusions:
We identified 19 team science skills that are important to the work of collaborative biostatisticians, laying the groundwork for enhancing graduate programs and establishing effective on-the-job training initiatives to meet workforce needs.
Statistical profiling of job seekers is an attractive option to guide the activities of public employment services. Many hope that algorithms will improve both efficiency and effectiveness of employment services’ activities that are so far often based on human judgment. Against this backdrop, we evaluate regression and machine-learning models for predicting job-seekers’ risk of becoming long-term unemployed using German administrative labor market data. While our models achieve competitive predictive performance, we show that training an accurate prediction model is just one element in a series of design and modeling decisions, each having notable effects that span beyond predictive accuracy. We observe considerable variation in the cases flagged as high risk across models, highlighting the need for systematic evaluation and transparency of the full prediction pipeline if statistical profiling techniques are to be implemented by employment agencies.
Edited by
Xiuzhen Huang, Cedars-Sinai Medical Center, Los Angeles,Jason H. Moore, Cedars-Sinai Medical Center, Los Angeles,Yu Zhang, Trinity University, Texas
The goal of this chapter is to explore and review the role of artificial intelligence (AI) in scientific discovery from data. Specifically, we present AI as a useful tool for advancing a No-Boundary Thinking (NBT) approach to bioinformatics and biomedical informatics. NBT is an agnostic methodology for scientific discovery and education that accesses, integrates, and synthesizes data, information, and knowledge from all disciplines to define important problems, leading to innovative and significant questions that can subsequently be addressed by individuals or collaborative teams with diverse expertise. Given this definition, AI is uniquely poised to advance NBT as it has the potential to employ data science for discovery by using information and knowledge from multiple disciplines. We present three recent AI approaches to data analysis that each contribute to a foundation for an NBT research strategy by either incorporating expert knowledge, automating machine learning, or both. We end with a vision for fully automating the discovery process while embracing NBT.
This article aims to explore the ethical issues arising from attempts to diversify genomic data and include individuals from underserved groups in studies exploring the relationship between genomics and health. We employed a qualitative synthesis design, combining data from three sources: 1) a rapid review of empirical articles published between 2000 and 2022 with a primary or secondary focus on diversifying genomic data, or the inclusion of underserved groups and ethical issues arising from this, 2) an expert workshop and 3) a narrative review. Using these three sources we found that ethical issues are interconnected across structural factors and research practices. Structural issues include failing to engage with the politics of knowledge production, existing inequities, and their effects on how harms and benefits of genomics are distributed. Issues related to research practices include a lack of reflexivity, exploitative dynamics and the failure to prioritise meaningful co-production. Ethical issues arise from both the structure and the practice of research, which can inhibit researcher and participant opportunities to diversify data in an ethical way. Diverse data are not ethical in and of themselves, and without being attentive to the social, historical and political contexts that shape the lives of potential participants, endeavours to diversify genomic data run the risk of worsening existing inequities. Efforts to construct more representative genomic datasets need to develop ethical approaches that are situated within wider attempts to make the enterprise of genomics more equitable.