Hostname: page-component-7dd5485656-zlgnt Total loading time: 0 Render date: 2025-10-29T13:12:38.388Z Has data issue: false hasContentIssue false

Can Generative AI Produce Novel Evidence?

Published online by Cambridge University Press:  27 August 2025

Donal Khosrowi*
Affiliation:
Institute of Philosophy, Leibniz University Hannover, Germany
Finola Finn
Affiliation:
Centre for Contemporary and Digital History, University of Luxembourg, Luxembourg
*
Corresponding author: Donal Khosrowi; Email: donal.khosrowi@philos.uni-hannover.de
Rights & Permissions [Opens in a new window]

Abstract

Researchers in history and the historical sciences explore the use of generative AI (GenAI) systems for reconstructing destroyed artifacts. This paper poses a novel question: Can such GenAI systems generate evidence that provides new knowledge about the world or can they only produce hypotheses that we might seek evidence for? Exploring responses to this question, the paper argues that (1) GenAI outputs can at least be understood as higher-order evidence (Parker 2022) and (2) may also constitute de novo synthetic evidence.

Information

Type
Contributed Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of the Philosophy of Science Association

1. Introduction

Artificial intelligence (AI) systems, including generative AI (GenAI), play ever larger roles across the sciences: They are used to make novel discoveries, e.g., of proteins, drugs, or materials (Jumper et al. Reference Jumper, Evans and Pritzel2021; Wang et al. Reference Wang, Fu and Du2023); to identify new concepts and equations in physics (Iten et al. Reference Iten, Metger, Wilming, del Rio and Renner2020; Udrescu et al. Reference Udrescu, Tan, Feng, Neto, Wu, Tegmark, Larochelle, Ranzato, Hadsell, Balcan and Lin2020; Wu and Tegmark Reference Wu and Tegmark2019); and to suggest new hypotheses, ideas, research questions, or experiments (Krenn et al. Reference Krenn, Buffoni, Coutinho, Eppel, Foster, Gritsevskiy, Lee, Lu, Moutinho, Sanjabi, Sonthalia, Tran, Valente, Xie, Yu and Kopp2023; Melnikov et al. Reference Melnikov, Nautrup, Krenn and Briegel2018). These increasingly extensive roles played by AI put foundational concepts we use to understand and structure scientific pursuits under pressure. For instance, what does it mean to be a scientific ‘discoverer’ (Clark and Khosrowi Reference Clark and Khosrowi2022)? Are AI systems like AlphaFold only ‘tools’ that humans use, or can they exhibit attributes such as epistemic ‘autonomy’ or scientific ‘understanding’ (Barman et al. Reference Barman, Caron, Claassen and de Regt2024) that we consider essential to the role of ‘discoverer’ or ‘researcher’?

While use cases of AI in physics, chemistry, and biology have attracted increasing attention from philosophers, there are also underexplored emerging uses of AI in history, archaeology, and the historical sciences more broadly, where researchers explore the use of GenAI for reconstructing partially destroyed manuscripts and artifacts (Navarro et al. 2022; Reference Navarro, Cintas, Lucena, Fuertes, Rueda, Segura, Ogayar-Anguita, González-José, Delrieux and Elkind2023; Assael et al. Reference Assael, Sommerschield and Shillingford2022; Lamb et al. Reference Lamb, Banerjee and Kholgade Banerjee2022; Papavassileiou et al. Reference Papavassileiou, Kosmopoulos and Owens2023; Moral-Andrés et al. Reference Moral-Andrés, Merino-Gómez, Reviriego and Lombardi2023; Wang et al. Reference Wang, Guo, Xu, Liu, Li, Pavlopoulos, Sommerschield, Assael, Gordin, Cho, Passarotti, Sprugnoli, Liu, Li and Anderson2024). Turning attention to these uses, this paper draws out a novel conceptual disruption (Löhr Reference Löhr2023; Hopster and Löhr Reference Hopster and Löhr2023) regarding how we should understand the outputs of GenAI systems: Can GenAI systems generate evidence that provides genuinely new knowledge in the way that, say, finding new material evidence can? Or can they only produce hypotheses, which may give us reasons for pursuit, but ultimately are the kind of thing that we require evidence for? Call this the evidence question. Like other conceptual disruptions caused by AI, the evidence question does not have a straightforward answer and highlights substantial uncertainty around how we should apply the concept of ‘evidence’ (see also Rowbottom et al. Reference Rowbottom, Curtis-Trudel and Peden2023; Zakharova Reference Zakharova2024 for related projects). The issues this raises are not merely terminological but have epistemic and methodological import for practicing researchers. Classifying an output as ‘evidence’ rather than a ‘hypothesis’ confers information about it; in turn, existing norms attached to these concepts may trigger different expectations, attitudes, and actions as appropriate in relation to an output.

Beyond putting the evidence question on the map, this paper also explores potential responses to it. We first consider related debates in the philosophy of computer simulation, where scholars such as Wendy Parker (Reference Parker2022) have elucidated whether simulation systems, such as those used in climate science, can provide (new) evidence for claims about the Earth’s climate. Drawing on this debate, we argue that GenAI systems can at least provide higher-order evidence in Parker’s sense, i.e. evidence that other evidence for a claim about the world exists. We then proceed to explore a more ambitious argument, according to which GenAI systems can produce de novo synthetic evidence, which could be epistemically on par with traditional forms of evidence, such as material evidence or expert judgment. The argument suggests they do so by performing pattern recognition-type inferences to yield outputs that provide genuinely new knowledge to agents who lack the ability to make those same inferences. Importantly, while this argument hints at interesting possibilities for understanding GenAI outputs as de novo synthetic evidence, it remains agnostic on what historians and other historical scientists should or would do with such evidence. In particular, we do not suggest that synthetic evidence is ever an end point or silver bullet for historical and archaeological inquiry (Nygren and Drimmer Reference Nygren and Drimmer2023). If used, it would require description, analysis, contextualization, and interpretation by researchers, as with any other form of evidence.

The discussion is organized as follows. Section 2 outlines the emerging use of GenAI in history and the historical sciences. Section 3 sharpens the evidence question. Section 4 explores debates in the philosophy of computer simulation and sketches the sequential arguments that GenAI can at least produce higher-order evidence, as well as, possibly, synthetic evidence. Section 5 concretizes why we may take GenAI outputs seriously. Section 6 concludes.

2. Generative AI in history and the historical sciences

A central challenge for historical researchers is that the ‘record’ of historical evidence, e.g. manuscripts or artifacts such as pottery, is an imperfect and partial reflection of past events, and is eroded – both figuratively and literally. Not everything survives, and what does is often incomplete or broken. Standard activities in getting a handle on the past (e.g. analyzing the stratigraphic relationships between features at archaeological sites, entertaining larger inferences about chronology, or inferring trade patterns) hence revolve around reconstructing what was from what remains. Reconstructing partially destroyed artifacts, e.g. to better determine relevant morphological or textural features, is currently often performed by hand, which is resource intensive, can further deteriorate remaining fragments, and cannot deal with fragments that are missing (Navarro et al. Reference Navarro, Cintas, Lucena, Fuertes, Rueda, Segura, Ogayar-Anguita, González-José, Delrieux and Elkind2023). Dealing with these and similar challenges, there is a rich tradition in the historical sciences, especially in archaeology, to recruit technologies from other fields (Wylie Reference Wylie2000), e.g. for sensing and scanning, or, in computational archaeology, using machine-learning methods. For instance, Navarro et al. (Reference Navarro, Cintas, Lucena, Fuertes, Rueda, Segura, Ogayar-Anguita, González-José, Delrieux and Elkind2023) develop a GenAI system based on generative adversarial networks (GANs; Goodfellow et al. Reference Goodfellow, Jean Pouget-Abadie, Xu, David Warde-Farley, Courville, Bengio, Ghahramani, Welling, Cortes, Lawrence, Weinberger, Red Hook and Associates2014) called IberianVoxel, which reconstructs broken Iberian pottery artifacts as 3D models. GANs consist of a coupled generator and discriminator architecture; in Navarro et al.’s case, the generator produces 3D voxel geometries of pottery and the discriminator ‘judges’ whether the geometries produced by the generator look like they were drawn from the data distribution of scanned real artifacts on which it is trained. After a period of adversarial training, the GAN is evaluated, including by surveying domain experts to assess reconstruction quality. The authors report that “archaeologists judge that IberianVoxel generated a correct Iberian style from an initial fragment, and also consider that the reconstructed pottery is between Good and Very Good” (2023, 5839), and conclude their system is “very helpful for exploring and designing automatic procedures to aid experts with the pottery completion task” (5833).

Systems such as IberianVoxel are first steps on a trajectory towards more advanced systems permitting finer-grained inferences, especially as GenAI technologies become cheaper to train. Extrapolating along this trajectory, let us imagine a stylized toy case inspired by IberianVoxel to draw out the central question of this paper more clearly. Consider AlphaPot, an imagined GenAI system that has been trained on a very large dataset consisting of images and corresponding high-quality 3D scans of a wide range of pottery artifacts in various states of decay. AlphaPot is trained to reconstruct masked/corrupted features of an input, i.e. parts of a 3D model of a scanned real artifact are intentionally corrupted (e.g. by generating synthetic data based on real artifacts that simulate fragmentation, see Lamb et al. Reference Lamb, Banerjee and Kholgade Banerjee2022, or by physically breaking artifacts and then re-scanning them) and the system is forced to predict how the uncorrupted artifact would have looked. Assume we are impressed with AlphaPot’s performance on unseen test data: It accurately reconstructs broken artifacts for which the ground truth geometry is known. Imagine now that we use AlphaPot to provide a reconstruction R of a novel, partially destroyed artifact A, for which the ground truth is unknown. A is missing pieces that haven’t been recovered, but are believed to be essential to classifying A’s likely origin or function. R, let us assume, is a plausible-looking 3D model exhibiting fine-grained morphological features that would significantly aid a domain expert (or, for that matter, another AI system) in telling when and where A originated.

3. What’s going on here, epistemically?

The key disruption motivating this paper is now clearly in view: What’s going on here, epistemically? Has AlphaPot generated a hypothesis or made a prediction? Or has it generated evidence, providing experts with genuinely new knowledge about how A looked when it was still intact? Understood as a mere hypothesis, R is the kind of thing that might give us reasons for pursuit, e.g. to seek further evidence to support that R is indeed what A looked like when it was still intact. By contrast, understood as evidence, R might already, by itself, support a range of hypotheses regarding A, as well as figure in further downstream inferences that A bears on, e.g. about trade taking place between communities.Footnote 1

In making progress on the evidence question, we need to find a benchmark first. A standard Bayesian conception of evidence requires only that a token of evidence E has the capacity to increase the posterior probability we assign to a hypothesis H (say, a claim about how A looked like) relative to some background theory T (Bovens and Hartmann Reference Bovens and Hartmann2003). It is easy to imagine that R has some such capacity, but that is not a very interesting insight (see also Rowbottom et al. Reference Rowbottom, Curtis-Trudel and Peden2023 who explore additional complications). Other, functional conceptions of evidence focus on what role evidence plays. Here, we are sympathetic to accounts that consider evidence as always being (1) of something, (2) for something, and (3) to someone, relative to a theory of evidence (Hacking Reference Hacking2006; Martini Reference Martini2021; see also Kosso Reference Kosso and Tucker2009; Jordanova Reference Jordanova2012). For instance, a freshly excavated artifact A is evidence of something, e.g. the fact that pottery of A’s kind was made, used, or traded at site S; evidence for something, e.g. an inferred claim that pottery of A’s kind was produced in P but ended up at S through a trade route; and evidence to someone who has a theory of evidence T and relevant background knowledge K to tell what A can be evidence of and for. Beyond following such structured conceptions, the subsequent discussion will remain largely uncommitted to specific philosophical accounts of evidence. Instead, we find it more productive to consider evidential practices in history and the historical sciences more broadly and think about what existing benchmark types of evidence we could compare GenAI outputs to. What could such benchmarks be? Historical researchers rely on primary sources, e.g. artifacts and documents that are close (causally, spatially, temporally, by provenance) to the phenomena of interest. Relevant benchmarks to address the evidence question could hence be, for instance, a highly similar, intact artifact B found in the same stratum at the same site, or pertinent text, illustrations, or tools bearing on the likely morphological features of A. Likewise, expert judgment that joins up available background theory and primary material evidence in a larger inference is another candidate. The evidence question is sharper now: Could AlphaPot’s outputs be considered evidence comparable to these benchmarks, e.g. other, material evidence like B that could licence an analogical inference that “A would have probably looked like B when it was intact” or expert judgment that joins various resources together to yield, say, a rendition or description of what A would look like, were it still intact?

4. Yes, but what kind of evidence is it?

The answer that we want to explore here is yes, GenAI systems like AlphaPot have the capacity to generate synthetic evidence that provides genuinely new knowledge about the world. What could an argument for such a thesis look like? A first pass could build on familiar successes of using AI systems for inferential tasks in science, like AlphaFold 2.0 (Jumper et al. Reference Jumper, Evans and Pritzel2021). Specifically, at training, these systems (1) latch onto information, especially high-dimensional and distributed correlational information or patterns, in training data, and (2) learn a model, i.e. an abstract representational space encoding relevant features and a corresponding function F within that space, which maps inputs to outputs in a way that minimizes empirical risk (at least locally). At inference, such models, given an input (e.g. a scan of a partially destroyed artifact), (3) generate outputs that yield accurate reconstructions of the input, as governed by F. This kind of story could touch on results in machine learning (e.g. Cybenko Reference Cybenko1989) and statistical learning theory (e.g. Vapnik Reference Vapnik2000; Bargagli Stoffi et al. Reference Bargagli Stoffi, Cevolani and Gnecco2022) to explain notable successes of machine learning systems, e.g. in latching onto complex, subtle, and distributed patterns that escape human attention, such as in skin cancer classification or protein structure prediction, or in successfully learning novel high-dimensional representations (e.g. word or image embeddings) that can be used for text and image synthesis, as demonstrated by GenAI systems like ChatGPT or Stable Diffusion (Rombach et al. Reference Rombach, Blattmann, Lorenz, Esser and Ommer2022).

This story, while somewhat compelling, is still too simple. Here, we focus on concerns arising from related debates in the philosophy of scientific models and computer simulation. In this space, philosophers have tried to understand whether models and simulations can provide genuinely new knowledge about the world and, if so, how (e.g. Parker Reference Parker2022; Beisbart Reference Beisbart2012). In a nutshell, sceptics about the epistemological significance of models and simulations point out that these tools only help us recognize the consequences of knowledge that we already possess, such as assumptions (e.g. of equations) and initial conditions (e.g. measurements, parameterizations). These consequences can at most be evidence in the sense that they provide new information to agents who are not able to, or simply did not, derive those same consequences given the same assumptions and initial conditions. But they would not be evidence to a more ideal agent who would already recognize these consequences under some form of inferential closure. So, while observation and experimentation allow us to gather new experience (Beisbart Reference Beisbart2012, 245), models and simulations don’t bring anything new to the table; though they do help limited epistemic agents better see what’s already on the table.

What does this mean for the evidence question? Parker summarizes the consequences of the sceptical view on computer simulation as follows: “If computer simulation is at bottom an attempt to calculate the implications of a set of modelling assumptions, then simulation results … seem to be predictions rather than evidence; they are the kind of thing we might seek evidence for”. (Parker Reference Parker2022, 1522; emphasis added). On such a view, the outputs of GenAI systems like AlphaPot are predictions, or, more generally, hypotheses. They might alert us to possibilities for how an artifact may have looked, and may give us reasons for pursuing these hypotheses by means of bringing evidence to bear on them; but they are not to be taken as evidence that could already, by itself, support knowledge claims about artefacts or figure importantly (alongside other evidence) in larger, downstream inferences, such as about trade taking place between different communities.

Filling the space between more extreme views that either consider simulation results evidence, or deny that they can ever be, Parker offers a finer-grained view to characterize what simulations provide to agents. Specifically, Parker argues that simulation outputs can be higher-order evidence: they can be evidence E that other evidence E′ for a hypothesis H exists. Specifically, such higher-order evidence can help agents obtain genuinely new knowledge of the world if they either (1) don’t have access to E′, or else (2) lack the background knowledge needed to understand how E′ bears on H. So, while simulations “do not provide information about the world that goes beyond that which is already implicit in their assumptions, particular epistemic agents—including even scientists and engineers using simulation models—might still gain genuinely new knowledge of the world via simulation” (Parker Reference Parker2022, 1522).

Parker’s view offers a useful backstop for thinking about GenAI outputs. At the very least, they seem able to figure as higher-order evidence. A 3D reconstruction of a broken artifact A from a suitably validated system like AlphaPot provides new knowledge about specific artifacts to agents who either don’t have access to the training dataFootnote 2 E′ that bear on the reconstructive query about A, or else lack the background knowledge to understand how E′ bears on questions about A. This is a useful insight already, but it also seems interesting to explore whether GenAI systems could ever provide more than ‘just’ higher-order evidence.

4.1. More than higher order?

What might GenAI systems be doing that goes beyond what simulation systems do? A central difference seems to be that GenAI systems can exhibit higher degrees of independence, which allows them to perform computations that instantiate inferences of a kind that simulation systems don’t instantiate. Specifically, simulation systems in the climate sciences are built based on highly developed antecedent understanding of the physics equations describing aspects of the Earth’s climate system (background knowledge), parameterized according to our best understanding of key parameters and known/understood aspects of the phenomena involved, and calibrated using data regarding the Earth’s climate system. Together, these inputs substantially constrain the behaviors of simulating systems.

GenAI systems exhibit comparatively higher independence because they are not as tightly constrained. There are no accepted equations that describe, say, the ‘grammar’ of Iberian pottery. Nor, for lack of such equations, are there measurements that GenAI systems are parameterized with. In short, there is no developed body of background knowledge that is explicitly encoded when building GenAI systems (at least in unsupervised/self-supervised regimes), nor would our existing background knowledge permit building systems in a way that mirrors the strategies behind building simulation systems. Rather, the very purpose of machine learning approaches is often to extract pertinent background knowledge from data, e.g. to find a function F that usefully captures features of a joint distribution and can be used to perform successful inferences. For this enterprise to be successful, GenAI systems must exhibit considerable degrees of freedom to ‘settle’ on representational spaces, representations, and input–output relationships that are (1) predictively useful, (2) possibly inaccessible to humans by other means (e.g. visual inspection), and (3) potentially novel to humans. GenAI systems hence harbor the capacity for a special kind of novelty in their outputs. Unlike simulation systems, they can generate synthetic evidence, i.e. evidence E that is not only psychologically novel to agents who lack other evidence E′ or background knowledge K, but is novel to agents who do not possess the same inferential abilities to extract pertinent knowledge K (e.g. of F) from the same training data. Such abilities are different from computational abilities to derive implications of equations and initial conditions. They are more akin to the inferential ability to ‘recognize’ that such-and-such is a good way to represent or compress data, or that such-and-such is a successful (i.e. error-minimizing) way to ‘fill in the blanks’ of a reconstructive query.

On the narrative presented here, GenAI systems bring inferential abilities to the table that simulation systems don’t. But why should this lead us to conclude that they can generate evidence that provides genuinely new knowledge to agents? Couldn’t, or shouldn’t, we still maintain that the relevant information with bearing on reconstructive queries ‘resides in’ the training data that GenAI systems are trained on?Footnote 3 This would bring us back to understanding GenAI systems as, at most, providing higher-order evidence in Parker’s sense and conclude that no evidence that is novel over and above whatever is contained in these data is generated.

A good way to explore how GenAI systems can provide novelty beyond higher-order evidence is to think about patterns. A standard success narrative in machine learning-based inference alluded to above is centrally tied to systems’ abilities to identify patterns, including subtle and distributed ones, and to exploit them for inference. But what is a pattern, anyway? This is yet another conceptual issue that ML systems press us to confront with greater care. Here, it is useful to distinguish two general types of views on patterns: ontic and epistemic views. On the first, a pattern is constituted by a collection of material facts about the world that may be distributed across entities. A pattern, on this view, is always ‘there’, even without a mind to recognize and exploit it for inference (see Ladyman and Ross Reference Ladyman and Ross2007 on ‘real patterns’). On an epistemic view (cf. Dennett Reference Dennett1991; McAllister Reference McAllister2010; Haugeland Reference Haugeland1998; Kästner and Haueis Reference Kästner and Haueis2021), patterns come into existence through an epistemic agent that recovers it, including, say, by devising an ontology of entities and features ranging over a domain (e.g. pots, fractures, materials, textures); by making efforts to describe and represent these entities and features within that domain in an abstract way (e.g. material or shape types); and by exploring how these representations hang together, e.g. causally or probabilistically, at that abstract representational level. On such a view, a pattern is instantiated by, refers to, and supervenes on, concrete material things, but ultimately resides at an abstract representational level (cf. Ladyman and Ross Reference Ladyman and Ross2007 on ‘second-order patterns’). If we find such an epistemic view compelling, then this allows that GenAI systems, like other epistemic agents, can perform inferential activities that bring patterns into existence.Footnote 4 This ability sets GenAI systems apart from simulating systems: they may produce outputs, based on abilities to infer patterns from data, that are novel to agents who do not possess such abilities.Footnote 5

5. Strictures on synthetic evidence

We now have a sketch of an argument for the claim that GenAI systems like AlphaPot may produce synthetic evidence, i.e. evidence E that provides genuinely new knowledge about the world to agents who do not possess the same inferential abilities to recover E from primary evidence E′ as the system that produced E. But when can we expect GenAI systems to produce good synthetic evidence? As researchers are exploring use cases of LLMs in history, for instance to ‘ventriloquize’ the voices of the past through LLMs trained and/or fine-tuned on historical text corpora to enable researchers to ‘query’ past societies or individuals (Hutson et al. Reference Hutson, Huffman and Ratican2024), there is a real risk of low-cost bogus AI-driven research. While there are a variety of salient concerns about the reliability of GenAI systems, such as regarding ‘hallucinations’, brittleness, lack of generalization abilities, and epistemic opacity, here, we outline some potential virtues that GenAI systems may exhibit, if designed and deployed responsibly. These virtues help better understand the conditions under which we may reasonably hope these systems to make valuable epistemic contributions.

  • Scope: GenAI systems are good at processing and ‘drawing on’ large amounts of rich data, which is relevant when patterns are distributed across large numbers of entities and different data modalities.

  • Sensitivity: ML systems are known to usefully latch onto subtle, distributed patterns, especially in quantitative data, that are often not accessible to human perception.

  • Probabilism: ML-based inference is probabilistic. Outputs are sampled from a whole modeled joint distribution. This often means that other possibilities for an output are not discarded by a system, but remain, or could be made, available to investigators.

  • Mechanicity: Many GenAI systems produce (near-)repeatable outputs from the same inputs, so they can be subjected to systematic intervention, allowing investigators to understand how outputs depend on inputs. For instance, they may upsample rare input types (e.g., by using synthetic training data to induce more variation regarding specific artifact types) and gauge whether outputs change for specific query types.

  • Theory freedom/agnosticism: Especially in unsupervised or self-supervised learning regimes, GenAI systems organize data somewhat independently of existing theory, working against unhelpful forms of theory-laden observation.

  • Complexity: Universal function approximation theorems (e.g. Cybenko Reference Cybenko1989) and statistical learning theory (Vapnik Reference Vapnik2000; Bargagli Stoffi et al. Reference Bargagli Stoffi, Cevolani and Gnecco2022) provide (probabilistic) guarantees for specific system types to successfully approximate arbitrarily complex input–output relationships under suitable conditions. This is important as there are no good reasons to believe that, say, the ‘grammar’ of Iberian pottery (i.e. the ‘rules’ that best describe the joint distribution of morphological features of Iberian pottery artifacts) is easily captured by simple, human-expressible functions.

  • Granularity: GenAI systems perform inference at multiple levels, including at fine-grained pixel or voxel levels that may not be salient to human investigators. Such systems are hence not as susceptible as humans to latch exclusively onto patterns or analogies obtaining at higher, more salient levels of analysis, e.g. inferring that artifact A probably had inscription S because B, C, and D, which look morphologically similar, do.

Of course, specific GenAI systems are not guaranteed to exhibit any of these virtues to significant degrees; only well-engineered systems may. Moreover, many of the candidate virtues outlined here can turn into vices if the properties they track are expressed too strongly: think of theory freedom or scope that could lead a system to attend to irrelevant or misleading information when there are good theoretical reasons not to. Spelling out a contextualist virtue epistemology for GenAI systems in science is arguably a larger project that will require more space, which is why the virtues sketched here should only provide some early inspiration rather than a sketch of a full-fledged account of what GenAI systems may bring to the table. That said, it seems promising to explore such an account in articulating answers to the evidence question.

6. Conclusions

This paper puts an important new question about the role of generative AI (GenAI) systems in the sciences on the map. The evidence question asks: Can GenAI systems generate evidence that provides agents, including experts, with genuinely new knowledge about the world? Focusing on history and the historical sciences more broadly, where researchers explore the use of GenAI systems to reconstruct partially destroyed manuscripts and artifacts to learn about the past, we argued that it is currently unclear whether we should understand the outputs produced by these systems as mere hypotheses or as evidence, where the former may give researchers reasons for pursuit and for seeking out further evidence, while the latter may already licence knowledge claims about the world and figure directly in supporting further inferences. Given this conceptual and practical uncertainty, we sketched how we may understand GenAI outputs not only as higher-order evidence in the sense of Parker (Reference Parker2022) but also as synthetic evidence, i.e. evidence that can provide agents, including experts, with genuinely new knowledge about the world. They do so by acquiring and deploying pattern recognition-type inferential abilities to produce outputs that are evidence to agents who lack those same inferential abilities, which may include even our best domain experts. The scope of this argument sketch is narrow: it applies, for now, only to the emerging uses of GenAI in the historical sciences discussed here. But zooming out, the evidence question may also extend to a range of other domains that explore the utility of GenAI for advanced inferential tasks (e.g. structural biology and materials science). For philosophers of science this is good news, inviting us to help characterize and resolve the conceptual and methodological disruptions affecting emerging scientific practices, and to contribute to the development of sound methodologies involving GenAI.

Acknowledgments

We thank Wendy Parker as well as the audiences at PSA 2024, Aarhus University, DKPhil2024, the University of Groningen, University of Tübingen, the GRK-SOCRATES Colloquium, the Machine Discovery and Creation Virtual Workshop, LICPOS 2023, and the 2023 MCMP-LUH-Wuppertal Workshop for their many helpful questions and suggestions, which contributed significantly to refining the arguments presented here.

Declarations

None to declare.

Funding information

The research for this article was supported by a grant from the Ministry of Science and Culture of Lower Saxony (MWK), Grant No. 11-7620-1155/2021. Towards its completion, this research was also supported by the Luxembourg National Research Fund (FNR; Grant No. 13307816).

Footnotes

1 To be clear, we do not draw a principled distinction between ‘hypothesis’ and ‘evidence.’ In line with Bayesian accounts (e.g. Bovens and Hartmann Reference Bovens and Hartmann2003), the difference is contextual. Another way of putting the evidence question is whether R constitutes, or gives rise to, a mere hypothesis H that enjoys no support thus far, and hence has an uninformative or low prior, or whether R constitutes, or gives rise to, a pre-justified hypothesis H′ about A that (1) has a high prior and (2) may stand in relevant evidential relationships with yet other hypotheses H″, e.g. about A’s likely origin, or whether trade took place between where A was found and other locations.

2 In particular, they might lack access to the information contained in that data, e.g. regularities about the ‘grammar’ of Iberian pottery.

3 This concern also flags a version of the problem of old evidence; see, e.g., Sprenger (Reference Sprenger2015) for a discussion.

4 This view is still compatible with realist views like Ladyman and Ross’ (Reference Ladyman and Ross2007); epistemic agents, or GenAI systems on our account, bring into existence real second-order patterns that represent real first-order patterns.

5 Of course, we must mind anthropomorphic pitfalls. Terms like ‘recognizing’, ‘using’, and so on, must not be taken to suggest that GenAI systems literally have mental states or cognitive abilities associated with these terms.

References

Assael, Yannis, Sommerschield, Thea, Shillingford, Brendan, et al. 2022. “Restoring and Attributing Ancient Texts Using Deep Neural Networks.” Nature 603:280–83. https://doi.org/10.1038/s41586-022-04448-z Google Scholar
Barman, Kristian G., Caron, Sascha, Claassen, Tom, and de Regt, Henk. 2024. “Towards a Benchmark for Scientific Understanding in Humans and Machines.” Minds & Machines 34:6. https://doi.org/10.1007/s11023-024-09657-1 Google Scholar
Bargagli Stoffi, Falco J., Cevolani, Gustavo, and Gnecco, Giorgio. 2022. “Simple Models in Complex Worlds: Occam’s Razor and Statistical Learning Theory.” Minds & Machines 32:1342. https://doi.org/10.1007/s11023-022-09592-z Google Scholar
Beisbart, Claus. 2012. “How Can Computer Simulations Produce New Knowledge?European Journal for Philosophy of Science 2:395434. https://doi.org/10.1007/s13194-012-0049-7 Google Scholar
Bovens, Luc, and Hartmann, Stephan. 2003. Bayesian Epistemology. Oxford: Oxford University Press. https://doi.org/10.1093/0199269750.001.0001 Google Scholar
Clark, Elinor, and Khosrowi, Donal. 2022. “Decentering the Discoverer: How AI Helps Us Rethink Scientific Discovery.” Synthese 200:463. https://doi.org/10.1007/s11229-022-03902-9 Google Scholar
Cybenko, George. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals, and Systems 2:303–14. https://doi.org/10.1007/BF02551274 Google Scholar
Dennett, Daniel C. 1991. “Real Patterns.” Journal of Philosophy LXXXVIII:2751.Google Scholar
Durán, Juan M., and Formanek, Nico. 2018. “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism.” Minds & Machines 28:645–66. https://doi.org/10.1007/s11023-018-9481-6 Google Scholar
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Xu, Bing, David Warde-Farley, Sherjil Ozair, Courville, Aaron, and Bengio, Yoshua. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, vol. 27, edited by Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K.Q.. Red Hook, NY: Associates, Curran, Inc.Google Scholar
Hacking, Ian. 2006. The Emergence of Probability: A Philosophical Study of Early Ideas About Probability, Induction and Statistical Inference. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511817557 Google Scholar
Haugeland, John. 1998. “Pattern and Being.” In Having Thought: Essays in the Metaphysics of Mind, 267–90. Cambridge, MA: Harvard University Press.Google Scholar
Hopster, Jeroen, and Löhr, Guido. 2023. “Conceptual Engineering and Philosophy of Technology: Amelioration or Adaptation?Philosophy & Technology 36:70. https://doi.org/10.1007/s13347-023-00670-3 Google Scholar
Hutson, James, Huffman, Paul, and Ratican, Jeremiah. 2024. “Digital Resurrection of Historical Figures: A Case Study on Mary Sibley through Customized ChatGPT.” Metaverse 4 (2):2424. https://doi.org/10.54517/m.v4i2.2424 Google Scholar
Iten, Raban, Metger, Tony, Wilming, Henrik, del Rio, Lidia, and Renner, Renato. 2020. “Discovering Physical Concepts with Neural Networks.” Physical Review Letters 124:010508. https://doi.org/10.1103/PhysRevLett.124.010508 Google Scholar
Jordanova, Ludmilla. 2012. The Look of the Past: Visual and Material Evidence in Historical Practice. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139051095 Google Scholar
Jumper, John, Evans, Richard, Pritzel, Alexander, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596:583–89. https://doi.org/10.1038/s41586-021-03819-2 Google Scholar
Kästner, Lena, and Haueis, Philipp. 2021. “Discovering Patterns: On the Norms of Mechanistic Inquiry.” Erkenntnis 86:1635–60. https://doi.org/10.1007/s10670-019-00174-7 Google Scholar
Kosso, Peter. 2009. “Philosophy of Historiography.” In A Companion to the Philosophy of History and Historiography, edited by Tucker, Aviezer, 725. Chichester: Wiley Blackwell. https://doi.org/10.1002/9781444304916 Google Scholar
Krenn, Mario, Buffoni, Lorenzo, Coutinho, Bruno, Eppel, Sagi, Foster, Jacob Gates, Gritsevskiy, Andrew, Lee, Harlin, Lu, Yichao, Moutinho, João P., Sanjabi, Nima, Sonthalia, Rishi, Tran, Ngoc Mai, Valente, Francisco, Xie, Yangxinyu, Yu, Rose, and Kopp, Michael. 2023. “Forecasting the Future of Artificial Intelligence with Machine Learning-Based Link Prediction in an Exponentially Growing Knowledge Network.” Nature Machine Intelligence 5:1326–35. https://doi.org/10.1038/s42256-023-00735-0 Google Scholar
Ladyman, James, Ross, Don, and David Spurrett with John Collier. 2007. Every Thing Must Go: Metaphysics Naturalized. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199276196.001.0001 Google Scholar
Lamb, Nikolas, Banerjee, Sean, and Kholgade Banerjee, Natasha. 2022. “DeepJoin: Learning a Joint Occupancy, Signed Distance, and Normal Field Function for Shape Repair.” ACM Transactions on Graphics 41 (6):230. https://doi.org/10.1145/3550454.3555470 Google Scholar
Löhr, Guido. 2023. “Conceptual Disruption and 21st Century Technologies: A Framework.” Technology in Society 74:102327. https://doi.org/10.1016/j.techsoc.2023.102327 Google Scholar
Martini, Carlo. 2021. “What ‘Evidence’ in Evidence-Based Medicine?Topoi 40:299305. https://doi.org/10.1007/s11245-020-09703-4 Google Scholar
McAllister, James W. 2010. “The Ontology of Patterns in Empirical Data.” Philosophy of Science 77 (5):804–14. https://doi.org/10.1086/656555 Google Scholar
Melnikov, Alexey. A., Nautrup, Hendrik P., Krenn, Mario, and Briegel, Hans H.. 2018. “Active Learning Machine Learns to Create New Quantum Experiments.” PNAS 115 (6):1221–26. https://doi.org/10.1073/pnas.1714936115 Google Scholar
Moral-Andrés, Fernando, Merino-Gómez, Elena, Reviriego, Pedro, and Lombardi, Fabrizio. 2023. “Can Artificial Intelligence Reconstruct Ancient Mosaics?Studies in Conservation 69 (5):313–26. https://doi.org/10.1080/00393630.2023.2227798 Google Scholar
Navarro, Pablo, Cintas, Celia, Lucena, Manuel, Fuertes, José Manuel, Segura, Rafael, Delrieux, Claudio, and González-José, Rolando. 2022. “Reconstruction of Iberian Ceramic Potteries Using Generative Adversarial Networks.” Scientific Reports 12:10644. https://doi.org/10.1038/s41598-022-14910-7 Google Scholar
Navarro, Pablo, Cintas, Celia, Lucena, Manuel, Fuertes, José Manuel, Rueda, Antonio, Segura, Rafael, Ogayar-Anguita, Carlos, González-José, Rolando, and Delrieux, Claudio. 2023. “IberianVoxel: Automatic Completion of Iberian Ceramics for Cultural Heritage Studies.” In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI’23), edited by Elkind, Edith, 5833–41. https://doi.org/10.24963/ijcai.2023/647 Google Scholar
Nygren, Christopher, and Drimmer, Sonja. 2023. “Art History and AI: Ten Axioms”. International Journal for Digital Art History 9:5.025.13. https://doi.org/10.11588/dah.2023.9.90400 Google Scholar
Papavassileiou, Katerina, Kosmopoulos, Dimitrios I., and Owens, Gareth. 2023. “A Generative Model for the Mycenaean Linear B Script and Its Application in Infilling Text from Ancient Tablets.” ACM Journal on Computing and Cultural Heritage 16 (3):52. https://doi.org/10.1145/3593431 Google Scholar
Parker, Wendy S. 2022. “Evidence and Knowledge from Computer Simulation.” Erkenntnis 87:1521–38. https://doi.org/10.1007/s10670-020-00260-1 Google Scholar
Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick and Ommer, Björn. 2022. “High-Resolution Image Synthesis with Latent Diffusion Models.” Preprint: arXiv:2112.10752. https://doi.org/10.48550/arXiv.2112.10752 Google Scholar
Rowbottom, Darrell P., Curtis-Trudel, André, and Peden, William. 2023. “Evidence, Computation and AI: Why Evidence Is Not Just in the Head.” Asian Journal of Philosophy 2:11. https://doi.org/10.1007/s44204-023-00061-7 Google Scholar
Sprenger, Jan. 2015. “A New Solution to the Problem of Old Evidence.” Philosophy of Science 82 (3):383401.Google Scholar
Udrescu, Silviu-Marian, Tan, Andrew, Feng, Jiahai, Neto, Orisvaldo, Wu, Tailin, and Tegmark, Max. 2020. “AI Feynman 2.0: Pareto-Optimal Symbolic Regression Exploiting Graph Modularity.” In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), edited by Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., 4860–71. Red Hook, NY: Curran Associates, Inc.Google Scholar
Vapnik, Vladimir N. 2000. The Nature of Statistical Learning Theory. New York: Springer.Google Scholar
Wang, Hanchen, Fu, Tianfan, Du, Yuanqi, et al. 2023. “Scientific Discovery in the Age of Artificial Intelligence.” Nature 620:4760. https://doi.org/10.1038/s41586-023-06221-2 Google Scholar
Wang, Shibin, Guo, Wenjie, Xu, Yubo, Liu, Dong, and Li, Xueshan. 2024. “Coarse-to-Fine Generative Model for Oracle Bone Inscriptions Inpainting.” In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), edited by Pavlopoulos, John, Sommerschield, Thea, Assael, Yannis, Gordin, Shai, Cho, Kyunghyun, Passarotti, Marco, Sprugnoli, Rachele, Liu, Yudong, Li, Bin, and Anderson, Adam, 107–14. Kerrville, TX: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.ml4al-1.12 Google Scholar
Wu, Tailin, and Tegmark, Max. 2019. “Toward an Artificial Intelligence Physicist for Unsupervised Learning”. Physical Review E 100 (3):033311. https://doi.org/10.1103/PhysRevE.100.033311 Google Scholar
Wylie, Alison. 2000. “Questions of Evidence, Legitimacy, and the (Dis)Unity of Science.” American Antiquity 65 (2):227–37. https://doi.org/10.2307/2694057 Google Scholar
Zakharova, Daria. 2024. “The Epistemology of AI-Driven Science: The Case of AlphaFold.” Preprint. https://philsci-archive.pitt.edu/id/eprint/24182 Google Scholar