1. Introduction
Artificial intelligence (AI) technology, and in particular generative artificial intelligence (GenAI), has impacted language learning like no other technology before (Liu et al., Reference Liu, Darvin and Ma2024; Zhang, Reference Zhang2024). However, AI is not just another new technology. AI represents a new phase in the evolution and manipulation of symbolic systems within digital culture (Lévy, Reference Lévy2025a; Mulgan, Reference Mulgan2017). AI has created a hybrid space where the capacities of large language models (LLMs) meet human reasoning and judgement (Lévy, Reference Lévy2021; Mulgan, Reference Mulgan2017). The pedagogical issue is no longer access (Han, Reference Han2024) but accountable use (Pérez-Paredes, Curry & Aguado Jiménez et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025).
Recent research has shown that a combination of corpus pedagogy and AI use might make learning more effective for the teaching of, for example, business academic writing (Wang et al., Reference Wang, Deng and Liu2025) and general-purpose writing (Dong & Wang, Reference Dong and Wang2025) in higher education (HE) contexts. This talk argues that corpus-based pedagogy, data-driven learning, and corpus literacy (Boulton & Cobb, Reference Boulton and Cobb2017; Ma, Reference Ma, McCallum and Tafazoli2025; Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025) have the potential to become the confirmation and transparency layer of critical AI literacy (CAIL), so AI-mediated language learning remains accountable to attested language use.
The talk advances three contributions. First, it offers a conceptual account of the convergence between corpus linguistics (CL) and AI, articulated through two classroom scenarios: (i) AI as a low-friction interface that extends data-driven learning (DDL) (Boulton & Cobb, Reference Boulton and Cobb2017; Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025) and (ii) DDL and corpora as the operational core of critical AI literacy (CAIL). Second, it sets out an integration proposal that shows how routine corpus-based teaching and learning practices such as sourcing and documenting language data, filtering, building concordances, analysing collocates, and triangulating findings instantiate critical AI literacy (CAIL) (Pérez-Paredes, Curry & Ordoñana-Guillamón et al., Reference Pérez-Paredes, Curry and Ordoñana-Guillamón2025) with concrete tasks and assessment criteria. Third, it advances a humanities-anchored stance on method that specifies what counts as evidence, requiring attribution and disclosure for AI mediation, and centering learner agency within AI-rich ecologies so that LLM outputs are consistently checked against attested language use.
My position is upfront: in language learning, AI’s huge availability needs to be mediated by what I describe here as open-box pedagogies. Corpus-based pedagogy, DDL, and corpus evidence exemplify a way into how educators and ultimately language learners can find a way into new language learning ecologies. I show how this can be achieved through convergence scenarios and a humanities-anchored literacy model (CAIL).
2. Language learning ecologies, AI, and corpus linguistics
The learning ecologies in which language pedagogy now operates differ from those described in early corpus-informed education. McEnery and Wilson’s (Reference McEnery and Wilson1997) call for systematic, goal-oriented integration of corpora emerged in a pre-internet, pre-smartphone period when digital tools were relatively stable and access to authentic language data was the main challenge. Since then, the expansion of digital technologies has reshaped the conditions under which language learning takes place. Today’s learners inhabit ecologies that are multimodal and mediated by digital artefacts, requiring pedagogical models that address far more complex interactions between tools, contexts, and user agency (Gillespie, Reference Gillespie2020; Godwin-Jones, Reference Godwin-Jones2015; Kern, Reference Kern2024).
Digital mediation has diversified participation structures and extended the spaces where learners use and encounter language beyond the spaces delineated in McEnery and Wilson (Reference McEnery and Wilson1997) and much of the DDL research in the following decade. Smartphones, videoconferencing, online platforms, and multimodal resources blur distinctions between physical and virtual settings, enabling language use that spans local and remote contexts simultaneously (Godwin-Jones, Reference Godwin-Jones2020). Such environments create continuous flows of linguistic input and interaction, with learners moving across formal instruction, informal practice, social media, and AI-supported communication. Language learning now is therefore best understood through ecological perspectives that view meaning-making as emerging from distributed alignments among users, artefacts, spaces, and sociocultural expectations (Kern, Reference Kern2024; Reinhardt & Oskoz, Reference Reinhardt and Oskoz2021). In these fluid configurations, learners’ trajectories are shaped as much by the technological tools that scaffold participation as by individual cognition or curricular design.
The rapid normalisation of generative AI has intensified this ecological complexity (Abdelhalim et al., Reference Abdelhalim, Alsahil, Baek and Warschauer2025). AI-driven platforms provide unprecedented accessibility, responsiveness, and individualised support (Han, Reference Han2024; Kern, Reference Kern2024). They enable learners to practise language at any moment, lower affective barriers, and offer adaptive explanations and multimodal input tailored to learners’ immediate needs. However, these affordances coexist with well-documented risks: opaque and inconsistent outputs, hallucinations, and the erosion of learner autonomy when AI is used uncritically (Pérez-Paredes, Curry & Aguado Jiménez, et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025). The variability of responses even to identical queries and learners’ difficulties in judging accuracy highlight the need for open-box pedagogies that foreground evaluation and source tracing (Zhang, Reference Zhang2024).
In AI-rich ecologies, what becomes pedagogically central is the evidence and procedural transparency that underpins learning. If classrooms incorporate AI, they must also incorporate mechanisms for verifying claims against attested usage. In this way, corpus-based practices introduce productive friction into otherwise frictionless AI-mediated environments, helping learners compare alternatives and justify linguistic choices (Pérez-Paredes, Curry & Aguado Jiménez et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025; Stewart, Reference Stewart2025).
These ecological shifts clarify why corpus linguistics remains essential to language education. Corpora provide traceable evidence against which AI-generated language can be interrogated. They reveal idiomaticity, register variation, and socio-cultural specificity in ways that LLMs often approximate but do not consistently reproduce. Because corpora are curated, often annotated, and their design is replicable, they support cumulative inquiry and help prevent overgeneralisations across genres and contexts. Engaging with concordances, collocational profiles, and frequency distributions cultivates critical awareness of how language varies across situations (Pérez-Paredes, Curry & Aguado Jiménez et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025), anchoring learners’ interpretations in empirical observation rather than surface-level system outputs.
3. Corpus linguistics and DDL impact on language education
While corpus linguistics (CL) has continued to exert a steady influence on language education over the last two decades (Gillespie, Reference Gillespie2020; McEnery & Wilson, Reference McEnery and Wilson1997), GenAI has moved into mainstream use in just a few years (Han, Reference Han2024). The use of CL has maintained a sustained presence in vocabulary, grammar, writing, and reading, which indicates an enduring research strand rather than a passing trend as has been the case of other technologies in CALL (Gillespie, Reference Gillespie2020). In the last decade, this presence has been mainstreamed. Major reference works and syntheses now routinely include DDL and CL applications (Hampel & Stickler, Reference Hampel and Stickler2024; Lafford et al., Reference Lafford, Cabrera and i Macià2025; Lusta et al., Reference Lusta, Demirel and Mohammadzadeh2023; McCAllum & Tafazoli, Reference McCallum and Tafazoli2025; Roelcke et al., Reference Roelcke, Breeze and Engberg2025; Sun & Park, Reference Sun and Park2023), and recent research agendas (Pérez-Paredes, Reference Pérez-Paredes and Crosthwaite2024; Crosthwaite & Boulton, Reference Crosthwaite, Boulton, Tyne, Bilger, Buscail, Leray, Curry and Pérez-Sabater2026; Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025) ask not just how to use corpora in language education, but how to expand their role across languages, learner profiles, teacher education, and types of language data.
This momentum suggests a shift in DDL from using tools to a new DDL mindset and teaching inquiry where students learn to pose testable claims about language and verify them against corpora with transparent methods. Against this backdrop, tools such as AntConc or English-Corpora.org have evolved towards the convergence that I advocate in this talk. AntConc-style RAG over user corpora (Anthony, Reference Anthony2025) and platforms like CorpusChat pair authentic datasets with conversational interfaces to return Key Word in Context (KWIC) results, frequency, and collocation alongside explanations, lowering users’ entry barriers without abandoning language evidence. The pedagogical outcome is strong: DDL is increasingly being reframed as a kind of literacy for inquiry and replication where corpus methods remain the verification layer for AI-mediated learning, assessment rewards process transparency, and teachers can integrate these practices.
So, in this complex scenario, is there a role for corpora in an AI-driven language learning ecology? Corpus linguistics offers several pedagogical advantages. Among its strengths are its empirical foundation (McEnery, Reference McEnery2025), which allows learners to engage directly with authentic language data, and its capacity to promote learner autonomy by encouraging exploration and hypothesis testing. Corpus-based approaches support the development of pattern recognition skills and contribute to learners’ data literacy (Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025). However, despite these benefits, there are notable challenges that have hindered its broader adoption in educational contexts. These include the need for more comprehensive teacher training, limited uptake among practitioners, and usability issues associated with existing corpus tools, which can present barriers to effective integration in the classroom (Boulton & Cobb, Reference Boulton and Cobb2017; Boulton & Vyatkina, Reference Boulton and Vyatkina2021; Chambers, Reference Chambers2019).
Large language models (LLMs) present several benefits in educational contexts (Han, Reference Han2024; Kern, Reference Kern2024). They offer high levels of accessibility, making language learning support available across a range of contexts and learner needs (Curry & McEnery, Reference Curry and McEnery2025; Kern, Reference Kern2024). Their responsiveness allows for adaptive interactions that can mimic conversational practice (Yuan et al., Reference Yuan, Li and Sawaengdist2024), while their capacity for providing tailored support contributes to the scaffolding of learner tasks (Mizumoto, Reference Mizumoto2023). Nevertheless, their use is not without complications. Concerns have been raised regarding their opaque decision-making processes, the generation of inaccurate or misleading output, commonly referred to as hallucinations, and the potential for over-reliance, which may lead to reduced learner autonomy and deskilling.
For this paper, what makes ecologies pedagogically critical is their evidentiary framework. If AI is present, classrooms may require traceability (corpus, dataset, or search filters), replicability (frequency counts, KWIC, collocates reproducible by others), and bias notes (register, domain, representativeness). Ecological design therefore links tools to assessment: submissions include prompts, corpus queries, and evidence trails; claims are verified against attestable usage; and reflection addresses limits and ethics. In short, an AI-rich ecology works when corpus-based practices discipline model outputs, teachers can audit the process, and learners develop accountable agency rather than outsource judgement.
The integration of artificial intelligence (AI) within language education may represent a significant paradigm shift, primarily by enhancing accessibility and scaffolding capabilities beyond those achievable through traditional pedagogical methods. AI-driven platforms such as ChatGPT provide learners with continuous, unrestricted access to language practice, enabling engagement whenever motivation arises and thus transcending the limitations of fixed classroom schedules. This persistent availability contributes to the democratisation of language learning by reducing financial and logistical barriers, fostering greater educational equity. The provision of multimodal input addresses diverse learner preferences and enriches the overall educational experience. In contrast to static resources, AI facilitates dynamic, adaptive interaction by tailoring explanations to individual proficiency levels, responding to follow-up queries, and simulating authentic communicative scenarios contextualised to learners’ specific needs, thereby promoting active and meaningful engagement.
Nevertheless, caution is warranted regarding uncritical dependence on AI, underscoring the imperative to develop learners’ critical AI literacy skills to evaluate the accuracy and limitations of AI-generated outputs. Given that AI-produced language diverges from authentic human usage, the continued relevance of corpus-based pedagogies is emphasised to anchor AI interaction in empirically attested language data. Corpus linguistics, through data-driven learning approaches, complements AI by encouraging critical engagement with authentic language examples, enabling learners to verify and contextualise AI outputs and thereby avoiding passive acceptance of machine-generated content.
4. Convergence scenarios
The process of convergence of AI and corpus linguistics in language education that I advocate proposes a balanced and pedagogically sound model that advances the accessibility and responsiveness of AI while simultaneously nurturing learners’ analytical and metacognitive competencies through corpus literacy. This integrative framework aligns with emerging calls for ethically grounded, discipline-specific AI literacy education (Curry & McEnery, Reference Curry and McEnery2025; Pérez-Paredes, Curry & Ordoñana-Guillamón et al., Reference Pérez-Paredes, Curry and Ordoñana-Guillamón2025), which equips learners to use AI tools critically and responsibly within their language learning paths. The principal challenge remains in the design of curricula and instructional strategies that combine the affordances of AI with corpus-informed critical pedagogy to support effective and ethically responsible language education in the digital era.
As we saw in the previous sections, while corpus linguistics emphasises methodological transparency and learner agency, applications of LLMs such as ChatGPT prioritise ease of access, In practice, CL and AI need not be seen as oppositional: integrating LLMs with corpus-based tools could enhance learner engagement while preserving analytical depth. Such integrations have been proposed by corpus software developers such Davies (Reference Davies2025) or Anthony (Reference Anthony2025) and several other corpus linguists such as Curry & McEnery (Reference Curry and McEnery2025) or Cheung & Crosthwaite (Reference Cheung and Crosthwaite2025).
The challenge lies in balancing the pedagogical rigour of corpus approaches with the immediacy of AI-driven technologies, ensuring that learners benefit from both critical reflection and adaptive support. I suggest two broad converging scenarios: in the first, AI expands DDL; in the second, corpus literacy complements AI-driven language learning. Let’s examine them.
5. Converging scenario 1: AI as an extension of DDL
AI can extend DDL if and only if it returns traceable corpus evidence (in the form of KWIC results, frequency counts, filters, etc.) and logs the search path. Otherwise, AI applications may clearly undermine DDL’s ethos.
Prototypical DDL (Pérez-Paredes, Reference Pérez-Paredes and Crosthwaite2024) has language learners manipulate concordances via tools like AntConc, English-Corpora.org, or Sketch Engine to discover patterns directly from corpora. While powerful, these tools often demand steep learning curves (e.g. Boulton & Cobb’s Reference Boulton and Cobb2017 meta-analysis on delayed post-testing and training needs). Corpus interfaces and complex query language have presented obstacles for learners (Pérez-Paredes et al., Reference Pérez-Paredes, Sánchez-Tornel and Alcaraz Calero2012, Reference Pérez-Paredes, Sánchez-Tornel, Alcaraz Calero and Jiménez2011). Stewart (Reference Stewart2025) has suggested that, given the high degree of sophistication involved in corpus consultation, even the lastest version of Sketch Engine ‘risks derailing even more experienced users’ (p. 551). For him, the risk of corpora becoming irrelevant in language education is now more real than ever as for most language learners ‘the increasing expectation is to be able to start conducting searches immediately and intuitively’ (pp. 551–552). AI, however, can serve as more intuitive proxy corpus interfaces, accepting natural-language queries such as ‘Show me how “despite” is used in academic articles’, and returning tailored concordance lines with explanatory scaffolding.
The AI-as-an-extension-of-DDL scenario tries to bridge the methodological rigour of prototypical DDL with the accessibility of GenAI. Mark Davie’s English-corpora.org website now allows users to work with corpus data in richer ways. Using LLMs, they can group and sort collocates, explore shades of meaning in polysemous items, compare near-synonyms through their typical companions, and observe shifts across genres, periods, or dialect areas. It also supports close reading of KWIC lines, helping users, including language learners, examine grammatical behaviour, text type tendencies, and pragmatic uses. The system can propose topic-related words or rephrasings and then show how often they appear in different corpus sections. For Davies,Footnote 1 the aim is to strengthen corpus work rather than replace it, since the corpus data remain visible and verifiable, encouraging users to confirm, adjust, or question any model output.
Anthony (Reference Anthony2025) argues that combining LLMs with corpus tools creates a productive synthesis rather than a replacement of established methods. For him, AI offers interpretive support. Anthony (Reference Anthony2025) presents ChatAI in AntConc as a bridge that lets users query corpora in natural language and then send findings to an LLM for expanded interpretation. This reduces technical effort, increases transparency by showing model settings, and limits fabricated output by anchoring responses in real data.
Cheung & Crosthwaite (Reference Cheung and Crosthwaite2025) have suggested that a scalable paradigm that integrates AI and CL can enhance future academic writing pedagogy. Their proposal represents a novel synthesis of corpus linguistics and generative artificial intelligence designed to support discipline-specific academic writing by university students. Drawing on the pedagogical principles of data-driven learning (DDL), CorpusChat employs retrieval-augmented generation (RAG) with ChatGPTFootnote 2 to query authentic, pre-loaded corpora such as the British Academic Written English corpus (BAWE) and bespoke in-house collections. Access to these corpora preserves the transparency and reliability of corpus data while typical GenAI hallucinations and user-interface barriers are mitigated. Underlying core corpus functions such as frequency information or concordance lines are encapsulated within an intuitive chat interface, enabling users to pose natural-language queries and receive both raw corpus evidence and AI-generated explanations without requiring technical query syntax.
In the AI-as-an-extension-of-DDL scenario, the role of language teachers is central. They may upload corpora and create bespoke course materials and deploy tailored chatbots, which fosters exploratory engagement across either general language learning or specific disciplinary discourses. This could be a major contribution to popularising DDL as language teachers have over the past consistently voiced criticism concerning the lack of fit between most available corpora and their lack of flexibility to address the specific curricular needs of their students (Chambers, Reference Chambers2019; Pérez-Paredes & Abad, Reference Pérez-Paredes, Abad, Tyine and Spina2025). Returning to Cheung & Crosthwaite (Reference Cheung and Crosthwaite2025), CorpusChat’s hybrid approach not only enhanced students’ motivational engagement and digital disciplinary literacies but also facilitated data-driven feedback that streamlined the corpus consultation process and deepened learners’ understanding of authentic language patterns. This is a promising finding that echoes Mizumoto’s (Reference Mizumoto2023) research on meta cognition and DDL.
In short, this scenario integrating DDL in corpus-related language learning strengthens reflective and learner-centred CALL practices.
6. Converging scenario 2: From data literacy to AI literacy
In this second scenario, DDL and corpus literacy operationalise what I and my colleagues have termed critical AI literacy (CAIL) (Pérez-Paredes, Curry & Ordoñana-Guillamón et al., Reference Pérez-Paredes, Curry and Ordoñana-Guillamón2025). The same acts that build corpus data and corpus literacies such as sourcing, filtering, or triangulating language data sources, become language learners’ habits for interrogating AI output.
As language learning ecologies become more digitally mediated (Kern, Reference Kern2021), language learners need to develop not only data literacy, but also AI literacy. This is precisely where the critical work carried out during the last three decades or so in data-driven language learning (Boulton & Vyatkina, Reference Boulton and Vyatkina2021, Reference Boulton and Vyatkina2024; O’Sullivan, Reference O’Sullivan2007) offers an opportunity for corpus linguists and applied linguists to examine critically the impact of LLMs on language learning.
Corpus linguistics has long supported data literacy by helping learners analyse patterns, make informed hypotheses, and evaluate linguistic evidence (Boulton, Reference Boulton2010; Crosthwaite & Baisa, Reference Crosthwaite and Baisa2024; Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025; Sinclair, Reference Sinclair2003). With the growing presence of AI in educational settings, this effort must expand into critical AI literacy, that is, an understanding of how AI systems generate and present information. In a recent critical AI literacy (CAIL) model (Pérez-Paredes, Curry & Ordoñana-Guillamón et al., Reference Pérez-Paredes, Curry and Ordoñana-Guillamón2025), I propose an integrated approach that combines corpus, digital data, and metacognitive literacies. In this context, learners are encouraged to interrogate AI outputs asking where data comes from and how it is processed. This approach promotes an informed use of AI and cultivates higher order thinking essential for autonomous and critically engaged learning in contemporary digital classrooms.
Language learners need opportunities to build critical skills in prompt design and in judging the adequacy of model outputs, including spotting bias or fabricated claims. In this setting, roles shift. The learner becomes a researcher who frames questions, interrogates both small and large corpora, and compares AI suggestions against attested data. The teacher becomes a curator who selects tools and helps learners trace evidence for claims. The direction of travel is towards hybrid and inclusive pedagogies in which AI and corpus linguistics work together to support transparency and critical engagement. This approach encourages collaborative co-design of learning pathways grounded in verifiable data. In short, corpus data and corpus literacy form a foundational aspect of AI literacy as they equip learners to interpret AI outputs critically, and ethically incorporate AI tools into applied linguistics and language learning. In short, CAIL reinforces the importance of an informed and reflective approach to AI integration in language education.
The From data literacy to AI literacy scenario emerges as the need for critical approaches is evidenced in rigorous research that, unlike designs that rely on surveys to collect learners’ opinions about the impact of AI, uses a close examination of real engagement with AI for learning purposes. In a study with undergrads from an EU university (Pérez-Paredes, Curry & Aguado Jiménez et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025), a sequential mixed-methods design was used to explore how integrating corpus data and AI supports language learning. Learners completed surveys and participated in structured workshops using Sketch Engine and ChatGPT and similar tools. These workshops featured vocabulary and grammar tasks aligned with the learners’ curriculum. Quantitative data came from pre- and post-workshop surveys and activity evaluations, while qualitative insights were gathered through reflective questions and a follow-up focus group with six students. Learners, initially unfamiliar with corpus tools, recognised their value in offering authentic language data and aiding vocabulary acquisition. Perhaps unsurprisingly, learners’ use of Sketch Engine yielded mixed results. Their answers formed three groups: selection of obvious L1 cognate collocates from Sketch Engine data, unattested uses in Sketch Engine but plausible, and biased proper nouns from skewed text selection in Sketch Engine mistaken for collocates. This suggests learners struggle with critical evaluation of corpus output and underscores gaps in their corpus-literacy skills.
But it was the analysis of the students’ engagement with AI that was revealing of the risks of over-reliance (Lee, Reference Lee2024) that language education currently faces. The learners’ trust often lacked critical scrutiny, suggesting that they over-relied on the tool’s perceived authority. The responses generated by ChatGPT varied significantly, even for identical queries, producing alternative grammatical structures and rephrasings. This clearly posed challenges for consistency and classroom management. Students’ ability to effectively engage with ChatGPT was strongly influenced by their prompt-writing skills as very few crafted detailed prompts while others submitted vague ones, which affected the usefulness of the outputs despite the guidelines provided in the workshop. Some learners admitted in the focus group to using ChatGPT to answer some post-activity reflective questions about their own practice. In the focus group discussions, however, students demonstrated emerging critical AI literacy, recognising the importance of cross-checking AI-generated content and advocating for a more balanced, informed use of such tools in language education. All the students who participated in the focus group found it difficult to process that ChatGPT would provide different answers to identical or very similar questions.
These findings corroborate rigorous research that has examined actual language learners’ interaction with AI. Zhang (Reference Zhang2024) looked at 87 Chinese EFL learners’ use of AI in a writing course. The author found that they experienced cognitive dissonance when integrating AI-generated content with their original ideas, with 82% reporting concerns about appropriation of their authorial voice. AI engagement produced marked textual convergence, and exposed deficits in learners’ evaluative and ethical capacities for applying AI.
7. Reconfiguring language learning ecologies through open box pedagogies
The convergence of corpus literacy and artificial intelligence exemplified in the two scenarios discussed above could contribute to reshaping the language classroom and beyond, inviting a reconfiguration of learning ecologies that foregrounds cognitive depth and critical literacy. These are, arguably, relevant educational aims.
The two scenarios converge on the view that corpus linguistics remains vital in language education yet diverge in how AI should be positioned. The first scenario frames AI as a direct extension of DDL. Here, AI reduces barriers (Chambers, Reference Chambers2019) by accepting natural language queries and scaffolding interpretation while preserving the traceability central to DDL. Systems such as English-corpora.org, Anthony’s ChatAI, and CorpusChat exemplify this synthesis: learners gain access to refined corpus output without needing complex query syntax, and teachers curate and select corpora, and design bespoke learning tasks. Research here should examine how these interfaces influence noticing, pattern formation, transfer to production, and whether long-term learning benefits match those shown in earlier DDL studies.
The second scenario moves beyond procedural enhancement, arguing that corpus literacy serves as a foundation for critical AI literacy. The key claim is that the same habits learners develop when sourcing or triangulating corpus data prepare them to question AI output. Instead of treating AI as a proxy interface, this scenario foregrounds evaluation. Early findings in this field show that learners struggle with both corpus output and AI variability, revealing a need for systematic attention to trust and bias as well as the negotiation of authorial voice. Research here should focus on how learners’ critical habits develop over time, how corpus-anchored questioning influences their judgement of AI output, and how teachers can scaffold comparison between attested data and AI suggestions.
The integration advocated in this talk fosters key educational gains for language learners and teachers (see Fig. 1) such as (1) a shift from procedural to conceptual engagement with language and language learning technology, promoting critical thinking; (2) the strategic use of technology for solving language-related problems; (3) the development of metacognitive skills through reflective interaction with data and AI outputs; and (4) the cultivation of interdisciplinary skills such as computational thinking.

Figure 1. Gains for language learners and teachers in open box pedagogies.
These four areas are central to open box pedagogies. Let’s examine them.
The DDL literature (Boulton & Cobb, Reference Boulton and Cobb2017; Boulton & Vyatkina, Reference Boulton and Vyatkina2021) has shown that, rather than following rote rules or memorised forms, while engaged with corpus data examination learners are encouraged to understand why language behaves the way it does. This approach reframes the classroom as a space for inquiry where learners notice patterns, question usage, and build hypotheses based on authentic data using deeper critical thinking. In this process, technology becomes a strategic ally, not just a delivery mechanism, but a tool for solving linguistic problems. Learners use corpus tools and AI to test assumptions and seek clarification about usage. This interaction fosters metacognitive growth (Mizumoto, Reference Mizumoto2023) as students evaluate outputs and regulate their learning strategies. For instance, deciding whether an AI-generated answer is trustworthy involves weighing evidence, questioning sources, and comparing it with corpus data. Some of the students in Pérez-Paredes, Curry & Aguado Jiménez et al. (Reference Pérez-Paredes, Curry and Aguado Jiménez2025) were surprised to see that AI outputs provide a variety of replies to the same question. Open box pedagogies support the development of interdisciplinary skills, especially computational thinking (Pérez-Paredes & Zapata-Ros, Reference Pérez-Paredes and Zapata-Ros2018). As learners navigate search interfaces, interpret frequency data, and adjust parameters such as type of genres or publication venues, they engage with core concepts like logic, abstraction, and pattern recognition. These skills prepare them not only for language use, but for broader digital literacy and problem-solving across domains, essential competencies in today’s data-saturated learning ecologies.
However, this reconfiguration is not without risk. As AI tools grow in prominence (Handley, Reference Handley2024), there is a danger that learners may rely on non-propositional knowledge (Simone, Reference Simone2001), that is, surface-level, system-generated outputs, at the expense of grounded, empirical understanding. This shift could undermine learners’ epistemic agency and neglect the need for interpretive depth. Here, CL plays a crucial anchoring role. By engaging learners in the analysis of authentic, attested language data, CL situates language learning in tangible linguistic realities and cultivates a more critical stance toward AI outputs. In a digital age dominated by fast and frictionless tools, an open box pedagogy like corpus-informed teaching offer friction with purpose. They slow down the learning process, demanding reflection, interpretation, and justification (Pérez-Paredes, Curry & Aguado Jiménez et al., Reference Pérez-Paredes, Curry and Aguado Jiménez2025; Stewart, Reference Stewart2025). When harmonised with the responsive affordances of AI, such pedagogies can foster a dynamic, data-literate learning ecology, one where learners not only consume language but interrogate its form, function, and sources.
In Pérez-Paredes, Curry & Ordoñana-Guillamón et al. (Reference Pérez-Paredes, Curry and Ordoñana-Guillamón2025), I argue for a balanced pedagogical approach that combines the analytical depth of corpus tools with the accessibility of GenAI. This paper underscores the need for explicit instruction in both corpus and AI literacies to promote informed and critical language learning. Corpus literacy is positioned as a foundational competence for developing critical AI literacy (CAIL), which includes technical understanding, critical thinking, ethical awareness, and practical application. We call for a shift in language education towards integrated literacy models that prepare students to navigate both human- and AI-mediated language landscapes critically and independently (Kern, Reference Kern2025). First, technical understanding provides learners with a grasp of how large language models are trained, why they may hallucinate, and how biases in internet-scale datasets shape the linguistic choices these systems propose. Such knowledge allows students to interrogate default usages, trace sources, and make informed decisions about when AI support is pedagogically appropriate. Second, critical thinking orients learners to evaluate outputs rather than accept them uncritically. They compare alternatives, judge accuracy, register, and cultural fit, and counter the ELIZA effect by triangulating with trusted resources such as corpora or teacher feedback. Third, ethical awareness extends learners’ scrutiny beyond correctness to the social and epistemic implications of AI use, including privacy, authorship, and integrity, unequal resource burdens, and the reproduction of linguistic hegemony. It fosters norms of transparency and defines the boundary between assistance and substitution. Finally, practical application(s) translates literacy into strategy. Language learners select suitable tools for specific tasks, prompt with precision, integrate human oversight, and transfer competencies across contexts, from corpus-based collocation checks to genre-sensitive drafting or translation. The overarching argument is that these four dimensions are mutually reinforcing: technical insight grounds critique; critique guides ethical judgement; ethics and critique shape purposeful practice; and iterative practice consolidates understanding. Cultivating all four domains develops autonomous learners who can harness AI for language development while resisting over-reliance, thereby aligning technological affordances with defensible educational aims.
8. The future of corpus linguistics
The transparency of corpora to support critical and ethical analysis make them an educationally relevant resource for reliable linguistic studies and language teaching and learning. AI-generated outputs, on the contrary, may lack these qualities due to their probabilistic and sometimes opaque nature. As Levy (Reference Lévy2025b) himself has put it, given the generative statistical and probabilistic nature of AI, users should always check information in real encyclopaedias or libraries.
The two scenarios discussed earlier share commitments to transparency and learner agency. Their divergence lies in emphasis: one centres on access and usability, while the other on evaluation and method. A productive way forward may be to integrate both strands: interfaces that simplify corpus access could be paired with pedagogies that promote learners’ critical judgement. Longitudinal, classroom-embedded studies will be essential to understand how learners progress from accessing patterns to interrogating the systems that generate them.
What the convergence I advocate requires is to involve language teachers so that they become the cornerstone of ethical, educationally grounded practices in language education that expose black-box practices in AI and truly empower language learners with literacy skills for the 21st century. As Russell Reference Russell(2019) put it, the aims of AI often diverge from those of education because AI systems are designed to optimise objectives such as efficiency or measurable outcomes without really understanding the broader human values central to education, leading to misalignment, as AI may promote behaviours that increase metrics but do not foster genuine learning or human development.
The accessibility and scaffolded support afforded by AI respond to learners’ need for low-friction interfaces in increasingly complex digital environments. At the same time, the methodological transparency and empirical grounding of corpus linguistics offer the verification layer necessary to preserve learners’ agency. Taken together, these shifts illustrate why AI and corpus-based pedagogy should not be treated as competing traditions but as complementary components of coherent, open-box pedagogies that align technological affordances with defensible educational aims.
Without explicit design to incorporate human values such as those discussed in the two scenarios above, AI risks undermining education goals by focusing narrowly on optimisation rather than cultivating individuals capable of learning and exercising critical judgement. However, human judgement can be easily eroded. I advocate a humanities-anchored stance that specifies what counts as linguistic evidence in language education, requiring disclosure for AI mediation, and centring learner agency within AI ecologies so that LLM outputs are consistently checked against attested language use and process transparency is explicitly embraced and fostered by educators.
Corpora and corpus literacy are well positioned to lead this transformation. Research has shown over the years how corpora can reveal idiomaticity and variability in real contexts of use in ways that generative models frequently approximate but do not reliably reproduce. Corpora are traceable: patterns can be linked to identifiable sources, genres, speakers, and situations, enabling accountable interpretation. By contrast, LLM outputs are products of probabilistic patterning within opaque architectures, which hinders verification and weakens claims about usage. Corpora can be deliberately curated and annotated to target varieties even in language education (Pérez-Paredes & Boulton, Reference Pérez-Paredes and Boulton2025), registers, or topics, thereby aligning language evidence with specific learning aims. LLM outputs remain hostage to the composition and biases of their training data and often drift toward overgeneralisation.
Research has shown in the past that engaging directly with corpus evidence cultivates critical awareness about language use across social contexts. In other words, corpora and corpus literacy can reveal to learners how the different contexts of use require the use of different patterns and lexis. AI outputs, however, mask these realities and dilute socio-cultural specificity. Corpus-based inquiry is fundamentally replicable: stable datasets and annotation schemes support cumulative and methodical approaches, whereas LLMs and stochastic generation erode experimental consistency.
As McEnery (Reference McEnery2025) has recently stated, the future of corpus work depends on careful use of new AI tools, wider data access, and deeper linguistic insight, with corpus experts assessing technologies and adopting those that offer reliable value today. For him, if they are found to be useful and insightful, they ‘should be incorporated into our toolset, at times complementing or supplementing existing tools or, perhaps, on occasion supplanting them’ (p. 1). My argument is that, in classroom contexts where AI-driven language learning ecologies are meaningful, corpora provide a reliable epistemic substrate that provides the evidence of how language is used, building accountable knowledge in the language classroom.
Acknowledgements
I would like to thank Mark Davies and Laurence Anthony for their feedback and suggestions to improve this paper.