Introduction
Al-Hoorie, Hiver and In’nami (Reference Al‐Hoorie, Cinaglia, Hiver, Huensch, Isbell, Leung and Sudina2024; henceforth Al-Hoorie et al.) make a compelling case for why L2 motivational self-system (L2MSS) research is currently in a state of validation crisis. It is a crisis evidenced by (a) a lack of systematic validation efforts, and (b) initial examinations that reveal discriminant validity issues regarding core L2MSS scales. Both call into question the credibility of the field.
In a previous response to Al-Hoorie et al., we (Henry & Liu, Reference Henry and Liu2024) argued that even though a validation crisis in L2MSS research might be manifested in jangle fallacy problems at the measurement level, its roots lie at the construct level. While we argued that jingle fallacy problems at the construct level are of a magnitude such that it may no longer be meaningful to continue the investigation of the ideal L2 self and the ought L2 self as they are currently conceptualized, other scholars have focused their responses on controversies at the measurement level.
In this paper, we engage with the current debate on validation by identifying approaches that can lead toward a constructive resolution of the crisis, and through which the field can emerge in a stronger position. For this to occur, two fundamental conditions are needed: psychological readiness and methodological maturity. While discussions have so far mostly concentrated on methodology, we believe that the human element—researcher emotions and mindsets—is equally important if the crisis is to be resolved. With this in mind, we examine the emotional responses that a crisis can trigger, and consider how a state of crisis can be transformed into opportunities for growth. Moving on to methodological maturity, we discuss contemporary views of validity and advocate for an argument-based approach to validation that can support more exacting efforts in validation research.
The emotional weight of a crisis
As L2 motivation researchers, we recognize that the concerns raised by Al-Hoorie et al. (Reference Al-Hoorie, Hiver and In’nami2024) can trigger complex emotions. Initial reactions can be underpinned by “fight or flight” instincts that range from denial to despair. While many of us will recognize the need to separate the research from the researcher, the research that we do and the discipline in which we operate can feel deeply personal. A crisis that calls into question the credibility of a field that we are part of cannot be other than unsettling. It can be tempting to downplay or dismiss issues that can compromise confidence in our work and to shy away from critiques that have been leveled. Anxiety, or unease over the implications that a crisis can have on ongoing or future projects is an understandable reaction. Frustration is also a natural response, not least in our field when researchers can have made career-spanning investments in work based on L2MSS theory. The realization that their research may have been built on less than stable foundations can be disheartening. Fear can also be a prevalent emotion. Concern about the stigma of association with a potentially discredited field is easy to fathom. Whatever the emotion—or combination of emotions that are generated—there is a risk of paralyzing effects. Action needed to address the crisis in constructive ways may not be taken.
While rarely a part of open discussion, it is important to recognize that these emotions stem from and reflect a commitment to the field. However, if unaddressed, they can present psychological barriers that can hinder meaningful progress. Denial can lead to a reluctance to engage with critical perspectives. Anxiety can result in unduly conservative research practices that decelerate progress. While frustration might escalate into a loss of motivation or the cynical dismissal of an entire field, fear can prevent the researcher from opening their work to critical scrutiny.
Given these complex emotions and their implications, it becomes important to consider how the psychological readiness needed to navigate the crisis in productive ways can be achieved. While emotions are ultimately personal (and private), we believe that the creation of a safe space for critical reflection and scrutiny is crucial if change is to happen. In the following section, we explore strategies that can foster such an environment, and which can bring about a transformation from a state of “crisis” to a state of credibility.
From crisis to credibility
Reframing the “crisis” narrative
An immediate and readily implementable step toward the creation of an environment for constructive dialogue is to consider Al-Hoorie et al.’s narrative framing. As a rhetorical device, the declaration of a “validation crisis” has been effective. It has directed attention to issues that have been systematically underappreciated, and it has sparked much-needed debate. However, as with any powerful framing, there is a possibility that the term can be misinterpreted and that the entire field is perceived as fundamentally unsound.
Of particular concern is the potential conflation of a validation crisis and a validity crisis. While Al-Hoorie et al. purposefully used the term “validation crisis,” it can easily be misunderstood as a “validity crisis.” The distinction is nuanced but crucial. A validation crisis primarily refers to insufficient or absent validation efforts, a point that the authors emphasize. In contrast, a validity crisis would indicate a demonstrated lack of validity resulting from systematic validation attempts. Currently, the field faces the former. While initial evidence has now been supplied, field-wide systematic validation efforts would be needed before convincing claims can be made about the latter.
Another concern involves the defeatist connotations associated with the term “crisis.” As Vazire (Reference Vazire2018) has put it, “crisis implies that we are at a loss for solutions, when in fact we have identified many ways to improve science’s credibility” (p. 411). While Vazire (Reference Vazire2018) was commenting on why she preferred to refer to psychology’s replicability crisis as a “credibility revolution” (see Liu, Reference Liu2023 for a recent discussion of this in relation to applied linguistics), the same principle applies to the validation crisis in our field. Given that systematic validation efforts have yet to be undertaken, perpetuation of a “crisis” discourse carries an additional risk—the premature stigmatizing an entire field. Association with a field in crisis can prompt researchers to steer clear of this line of inquiry. This, in turn, could lead to a premature dismissal of valuable research directions, and stifle progress in addressing the very issues that require resolution.
As motivation researchers working in applied linguistics, we are well aware of the importance of semantics and the power of (re)framing. How we think about our challenges can shape how we approach them. While Al-Hoorie et al.’s portrayal of a “validation crisis” has been effective in drawing attention to critical issues that have long been neglected, and has been the catalyst for productive thinking (e.g., Henry & Liu, Reference Henry and Liu2024; Oga-Baldwin, Reference Oga-Baldwin2024), to move to the next phase it can be useful to talk in terms of a “credibility revolution.” Calling this a “credibility revolution” instead of a “crisis” is not just a play on words. Rather, it signifies a strategic shift in orientation. By thinking productively about methodological innovation and conceptual revision, we stand not only to enhance our agency in driving positive change; we can also mitigate the risk of allowing complex negative emotions to cloud our judgment.
Valuing controversy
Unlike research in other applied disciplines of motivation science, L2 motivation research has a history largely built around models developed within the field, and where the influence of mainstream theories and frameworks has been limited. Among the many problems that insularity has brought, the “endowment effect” has created particular challenges. While in any field researchers may “end up within a certain theoretical camp for reasons other than pure science” (King & Fryer, Reference King and Fryer2024, p. 10), in a field as insular as ours, flags are easily tied to the masts of particular models. However, it is not merely the case that researchers can have a natural predisposition to afford greater value to the theories upon which their careers have been founded. The “endowment effect” can easily lead to camp-ism, defensiveness, and a shuttering off of productive communication with researchers who take opposing views (King & Fryer, Reference King and Fryer2024).
As motivation researchers, we need to recognize that controversy is important. Because it can highlight “the importance of exacting definitions of constructs,” and can encourage researchers from different camps to engage in debate, controversy can be the driver of development (Ryan, Reference Ryan2024, p. 6). Beyond the need to approach controversy in a non-defensive manner, it is important to engage with challenging controversies—those that have the potential to be productive and which can require us “to take a step back and rigorously evaluate the theories we use” (King & Fryer, Reference King and Fryer2024, p. 10).
While controversies abound in motivational science, not all will necessarily be productive. Controversy per se is not a trigger for development:
To make progress in motivation research, it may be useful to focus on resolving existing controversial issues. However, it is also important to consider under what conditions productive controversies arise. Two especially important conditions are (a) precision of theories and (b) precision of measures and empirical study designs to test them. Precise theoretical propositions and precise measurements are needed, otherwise, contradictions may not be detectable.
(Pekrun, Reference Pekrun2024, p. 7, emphasis added)
In relation to Pekrun’s first point—the precision of theories—we (2024) have argued that jingle fallacy problems at the construct level have created significant problems in the construal and operationalization of the original L2MSS constructs, and in subsequent iterations where L2 self-guides have been bifurcated to reflect promotion and prevention motives (Dörnyei, Reference Dörnyei, Dörnyei and Ushioda2009; Papi, Bondarenko, Mansouri, Feng, & Jiang, Reference Papi, Bondarenko, Mansouri, Feng and Jiang2019). In our response to Al-Hoorie et al.’s (Reference Al-Hoorie, Hiver and In’nami2024, p. 10) initiative in “opening a discussion” around validity in L2MSS research, we were at pains to not only address the theoretical imprecision in the L2 self-guide construct and the consequences that follow when it is operationalized. In a spirit of productive engagement and potential cross-fertilization (King & Fryer, Reference King and Fryer2024), we also explained how self-guides and other standards (Higgins, Strauman, & Klein, Reference Higgins, Strauman and Klein1986) can be theoretically incorporated into frameworks of L2 motivation that draw on self-determination theory (e.g., Noels et al., Reference Noels, Lou, Lascano, Chaffee, Dincer, Zhang, Zhang, Lamb, Csizér, Henry and Ryan2019). While it needs to be recognized that “whenever scholars forward new theoretical models or attempt to reframe or restructure what already exists, they are taking risks” (Alexander, Reference Alexander2024, p. 11), we believe that a commitment to theoretical precision can facilitate integration across frameworks and can shift L2 motivation research into a more productive orbit.
Normalizing failure and (self-)correction
Another step that we believe to be important in developing a constructive environment for reform is the normalization of failure and the encouragement of (self-)correction. Here, we can again look to our colleagues in psychology for valuable lessons.
One lesson involves evaluation of the odds of failure. In recent decades, and in response to the replicability crisis, researchers in psychology have worked hard to improve the replicability of their research findings. However, as they have come to realize, a 100% replication rate is neither realistic nor desirable. As Nosek et al. (Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber and Vazire2022) have pointed out, achieving a near 100% replicability would require “adopting an extremely conservative research agenda that studies phenomena that are already well understood or have extremely high prior odds. Such an approach would produce nearly zero research progress” (p. 730). In fact, Nosek et al. argue that a healthy, theoretically generative research enterprise will inevitably include some nonreplicable findings. As they put it, “science exists to expand the boundaries of knowledge. In this pursuit, false starts and promising leads that turn out to be dead ends are inevitable” (p. 730). Here we can extrapolate this lesson to the case of validation. Just as we should not expect 100% replication rates, neither should we anticipate perfect validation results across all measures and constructs in all contexts. The process of validation is iterative, ongoing, and contextual (AERA, APA, & NCME, 2014). Rather than providing a binary “valid” or “invalid” verdict, the key is to reveal areas for improvement or refinement, a point to which we return in our discussion of validity and validation.
Another lesson involves the importance of intellectual humility in navigating research challenges. As Nosek et al. (Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber and Vazire2022) have emphasized, researchers should “[get] used to being wrong – a lot” (p. 733). They need to develop mindsets that prioritize getting it right over being right. In the context of the current crisis (Al-Hoorie et al., Reference Al-Hoorie, Hiver and In’nami2024), this would involve a willingness to critically examine our own work and an openness to revising our perspectives in the light of emerging evidence. Here, the “loss-of-confidence project” in psychology (Rohrer et al., Reference Rohrer, Tierney, Uhlmann, DeBruine, Heyman, Jones and Yarkoni2021) can be a source of inspiration. This project invited researchers to publicly share instances where they had lost confidence in their own published findings. By creating a platform for such disclosures, the project aimed to destigmatize self-correction and promote it as a normal and valuable part of the research process. As Bishop (Reference Bishop2018) has argued, “the reputations of scientists will depend not on whether there are flaws in their research, but on how they respond when those flaws are noted” (p. 437). By shifting our cultural norms to value critical self-reflection and correction, we can create an environment where rigorous scrutiny of one’s own work constitutes a hallmark of scientific integrity.
Resisting the allure of novelty
If we are serious about normalizing failure and encouraging (self-)correction, an obsession with novelty also needs to be confronted. Normalizing failure is not about lowering standards. Rather, it involves creating an environment for continuous improvement through critical self-examination. By focusing on quality, we reduce the temptation to conduct hasty, speculative, or careless research in the pursuit of novelty.
Here, Plonsky’s (Reference Plonsky2024b) framework for study quality provides valuable guidance. High-quality research is described as “(a) methodologically rigorous, (b) transparent, (c) ethical, and (d) of value to society” (p. 1). Notably, novelty is not a criterion. This absence is particularly relevant to validation challenges in L2MSS research, where the pursuit of novel findings has often overshadowed rigorous validation efforts.
The omission of novelty is important. By removing novelty as a parameter for high-quality research, validation, and replication studies gain equal footing with the original research. Moreover, the emphasis on methodological rigor and transparency as hallmarks of quality also aligns with open science movements in applied linguistics. Recent discussions on the topic highlight the importance of these aspects (e.g., Al‐Hoorie, Cinaglia, et al., Reference Al‐Hoorie, Cinaglia, Hiver, Huensch, Isbell, Leung and Sudina2024; Liu et al., Reference Liu, Chong, Marsden, McManus, Morgan-Short, Al-Hoorie and Hui2023; Marsden & Morgan‐Short, Reference Marsden and Morgan‐Short2023; Plonsky, Reference Plonsky2024a). Given the scarcity of open science practices in L2MSS research (Liu, Reference Liu2024), a credibility revolution would also set the field on a path where it could catch up with ongoing developments in applied linguistics. Adopting open science practices, such as pre-registration, data/code/materials sharing, and transparent reporting would help ensure that validation efforts also meet the standards of quality research.
Toward systematic validation research
Having examined several preconditions for successful navigation of the “validation crisis” in L2MSS research, we now turn to methodological maturity. Here, we define “methodological maturity” as a field’s collective capacity to consistently implement, evaluate, and refine rigorous research methods. In the context of validity and validation, this would mean the ability (a) to design and implement robust and systematic validation studies, (b) to critically evaluate the results, and (c) to continuously refine measures and theories based on the findings. In the following sections, we briefly review validity concerns in applied linguistics and explain how contemporary views of validity can help to address them. We then present an argument-based approach to validation and describe how it can provide a promising framework for guiding systematic validation efforts.
Prevalent concerns for validity
While researchers involved in L2MSS research may have been the first to officially declare a state of validation crisis (Al-Hoorie, et al., Reference Al-Hoorie, Hiver and In’nami2024), concerns regarding validity issues are not new in applied linguistics. Over a decade ago, Norris and Ortega (Reference Norris, Ortega, Gass and Mackey2012) problematized a “tendency to assume – rather than build an empirical case for – the validity for whatever assessment method is adopted” (p. 575) regardless of the learner population studied or the theoretical interpretations that a researcher employs. Ellis (Reference Ellis2021) went further, noting that while there is general recognition of validity issues, researchers have “largely ignored” them (p. 197). To evaluate the extent to which concerns such as these are warranted, Plonsky (2024) showed that in a corpus analysis of 23,142 articles from 22 mainstream applied linguistics journals, only 4% made explicit mention of construct validity. In a similar vein, Teimouri, Sudina, and Plonsky (Reference Teimouri, Sudina, Plonsky, Gregersen and Mercer2021) observed that researchers often “rely on conventions and/or to report reliability and validity evidence from other studies, for example, rather than doing so themselves” (p. 378). These findings underscore the urgent need for more rigorous validation practices and transparent reporting in applied linguistics research. As Plonsky (2024) has argued, “it is incumbent upon researchers to provide explicit evidence of the validity of their measures” (p. 7). This is necessary not simply to fulfill the criteria for methodological rigor, but also to meet the ethical obligation of producing trustworthy findings. In this sense, the validation crisis extends beyond L2MSS research. Lessons drawn from the crisis can resonate with the wider applied linguistics community and can contribute to the further improvement of research quality.
Contemporary views of validity
To address the validation crisis effectively, it is important that we align our understanding of validity and validation with views currently held in measurement science. The Standards for Educational and Psychological Testing (henceforth the Standards; AERA et al., 2014) represent the current consensus and state-of-the-art guidelines in measurement research. As Purpura, Brown, and Schoonen (Reference Purpura, Brown and Schoonen2015) stated in their call for greater validity of quantitative measures in applied linguistics, “the development, use, and evaluation of all measured constructs… should be guided by professional standards for “good” practices such as those recommended in the Standards for Educational and Psychological Testing” (p. 39, original emphasis). A comprehensive guide for best practices in test development, use, and interpretation, the Standards can offer some guidance for more robust validation efforts.
Notwithstanding the ongoing debates on validity theories, such as differing views on the role of consequential validity (Cizek, Reference Cizek2020), there is a broad consensus regarding the key characteristics of validity. This set of characteristics—mainly derived from Cronbach (e.g., 1971) and Messick (e.g., 1989)—constitutes the core of the contemporary view of validity as reflected in the Standards. Table 1 shows the six foundational tenets as summarized by Cizek (Reference Cizek2020, p. 37).
Table 1. Key tenets of contemporary validity theories (Cizek, Reference Cizek2020, p. 37).

The Standards define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” and validation as “accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations” (p. 11). In other words, validity is not an inherent property of the instrument. As Cronbach (Reference Cronbach1971) noted, “one validates, not a test, but an interpretation of data arising from a specified procedure” (p. 447). This point is further emphasized by Messick (Reference Messick1989), who has observed that “what is validated is not the test or observation device as such but the inferences derived from test scores” (p. 13).
Furthermore, the current version of the Standards (2014) also favors Messick’s (Reference Messick1989) notion of validity as a unitary concept, rather than the different types of validity (e.g., content validity, predictively validity) originally specified in the first edition of the Standards back in the 1950s. The different “types” of validity have now been replaced by differing “sources” of validity evidence used to evaluate the adequacy of the inferences from a set of test scores. An important implication of this unitary view of validity is that it resists a binary verdict: “validity is a matter of degree, not all or none” (Messick, Reference Messick1989, p. 13). To place validity on a continuum is necessary because, in practice, we rarely have all evidence pointing unequivocally to a dichotomous evaluation of the inference as valid or invalid. Similarly, the same evidence may bear different weight, depending on the intended inferences, the context, or the person making the judgment. Such variability means that validation cannot be a one-time activity. Rather, it involves a continuous process to ensure ongoing support for a test’s intended inferences, qualification of those inferences, or discovery that the intended inferences are no longer adequately supported (Cizek, Reference Cizek2020).
From instrument validity to inference validity
In line with the contemporary view of validity, we believe an important step toward addressing the validation crisis requires a shift in the way we think about validity from instrument-focused, to inference-focused. A shift in thinking is needed for multiple reasons. First, it can help researchers move away from the problematic assumptions that have contributed to the current validation crisis. In conceptualizing validity as about inferences (and not instruments), it challenges the false assumption that once an instrument has been “validated” in one context, it can be uncritically applied to another. While not every operational use of an instrument would require a full validation study, researchers must move beyond simply citing prior validation evidence without thoughtful consideration (e.g., merely citing that a scale has been “validated” in other research, as cautioned by Teimouri et al., Reference Teimouri, Sudina, Plonsky, Gregersen and Mercer2021). This shift in perspective foregrounds the need to carefully evaluate whether existing evidence adequately supports the intended interpretation or use in a new context or population.
Second, it can encourage researchers to be more measured in the claims that they make. Rather than relying on generic statements about a scale’s validity, researchers would need to specifically articulate the inferences that they seek to make and provide evidence to support them. As Plonsky (Reference Plonsky2024b) has noted, “transparency is what allows us to evaluate – and is therefore a prerequisite for – every other facet of quality” (p. 4). In the context of validation, transparency extends beyond sharing research instruments. It involves making transparent the theoretical assumptions that underpin our measures and the inferences that we seek to draw. This, we argue, is a crucial component in a critical evaluation of validity.
A further advantage of this approach is that it aligns better with the complex and context-dependent nature of a psychological construct such as motivation. Just as we would not expect a construct to function similarly across all contexts, neither should we assume uniform measurement quality (regardless of context). This resonates with the emphasis on validation as an ongoing process that seeks continuous improvement through the adaptation of measures. As theories evolve, and as contexts shift, it ensures that measurement remains relevant and meaningful. Finally, this perspective can help mitigate the “jingle-jangle” fallacies prevalent in our field (Al-Hoorie et al., Reference Al-Hoorie, Hiver and In’nami2024; Henry & Liu, Reference Henry and Liu2024), where constructs with the same name may be conceptualized differently (and thereby entail different inferences), or where constructs with different names overlap substantially in their intended interpretations. By focusing on specific inferences, rather than general claims of validity, we will be able to more clearly delineate and evaluate what, exactly, our measures are capturing.
Argument-based approach to validation
Now that we have established the importance of shifting focus from instrument validity to inference validity, the next logical question is: How do we go about validating these inferences? To address this question, we turn to the argument-based approach to validation, an approach that can provide a systematic framework for evaluating the validity of score interpretations and uses. The argument-based approach to validation was primarily developed by Kane (Reference Kane1992, Reference Kane and Brennan2006, Reference Kane2013) who drew on Toulmin’s (Reference Toulmin2003) model of argument to structure and evaluate test score inferences. This approach aligns well with contemporary views of validity and supplies the conceptual tools needed for applying the Standards (2014) in practice.
Importantly, the focus on inferences and evidential support renders the argument-based validation framework applicable to any type of scores, whether derived from performance tasks or self-report measures. As a systematic framework, the argument-based approach to validation has been applied in varying forms of psychological and educational research, including language testing and applied linguistics. In the field of language testing, the approach gained traction back in the 2000s. In validating the TOEFL iBT, Chapelle, Enright, and Jamieson (Reference Chapelle, Enright and Jamieson2008) provided one of the first comprehensive applications of this approach. Moving beyond large-scale language tests, Purpura et al. (Reference Purpura, Brown and Schoonen2015) and Révész and Tineke (Reference Révész, Brunfaut, Winke and Brunfaut2020) made a compelling case for how Kane’s framework could be utilized to justify the interpretation of scores obtained through L2 elicitation devices for research purposes, and thus expanded the scope of application to second language acquisition and applied linguistics research in general. Over the years, edited volumes and monographs on validity arguments in language testing and beyond (e.g., Chapelle, Reference Chapelle2021; Chapelle & Voss, Reference Chapelle and Voss2021; Cizek, Reference Cizek2020) have been produced to facilitate wider adoption of the approach.
Given the successful application of argument-based validation in neighboring fields, there is significant potential in applying this framework to L2MSS research (and indeed other areas of L2 psychology). By adapting the principles to the specific context of L2 motivation, a more robust and systematic approach to addressing the crisis can be developed.
An integrated framework for argument-based validation
Drawing on insights from the Standards (2014) and key works on argument-based validation (e.g., Chapelle, Reference Chapelle2021; Cizek, Reference Cizek2020; Kane, Reference Kane2013), we present a schematic representation of an integrated framework for argument-based validation (Figure 1).

Figure 1. An integrated framework for argument-based validation.
The figure illustrates the workflow for argument-based validation, which begins with theory and ends with validated score interpretation and use. In an argument-based approach, theory plays a fundamental role throughout the process—from informing the initial construction of the argument and the instrument to guiding the validation process and interpretation of generated evidence (Chappelle, Reference Chapelle2021). From theory, we move to the core, argument-based validation process. This process consists of three key stages. Stage 1 involves constructing the interpretation/use argument (IUA). This concerns the intended, theory-informed interpretations or uses of the test/instrument. Since the focus of research will vary, different types of inferences are required. For instance, while language testing often focuses on assessment-based inferences (e.g., generalization, extrapolation), in motivation research explanation inference is likely to be of particular relevance, that is inferences that articulate how the scale scores relate to the underlying construct. Regardless of the type, these inferences must be carefully specified in alignment with both the theory and the evidence (see Chapelle, Reference Chapelle2021 for detailed instructions on how to build and combine complex chains of inferences). A basic argument structure (Toulmin, Reference Toulmin2003) is illustrated in the figure: the data (or grounds, i.e., the scores), claim (i.e., the intended interpretation/use), warrant (i.e., justification for the claim), backing (i.e., supporting evidence), rebuttal (i.e., counterclaim). The IUA serves as the guide for the next stage.
Stage 2 focuses on conducting validation research. In this stage, evidence is collected to support the claim. Here we draw on the Standards for a comprehensive list of sources of evidence: content-oriented evidence (i.e., analysis of the instrument and its relevance to the construct being measured), evidence-based on response processes (i.e., theoretical or empirical evidence about the psychological processes or cognitive operations of the respondents), evidence based on internal structure (i.e., analysis of relationships among scale items or parts of instrument), evidence based on relations to other variables (i.e., analysis of relationships with other related variables), evidence based on relations to criteria (i.e., analysis of how the scores relate to criterion variables), and evidence-based on consequences of testing (i.e., evaluation of intended and unintended consequences of the test). The type and combination of evidence gathered will depend on the specific claims or (chain of) inferences to be validated (see Chapelle, Reference Chapelle2021 for a list of evidence corresponding to various types of inferences), as well as practical considerations such as resources and feasibility (Purpura et al., Reference Purpura, Brown and Schoonen2015).
Stage 3 involves developing the validity argument: an integrated evaluative judgment that assesses how well the collected evidence supports or challenges the IUA. The backward arrows in the diagram represent the iterative nature of this process. If the validity argument does not adequately support the intended score interpretation or use, researchers may need to repeat the process. This could involve constructing a new interpretive argument, collecting additional or different types of evidence, or even revising the measurements or underlying theory. As evidence accumulates over time, secondary research/synthesis will be required to provide more informed guidance on refining the measurements and/or the theory. The figure also illustrates how previous cases of score interpretation and uses can serve as supporting evidence when constructing similar IUAs in future research. This iterative approach ensures that the validation process is: (a) cumulative, (b) responsive to new evidence, (c) continuously improving in its ability to support meaningful score interpretations and uses, and (d) supporting the refinement of measurements and theories in L2 motivation research.
An example application
To showcase the utility of this framework, we draw on Al-Hoorie, McClelland, et al. (Reference Al-Hoorie, McClelland, Resnik, Hiver and Botes2024) as an example of how argument-based validation might work in practice. The authors conducted two studies examining the validity of the ideal L2 self-construct. In Study 1, they experimentally manipulated ideal L2 self-items to explicitly refer to ability beliefs and tested for discriminant validity across three countries. Both exploratory and confirmatory factor analysis suggested ideal L2 self and L2 ability beliefs were not distinct. In Study 2, the authors used cognitive interviewing to examine participants’ thought processes when responding to ideal L2 self-items and found that responses to the ideal L2 self-scales were dominated by references to current ability beliefs.
These studies represent an excellent example of the type of validation efforts needed in the field. In conducting this work, and by incorporating evidence based on response processes, the authors moved beyond the conventional sole focus on the internal structure of the construct and its relationship with other variables. While Al-Hoorie, McClelland, et al. (Reference Al-Hoorie, McClelland, Resnik, Hiver and Botes2024) did not explicitly align their study with a formal validation framework, we would advocate the use of the argument-based approach to make it easier for future synthesis and cumulative work. By conducting studies in accordance with this approach, several advantages stand to be gained: (a) the specific inferences being drawn can be more clearly articulated, (b) the evidence that accumulates can be systematically evaluated, and (c) findings (validity arguments) can be situated within a broader validation program for L2MSS research.
Applying the argument-based validation framework, a structure is provided for the central arguments of Al-Hoorie, McClelland, et al.’s research (Figure 2). The core of the IUA is that scores on the ideal L2 self-scale reflect the intended construct, i.e., learners’ vision of themselves as future L2 users (Dörnyei, Reference Dörnyei, Dörnyei and Ushioda2009). This claim is supported by the warrant that the scale scores reflect an imagined future L2 self that is distinct from beliefs involving current L2 abilities. However, the study also considers a potential rebuttal—that the ideal L2 self-scale scores are not empirically distinguishable from those of learners’ current ability beliefs. It should be noted that while as an illustrative example, this IUA only focuses on one (explanation) inference, a comprehensive IUA typically involves a chain of interconnected inferences (Chapelle, Reference Chapelle2021).

Figure 2. Argument structure for Al-Hoorie, McClelland, et al. (Reference Al-Hoorie, McClelland, Resnik, Hiver and Botes2024).
To evaluate this argument, the researchers collected both quantitative and qualitative evidence. They conducted factor analyses and regression analysis to supply evidence of the construct’s internal structure and relationships with other variables, largely supporting the rebuttal. They also conducted cognitive interviews to gather evidence based on the response processes, which also predominantly aligned with the rebuttal.
Based on the evidence, we can construct the following validity argument: the intended interpretation of the ideal L2 self-scale scores as reflecting future visions is not adequately supported. The evidence collected suggests that the scale scores were not empirically distinguishable from those of current ability beliefs, thus challenging the intended interpretation of the ideal L2 self-scale scores. At this juncture, researchers will have two basic options. They can either modify the ideal L2 self-scale to better capture future visions, or they can reconsider how the construct can be conceptualized within L2MSS theory. Both would necessitate follow-up studies to validate new inferences.
From this application, we can see how the argument-based validation framework translates abstract validity concepts into a concrete/actionable steps. By providing a clear structure for articulating and evaluating validity claims, the framework “forces” us to think more intentionally, and to critically consider the inferences that we make from measurement scores. It also moves us beyond traditional psychometric analyses to consider multiple sources of evidence. Most tellingly, the approach can serve as a common language and methodology that would enable more systematic and programmatic validation efforts across the field. To facilitate its wider adoption, we have developed a free and open-access tool (https://validarg.netlify.app/) that makes it easier for researchers to construct and visualize argument structures like the one illustrated in Figure 2. Ultimately, this approach can serve as a unifying framework for more rigorous and cumulative research, both for the L2MSS and beyond.
Looking forward
As L2 motivation researchers, we know how the will to learn and the skill to do so are equally crucial for successful language acquisition. The same principle applies to resolving the current crisis in L2MSS research. To move forward constructively, we need both psychological readiness and methodological maturity.
For psychological readiness, there is a need to reframe the “crisis” narrative, to value controversy, to normalize failure and (self-)correction, and to resist the allure of novelty. Methodologically, we suggest that an argument-based approach to validation can provide a promising direction. The integrated framework outlined in this article offers an anchor for systematic validation and for structured thinking about how validation is approached.
Moving forward, both individual local validation studies and field-wide syntheses of validity arguments will be crucial if the validation crisis is to be successfully navigated. Enhanced rigor is needed at multiple levels. Increased theoretical precision in construct definitions, more rigorous construct operationalization, and more systematic validation efforts, can each go some way toward resolving the controversies now plaguing L2MSS research. Finally, we believe that the validation crisis in L2MSS research can have field-wide implications. Efforts to address the crisis can form the foundations for a credibility revolution that can place L2 motivation research at the forefront of methodological rigor and constructive self-scrutiny in applied linguistics.
Acknowledgements
This research was conducted within the research program Transdisciplinary Approaches to Learning, Acquisition, Multilingualism (TEAM) funded by Riksbankens Jubileumsfond, grant number M23-0052. We gratefully acknowledge this support. We would also like to express our gratitude to Dan Isbell, Ali H. Al-Hoorie, and Phil Hiver for their valuable feedback on earlier versions of this manuscript.
 
 


