Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Ting Wang; Carolin Strobl; Achim Zeileis; Edgar C. Merkle

doi:10.1007/s11336-017-9591-8

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Published online by Cambridge University Press: 01 January 2025

and

Ting Wang*: Affiliation:
University of Missouri
Carolin Strobl: Affiliation:
University of Zurich
Achim Zeileis: Affiliation:
Universität Innsbruck
Edgar C. Merkle: Affiliation:
University of Missouri
*: Correspondence should be made to Ting Wang, Department of Psychological Sciences, University of Missouri, Columbia, MO, USA. Email: twb8d@mail.missouri.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Measurement invariance is a fundamental assumption in item response theory models, where the relationship between a latent construct (ability) and observed item responses is of interest. Violation of this assumption would render the scale misinterpreted or cause systematic bias against certain groups of persons. While a number of methods have been proposed to detect measurement invariance violations, they typically require advance definition of problematic item parameters and respondent grouping information. However, these pieces of information are typically unknown in practice. As an alternative, this paper focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when measurement invariance is assumed to hold), and they have been previously applied in factor-analytic, continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is also provided.

Keywords

pairwise maximum likelihood score-based test item response theory differential item functioning

Type: Original Paper
Information: Psychometrika , Volume 83 , Issue 1 , March 2018 , pp. 132 - 155

DOI: https://doi.org/10.1007/s11336-017-9591-8 [Opens in a new window]
Copyright: Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Supported by National Science Foundation Grants SES-1061334 and 1460719

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-017-9591-8) contains supplementary material, which is available to authorized users.

References

Andrews, D.W.K., (1993). Tests for parameter instability and structural change with unknown change point, Econometrica, 61, 821–856.CrossRef Google Scholar

Bechger, T.M., Maris, G., (2015). A statistical test for differential item pair functioning, Psychometrika, 80(2), 317–340.CrossRef Google Scholar PubMed

Bock, R.D., Schilling, S., (1997). High-dimensional full-information item factor analysis. In Berkane, M. (Ed.), Latent variable modeling and applications to causality, New York, NY: Springer pp(163–176).CrossRef Google Scholar

Chalmers, R.P., (2012). mirt: A multidimensional item response theory package for the R environment, Journal of Statistical Software, 48(6), 1–29.CrossRef Google Scholar

De Ayala, R.J., (2009). The theory and practice of item response theory, New York: Guilford Press.Google Scholar

Doolaard, S., (1999). Schools in change or schools in chains. Unpublished doctoral dissertation, University of Twente, The Netherlands.Google Scholar

Dorans, N.J., (2004). Using subpopulation invariance to assess test score equity, Journal of Educational Measurement, 41(1), 43–68.CrossRef Google Scholar

Fischer, G.H. eds.Fischer, G.H., Molenaar, I.W., (1995). Derivations of the Rasch model, Rasch models, New York, NY: Springer pp(15–38).CrossRef Google Scholar

Fischer, G.H., (1995). Some neglected problems in IRT, Psychometrika, 60(4), 459–487.CrossRef Google Scholar

Fischer, G.H., Molenaar, I.W., (2012). Rasch models: Foundations, recent developments, and applications, Berlin: Springer.Google Scholar

Fox, J-P, (2010). Bayesian item response modeling: Theory and applications New York, NY: Springer.CrossRef Google Scholar

Glas, C.A.W., (1998). Detection of differential item functioning using Lagrange multiplier tests, Statistica Sinica, 8(3), 647–667.Google Scholar

Glas, C.A.W., (1999). Modification indices for the 2-PL and the nominal response model, Psychometrika, 64(3), 273–294.CrossRef Google Scholar

Glas, C.A.W., (2009). Item parameter estimation and item fit analysis. In van der Linden, W., Glas, C.A.W.(Eds.), Elements of adaptive testing, New York, NY: Springer pp(269–288).CrossRef Google Scholar

Glas, C.A.W., (2010). Testing fit to IRT models for polytomously scored items, In Nering, M.L., Ostini, R.(Eds.), Handbook of polytomous item response theory models, New York, NY: Routledge pp(185–210).Google Scholar

Glas, C. A. W., (2015). Item response theory models in behavioral social science: Assessment of fit. Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06436.pub2.CrossRef Google Scholar

Glas, C.A.W., Falcón, J.C.S., (2003). A comparison of item-fit statistics for the three-parameter logistic model, Applied Psychological Measurement, 27(2), 87–106.CrossRef Google Scholar

Glas, C.A.W., Jehangir, K., (2014). Modeling country-specific differential item functioning. In Rutkowski, L., von Davier, M., Rutkowski, D.(Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis, Boca Raton, FL: Chapman and Hall/CRC pp(97–115).Google Scholar

Glas, C.A.W., Linden, W.J., (2010). Marginal likelihood inference for a model for item responses and response times, British Journal of Mathematical and Statistical Psychology, 63(3), 603–626.CrossRef Google Scholar

Hjort, N.L., Koning, A., (2002). Tests for constancy of model parameters over time, Nonparametric Statistics, 14, 113–132.CrossRef Google Scholar

Holland, P.W., Thayer, D.T., (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer, H., Braun, H.I.(Eds.), Test validity, Hillsdale, NJ: Routledge pp(129–145.Google Scholar

Katsikatsou, M., Moustaki, I., (2016). Pairwise likelihood ratio tests and model selection criteria for structural equation models with ordinal variables, Psychometrika, 81(4), 1046–1068.CrossRef Google Scholar PubMed

Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., Jöreskog, K.G., (2012). Pairwise likelihood estimation for factor analysis models with ordinal data, Computational Statistics & Data Analysis, 56(12), 4243–4258.CrossRef Google Scholar

Kolen, M.J., Brennan, R.L., (2004). Test equating, scaling, and linking, New York: Springer.CrossRef Google Scholar

Kopf, J., Zeileis, A., Strobl, C., (2015). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches, Educational and Psychological Measurement, 75(1), 22–56.CrossRef Google Scholar PubMed

Lord, F.M., (1980). Applications of item response theory to practical testing problems, New York: Routledge.Google Scholar

Magis, D., Beland, S., & Raiche, G., (2015). difR: Collection of methods to detect dichotomous differential item functioning (DIF) [Computer software manual]. (R package version 4.6). https://doi.org/10.3758/brm.42.3.847.CrossRef Google Scholar

Magis, D., Béland, S., Tuerlinckx, F., De Boeck, P., (2010). A general framework and an R package for the detection of dichotomous differential item functioning, Behavior Research Methods, 42(3), 847–862.CrossRef Google Scholar

Magis, D., Facon, B., (2013). Item purification does not always improve DIF detection: A counterexample with Angoff’s delta plot, Educational and Psychological Measurement, 73(2), 293–311.CrossRef Google Scholar

Mellenbergh, G.J., (1989). Item bias and item response theory, International Journal of Educational Research, 13, 127–143.CrossRef Google Scholar

Merkle, E.C., Fan, J., Zeileis, A., (2014). Testing for measurement invariance with respect to an ordinal variable, Psychometrika, 79, 569–584.CrossRef Google Scholar

Merkle, E.C., Zeileis, A., (2013). Tests of measurement invariance without subgroups: A generalization of classical methods, Psychometrika, 78, 59–82.CrossRef Google Scholar PubMed

Millsap, R.E., (2005). Four unresolved problems in studies of factorial invariance. In Maydeu-Olivares, A., McArdle, J.J.(Eds.), Contemporary psychometrics, Mahwah, NJ: Lawrence Erlbaum Associates pp(153–171).Google Scholar

Millsap, R.E., (2012). Statistical approaches to measurement invariance New York: Routledge.CrossRef Google Scholar

Millsap, R.E., Everson, H.T., (1993). Methodology review: Statistical approaches for assessing measurement bias, Applied Psychological Measurement, 17(4), 297–334.CrossRef Google Scholar

Muraki, E., (1992). A generalized partial credit model: Application of an EM algorithm, Applied Psychological Measurement, 16, 159–176.CrossRef Google Scholar

Osterlind, S.J., Everson, H.T., (2009). Differential item functioning Thousand Oaks, CA: Sage.CrossRef Google Scholar

Raju, N.S., (1988). The area between two item characteristic curves, Psychometrika, 53(4), 495–502.CrossRef Google Scholar

Core, R Team. (2017). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/.Google Scholar

Rosseel, Y., (2012). lavaan: An R package for structural equation modeling, Journal of Statistical Software, 48(2), 1–36.CrossRef Google Scholar

Samejima, F., (1969). Estimation of latent ability using a response pattern of graded scores, Psychometrika Monograph Supplement.Google Scholar

Satorra, A., (1989). Alternative test criteria in covariance structure analysis: A unified approach, Psychometrika, 54, 131–151.CrossRef Google Scholar

Schilling, S., Bock, R.D., (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature, Psychometrika, 70(3), 533–555.Google Scholar

Stark, S., Chernyshenko, O.S., Drasgow, F., (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy, Journal of Applied Psychology, 91, 1292–1306.CrossRef Google Scholar

Strobl, C., Kopf, J., Zeileis, A., (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model, Psychometrika, 80, 289–316.CrossRef Google Scholar

Swaminathan, H., Rogers, H.J., (1990). Detecting differential item functioning using logistic regression procedures, Journal of Educational Measurement, 27(4), 361–370.CrossRef Google Scholar

Takane, Y., de Leeuw, J., (1987). On the relationship between item response theory and factor analysis of discretized variables, Psychometrika, 52, 393–408.CrossRef Google Scholar

Thissen, D., (1982). Marginal maximum likelihood estimation for the one-parameter logistic model, Psychometrika, 47, 175–186.CrossRef Google Scholar

Thissen, D., Steinberg, L., Wainer, H., (1988). Use of item response theory in the study of group differences in trace lines. In Wainer, H., Braun, H.I.(Eds.), Test validity, Hillsdale, NJ: Lawrence Erlbaum Associates pp(147–172).Google Scholar

Tutz, G., Schauberger, G., (2015). A penalty approach to differential item functioning in Rasch models, Psychometrika, 80(1), 21–43.CrossRef Google Scholar PubMed

Van den Noortgate, W., De Boeck, P., (2005). Assessing and explaining differential item functioning using logistic mixed models, Journal of Educational and Behavioral Statistics, 30(4), 443–464.CrossRef Google Scholar

Verhagen, J., Levy, R., Millsap, R.E., Fox, J-P, (2016). Evaluating evidence for invariant items: A Bayes factor applied to testing measurement invariance in IRT models, Journal of Mathematical Psychology, 72, 171–182.CrossRef Google Scholar

Wang, T., Merkle, E., Zeileis, A., (2014). Score-based tests of measurement invariance: Use in practice, Frontiers in Psychology, 5(438), 1–11.CrossRef Google Scholar PubMed

Wang, W-CYeh, Y-L, (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test, Applied Psychological Measurement, 27(6), 479–498.CrossRef Google Scholar

Woods, C.M., (2009). Empirical selection of anchors for tests of differential item functioning, Applied Psychological Measurement, 33(1), 42–57.CrossRef Google Scholar

Zeileis, A., (2006). Implementing a class of structural change tests: An econometric computing approach, Computational Statistics & Data Analysis, 50(11), 2987–3008.CrossRef Google Scholar

Zeileis, A., Hornik, K., (2007). Generalized M-fluctuation tests for parameter instability, Statistica Neerlandica, 61, 488–508.CrossRef Google Scholar

Zeileis, A., Leisch, F., Hornik, K., Kleiber, C., (2002). strucchange: An R package for testing structural change in linear regression models: An R package for testing structural change in linear regression models, Journal of Statistical Software, 7(2), 1–38.CrossRef Google Scholar

Wang et al. supplementary material

File 217.6 KB

Article contents

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Abstract

Keywords

Access options

Footnotes

References

Wang et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests