Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-07T18:39:30.582Z Has data issue: false hasContentIssue false

Restricted Recalibration of Item Response Theory Models

Published online by Cambridge University Press:  01 January 2025

Yang Liu*
Affiliation:
University of Maryland
Ji Seung Yang
Affiliation:
University of Maryland
Alberto Maydeu-Olivares
Affiliation:
University of South Carolina University of Barcelona
*
Correspondence should be made to Yang Liu, Department of Human Development and Quantitative Methodology, University of Maryland, College Park, USA. Email: yliu87@umd.edu

Abstract

In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.

Type
Original Paper
Copyright
Copyright © 2019 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-019-09667-4) contains supplementary material, which is available to authorized users.

The authors would like to thank Dr. David Thissen from the Department of Psychology at the University of North Carolina at Chapel Hill for his feedback and suggestions about this work. The participation of Ji Seung Yang was supported by the National Science Foundation under Grant EHR-1534846. The participation of Alberto Maydeu-Olivares was supported by the National Science Foundation under Grant SES-1659936.

References

Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289300.CrossRefGoogle Scholar
Birnbaum, A. (1968). Some latent train models and their use in inferring an examinee’s ability. In Lord, F. M., Novick, M. R. (Eds), Statistical theories of mental test scores, Reading, MA: Addison-Wesley 395479.Google Scholar
Bock, R. D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443459.CrossRefGoogle Scholar
Bock, R. D., Lieberman, M. (1970). Fitting a response model for n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} dichotomously scored items. Psychometrika, 35(2), 179197.Google Scholar
Bock, R. D., Zimowski, M. F. (1997). Multiple group irt. In van der Linden, W. J., Hambleton, R. K. (Eds), Handbook of modern item response theory, New York: Springer 433448.CrossRefGoogle Scholar
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems: I. Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290302.CrossRefGoogle Scholar
Boyd, S., Vandenberghe, L. (2004). Convex optimization, Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Bradlow, E. T., Wainer, H., Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153168.CrossRefGoogle Scholar
Breithaupt, K., Ariel, A. A., Hare, D. R. (2010). Assembling an inventory of multistage adaptive testing systems. In van der Linden, W. J., Glas, C. A. (Eds), Elements of adaptive testing, New York, NY: Springer 247266.Google Scholar
Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108132.CrossRefGoogle ScholarPubMed
Cai, L., Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245276.CrossRefGoogle ScholarPubMed
Cai, L., Maydeu-Olivares, A., Coffman, D. L., Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173194.CrossRefGoogle ScholarPubMed
Cheng, Y., Yuan, K.-H. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280291.CrossRefGoogle ScholarPubMed
Cochran, W. G. (1952). The χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\chi }^{2}$$\end{document} test of goodness of fit. The Annals of Mathematical Statistics, 23(3), 315345.CrossRefGoogle Scholar
Cressie, N., Read, T. R. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B (Methodological), 46(3), 440464.CrossRefGoogle Scholar
Curran, P. J., Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81100.CrossRefGoogle ScholarPubMed
Drasgow, F., Levine, M. V., Tsien, S., Williams, B., Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19(2), 143166.CrossRefGoogle Scholar
Embretson, S. E., Reise, S. P. (2000). Item response theory for psychologists, Mahwah, NJ: Erlbaum.Google Scholar
Fox, J.-P. (2005). Multilevel irt using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145172.CrossRefGoogle ScholarPubMed
Glas, C. A. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525546.CrossRefGoogle Scholar
Glas, C. A. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64(3), 273294.CrossRefGoogle Scholar
Glas, C. A., Suárez Falcón, J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87106.CrossRefGoogle Scholar
Gong, G., Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9(4), 861869.CrossRefGoogle Scholar
Gunsjö, A. (1994). Faktoranalys av ordinala variabler, Stockholm: Acta Universitatis Upsaliensis.Google Scholar
Haberman, S. J. (2006). Adaptive quadrature for item response models. ETS Research Report Series, 2006(2), 110.CrossRefGoogle Scholar
Haberman, S. J., Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108(504), 14351444.CrossRefGoogle Scholar
Haberman, S. J., Sinharay, S., Chon, K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417440.CrossRefGoogle ScholarPubMed
Haley, S. M., Ni, P., Jette, A. M., Tao, W., Moed, R., Meyers, D., Ludlow, L. H. (2009). Replenishing a computerized adaptive test of patient-reported daily activity functioning. Quality of Life Research, 18(4), 461471.CrossRefGoogle ScholarPubMed
Hofer, S. M., & Piccinin, A. M. (2009). Integrative data analysis through coordination of measurement and analysis protocol across independent longitudinal studies. Psychological Methods, 14(2), 150–164.CrossRefGoogle Scholar
Joe, H., Maydeu-Olivares, A. (2006). On the asymptotic distribution of pearson’s x2 in cross-validation samples. Psychometrika, 71(3), 587592.CrossRefGoogle Scholar
Joe, H., Maydeu-Olivares, A. (2010). A general family of limited information goodness-of-fit statistics for multinomial data. Psychometrika, 75(3), 393419.CrossRefGoogle Scholar
Jöreskog, K. G., Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347387.CrossRefGoogle ScholarPubMed
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355381.CrossRefGoogle Scholar
Lai, J.-S., Stucky, B. D., Thissen, D., Varni, J. W., DeWitt, E. M., Irwin, D. E., Yeatts, K. B., DeWalt, D. A. (2013). Development and psychometric properties of the promisÂő pediatric fatigue item banks. Quality of Life Research, 22(9), 24172427.CrossRefGoogle ScholarPubMed
Liu, Y., Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354371.CrossRefGoogle ScholarPubMed
Liu, Y., Thissen, D. (2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670688.CrossRefGoogle Scholar
Liu, Y., Thissen, D. (2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496513.CrossRefGoogle ScholarPubMed
Liu, Y., & Yang, J. S. (2017). Interval estimation of latent variable scores in item response theory. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/1076998617732764.CrossRefGoogle Scholar
Liu, Y., Yang, J. S. (2018). Bootstrap-calibrated interval estimates for latent variable scores in item response theory. Psychometrika, 83(2), 333354.CrossRefGoogle ScholarPubMed
Luecht, R. M. (2006). Operational issues in computer-based testing. In Bartram, D., Hambleton, R. (Eds), Computer-based testing and the internet: Issues and advances, New York: Wiley 91114.Google Scholar
Magnus, J., Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics, New York: Wiley.Google Scholar
Maydeu-Olivares, A., Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in 2n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^{n}$$\end{document} contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 10091020.CrossRefGoogle Scholar
Maydeu-Olivares, A., Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71(4), 713732.CrossRefGoogle Scholar
Maydeu-Olivares, A., Joe, H. (2008). An overview of limited information goodness-of-fit testing in multidimensional contingency tables. In Shigemasu, K., Okada, A., Imaizumi, T., Hoshino, T. (Eds), New trends in psychometrics, Tokyo: Universal Academy Press 253262.Google Scholar
Maydeu-Olivares, A., Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305328.CrossRefGoogle Scholar
Maydeu-Olivares, A., Liu, Y. (2015). Item diagnostics in multivariate discrete data. Psychological Methods, 20(2), 276292.CrossRefGoogle ScholarPubMed
McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100117.CrossRefGoogle Scholar
Meng, X.-L., Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831860.Google Scholar
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525543.CrossRefGoogle Scholar
Mosier, C. I. (1951). Symposium: The need and means of cross-validation. i. Problems and designs of cross-validation. Educational and Psychological Measurement, 11(1), 511.CrossRefGoogle Scholar
Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 43(4), 551560.CrossRefGoogle Scholar
Muthén, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22 1–24365.CrossRefGoogle Scholar
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115132.CrossRefGoogle Scholar
Muthén, B. (1993). Goodness of fit with categorical and other nonnormal variables. In Bollen, K. A., Long, J. S. (Eds), Testing structural equation models, Newbury Park, CA: Sage 205234.Google Scholar
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide [Computer software manual]. Los Angeles, CA.Google Scholar
Parke, W. R. (1986). Pseudo maximum likelihood estimation: The asymptotic distribution. The Annals of Statistics, 14(1), 355357.CrossRefGoogle Scholar
R Core Team. (2018). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/.Google Scholar
Ranger, J., Kuhn, J.-T. (2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49(3), 247268.CrossRefGoogle Scholar
Rao, C. R. (1973). Linear statistical inference and its applications, New York: Wiley.CrossRefGoogle Scholar
Read, T. R. (1984). Closer asymptotic approximations for the distributions of the power divergence goodness-of-fit statistics. Annals of the Institute of Statistical Mathematics, 36(1), 5969.CrossRefGoogle Scholar
Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61(3), 509528.CrossRefGoogle Scholar
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12(4), 11511172.CrossRefGoogle Scholar
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 338.Google Scholar
Rupp, A. A., Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 6384.CrossRefGoogle Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph No. 17. Richmond, VA: Psychometric Society.Google Scholar
Schilling, S., Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533555.Google Scholar
Thissen, D., Liu, Y., Magnus, B., & Quinn, H. (2015). Extending the use of multidimensional IRT calibration as projection: Many-to-one linking and linear computation of projected scores. In Quantitative psychology research (pp. 1–16). Springer.CrossRefGoogle Scholar
Thissen, D., Steinberg, L. (2009). Item response theory. In Millsap, R., Maydeu-Olivares, A. (Eds), The sage handbook of quantitative methods in psychology, London: Sage Publications 148177.CrossRefGoogle Scholar
Thissen, D., Steinberg, L., Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 7783.CrossRefGoogle Scholar
Thissen, D., Steinberg, L., Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates 67113.Google Scholar
Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., DeWalt, D. A. (2011). Using the PedsQLtm 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 14971505.CrossRefGoogle Scholar
van der Vaart, A. W. (2000). Asymptotic statistics, New York: Cambridge University Press.Google Scholar
Venables, W. N., Ripley, B. D. (2002). Modern applied statistics with S, 4New York: Springer.CrossRefGoogle Scholar
von Davier, M., von Davier, A. A. (2007). A unified approach to IRT scale linking and scale transformations. Methodology, 3(3), 115124.CrossRefGoogle Scholar
Wollack, J. A., Cohen, A. S., Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307330.CrossRefGoogle Scholar
Yang, J. S., Hansen, M., Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and psychological measurement, 72(2), 264290.CrossRefGoogle Scholar
Zhao, Y., Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. Canadian Journal of Statistics, 33(3), 335356.CrossRefGoogle Scholar
Supplementary material: File

Liu et al, Supplementary material

Liu et al, Supplementary material 1
Download Liu et al, Supplementary material(File)
File 223.5 KB
Supplementary material: File

Liu et al, Supplementary material

Liu et al, Supplementary material 2
Download Liu et al, Supplementary material(File)
File 20 KB
Supplementary material: File

Liu et al, Supplementary material

Liu et al, Supplementary material 3
Download Liu et al, Supplementary material(File)
File 13.8 KB
Supplementary material: File

Liu et al, Supplementary material

Liu et al, Supplementary material 4
Download Liu et al, Supplementary material(File)
File 16.1 KB
Supplementary material: File

Liu et al, Supplementary material

Liu et al, Supplementary material 5
Download Liu et al, Supplementary material(File)
File 15.2 KB