Restricted Recalibration of Item Response Theory Models

Yang Liu; Ji Seung Yang; Alberto Maydeu-Olivares

doi:10.1007/s11336-019-09667-4

Restricted Recalibration of Item Response Theory Models

Published online by Cambridge University Press: 01 January 2025

Yang Liu ,

Ji Seung Yang and

Alberto Maydeu-Olivares

Show author details

Yang Liu*: Affiliation:
University of Maryland
Ji Seung Yang: Affiliation:
University of Maryland
Alberto Maydeu-Olivares: Affiliation:
University of South Carolina University of Barcelona
*: Correspondence should be made to Yang Liu, Department of Human Development and Quantitative Methodology, University of Maryland, College Park, USA. Email: yliu87@umd.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.

Keywords

item response theory measurement invariance cross-validation item calibration pseudo-maximum likelihood residual contingency table goodness of fit

Information

Type: Original Paper
Information: Psychometrika , Volume 84 , Issue 2 , June 2019 , pp. 529 - 553

DOI: https://doi.org/10.1007/s11336-019-09667-4 [Opens in a new window]
Copyright: Copyright © 2019 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11336-019-09667-4) contains supplementary material, which is available to authorized users.

The authors would like to thank Dr. David Thissen from the Department of Psychology at the University of North Carolina at Chapel Hill for his feedback and suggestions about this work. The participation of Ji Seung Yang was supported by the National Science Foundation under Grant EHR-1534846. The participation of Alberto Maydeu-Olivares was supported by the National Science Foundation under Grant SES-1659936.

References

Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289–300.CrossRef Google Scholar

Birnbaum, A. (1968). Some latent train models and their use in inferring an examinee’s ability. In Lord, F. M., Novick, M. R. (Eds), Statistical theories of mental test scores, Reading, MA: Addison-Wesley 395–479.Google Scholar

Bock, R. D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.CrossRef Google Scholar

Bock, R. D., Lieberman, M. (1970). Fitting a response model for

n

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}

dichotomously scored items. Psychometrika, 35(2), 179–197.CrossRef Google Scholar

Bock, R. D., Zimowski, M. F. (1997). Multiple group irt. In van der Linden, W. J., Hambleton, R. K. (Eds), Handbook of modern item response theory, New York: Springer 433–448.CrossRef Google Scholar

Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems: I. Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290–302.CrossRef Google Scholar

Boyd, S., Vandenberghe, L. (2004). Convex optimization, Cambridge: Cambridge University Press.CrossRef Google Scholar

Bradlow, E. T., Wainer, H., Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168.CrossRef Google Scholar

Breithaupt, K., Ariel, A. A., Hare, D. R. (2010). Assembling an inventory of multistage adaptive testing systems. In van der Linden, W. J., Glas, C. A. (Eds), Elements of adaptive testing, New York, NY: Springer 247–266.Google Scholar

Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108–132.CrossRef Google Scholar PubMed

Cai, L., Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276.CrossRef Google Scholar PubMed

Cai, L., Maydeu-Olivares, A., Coffman, D. L., Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173–194.CrossRef Google Scholar PubMed

Cheng, Y., Yuan, K.-H. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280–291.CrossRef Google Scholar PubMed

Cochran, W. G. (1952). The

χ^{2}

test of goodness of fit. The Annals of Mathematical Statistics, 23(3), 315–345.CrossRef Google Scholar

Cressie, N., Read, T. R. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B (Methodological), 46(3), 440–464.CrossRef Google Scholar

Curran, P. J., Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100.CrossRef Google Scholar PubMed

Drasgow, F., Levine, M. V., Tsien, S., Williams, B., Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19(2), 143–166.CrossRef Google Scholar

Embretson, S. E., Reise, S. P. (2000). Item response theory for psychologists, Mahwah, NJ: Erlbaum.Google Scholar

Fox, J.-P. (2005). Multilevel irt using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172.CrossRef Google Scholar PubMed

Glas, C. A. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546.CrossRef Google Scholar

Glas, C. A. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64(3), 273–294.CrossRef Google Scholar

Glas, C. A., Suárez Falcón, J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106.CrossRef Google Scholar

Gong, G., Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9(4), 861–869.CrossRef Google Scholar

Gunsjö, A. (1994). Faktoranalys av ordinala variabler, Stockholm: Acta Universitatis Upsaliensis.Google Scholar

Haberman, S. J. (2006). Adaptive quadrature for item response models. ETS Research Report Series, 2006(2), 1–10.CrossRef Google Scholar

Haberman, S. J., Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108(504), 1435–1444.CrossRef Google Scholar

Haberman, S. J., Sinharay, S., Chon, K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440.CrossRef Google Scholar PubMed

Haley, S. M., Ni, P., Jette, A. M., Tao, W., Moed, R., Meyers, D., Ludlow, L. H. (2009). Replenishing a computerized adaptive test of patient-reported daily activity functioning. Quality of Life Research, 18(4), 461–471.CrossRef Google Scholar PubMed

Hofer, S. M., & Piccinin, A. M. (2009). Integrative data analysis through coordination of measurement and analysis protocol across independent longitudinal studies. Psychological Methods, 14(2), 150–164.CrossRef Google Scholar

Joe, H., Maydeu-Olivares, A. (2006). On the asymptotic distribution of pearson’s x2 in cross-validation samples. Psychometrika, 71(3), 587–592.CrossRef Google Scholar

Joe, H., Maydeu-Olivares, A. (2010). A general family of limited information goodness-of-fit statistics for multinomial data. Psychometrika, 75(3), 393–419.CrossRef Google Scholar

Jöreskog, K. G., Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387.CrossRef Google Scholar PubMed

Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381.CrossRef Google Scholar

Lai, J.-S., Stucky, B. D., Thissen, D., Varni, J. W., DeWitt, E. M., Irwin, D. E., Yeatts, K. B., DeWalt, D. A. (2013). Development and psychometric properties of the promisÂő pediatric fatigue item banks. Quality of Life Research, 22(9), 2417–2427.CrossRef Google Scholar PubMed

Liu, Y., Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354–371.CrossRef Google Scholar PubMed

Liu, Y., Thissen, D. (2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670–688.CrossRef Google Scholar

Liu, Y., Thissen, D. (2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496–513.CrossRef Google Scholar PubMed

Liu, Y., & Yang, J. S. (2017). Interval estimation of latent variable scores in item response theory. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/1076998617732764.CrossRef Google Scholar

Liu, Y., Yang, J. S. (2018). Bootstrap-calibrated interval estimates for latent variable scores in item response theory. Psychometrika, 83(2), 333–354.CrossRef Google Scholar PubMed

Luecht, R. M. (2006). Operational issues in computer-based testing. In Bartram, D., Hambleton, R. (Eds), Computer-based testing and the internet: Issues and advances, New York: Wiley 91–114.Google Scholar

Magnus, J., Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics, New York: Wiley.Google Scholar

Maydeu-Olivares, A., Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in

2^{n}

contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020.CrossRef Google Scholar

Maydeu-Olivares, A., Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71(4), 713–732.CrossRef Google Scholar

Maydeu-Olivares, A., Joe, H. (2008). An overview of limited information goodness-of-fit testing in multidimensional contingency tables. In Shigemasu, K., Okada, A., Imaizumi, T., Hoshino, T. (Eds), New trends in psychometrics, Tokyo: Universal Academy Press 253–262.Google Scholar

Maydeu-Olivares, A., Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328.CrossRef Google Scholar

Maydeu-Olivares, A., Liu, Y. (2015). Item diagnostics in multivariate discrete data. Psychological Methods, 20(2), 276–292.CrossRef Google Scholar PubMed

McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100–117.CrossRef Google Scholar

Meng, X.-L., Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831–860.Google Scholar

Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543.CrossRef Google Scholar

Mosier, C. I. (1951). Symposium: The need and means of cross-validation. i. Problems and designs of cross-validation. Educational and Psychological Measurement, 11(1), 5–11.CrossRef Google Scholar

Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 43(4), 551–560.CrossRef Google Scholar

Muthén, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22 1–243–65.CrossRef Google Scholar

Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115–132.CrossRef Google Scholar

Muthén, B. (1993). Goodness of fit with categorical and other nonnormal variables. In Bollen, K. A., Long, J. S. (Eds), Testing structural equation models, Newbury Park, CA: Sage 205–234.Google Scholar

Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide [Computer software manual]. Los Angeles, CA.Google Scholar

Parke, W. R. (1986). Pseudo maximum likelihood estimation: The asymptotic distribution. The Annals of Statistics, 14(1), 355–357.CrossRef Google Scholar

R Core Team. (2018). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/.Google Scholar

Ranger, J., Kuhn, J.-T. (2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49(3), 247–268.CrossRef Google Scholar

Rao, C. R. (1973). Linear statistical inference and its applications, New York: Wiley.CrossRef Google Scholar

Read, T. R. (1984). Closer asymptotic approximations for the distributions of the power divergence goodness-of-fit statistics. Annals of the Institute of Statistical Mathematics, 36(1), 59–69.CrossRef Google Scholar

Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61(3), 509–528.CrossRef Google Scholar

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12(4), 1151–1172.CrossRef Google Scholar

Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.Google Scholar

Rupp, A. A., Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63–84.CrossRef Google Scholar

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph No. 17. Richmond, VA: Psychometric Society.Google Scholar

Schilling, S., Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555.CrossRef Google Scholar

Thissen, D., Liu, Y., Magnus, B., & Quinn, H. (2015). Extending the use of multidimensional IRT calibration as projection: Many-to-one linking and linear computation of projected scores. In Quantitative psychology research (pp. 1–16). Springer.CrossRef Google Scholar

Thissen, D., Steinberg, L. (2009). Item response theory. In Millsap, R., Maydeu-Olivares, A. (Eds), The sage handbook of quantitative methods in psychology, London: Sage Publications 148–177.CrossRef Google Scholar

Thissen, D., Steinberg, L., Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.CrossRef Google Scholar

Thissen, D., Steinberg, L., Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates 67–113.Google Scholar

Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., DeWalt, D. A. (2011). Using the PedsQLtm 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505.CrossRef Google Scholar

van der Vaart, A. W. (2000). Asymptotic statistics, New York: Cambridge University Press.Google Scholar

Venables, W. N., Ripley, B. D. (2002). Modern applied statistics with S, 4New York: Springer.CrossRef Google Scholar

von Davier, M., von Davier, A. A. (2007). A unified approach to IRT scale linking and scale transformations. Methodology, 3(3), 115–124.CrossRef Google Scholar

Wollack, J. A., Cohen, A. S., Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307–330.CrossRef Google Scholar

Yang, J. S., Hansen, M., Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and psychological measurement, 72(2), 264–290.CrossRef Google Scholar

Zhao, Y., Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. Canadian Journal of Statistics, 33(3), 335–356.CrossRef Google Scholar

Liu et al, Supplementary material

Liu et al, Supplementary material 1

File 223.5 KB

Liu et al, Supplementary material

Liu et al, Supplementary material 2

File 20 KB

Liu et al, Supplementary material

Liu et al, Supplementary material 3

File 13.8 KB

Liu et al, Supplementary material

Liu et al, Supplementary material 4

File 16.1 KB

Liu et al, Supplementary material

Liu et al, Supplementary material 5

File 15.2 KB

Article contents

Restricted Recalibration of Item Response Theory Models

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Liu et al, Supplementary material

Liu et al, Supplementary material

Liu et al, Supplementary material

Liu et al, Supplementary material

Liu et al, Supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests