Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-01-07T18:00:38.424Z Has data issue: false hasContentIssue false

Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions

Published online by Cambridge University Press:  01 January 2025

Shelby J. Haberman
Affiliation:
Educational Testing Service
Sandip Sinharay*
Affiliation:
Educational Testing Service
Kyong Hee Chon
Affiliation:
Western Kentucky University
*
Requests for reprints should be sent to Sandip Sinharay, CTB/McGraw-Hill, Monterey, CA, USA. E-mail: sandip_sinharay@ctb.com

Abstract

Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

Type
Original Paper
Copyright
Copyright © 2012 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Note: Any opinions expressed in this publication are those of the authors and not necessarily of Educational Testing Service. Sandip Sinharay conducted this study and wrote this report while on staff at Educational Testing Service. He is currently at CTB/McGraw-Hill.

References

American Educational Research Association American Psychological Association National Council for Measurement in Education (1999). Standards for educational and psychological testing, Washington: American Educational Research AssociationGoogle Scholar
Bock, R.D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika, 46, 443459CrossRefGoogle Scholar
Bock, R.D., & Haberman, S.J. (2009). Confidence bands for examining goodness-of-fit of estimated item response functions. Paper presented at the annual meeting of the Psychometric Society. Cambridge, UK. Google Scholar
Box, G.E.P., Draper, N.R. (1987). Empirical model-building and response surfaces, New York: WileyGoogle Scholar
Chon, K.H., Lee, W., Dunbar, S.B. (2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47, 318338CrossRefGoogle Scholar
Cochran, W.G. (1977). Sampling techniques, (3rd ed.). New York: WileyGoogle Scholar
Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement, 41, 259268CrossRefGoogle Scholar
du Toit, M. (2003). IRT from SSI, Lincolnwood: Scientific Software InternationalGoogle Scholar
Glas, C.A.W., Suarez-Falcon, J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87106CrossRefGoogle Scholar
Haberman, S.J. (1976). Generalized residuals for log-linear models. Proceedings of the ninth international biometrics conference, Boston: International Biometric Society 104172Google Scholar
Haberman, S.J. (1977). Log-linear models and frequency tables with small expected cell counts. The Annals of Statistics, 5, 11481169CrossRefGoogle Scholar
Haberman, S.J. (1977). Maximum likelihood estimates in exponential response models. The Annals of Statistics, 5, 815841CrossRefGoogle Scholar
Haberman, S.J. (1978). Analysis of qualitative data, Vol. I: Introductory topics, New York: Academic PressGoogle Scholar
Haberman, S.J. (1979). Analysis of qualitative data, Vol. II: New developments, New York: Academic PressGoogle Scholar
Haberman, S.J. (1988). A stabilized Newton-Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociological Methodology, 18, 193211CrossRefGoogle Scholar
Haberman, S.J. (2006). Adaptive quadrature for item response models (Research Rep. No. RR-06-29). Princeton: ETS. Google Scholar
Haberman, S.J. (2009). Use of generalized residuals to examine goodness of fit of item response models (Research Rep. No. RR-09-15). Princeton: ETS. Google Scholar
Haberman, S.J., & Sinharay, S. (2012). Assessing goodness of fit of item response theory models using generalized residuals (Unpublished manuscript). Google Scholar
Hambleton, R.K., Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: a five step plan and several graphical displays. In Lenderking, W.R., Revicki, D. (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications, Washington: Degnon Associates 5778Google Scholar
Hambleton, R.K., Swaminathan, H. (1985). Item response theory: principles and applications, Boston: Kluwer AcademicCrossRefGoogle Scholar
Hambleton, R.K., Swaminathan, H., Rogers, H.J. (1991). Fundamentals of item response theory, Newbury Park: Sage.Google Scholar
Holland, P.W. (1990). The Dutch identity: a new tool for the study of item response models. Psychometrika, 55, 518CrossRefGoogle Scholar
Kang, T., Chen, T.T. (2008). Performance of the generalized Sχ 2 item-fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391406CrossRefGoogle Scholar
Kolen, M.J., Brennan, R.L. (2004). Test equating, scaling, and linking, (2nd ed.). New York: SpringerCrossRefGoogle Scholar
Li, Y., Rupp, A.A. (2011). Performance of the Sχ 2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 9861005CrossRefGoogle Scholar
Liang, T., Han, T.K., Hambleton, R.K. (2009). ResidPlots-2: computer software for IRT graphical residual analyses. Applied Psychological Measurement, 33, 411412CrossRefGoogle Scholar
Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44. CrossRefGoogle Scholar
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47. CrossRefGoogle Scholar
Mislevy, R.J., Bock, R.D. (1991). BILOG 3.11 [computer software], Lincolnwood: Scientific Software InternationalGoogle Scholar
Muraki, E. (1997). A generalized partial credit model. In van der Linden, W.J., Hambleton, R.K. (Eds.), Handbook of modern item response theory, New York: SpringerGoogle Scholar
Muraki, E., Bock, R.D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program], Chicago: Scientific SoftwareGoogle Scholar
Naylor, J.C., Smith, A.F.M. (1982). Applications of a method for the efficient computation of posterior distributions. Applied Statistics, 31, 214225CrossRefGoogle Scholar
Orlando, M., Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 5064CrossRefGoogle Scholar
Rao, C.R. (1973). Linear statistical inference and its applications, (2nd ed.). New York: WileyCrossRefGoogle Scholar
Reckase, M.D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 2536CrossRefGoogle Scholar
Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375394CrossRefGoogle Scholar
Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical & Statistical Psychology, 59, 429449CrossRefGoogle ScholarPubMed
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150174CrossRefGoogle Scholar
Stone, C.A., Zhang, B. (2003). Assessing goodness of fit of item response theory models: a comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331352CrossRefGoogle Scholar
von Davier, M., Sinharay, S., Beaton, A.E., Oranje, A. (2006). The statistical procedures used in national assessment of educational progress. In Rao, C.R., Sinharay, S. (Eds.), Handbook of statistics, Amsterdam: North-Holland 205233Google Scholar
Wainer, H., Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339368CrossRefGoogle Scholar
Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245262CrossRefGoogle Scholar