Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-01-07T18:57:28.810Z Has data issue: false hasContentIssue false

Psychometrics: from Practice to Theory and Back

15 Years of nonparametric multidimensional IRT, DIF/test equity, and skills diagnostic assessment

Published online by Cambridge University Press:  01 January 2025

William Stout*
Affiliation:
Department of Statistics, University of Illinois Educational Testing Service
*
Requests for reprints should be sent to William Stout, Department of Statistics, University of Illinois, 725 S. Wright Street, Champaign IL 61820. E-Mail: stout@stat.uiuc.edu

Abstract

The paper surveys 15 years of progress in three psychometric research areas: latent dimensionality structure, test fairness, and skills diagnosis of educational tests. It is proposed that one effective model for selecting and carrying out research is to chose one's research questions from practical challenges facing educational testing, then bring to bear sophisticated probability modeling and statistical analyses to solve these questions, and finally to make effectiveness of the research answers in meeting the educational testing challenges be the ultimate criterion for judging the value of the research. The problem-solving power and the joy of working with a dedicated, focused, and collegial group of colleagues is emphasized. Finally, it is suggested that the summative assessment testing paradigm that has driven test measurement research for over half a century is giving way to a new paradigm that in addition embraces skills level formative assessment, opening up a plethora of challenging, exciting, and societally important research problems for psychometricians.

Type
Articles
Copyright
Copyright © 2002 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This aricle is based on the Presidential Address William Stout gave on June 23, 2002 at the 67th Annual Meeting of the Psychometric Society held in Chapel Hill, North Carolina.—Editor

I wish to especially thank Sarah Hartz and Louis Roussos for their suggestions that helped shape this paper. I wish to thank all my former Ph.D. students: Without their contributions, the content of this paper would have been vastly different and much less interesting!

Dedication: I want to dedicate this paper to my wife, Barbara Meihoefer, who was lost to illness in this year of my presidency. For, in addition to all the wonderful things she meant to me personally and the enormous support she gave concerning my career, she truly enjoyed and greatly appreciated my psychometric colleagues and indeed found psychometrics an important and fascinating intellectual endeavor, in particular finding the skills diagnosis area exciting and important: She often took time from her career as a business manager and entrepreneur to attend psychometric meetings with me and to discuss research projects with my colleagues and me. She would have enjoyed this paper.—William Stout

References

Ackerman, T.A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 6791.CrossRefGoogle Scholar
Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 324). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Bolt, D., Froelich, A.G., Habing, B., Hartz, S., Roussos, L., & Stout, W. (in press). An applied and foundational research project addressing DIF, impact, and equity: With applications to ETS test development (ETS Technical Report). Princeton, NJ:ETS.Google Scholar
Chang, H., Mazzeo, J., & Roussos, L. (1996). Detecting DIF for polytomously scored items: an adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33, 333353.CrossRefGoogle Scholar
Chang, H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 3752.CrossRefGoogle Scholar
DiBello, L., Stout, W., & Roussos, L. (1995). Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques. In Nichols, P., Chipman, S., & Brennen, R. (Eds.), Cognitively diagnostic assessment (pp. 361389). Hillsdale, NJ: Earlbaum.Google Scholar
Doignon, J.-P., & Falmagne, J.-C. (in press), Knowledge spaces. Berlin: Springer-Verlag.Google Scholar
Dorans, N.J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355368.CrossRefGoogle Scholar
Douglas, J. (1997). Joint consistency of nonparametric item characteristic curve and ability estimation. Psychometrika, 62, 728.CrossRefGoogle Scholar
Douglas, J.A. (2001). Asymptotic identifiability of nonparametric item response models. Psychometrika, 66, 531540.CrossRefGoogle Scholar
Douglas, J.A., & Cohen, A. (2001). Nonparametric ICC estimation to assess fit of parametric models. Applied Psychological Measurement, 25, 234243.CrossRefGoogle Scholar
Douglas, J., Kim, H.R., Habing, B., & Gao, F. (1998). Investigating local dependence with conditional covariance functions. Journal of Educational and Behavioral Statistics, 23, 129151.CrossRefGoogle Scholar
Douglas, J., Roussos, L., & Stout, W. (1996). Item bundle DIF hypothesis testing: Identifying suspect bundles and assessing their DIF. Journal of Educational Measurement, 33, 465484.CrossRefGoogle Scholar
Douglas, J., Stout, W., & DiBello, L. (1996). A kernel smoothed version of SIBTEST with applications to local DIF inference and unction estimation. Journal of Educational and Behavioral Statistics, 21, 333363.CrossRefGoogle Scholar
Ellis, J.L., & Junker, B.W. (1997). Tail-measurability in monotone latent variable models. Psychometrika, 62, 495524.CrossRefGoogle Scholar
Embretson (Whitely), S.E. (1980). Multicomponent latent trait models for ability tests. Psychometrika, 45, 479494.Google Scholar
Embretson, S.E. (1984). A general latent trait model for response processes. Psychometrika, 49, 175186.CrossRefGoogle Scholar
Embretson, S. E. (1985). Test design: Developments in psychology and psychometrics (pp. 195218). Orlando, FL: Academic Press.CrossRefGoogle Scholar
Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359374.CrossRefGoogle Scholar
Froelich, A.G., & Habing, B. (2002, July). A study of methods for selecting the AT subtest in the DIMTEST procedure. Paper presented at the 2002 Annual Meeting of the Psychometrika Society, University of North Carolina at Chapel Hill.Google Scholar
Gierl, M.J., Bisanz, J., Bisanz, G., Boughton, K., & Khaliq, S. (2001). Illustrating the utility of differential bundle functioning analyses to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practice, 20, 2636.CrossRefGoogle Scholar
Gierl, M.J., & Khaliq, S.N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests. Journal of Educational Measurement, 38, 164187.CrossRefGoogle Scholar
Gierl, M.J., Bisanz, J., Bisanz, G.L., & Boughton, K.A. (2002, April). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the DIF analysis framework. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.CrossRefGoogle Scholar
Haberman, S.J. (1977). Maximum likelihood estimates in exponential response models. The Annals of Statistics, 5, 815841.CrossRefGoogle Scholar
Habing, B. (2001). Nonparametric regression and the parametric bootstrap for local dependence assessment. Applied Psychological Measurement, 25, 221233.CrossRefGoogle Scholar
Haertel, E. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301321.CrossRefGoogle Scholar
Hartz, S.M. (2002). A Bayesian framework for the Unified Model for assessing cognitive abilities: blending theory with practicality. Urbana-Champaign: University of Illinois.Google Scholar
Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577601.CrossRefGoogle Scholar
Holland, P.W. (1990). The Dutch identity: a new tool for the study of item response models. Psychometrika, 55, 518.CrossRefGoogle Scholar
Holland, P.W., & Rosenbaum, P.R. (1986). Conditional association and unidimensionality in monotone latent variable models. The Annals of Statistics, 14, 15231543.CrossRefGoogle Scholar
Holland, W.P., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer, H., & Braun, H.I. (Eds.), Test validity (pp. 129145). Hillsdale, NJ: Lawrence Earlbaum Associates.Google Scholar
Jiang, H., & Stout, W. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23, 291322.CrossRefGoogle Scholar
Junker, B.W. (1993). Conditional association, essential independence, and monotone unidimensional latent variable models. Annals of Statistics, 21, 13591378.CrossRefGoogle Scholar
Junker, B.W. (1999). Some statistical models and computational methods that may be useful for cognitively-relevant assessment. Prepared for the National Research Council Committee on the Foundations of Assessment. Retrieved April 2, 2001, from http://www.stat.cmu.edu/∼brian/nrc/cfa/Google Scholar
Junker, B.W., & Ellis, J.L. (1998). A characterization of monotone unidimensional latent variable models. Annals of Statistics, 25(3), 13271343.Google Scholar
Junker, B. W., & Sijtsma, K. (2001). Nonparametric item response theory in action: an overview of the special issue. Applied Psychological Measurement, 25, 211220.CrossRefGoogle Scholar
Koedinger, K.R., & MacLaren, B.A. (2002). Developing a pedagogical domain theory of early algebra problem solving. Pittsburgh, PA: Carnegie Mellon University, School of Computer Science.Google Scholar
Li, H., & Stout, W. (1996). A new procedure for detecting crossing DIF. Psychometrika, 61, 647677.CrossRefGoogle Scholar
Kok, F. (1988). Item bias and test multidimensionality. In Langeheine, R., & Rost, J. (Eds.), Latent trait and latent models (pp. 263275). New York, NY: Plenum Press.CrossRefGoogle Scholar
Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 349364). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hinsdale, NJ.: Lawrence Erlbaum Associates.Google Scholar
McDonald, R.P. (1994). Testing for approximate dimensionality. In Laveault, D., Zumbo, B.D., Gessaroli, M.E., & Boss, M.W. (Eds.), Modern theories of measurement: Problems and issues (pp. 6386). Ottawa, Canada: University of Ottawa.Google Scholar
Maris, E. (1995). Psychometric latent response models. Psychometrika, 60, 523547.CrossRefGoogle Scholar
Mislevy, R.J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439483.CrossRefGoogle Scholar
Mislevy, R.J., Almond, R.G., Yan, D., & Steinberg, L.S. (1999). Bayes nets in educational assessment: Where do the numbers come from?. In Laskey, K.B., Prade, H. (Eds.), Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 437446). San Francisco, CA: Morgan Kaufmann.Google Scholar
Mislevy, R., Steinberg, L. & Almond, R. (in press). On the structure of educational assessments. Measurement: Interdisciplinary research and perspective. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Mokken, R.J. (1971). A theory and procedure of scale analysis. The Hague: Mouton.CrossRefGoogle Scholar
Molenaar, I.W., Sijtsma, K. (2000). User's manual MSP5 for Windows: A program for Mokken Scale Analysis for Polytomous Items. Version 5.0 [Software manual]. Groningen: ProGAMMA.Google Scholar
Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout's test for DIF. Journal of Educational Measurement, 30, 293311.CrossRefGoogle Scholar
Nandakumar, R., & Roussos, L. (in press). Evaluation of CATSIB procedure in pretest setting. Journal of Educational and Behavioral Statistics.Google Scholar
Nandakumar, R., & Stout, W. (1993). Refinements of Stout's procedure for assessing latent trait unidimensionality. Journal of Educational Statistics, 18, 4168.Google Scholar
O'Neill, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with differential item functioning. In Holland, P.W., Wainer, H. (Eds.), Differential item functioning (pp. 255276). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Pellegrino, J.W., Chudowski, N., & Glaser, R. (2001). Knowing what students know: The science and design of educational assessment (pp. 111172). Washington, DC: National Academy Press.Google Scholar
Philipp, W., & Stout, W. (1975). Almost sure convergence principles for sums of dependent random variables. Providence, RI: American Mathematical Society.Google Scholar
Ramsay, J.O. (2000). TESTGRAF:A program for the graphical analysis of multiple choice test and questionnaire data (TESTGRAF user's guide for TESTGRAF98 software). Montreal, Quebec: Author.Google Scholar
Ramsey, P.A. (1993). Sensitivity review: the ETS experience as a case study. In Holland, P.W., & Wainer, H. (Eds.), Differential item functioning (pp. 367388). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Rossi, N., Wang, W. & Ramsay, J.O. (in press). Nonparametric item response function estimates with the EM algorithm. Journal of Educational and Behavioral Statistics.Google Scholar
Roussos, L., & Stout, W. (1996). DIF from the multidimensional perspective. Applied Psychological Measurement, 20, 335371.Google Scholar
Roussos, L., & Stout, W. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type 1 error performance. Journal of Education Measurement, 33, 215230.CrossRefGoogle Scholar
Roussos, L.A., Stout, W.F., & Marden, J. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement, 35, 130.CrossRefGoogle Scholar
Roussos, L.A., Schnipke, D.A., & Pashley, P.J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24, 293322.CrossRefGoogle Scholar
Shealy, R.T. (1989). An item response theory-based statistical procedure for detecting concurrent internal bias in ability tests. Urbana-Champaign: Department of Statistics, University of Illinois.Google Scholar
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159194.CrossRefGoogle Scholar
Shealy, R., & Stout, W. (1993). An item response theory model for test bias and differential test functioning. In Holland, P., & Wainer, H. (Eds.), Differential item functioning (pp. 197240). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
Sijtsma, K. (1998). Methodology review: nonparametric IRT approaches to the analysis of dichotomous item scores. Applied Psychological Measurement, 22, 332.CrossRefGoogle Scholar
Sternberg, R.J. (1985). Beyond IQ: A triarchic theory of human intelligence. New York, NY: Cambridge University Press.Google Scholar
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589617.CrossRefGoogle Scholar
Stout, W. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55, 293325.CrossRefGoogle Scholar
Stout, W., Froelich, A.G., & Gao, F. (2001). Using resampling to produce an improved DIMTEST procedure. In Boomsma, A., van Duijn, M.A.J., & Snijders, T.A.B. (Eds.), Essays on item response theory (pp. 357376). New York, NY: Springer-Verlag.CrossRefGoogle Scholar
Stout, W., Habing, B., Douglas, J., Kim, H.R., Roussos, L., & Zhang, J. (1996). Conditional covariance based nonparametric multidimensionality assessment. Applied Psychological Measurement, 20, 331354.CrossRefGoogle Scholar
Stout, W., Li, H., Nandakumar, R., & Bolt, D. (1997). MULTISIB—A procedure to investigate DIF when a test is intentionally multidimensional. Applied Psychological Measurement, 21, 195213.CrossRefGoogle Scholar
Suppes, P., Zanotti, M. (1981). When are probabilistic explanations possible?. Synthese, 48, 191199.CrossRefGoogle Scholar
Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive error diagnosis. In Frederiksen, N., Glazer, R., Lesgold, A., & Shafto, M.G. (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453488). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classification approach. In Nichols, P., Chipman, S., & Brennen, R. (Eds.), Cognitively diagnostic assessment (pp. 327359). Hillsdale, NJ: Earlbaum.Google Scholar
Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence Erlbaum Associates.CrossRefGoogle Scholar
Trachtenberg, F., & He, X. (2002). One-step joint maximum likelihood estimation for item response theory models. Submitted for publication.Google Scholar
Tucker, L.R., Koopman, R.F., & Linn, R.L. (1969). Evaluation of factor analytic research procedures by means of simulated correlation matrices. Psychometrika, 34, 421459.CrossRefGoogle Scholar
Wainer, H., & Braun, H.I. (1988). Test validity. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Whitely, S.E. (1980). (See Embretson, 1980)Google Scholar
Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64, 213249.CrossRefGoogle Scholar