Hostname: page-component-745bb68f8f-hvd4g Total loading time: 0 Render date: 2025-01-07T17:29:04.276Z Has data issue: false hasContentIssue false

Standard Error of Ability Estimates and the Classification Accuracy and Consistency of Binary Decisions

Published online by Cambridge University Press:  01 January 2025

Ying Cheng*
Affiliation:
University of Notre Dame
Cheng Liu
Affiliation:
University of Notre Dame
John Behrens
Affiliation:
Pearson
*
Correspondence should be sent to Ying Cheng, Department of Psychology, University of Notre Dame, 118 Haggar Hall, Notre Dame, IN 46556, USA. E-mail: ycheng4@nd.edu

Abstract

While estimation bias is a primary concern in psychological and educational measurement, the standard error is of equal importance in linking key aspects of the assessment structure, especially when the assessment goal concerns the classification of individuals into categories (e.g., master/non-mastery). In this paper, we show analytically how standard error of ability estimates affects expected classification accuracy and consistency when the decision is binary. When standard error decreases, the conditional classification accuracy and consistency increase. Given an examinee population and a cut score, smaller standard error over the entire latent trait continuum guarantees higher overall expected classification accuracy and consistency. We were also able to show the interrelationship between standard error, the expected classification consistency, and reliability. Utilizing the relationship between standard error and expected classification accuracy and consistency, we derive the upper bounds of the overall expected classification accuracy and consistency of a fixed-length computerized adaptive test. The lower bound of the expected classification accuracy and consistency is also derived given a number of stopping rules of variable-length computerized adaptive testing. Implications of these analytical results on operational tests are discussed.

Type
Original Paper
Copyright
Copyright © 2014 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Birnbaum, A., (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
Brown, J. M., & Weiss, D. J., (1977). An adaptive testing strategy for achievement test batteries (Research Report 77–6). Minneapolis, MN: Department of Psychology, University of Minnesota.Google Scholar
Chang, H. -H., & Stout, W., (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52.CrossRefGoogle Scholar
Cheng, Y., (2010). Improving cognitive diagnostic computerized adaptive testing by balancing attribute coverage: The modified maximum global discrimination index method. Educational and Psychological Measurement, 70, 902–913.CrossRefGoogle Scholar
Cheng, Y., & Morgan, D. (2013). Classification accuracy and consistency of computerized adaptive testing. Behavioral Research Methods, 45, 132142.CrossRefGoogle ScholarPubMed
Cheng, Y., Yuan, K. -H., (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280–291.CrossRefGoogle Scholar
Cheng, Y., Yuan, K. -H., & Liu, C., (2012). Comparison of reliability measures under factor analysis and item response theory. Educational and Psychological Measurement, 72, 52–67.CrossRefGoogle Scholar
Choi, S. W., Grady, M. W., & Dodd, B. G., (2011). A new stopping rule for computerized adaptive testing. Educational and Psychological Measurement, 71, 37–53.CrossRefGoogle Scholar
Cronbach, L. J., (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.CrossRefGoogle Scholar
Cronbach, L. J., Schoenemann, P., & McKie, D., (1965). Alpha coefficient for stratified-parallel tests. Educational and Psychological Measurement, 25, 291–312.CrossRefGoogle Scholar
DiCerbo, K. E., & Behrens, J. T., (2012). Implications of the digital ocean on current and future assessment. In R. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: Recent history and predictions for the future (pp. 273–306). Charlotte, NC: Information Age.Google Scholar
Eggen, T. J. H. M., & Straetmans, G. J. J. M., (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713–734.CrossRefGoogle Scholar
Fan, X., & Thompson, B., (2001). Confidence intervals for effect sizes—confidence intervals about score reliability coefficients, please: An EPM guidelines editorial. Educational and Psychological Measurement, 61, 517–531.CrossRefGoogle Scholar
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research and Evaluation, 11(6), 19.Google Scholar
Hoshino, T., & Shigemasu, K. (2008). Standard errors of estimated latent variable scores with estimated structural parameters. Applied Psychological Measurement, 32, 181189.CrossRefGoogle Scholar
Huebner, A., (2012). Item overexposure in computerized classification tests using sequential item selection. Practical Assessment, Research and Evaluation, 17(12). http://pareonline.net/getvn.asp?v=17&n=12.Google Scholar
Lathrop, Q. N., & Cheng, Y., (2013). Two approaches to estimation of classification accuracy under the IRT framework. Applied Psychological Measurement, 37, 226–241.CrossRefGoogle Scholar
Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 117.CrossRefGoogle Scholar
Lee, W., Hanson, B., & Brennan, R. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412432.CrossRefGoogle Scholar
Lord, F. M., (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48, 233–245.CrossRefGoogle Scholar
McDonald, R. P., (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum.Google Scholar
Mellenbergh, G. J., (1994). A unidimensional latent trait model for continuous item responses. Multivariate Behavioral Research, 29, 223–236.CrossRefGoogle Scholar
Ogasawara, H. (2013). Asymptotic properties of the Bayes and pseudo Bayes estimators of ability in item response theory. Journal of Multivariate Analysis, 114, 359377.CrossRefGoogle Scholar
Patton, J., Cheng, Y., Yuan, K. -H., & Diao, Q., (2013). The influence of calibration error on variable-length computerized adaptive testing. Applied Psychological Measurement, 37, 24–40.CrossRefGoogle Scholar
Rajaratnam, N., Cronbach, L. J., & Gleser, G. C., (1965). Generalizability of stratified-parallel test. Psychometrika, 30, 39–56.CrossRefGoogle Scholar
Rudner, L. M., (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, Research and Evaluation, 7(14), 1–5.Google Scholar
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research and Evaluation, 10(13), 1–4.Google Scholar
Spray, J. A., & Reckase, M. D., (1996). Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized test. Journal of Educational and Behavioral Statistics, 21, 405–414.CrossRefGoogle Scholar
Stocking, M. L., & Lewis, C., (2000). Methods of controlling the exposure of items in CAT. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 163–182). The Netherlands: Kluwer.Google Scholar
Thompson, N. A., (2007). A practitioner’s guide for variable-length computerized classification testing. Practical Assessment Research and Evaluation, 12(1). http://pareonline.net/getvn.asp?v=12&n=1.Google Scholar
Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69, 778–793.CrossRefGoogle Scholar
Warm, T. A., (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.CrossRefGoogle Scholar
Weiss, D. J., (1982). Improving measurement quality and efficiency with adaptive theory. Applied Psychological Measurement, 6, 473–492.CrossRefGoogle Scholar
Wilkinson, L., APA Task Force on Statistical Inference., (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.CrossRefGoogle Scholar
Woods, C. M., & Thissen, D., (2006). Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika, 71, 281–301.CrossRefGoogle Scholar
Woods, C. M., & Lin, N., (2009). Item response theory with estimation of the latent density using Davidian curves. Applied Psychological Measurement, 33, 102–117.CrossRefGoogle Scholar
Wyse, A. E., & Hao, S., (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36, 602–624.CrossRefGoogle Scholar
Yang, X., Poggio, J. C., & Glasnapp, D. R., (2006). Effects of estimation bias on multiple-category classification with an IRT-based adaptive classification procedure. Educational and Psychological Measurement, 66, 545–564.CrossRefGoogle Scholar
Yi, H. S., Kim, S., & Brennan, R. L., (2007). A method for estimating classification consistency indices for two equated forms. Applied Psychological Measurement, 31, 275–291.Google Scholar