Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-07T18:21:49.877Z Has data issue: false hasContentIssue false

Monitoring Scale Scores over Time via Quality Control Charts, Model-Based Approaches, and Time Series Techniques

Published online by Cambridge University Press:  01 January 2025

Yi-Hsuan Lee*
Affiliation:
Educational Testing Service
Alina A. von Davier
Affiliation:
Educational Testing Service
*
Requests for reprints should be sent to Yi-Hsuan Lee, Educational Testing Service, Princeton, NJ, USA. E-mail: YLee@ETS.ORG

Abstract

Maintaining a stable score scale over time is critical for all standardized educational assessments. Traditional quality control tools and approaches for assessing scale drift either require special equating designs, or may be too time-consuming to be considered on a regular basis with an operational test that has a short time window between an administration and its score reporting. Thus, the traditional methods are not sufficient to catch unusual testing outcomes in a timely manner. This paper presents a new approach for score monitoring and assessment of scale drift. It involves quality control charts, model-based approaches, and time series techniques to accommodate the following needs of monitoring scale scores: continuous monitoring, adjustment of customary variations, identification of abrupt shifts, and assessment of autocorrelation. Performance of the methodologies is evaluated using manipulated data based on real responses from 71 administrations of a large-scale high-stakes language assessment.

Type
Original Paper
Copyright
Copyright © 2013 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Petrov, B.N., Csaki, B.F. (Eds.), Second international symposium on information theory, Budapest: Academiai Kiado 267281Google Scholar
Allalouf, A. (2007). An NCME instructional module on quality control procedures in the scoring, equating, and reporting of test scores. Educational Measurement, Issues and Practice, 26(1), 3646CrossRefGoogle Scholar
Armstrong, R.D., Shi, M. (2009). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33, 391410CrossRefGoogle Scholar
AT&T Statistical Quality Control Handbook (1985). Charlotte: Delmar. Google Scholar
Bakis, R. (1976). Continuous speech word recognition via centisecond acoustic states. Paper presented at the meeting of the Acoustics Society of America, Washington. Google Scholar
Bloomfield, P. (2000). Fourier analysis of time series: an introduction, (2nd ed.). New York: WileyCrossRefGoogle Scholar
Brockwell, P.J., Davis, R.A. (1991). Time series: theory and methods, (2nd ed.). New York: SpringerCrossRefGoogle Scholar
Brown, R.L., Durbin, J., Evans, J.M. (1975). Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society. Series B, 37, 149163CrossRefGoogle Scholar
Casella, G., Berger, R.L. (2002). Statistical inference, Belmont: Duxbury PressGoogle Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 138CrossRefGoogle Scholar
Dorans, N. (2002). Recentering and realigning the SAT score distributions: how and why?. Journal of Educational Measurement, 39, 5984CrossRefGoogle Scholar
Dorans, N. (2002b). The recentering of SAT ®scales and its effects on score distributions and score interpretations (College Board Research Rep. No. 2002-11). New York: The College Board. CrossRefGoogle Scholar
Dorans, N.J., Moses, T.P., Eignor, D.R. (2011). Equating test scores: towards best practices. In von Davier, A.A. (Eds.), Statistical models for test equating, scaling, and linking, New York: Springer 2142Google Scholar
Draper, N.R., Smith, H. (1998). Applied regression analysis, (3rd ed.). New York: WileyCrossRefGoogle Scholar
Haberman, S.J., Guo, H., Liu, J., & Dorans, N. (2008). Consistency of SAT I: reasoning score conversions (ETS Research Report No. RR-08-67). Princeton: ETS. Google Scholar
Hawkins, D.M. (1993). Cumulative sum control charting: an underutilized SPC tool. Quality Engineering, 5(3), 463477CrossRefGoogle Scholar
Hawkins, D.M., Qiu, P., Kang, C.W. (2003). The change point model for statistical process control. Journal of Quality Technology, 35, 355366CrossRefGoogle Scholar
Hawkins, D.M., Zamba, K.D. (2005). Statistical process control for shift in mean or variance using the change point formulation. Technometrics, 47, 164173CrossRefGoogle Scholar
International Test Commission (2011). ITC guidelines for quality control in scoring, test analysis, and reporting of test scores. Retrieved from. http://www.intestcom.org/upload/sitefiles/qcguidelines.pdf. Google Scholar
Jelinek, F. (1976). Continuous recognition by statistical methods. Proceedings of the IEEE, 64(4), 532555CrossRefGoogle Scholar
Kolen, M.J., Brennan, R.L. (2004). Test equating, scaling, and linking: methods and practices, (2nd ed.). New York: SpringerCrossRefGoogle Scholar
Lee, Y.-H., & Haberman, S.J. (2013, in press). Harmonic regression and scale stability. Psychometrika. CrossRefGoogle Scholar
Li, D., Li, S., von Davier, A.A. (2011). Applying time-series analysis to detect scale drift. In von Davier, A.A. (Eds.), Statistical models for test equating, scaling, and linking, New York: Springer 327346Google Scholar
Meijer, R.R. (2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39, 219233CrossRefGoogle Scholar
Montgomery, D.C. (2009). Introduction to statistical quality control, (6th ed.). Hoboken: WileyGoogle Scholar
Nelson, L.S. (1984). The Shewhart control chart—tests for special causes. Journal of Quality Technology, 15, 237239CrossRefGoogle Scholar
Oakland, J.S. (1986). Statistical process control: a practical guide, New York: WileyGoogle Scholar
Omar, M.H. (2010). Statistical process control charts for measuring and monitoring temporal consistency of ratings. Journal of Educational Measurement, 47, 1835CrossRefGoogle Scholar
Page, E.S. (1954). Continuous inspection schemes. Biometrika, 41, 100115CrossRefGoogle Scholar
Petersen, N.S., Cook, L.L., Stocking, M.L. (1983). IRT versus conventional equating methods: a comparative study of scale stability. Journal of Educational Statistics, 8(2), 137156CrossRefGoogle Scholar
Puhan, G. (2008). Detecting and correcting scale drift in test equating: an illustration from a Large Scale testing program. Applied Measurement in Education, 22(1), 79103CrossRefGoogle Scholar
Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257285CrossRefGoogle Scholar
Schwarz, G.E. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461464CrossRefGoogle Scholar
Swift, J.A. (1995). Introduction to modern statistical quality control and management, Delray Beach: Lucie PressGoogle Scholar
Van Krimpen-Stoop, E.M.L.A., Meijer, R.R. (2001). CUSUM-based person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26, 199218CrossRefGoogle Scholar
Veerkamp, W.J.J., Glas, C.A.W. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25, 373–389CrossRefGoogle Scholar
Visser, I., Raijmakers, M.E.J., van der Maas, H.L.J. (2009). Hidden Markov models for individual time series. In Valsiner, J., Molenaar, P.C.M., Lyra, M.C.D.P., Chaudhary, N. (Eds.), Dynamic process methodology in the social and developmental sciences, New York: Springer 269289 Chapter 13CrossRefGoogle Scholar
Visser, I., Speekenbrink, M. (2010). depmixS4: an R package for hidden Markov models. Journal of Statistical Software, 36(7), 121CrossRefGoogle Scholar
Western Electric (1956). Statistical quality control handbook, Indianapolis: Western Electric CorporationGoogle Scholar
Zeileis, A. (2005). A unified approach to structural change tests based on ML scores, F statistics, and OLS residuals. Econometric Reviews, 24(4), 445466CrossRefGoogle Scholar
Zeileis, A., Leisch, F., Hornik, K., Kleiber, C. (2002). strucchange: an R package for testing for structural change in linear regression models. Journal of Statistical Software, 7(2), 138CrossRefGoogle Scholar