Hostname: page-component-745bb68f8f-hvd4g Total loading time: 0 Render date: 2025-01-07T19:00:03.017Z Has data issue: false hasContentIssue false

Observed-Score Equating as a Test Assembly Problem

Published online by Cambridge University Press:  01 January 2025

Wim J. van der Linden*
Affiliation:
University of Twente
Richard M. Luecht
Affiliation:
National Board of Medical Examiners
*
Requests for reprints should be sent to W. J. van der Linden, Department of Educational Measurement and Data Analysis, University of Twente, P.O. Box 217, 7500 AE Ensehede, THE NETHERLANDS. E-mail: vanderlinden@edte.utwente.nl

Abstract

A set of linear conditions on item response functions is derived that guarantees identical observed-score distributions on two test forms. The conditions can be added as constraints to a linear programming model for test assembly that assembles a new test form to have an observed-score distribution optimally equated to the distribution on an old form. For a well-designed item pool and items fitting the IRT model, use of the model results into observed-score pre-equating and prevents the necessity of post hoc equating by a conventional observed-score equating method. An empirical example illustrates the use of the model for an item pool from the Law School Admission Test.

Type
Original Paper
Copyright
Copyright © 1998 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors are most indebted to Norman D. Verhelst for suggesting Proposition 4 and its proof, to the Law School Admission Council (LSAC) for making available the data set, and to Wim M. M. Tielen for his computational assistance.

References

Adema, J. J. (1990). The construction of customized two-staged tests. Journal of Educational Measurement, 27, 241253.CrossRefGoogle Scholar
Adema, J. J. (1992). Methods and models for the construction of weakly parallel tests. Applied Psychological Measurement, 16, 5363.CrossRefGoogle Scholar
Adema, J. J., & van der Linden, W. J. (1989). Algorithms for computerized test construction using classical item parameters. Journal of Educational Statistics, 14, 279290.CrossRefGoogle Scholar
Armstrong, R. D., & Jones, D. H. (1992). Polynomial algorithms for item matching. Applied Psychological Measurement, 16, 365373.CrossRefGoogle Scholar
Armstrong, R. D., Jones, D. H., & Wang, Z. (1994). Automated parallel test construction using classical test theory. Journal of Educational Statistics, 19, 7390.CrossRefGoogle Scholar
Armstrong, R. D., Jones, D. H., & Wu, I.-L. (1992). An automated test development of parallel tests. Psychometrika, 57, 271288.CrossRefGoogle Scholar
Boekkooi-Timminga, E. (1987). Simultaneous test construction by zero-one programming. Methodika, 1, 1101–112.Google Scholar
Boekkooi-Timminga, E. (1990). The construction of parallel tests from IRT-based item banks. Journal of Educational Statistics, 15, 129145.CrossRefGoogle Scholar
Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In Holland, P. W., & Rubin, D. B. (Eds.), Test equating, New York: Academic Press.Google Scholar
Glas, C. A. W. (1992). A Rasch model with a multivariate distribution of ability. In Wilson, M. (Eds.), Objective measurement: Theory into practice (Vol. 1), Norwood, NJ: Ablex.Google Scholar
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications, Boston: Kluwer-Nijhoff.CrossRefGoogle Scholar
Kelderman, H. (1997). Loglinear multidimensional item response model models for polytomously scored items. In van der Linden, W. J., & Hambleton, R. K. (Eds.), Handbook of modern item response theory (pp. 287304). New York: Springer-Verlag.CrossRefGoogle Scholar
Kendall, M. G., & Stuart, A. (1977). The advanced theory of statistics 4th ed.,, London: Griffin & Co..Google Scholar
Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices, New York: Springer-Verlag.CrossRefGoogle Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems, Hillsdale, NJ: Erlbaum.Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental tests, Reading, MA: Addision-Wesley.Google Scholar
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452461.CrossRefGoogle Scholar
Luecht, R. M., & Hirsch, T. M. (1992). Computerized test construction using average growth approximation of target information functions. Applied Psychological Measurement, 16, 4152.CrossRefGoogle Scholar
McKinley, R. L., & Reckase, M. N. (1983). An extension of the two-parameter logistic model to the multidimensional latent space, Iowa City, IA: American College Testing.Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests, Copenhagen: Danmarks Paedagogiske Institut.Google Scholar
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401412.CrossRefGoogle Scholar
Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In van der Linden, W. J., & Hambleton, R. K. (Eds.), Handbook of modern item response theory (pp. 271286). New York City, NY: Springer-Verlag.CrossRefGoogle Scholar
Samejima, F. (1974). Normal-ogive model for the continuous response level in the multidimensional latent space. Psychometrika, 39, 111121.CrossRefGoogle Scholar
Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17, 151166.CrossRefGoogle Scholar
Tang, K. L., Way, W. D., & Carey, P. A. (1993). The effect of small calibration sample sizes on TEOFL IRT-based equating, Princeton, NJ: Educational Testing Service.Google Scholar
Theunissen, T. J. J. M. (1985). Binary programming and test design. Psychometrika, 50, 411420.CrossRefGoogle Scholar
Timminga, E., van der Linden, W. J., & Schweizer, D. A. (1996). ConTEST 2.0: A decision support system for item banking and optimal test assembly (computer program and manual), Groningen, The Netherlands: iec Pro-GAMMA.Google Scholar
van der Linden, W. J. (1996). Assembling test for the measurement of multiple traits. Applied Psychological Measurement, 20, 373388.CrossRefGoogle Scholar
van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195211.CrossRefGoogle Scholar
van der Linden, W. J., & Boekkooi-Timminga, E. (1988). A zero-one programming approach to Gulliksen's matched random subsets method. Applied Psychological Measurement, 12, 201209.CrossRefGoogle Scholar
van der Linden, W. J., & Boekkooi-Timminga, E. (1989). A maximin model for test design with practical constraints. Psychometrika, 17, 237247.CrossRefGoogle Scholar
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory, New York City, NY: Springer-Verlag.CrossRefGoogle Scholar
van der Linden, W. J., & Luecht, R. M. (1966). An optimization model for test assembly to match observed-score distributions. In Engelhard, G., & Wilson, M. (Eds.), Objective measurement: Theory into practice (pp. 405418). Norwood, NJ: Ablex Publishing Company.Google Scholar
van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22, 259270.CrossRefGoogle Scholar
Walsh, J. E. (1953). Approximate probability values for observed number of successes. Sankhya, 15, 281290.Google Scholar
Walsh, J. E. (1963). Corrections to two papers concerned with binomial events. Sankhya, 25, 427427.Google Scholar
Zeng, L., & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231241.CrossRefGoogle Scholar