Evidence and Inference in Educational Assessment

Robert J. Mislevy

doi:10.1007/BF02294388

Evidence and Inference in Educational Assessment

Published online by Cambridge University Press: 01 January 2025

Robert J. Mislevy

Show author details

Robert J. Mislevy*: Affiliation:
Educational Testing Service
*: Requests for reprints should be sent to Robert J. Mislevy, Educational Testing Service, Princeton, NJ 08541.

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Educational assessment concerns inference about students' knowledge, skills, and accomplishments. Because data are never so comprehensive and unequivocal as to ensure certitude, test theory evolved in part to address questions of weight, coverage, and import of data. The resulting concepts and techniques can be viewed as applications of more general principles for inference in the presence of uncertainty. Issues of evidence and inference in educational assessment are discussed from this perspective.

Keywords

Bayesian inference networks cognitive psychology evidence inference performance assessment probability psychometrics test theory

Type: Original Paper
Information: Psychometrika , Volume 59 , Issue 4 , December 1994 , pp. 439 - 483

DOI: https://doi.org/10.1007/BF02294388 [Opens in a new window]
Copyright: Copyright © 1994 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Presidential address to the Psychometric Society, presented June 25, 1994, in Champaign, Illinois.

Supported by (1) Contract No. N00014-91-J-4101, R&T 4421573-01, from the Cognitive Science Program, Cognitive and Neural Sciences Division, Office of Naval Research, (2) the National Center for Research on Evaluation, Standards, Student Testing (CRESST), Educational Research and Development Program, cooperative agreement number R117G10027 and CFDA catalog number 84.117G, as administered by the Office of Educational Research and Improvement, U.S. Department of Education, and (3) the Statistical and Psychometric Research Division of Educational Testing Service. I am grateful for comments and suggestions from Henry Braun, Drew Gitomer, Richard Patz, Jonathan Troper, and Howard Wainer.

References

Aitkin, M., Longford, N. (1986). Statistical modeling issues in school effectiveness studies. Journal of the Royal Statistical Society, 149, 1–43.CrossRef Google Scholar

American Council on the Training of Foreign Languages (1989). ACTFL proficiency guidelines, Yonkers, NY: Author.Google Scholar

Andreassen, S., Jensen, F. V., Olesen, K. G. (1990). Medical expert systems based on causal probabilistic networks, Aalborg, Denmark: Aalborg University, Institute of Electronic Systems.Google Scholar

Andersen, S. K., Jensen, F. V., Olesen, K. G., Jensen, F. (1989). HUGIN: A shell for building Bayesian belief universes for expert systems [computer program], Aalborg, Denmark: HUGIN Expert.Google Scholar

Anderson, T. J., Twining, W. L. (1991). Analysis of evidence, Boston: Little, Brown, & Co..Google Scholar

Askin, W. (1985). Evaluating the Advanced Placement portfolio in studio art, Princeton, NJ: Educational Testing Service.Google Scholar

Bentham, J. (1825). A treatise on judicial evidence, London: Hunt & Clarke.Google Scholar

Bentham, J. (1827). Rationale of judicial evidence, London: Hunt & Clarke.Google Scholar

Box, G. E. P., Tiao, G. C. (1973). Bayesian inference in statistical analysis, Reading, MA: Addison-Wesley.Google Scholar

Cohen, L. J. (1977). The probable and the provable, Oxford: The Clarendon Press.CrossRef Google Scholar

Cronbach, L. J., Gleser, G. C., Nanda, H., Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles, New York: Wiley.Google Scholar

de Finetti, B. (1974). Theory of probability, London: Wiley.Google Scholar

Diaconis, P., Freedman, D. (1980). Finite exchangeable sequences. The Annals of Probability, 8, 745–764.CrossRef Google Scholar

Falmagne, J-C. (1989). A latent trait model via a stochastic learning theory for a knowledge space. Psychometrika, 54, 283–303.CrossRef Google Scholar

Glaser, R., Lesgold, A., Lajoie, S. (1987). Toward a cognitive theory for the measurement of achievement. In Ronning, R., Glover, J., Conoley, J. C., Witt, J. (Eds.), The influence of cognitive psychology on testing and measurement: The Buros-Nebraska Symposium on measurement and testing (pp. 41–85). Hillsdale, NJ: Erlbaum.Google Scholar

Good, I. J. (1950). Probability and the weighting of evidence, New York: Hafner.Google Scholar

Greeno, J. G. (1989). A perspective on thinking. American Psychologist, 44, 134–141.CrossRef Google Scholar

Gulliksen, H. (1961). Measurement of learning and mental abilities. Psychometrika, 26, 93–107.CrossRef Google Scholar PubMed

Haertel, E. H., Wiley, D. E. (1993). Representations of ability structures: Implications for testing. In Frederiksen, N., Mislevy, R. J., Bejar, I. I. (Eds.), Test theory for a new generation of tests (pp. 359–384). Hillsdale, NJ: Erlbaum.Google Scholar

Holland, P. W., Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent trait variable models. Annals of Statistics, 14, 1523–1543.CrossRef Google Scholar

Inhelder, B., Piaget, J. (1958). The growth of logical thinking from childhood to adolescence, New York: Basic.CrossRef Google Scholar

Jöreskog, K. G., Sörbom, D. (1979). Advances in factor analysis and structural equation models, Cambridge, MA: Abt Books.Google Scholar

Kahneman, D., Slovic, P., Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases, Cambridge: Cambridge University Press.CrossRef Google Scholar

Kempf, W. (1983). Some theoretical concerns about applying latent trait models in educational testing. In Anderson, S. B., Helmick, J. S. (Eds.), On educational testing (pp. 252–270). San Francisco: Josey-Bass.Google Scholar

Kolmogorov, A. N. (1950). Foundations of the theory of probability, New York: Chelsea.Google Scholar

Koretz, D. (1992). Evaluating and validating indicators of mathematics and science education, Santa Monica, CA: RAND.Google Scholar

Kuhn, T. S. (1970). The structure of scientific revolutions 2nd ed.,, Chicago: University of Chicago Press.Google Scholar

Kyllonen, P. C., Lohman, D. F., Snow, R. E. (1984). Effects of aptitudes, strategy training, and test facets on spatial task performance. Journal of Educational Psychology, 76, 130–145.CrossRef Google Scholar

Lakatos, I. (1970). Falsification and the methodology of scientific research programs. In Lakatos, I., Musgrove, A. (Eds.), Criticism and the growth of knowledge (pp. 91–196). Cambridge: Cambridge University Press.CrossRef Google Scholar

Lauritzen, S. L., Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal Statistical Society, 50, 157–224.CrossRef Google Scholar

Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., Clausen, J. A. (Eds.), Studies in social psychology in World War II, Volume 4: Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.Google Scholar

Levine, M., Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42–56.CrossRef Google Scholar

Lewis, C. (1986). Test theory and Psychometrika: The past twenty-five years. Psychometrika, 51, 11–22.CrossRef Google Scholar

Linacre, J. M. (1989). Multi-faceted Rasch measurement, Chicago: MESA Press.Google Scholar

Lindley, D. V., Novick, M. R. (1981). The role of exchangeability of inference. Annals of Statistics, 9, 45–58.CrossRef Google Scholar

Lord, F. M. (1980). Applications of item response theory to practical testing problems, Hillsdale, NJ: Erlbaum.Google Scholar

Martin, J. D., VanLehn, K. (1993). OLEA: Progress toward a multi-activity, Bayesian student modeler. In Brna, S. P., Ohlsson, S., Pain, H. (Eds.), Artificial intelligence in education: Proceedings of AI-ED 93 (pp. 410–417). Charlottesville, VA: Association for the Advancement of Computing in Education.Google Scholar

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–196.CrossRef Google Scholar

Mislevy, R. J. (in press). Probability-based inference in cognitive diagnosis. In Nichols, P., Chipman, S., & Brennan, R. (Eds.), Cognitively diagnostic assessment. Hillsdale, NJ: Erlbaum.Google Scholar

Mislevy, R. J., Sheehan, K. M. (1989). The role of collateral information about examinees in item parameter estimation. Psychometrika, 54, 661–679.CrossRef Google Scholar

Mislevy, R. J., Sheehan, K. M., Wingersky, M. S. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30, 55–78.CrossRef Google Scholar

Mislevy, R. J., Yamamoto, K., Anacker, S. (1992). Toward a test theory for assessing student understanding. In Lesh, R. A., Lamon, S. (Eds.), Assessments of authentic performance in school mathematics (pp. 293–318). Washington, DC: American Association for the Advancement of Science.Google Scholar

Mitchell, R. (1992). Testing for learning: How new approaches to evaluation can improve American schools, New York: The Free Press.Google Scholar

Myford, C. M., & Mislevy, R. J. (in press). Monitoring and improving a portfolio assessment system (ETS Research Report). Princeton, NJ: Educational Testing Service.Google Scholar

Noetic Systems (1991). ERGO [computer program], Baltimore, MD: Author.Google Scholar

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference, San Mateo, CA: Kaufmann.Google Scholar

Peploe, M., Wollen, P., Antonioni, M. (1975). The passenger, New York, NY: Random House.Google Scholar

Platt, W. J. (1975). Policy making and international studies in educational evaluation. In Purves, A. C., Levine, D. U. (Eds.), Educational policy and international assessment (pp. 33–59). Berkeley, CA: McCutchen.Google Scholar

Posner, G. (1993). Case closed: Lee Harvey Oswald and the assassination of JFK, New York: Random House.Google Scholar

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests, Copenhagen: Danish Institute for Educational Research.Google Scholar

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172.CrossRef Google Scholar

Schum, D. A. (1981). Sorting out the effects of witness sensitivity and response-criterion placement upon the inferential value of testimonial evidence. Organizational Behavior and Human Performance, 27, 153–196.CrossRef Google Scholar

Schum, D. A. (1987). Evidence and inference for the intelligence analyst, Lanham, MD: University Press of America.Google Scholar

Shafer, G. (1976). A mathematical theory of evidence, Princeton: Princeton University Press.CrossRef Google Scholar

Shafer, G., Shenoy, P. (1988). Bayesian and belief-function propagation, Lawrence, KS: University of Kansasm, School of Business.Google Scholar

Siegler, R. S. (1981). Developmental sequences within and between concepts. Monograph of the Society for Research in Child Development, Serial No. 189, 46.Google Scholar

Spearman, C. (1904). “General intelligence” objectively determined and measured. American Journal of Psychology, 15, 201–292.CrossRef Google Scholar

Spearman, C. (1927). The abilities of man: Their nature and measurement, New York: Macmillan.Google Scholar

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354.CrossRef Google Scholar

Tatsuoka, K. K. (1987). Validation of cognitive sensitivity for item response curves. Journal of Educational Measurement, 24, 233–245.CrossRef Google Scholar

Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In Frederiksen, N., Glaser, R., Lesgold, A., Shafto, M. G. (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.Google Scholar

Thompson, P. W. (1982). Were lions to speak, we wouldn't understand. Journal of Mathematical Behavior, 3, 147–165.Google Scholar

Twining, W. L. (1985). Theories of evidence: Bentham and Wigmore, Stanford, CA: Stanford University Press.Google Scholar

VanLehn, K. (1990). Mind bugs: The origins of procedural misconceptions, Cambridge, MA: MIT Press.Google Scholar

Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., Thissen, D. (1990). Computerized adaptive testing: A primer, Hillsdale, NJ: Lawrence Erlbaum.Google Scholar

Wigmore, J. H. (1937). The science of judicial proof 3rd ed.,, Boston: Little, Brown, & Co.Google Scholar

Wolf, D., Bixby, J., Glenn, J., Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In Grant, G. (Eds.), Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.Google Scholar

Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics, 5, 161–215.CrossRef Google Scholar

Yamamoto, K. (1987). A model that combines IRT and latent class models. Unpublished doctoral dissertation, University of Illinois, Champaign-Urbana.Google Scholar

Article contents

Evidence and Inference in Educational Assessment

Abstract

Keywords

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests