Hostname: page-component-5f745c7db-q8b2h Total loading time: 0 Render date: 2025-01-06T21:38:46.733Z Has data issue: true hasContentIssue false

Standard Errors and Confidence Intervals of Norm Statistics for Educational and Psychological Tests

Published online by Cambridge University Press:  01 January 2025

Hannah E. M. Oosterhuis*
Affiliation:
Tilburg University
L. Andries van der Ark
Affiliation:
University of Amsterdam
Klaas Sijtsma
Affiliation:
Tilburg University
*
Correspondence should be made to Hannah E. M. Oosterhuis, Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, The Netherlands. Email: H.E.M.Oosterhuis@tilburguniversity.edu

Abstract

Norm statistics allow for the interpretation of scores on psychological and educational tests, by relating the test score of an individual test taker to the test scores of individuals belonging to the same gender, age, or education groups, et cetera. Given the uncertainty due to sampling error, one would expect researchers to report standard errors for norm statistics. In practice, standard errors are seldom reported; they are either unavailable or derived under strong distributional assumptions that may not be realistic for test scores. We derived standard errors for four norm statistics (standard deviation, percentile ranks, stanine boundaries and Z-scores) under the mild assumption that the test scores are multinomially distributed. A simulation study showed that the standard errors were unbiased and that corresponding Wald-based confidence intervals had good coverage. Finally, we discuss the possibilities for applying the standard errors in practical test use in education and psychology. The procedure is provided via the R function check.norms, which is available in the mokken package.

Type
Original paper
Copyright
Copyright © 2016 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aardoom, J. J., Dingemans, A. E., Landt, M. S. C., & Van Furth, E. F.. (2012). Norms and discriminative validity of the Eating Disorder Examination Questionnaire (EDE-Q). Eating Behaviors, 13, 305309. doi:10.1016/j.eatbeh.2012.09.002.CrossRefGoogle ScholarPubMed
AERA, Apa, & NCME., (1999). Standards for educational and psychological testing. Washington, DC: Author..Google Scholar
Agresti, A.Analysis of ordinal categorical data 2012 2Hoboken, NJ: Wiley.Google Scholar
Agresti, A.Categorical data analysis 2013 3Hoboken, NJ: Wiley.Google Scholar
Agresti, A., Min, Y.. (2001). On small-sample confidence intervals for parameters in discrete distributions. Biometrics, 57(963), 971.CrossRefGoogle ScholarPubMed
Ahn, S., & Fessler, A. (2003). Standard errors of mean, variance, and standard deviation estimators. Technical Report. Ann Arbor, MI: EECS Department, University of Michigan: July 2003. http://www.eecs.umich.edu/~fessler/papers/files/tr/stderr.pdf.Google Scholar
American Psychological Association Publication Manual of the American Psychological Association 2010 6Washington, DC: Author.Google Scholar
Bergsma, W. P., (1997). Marginal models for categorical data. Tilburg: Tilburg University Press.Google Scholar
Bergsma, W. P., Croon, M. A., & Hagenaars, J. A., (2009). Marginal models for dependent, clustered and longitudinal categorical data. New York, NY: Springer.Google Scholar
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 453–479). Reading, MA: Addison-Wesley..Google Scholar
Brennan, R. L., & Lee, W-C. (1999). Conditional scale-score standard errors of measurement under binomial and compound binomial assumptions. Educational and Psychological Measurement, 56, 524. doi:10.1177/0013164499591001.CrossRefGoogle Scholar
Cavaco, S., Gonçalves, A., Pinto, C., Almeida, E., Gomes, F., Moreira, I., et al. (2013). Trail making test: Regression-based norms for the Portuguese population. Archives of Clinical Neuropsychology, 28, 189–198. doi:10.1093/arclin/acs115.CrossRefGoogle Scholar
Cooch, E., White, G.Program MARK: A gentle introduction 2015 14Fort Collins, CO: Colorado State University.Google Scholar
Crawford, J., Cayley, C., Lovibond, P. F., Wilson, P. H., & Hartley, C.. (2011). Percentile norms and accompanying interval estimates from an Australian general adult population sample for self-report mood scales (BAI, BDI, CRSD, CES-D, DASS, DASS-21, STAI-X, STAI-Y, SRDS, and SRAS). Australian Psychologist, 46, 314. doi:10.1111/j.1742-9544.2010.00003.x.CrossRefGoogle Scholar
Crawford, J. R., Garthwaite, P. H., & Slick, D. J.. (2009). On percentile norms in neuropsychology: Proposed reporting standards and methods for quantifying the uncertainty over the percentile ranks of test scores. The Clinical Neuropsychologist, 23, 11731195. doi:10.1080/13854040902795018.CrossRefGoogle ScholarPubMed
Crawford, J. R., & Howell, D. C.. (1998). Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist, 12, 482486. doi:10.1076/clin.12.4.482.7241.CrossRefGoogle Scholar
Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K.COTAN assessment system for the quality of tests 2009 Amsterdam: Nederlands Instituut van Psychologen.Google Scholar
Glaesmer, H., Rief, W., Martin, A., Mewes, R., Brähler, E., Zenger, M., & Hinz, A. (2012). Psychometric properties and population-based norms of the Life Orientation Test Revised (LOT-R). British Journal of Health Psychology, 17, 432–445. doi:10.1111/j.2044-8287.2011.02046.x.CrossRefGoogle Scholar
Goretti, B., Niccolai, C., Hakiki, B., Sturchio, A., Falautano, M., Eleonora, M. et al. (2014). The Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS): Normative values with gender, age and education corrections in the Italian population. BMC Neurology, 14, 171176. doi:10.1186/s12883-014-0171-6 4172942.CrossRefGoogle ScholarPubMed
Grande, G., Romppel, M., Glaesmer, H., Petrowski, K., Herrmann-Lingen, C.. (2010). The type-D scale (DS14): Norms and prevalence of type-D personality in a population-based representative sample in Germany. Personality and Individual Differences, 48, 935939. doi:10.1016/j.paid.2010.02.026.CrossRefGoogle Scholar
Grizzle, J. E., Starmer, C. F., & Koch, G. G.. (1969). Analysis of categorical data for linear models. Biometrics, 25, 489504. doi:10.2307/2528901.CrossRefGoogle ScholarPubMed
Kendall, M., & Stuart, A. (1977). The advanced theory of statistics, distributional theory (4th ed., Vol. 1). New York, NY: Macmillan..Google Scholar
Kessels, R. P., Montagne, B., Hendriks, A. W., Perrett, D. I., & De Haan, E. H.. (2014). Assessment of perception of morphed facial expression using the Emotion Recognition Task: Normative data from healthy participants aged 8–75. Journal of Neuropsychology, 8, 7593. doi:10.1111/jnp.12009.CrossRefGoogle ScholarPubMed
Kritzer, H. M.. (1977). Analyzing measures of association derived from contingency tables. Sociological Methods and Research, 5, 3550. doi:10.1177/004912417700500401.CrossRefGoogle Scholar
Kuijpers, R. E., Van der Ark, L. A., & Croon, M. A.. (2013). Standard errors and confidence intervals for scalability coefficients in Mokken scale analysis using marginal models. Sociological Methodology, 43, 4269. doi:10.1177/0081175013481958.CrossRefGoogle Scholar
Kuijpers, R. E., Van der Ark, L. A., & Croon, M. A.. (2013). Testing hypotheses involving Cronbach’s alpha using marginal models. British Journal of Mathematical and Statistical Psychology, 66, 503520. .CrossRefGoogle ScholarPubMed
Lang, J. B.. (2008). Score and profile likelihood confidence intervals for contingency table parameters. Statistics in Medicine, 27, 59755990. doi:10.1002/sim.3391.CrossRefGoogle ScholarPubMed
Larson, R., & Edwards, B. (2013). Calculus (10th ed.). Boston, MA: Cengage Learning, Brooks/Cole..Google Scholar
Lee, W-C, Brennan, R. L., & Kolen, M. J.. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 120. doi:10.1111/j.1745-3984.2000.tb01073.x.CrossRefGoogle Scholar
Lehtonen, R., Pahkinen, E.Practical methods for design and analysis of complex surveys 2004 2West Sussex: Wiley.Google Scholar
Merrell, K. W., (1994). Preschool and Kindergarten Behavior Scales. Test manual. Brandon, VT: Clinical Psychology Publishing Company.Google Scholar
Mertler, C. A., (2007). Interpreting standardized test scores: Strategies for data-driven instructional decision making. Thousand Oaks, CA: Sage.Google Scholar
Mond, J. M., Hay, P. J., Rodgers, B., & Owen, C.. (2006). Eating Disorder Examination Questionnaire (EDE-Q): Norms for young adult women. Behaviour Research and Therapy, 44, 5362. doi:10.1016/j.brat.2004.12.003.CrossRefGoogle ScholarPubMed
Oosterhuis, H. E. M., Van der Ark, L. A., & Sijtsma, K.. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23, 191202. doi:10.1177/1073191115580638.CrossRefGoogle ScholarPubMed
Palomo, R., Casals-Coll, M., Sánchez-Benavides, G., Quintana, M., Manero, R. M., Rognoni, T., et al. (2011). Spanish normative studies in young adults (NEURONORMA young adults project): Norms for the Rey-Osterrieth Complex Figure (copy and memory) and Free and Cued Selective Reminding Test. Neurologiá, 28, 226–235. doi:10.1016/j.nrl.2012.03.008.CrossRefGoogle Scholar
R Core Team (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/.Google Scholar
Rao, R.Linear statistical inference and its applications 1973 2New York, NY: Wileydoi:10.1002/9780470316436.CrossRefGoogle Scholar
Sartorio, F., Bravini, E., Vercelli, S., Ferriero, G., Plebani, G., Foti, C., & Franchignoni, F. (2013). The functional dexterity test: Test-retest reliability analysis and up-to-date reference norms. Journal of Hand Therapy, 26, 62–68. doi:10.1016/j.jht.2012.08.001.CrossRefGoogle Scholar
Shi, J., Wei, M., Tian, J., Snowden, J., Zhang, X., Li, T., et al. (2014). The Chinese version of story recall: A useful screening tool for mild cognitive impairment and Alzheimer’s disease in the elderly. BMC Psychiatry, 14, 71–80. doi:10.1186/1471-244X-14-71.CrossRefGoogle Scholar
Van Belle, G.Statistical rules of thumb 2003 2Hoboken, NJ: Wiley.Google Scholar
Van der Ark, L. A.. (2012). New developments in Mokken Scale Analysis in R. Journal of Statistical Software, 48(5), 127. doi:10.18637/jss.v048.i05.Google Scholar
Van der Ark, L. A., Croon, M. A., & Sijtsma, K.. (2008). Mokken scale analysis for dichotomous items using marginal models. Psychometrika, 73, 183208. doi:10.1007/s11336-007-9034-z.CrossRefGoogle ScholarPubMed
Van der Linden, W. J., & Hambleton, R. K., (1997). Handbook of modern item response theory. New York, NY: Springerdoi:10.1007/978-1-4757-2691-6.CrossRefGoogle Scholar