Monitoring Countries in a Changing World: A New Look at DIF in International Surveys

Robert J. Zwitser; S. Sjoerd F. Glaser; Gunter Maris

doi:10.1007/s11336-016-9543-8

Monitoring Countries in a Changing World: A New Look at DIF in International Surveys

Published online by Cambridge University Press: 01 January 2025

Robert J. Zwitser ,

S. Sjoerd F. Glaser and

Gunter Maris

Show author details

Robert J. Zwitser*: Affiliation:
University of Amsterdam
S. Sjoerd F. Glaser: Affiliation:
University of Amsterdam
Gunter Maris: Affiliation:
University of Amsterdam Cito Institute for Educational Measurement
*: Correspondence should be made to Robert J. Zwitser, University of Amsterdam, Amsterdam, The Netherlands. Email: zwitser@uva.nl

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper discusses the issue of differential item functioning (DIF) in international surveys. DIF is likely to occur in international surveys. What is needed is a statistical approach that takes DIF into account, while at the same time allowing for meaningful comparisons between countries. Some existing approaches are discussed and an alternative is provided. The core of this alternative approach is to define the construct as a large set of items, and to report in terms of summary statistics. Since the data are incomplete, measurement models are used to complete the incomplete data. For that purpose, different models can be used across countries. The method is illustrated with PISA’s reading literacy data. The results indicate that this approach fits the data better than the current PISA methodology; however, the league tables are nearly identical. The implications for monitoring changes over time are discussed.

Keywords

differential item functioning DIF ranking robustness educational testing programme for international student assessment PISA Rasch models reading literacy

Type: Original Paper
Information: Psychometrika , Volume 82 , Issue 1 , March 2017 , pp. 210 - 232

DOI: https://doi.org/10.1007/s11336-016-9543-8 [Opens in a new window]
Copyright: Copyright © 2016 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adams, R. (2011, 19 April). Comments on Kreiner 2011: Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. Retrieved from http://www.oecd.org/pisa/47681954.Google Scholar

Adams, R., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, (1), 1–23.CrossRef Google Scholar

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washinton, DC: American Educational Research Association.Google Scholar

Andersen, E. B. (1973). Conditional inference and models for measuring. (Unpublished doctoral dissertation). Mentalhygiejnisk Forskningsinstitut.Google Scholar

Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80, (2), 317–340.CrossRef Google Scholar PubMed

Bechger, T.M., Maris, G., & Verstralen, H.H.F.M. (2010). A different view on DIF (Measurement and Research Department Reports No. 2010-4). Cito.Google Scholar

Béguin, A. A., Wools, S., Millsap, R. E., Bolt, D. M., van der Ark, L. A., & Wang, W. C. (2015). Vertical comparison using reference sets. Quantitative psychology research, Switzerland: Springer International Publishing 195–211.CrossRef Google Scholar

Bolsinova, M., Maris, G., & Hoijtink, H. (2016). Unmixing Rasch scales: How to score an educational test. Annals of Applied Statistics, 10, (2), 925–945.CrossRef Google Scholar

Council of Europe. (2012). First european survey on language competences: Technical report. Retrieved from http://www.surveylang.org/.Google Scholar

Dieterich, C. (2013, March). In or out, DJIA companies reflect changing times. The Wall Street Journal. Retrieved from http://online.wsj.com/news/articles/SB10001424127887324678604578342113520798752.Google Scholar

Goldstein, H. (2004). International comparisons of student attainment: Some issues arising from the PISA study. Assessment in Education, 11, (3), 319–330.Google Scholar

Holland, P., & Wainer, H. (1993). Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar

Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. Methods and practices, 2New York: Springer.CrossRef Google Scholar

Kreiner, S. (2011). Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. (Tech. Rep.). Dept. of Biostatistics, University of Copenhagen.Google Scholar

Kreiner, S., Christensen, K. B., Von Davier, M., & Carstensen, C. H. (2007). Validity and objectivity in health-related scales: Analysis by graphical loglinear Rasch models. Multivariate and mixture distribution Rasch models, New York: Springer 329–346.CrossRef Google Scholar

Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79, (2), 210–231.CrossRef Google Scholar

Le, L. T. (2007). Effects of item positions on their difficulty and discrimination: A study in PISA science data across test language and countries. Paper presented at the 72nd Annual Meeting of the Psychometric Society, Tokyo, Japan. Retrieved from http://research.acer.edu.au/pisa/2/.Google Scholar

Linthorne, N. (2014, August). Wind assistance in the 100m sprint. Retrieved from http://www.brunel.ac.uk/~spstnpl/Publications/.Google Scholar

Lord, F., & Novick, M. R. (1968). Statistical theories of mental test scores, Reading, MA: Addison-Wesley.Google Scholar

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, (3), 179–193.CrossRef Google Scholar

Marsman, M., Maris, G., Bechger, T., & Glas, C. (2016). What can we learn from Plausible Values?. Psychometrika, 81, 274–289.CrossRef Google Scholar PubMed

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.CrossRef Google Scholar

Mazzeo, J., Kulick, E., Tay-Lim, B., & Perie, M. (2006). Technical report for the 2000 market-basket study in mathematics (Tech. Rep.). ETS.Google Scholar

Mislevy, R. J. (1998). Implications of market-basket reporting for achievement-level setting. Applied Psychological Measurement, 11, (1), 49–63.Google Scholar

National Research Council. (2001). Naep reporting practices: Investigating district-level and market-basket reporting. Washington, DC: The National Academies Press. doi:10.17226/10049.CrossRef Google Scholar

NCES. (1997, October). NAEP reconfigured: An integrated redesign of the national assessment of educational progress (Tech. Rep. No. 97-31). National Center For Educational Statistics. Retrieved from http://nces.ed.gov/pubs97/9731.Google Scholar

OECD. (2004). Learning for tomorrows world: First results from PISA 2003. Retrieved from www.oecd.org/dataoecd/1/60/34002216.Google Scholar

OECD. (2007). PISA 2006: Science competencies for tomorrows world: Volume 1: Analysis.Google Scholar

OECD. (2009a). PISA 2006 technical report.Google Scholar

OECD. (2009b) PISA data analysis manual.Google Scholar

OECD. (2012). The policy impact of PISA: An exploration of the normative effects of international benchmarking in school system performance (OECD Education Working Paper No. 71). Organisation for Economic Co-operation and Development.Google Scholar

Oliveri, M. E., & Ercikan, K. (2011). Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions?. Applied Measurement in Education, 24, (4), 349–366.CrossRef Google Scholar

Oliveri, M. E., & Von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, (3), 315–333.Google Scholar

Oliveri, M. E., & Von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, (1), 1–21.CrossRef Google Scholar

Sandilands, D., Oliveri, M. E., Zumbo, B. D., & Ercikan, K. (2013). Investigating sources of differential item functioning in international large-scale assessments using a confirmatory approach. International Journal of Testing, 13, (2), 152–174.CrossRef Google Scholar

Verhelst, N. D. (2012). Profile analysis: A closer look at the PISA 2000 reading data. Scandinavian Journal of Educational Research, 56, (3), 315–332.CrossRef Google Scholar

Verhelst, N. D., Glas, CAW, Fischer, G. H., & Molenaar, I. W. (1995). The one parameter logistic model: OPLM. Rasch models: Foundations, recent developments and applications, New York: Springer 215–238.CrossRef Google Scholar

Verhelst, N. D., Glas, CAW, & Verstralen, HHFM (1993). OPLM: One parameter logistic model. Computer program and manual, Arnhem: Cito.Google Scholar

Article contents

Monitoring Countries in a Changing World: A New Look at DIF in International Surveys

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests