Published online by Cambridge University Press: 01 March 2008
In the present study, we explore whether multiple data sources may be more effective than single sources at predicting the words that language learners are likely to know. Second language researchers have hypothesized that there is a relationship between word frequency and the likelihood that words will be encountered or used by second language learners, but it is not yet clear how this relationship should be effectively measured. An analysis of word frequency measures showed that spoken language frequency alone may predict the occurrence of words in learner textbooks, but that multiple corpora as well as textbook status can improve predictions of learner usage.
Arna van Doorn assembled the vocabulary lists from the three Dutch textbooks. The Max Planck Institute for Psycholinguistics provided access to the CELEX, the CGN, and the ESF corpora. The analysis was conducted using R (R Development Team, 2005), and the stats (R Development Team, 2005) and MASS (Venables & Ripley, 2002) libraries. This research was supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO). We would also like to thank two anonymous reviewers for useful suggestions, and Jan Hulstijn for providing helpful comments and references for textbook vocabulary selection including, in addition to those cited in the text, Hazenberg (1994) and Sciarone (1979).