The cross-linguistic performance of word segmentation models over time

Andrew CAINES; Emma ALTMANN-RICHER; Paula BUTTERY

doi:10.1017/S0305000919000485

The cross-linguistic performance of word segmentation models over time

Published online by Cambridge University Press: 11 October 2019

Andrew CAINES ,

Emma ALTMANN-RICHER and

Paula BUTTERY

Show author details

Andrew CAINES*: Affiliation:
Department of Computer Science & Technology, University of Cambridge, Cambridge, UK
Emma ALTMANN-RICHER: Affiliation:
Faculty of Modern & Medieval Languages, University of Cambridge, Cambridge, UK
Paula BUTTERY: Affiliation:
Department of Computer Science & Technology, University of Cambridge, Cambridge, UK
*: *Corresponding author: Department of Computer Science & Technology, William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK. E-mail: andrew.caines@cl.cam.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We select three word segmentation models with psycholinguistic foundations – transitional probabilities, the diphone-based segmenter, and PUDDLE – which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult.

Keywords

word segmentation CHILDES statistical learning

Information

Type: Articles
Information: Journal of Child Language , Volume 46 , Issue 6 , November 2019 , pp. 1169 - 1201

DOI: https://doi.org/10.1017/S0305000919000485 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–23.Google Scholar

Amatuni, A., & Bergelson, E. (2017). Semantic networks generated from early linguistic input. In Proceedings of the 39th Annual Conference of the Cognitive Science Society. Online <https://mindmodeling.org/cogsci2017/papers/0302/index.html>..>Google Scholar

Aslin, R. N., Saffran, J. R. & Newport, E. L. (1998). Computation of probability statistics by 8-month-old infants. Psychological Science, 9, 321–4.Google Scholar

Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.Google Scholar

Baayen, R. H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412.Google Scholar

Bartoń, K. (2018). MuMIn: Multi-Model Inference. R package version 1.42.1. Online <https://cran.r-project.org/package=MuMIn>..>Google Scholar

Basbøll, H. (2005). The phonology of Danish. Oxford University Press.Google Scholar

Basbøll, H. (2012). Monosyllables and prosody: the sonority syllable model meets the word. In Stolz, T., Nau, N., & Stroh, C. (Eds.), Studia typologica: Monosyllables: from phonology to typology (pp. 13–41). Berlin: De Gruyter.Google Scholar

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1–48.Google Scholar

Bentz, C., Alikaniotis, D., Cysouw, M., & i Cancho, R. F. (2017). The entropy of words – learnability and expressivity across more than 1000 languages. Entropy, 19(6), 275.Google Scholar

Bergelson, E., Amatuni, A., Dailey, S., Koorathota, S., & Tor, S. (2019). Day by day, hour by hour: naturalistic language input to infants. Developmental Science, 22(1), e12715.Google Scholar

Bernard, M. (2018). phonemizer-1.0. Online <http://doi.org/10.5281/zenodo.1045826>..>Google Scholar

Bernard, M., Thiolliere, R., Saksida, A., Loukatou, G., Larsen, E., Johnson, M., Fibla, L., Dupoux, E., Daland, R., Cao, X., & Cristia, A. (in press). WordSeg: standardizing unsupervised word form segmentation from text. Behavior Research Methods. Online <https://doi.org/10.3758/s13428-019-01223-3>..>Google Scholar

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the Natural Language Toolkit. Beijing: O'Reilly Media.Google Scholar

Bleses, D., Basbøll, H., & Vach, W. (2011). Is Danish difficult to acquire? Evidence from Nordic past-tense studies. Language and Cognitive Processes, 26, 1193–231.Google Scholar

Bleses, D., Vach, W., Slott, M., Wehberg, S., Thomsen, P., Madsen, T., & Basbøll, H. (2008). Early vocabulary development in Danish and other languages: a CDI-based comparison. Journal of Child Language, 35, 619–50.Google Scholar

Bortfield, H., Morgan, J., Golinkoff, R., & Rathbun, K. (2005). Mommy and me: familiar names help launch babies into speech-stream segmentation. Psychological Science, 16, 298–304.Google Scholar

Boruta, L., Peperkamp, S., Crabbé, B., & Dupoux, E. (2011). Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W11-0601>..>Google Scholar

Braginsky, M., Yurovsky, D., Marchman, V., & Frank, M. (2018). Consistency and variability in children's word learning across languages. PsyArXiv. doi:10.31234/osf.io/cg6ahGoogle Scholar

Brent, M., & Cartwright, T. (1996). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93–125.Google Scholar

Butler, J., & Frota, S. (2018). Emerging word segmentation abilities in European Portuguese-learning infants: new evidence for the rhythmic unit and the edge factor. Journal of Child Language, 45, 1294–308.Google Scholar

Cairns, P., Shillcock, R., Chater, N., & Levy, J. (1997). Bootstrapping word boundaries: a bottom-up corpus-based approach to speech segmentation. Cognitive Psychology, 33, 111–53.Google Scholar

Casas, B., Català, N., Ferrer-i-Cancho, R., Hernández-Fernández, A., & Baixeries, J. (2018). The polysemy of the words that children learn over time. Interaction Studies, 19, 389–426.Google Scholar

Chin, I., Goodwin, M., Vosoughi, S., Roy, D., & Naigles, L. (2018). Dense home-based recordings reveal typical and atypical development of tense/aspect in a child with delayed language development. Journal of Child Language, 45, 1–34.Google Scholar

Çöltekin, Ç. (2017). Using predictability for lexical segmentation. Cognitive Science, 41, 1988–2021.Google Scholar

Curtin, S. (2009). Twelve-month-olds learn novel word–object pairs differing only in stress pattern. Journal of Child Language, 36, 1157–65.Google Scholar

Curtin, S., Mintz, T. H., & Christiansen, M. H. (2005). Stress changes the representational landscape: evidence from word segmentation. Cognition, 96, 233–62.Google Scholar

Cutler, A., & Carter, D. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–42.Google Scholar

Dahan, D., & Brent, M. (1999). An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–85.Google Scholar

Daland, R., & Pierrehumbert, J. (2011). Learning diphone-based segmentation. Cognitive Science, 35, 119–55.Google Scholar

Dautriche, I., Mahowald, K., Gibson, E., Christophe, A., & Piantadosi, S. (2017). Words cluster phonetically beyond phonotactic regularities. Cognition, 163, 128–45.Google Scholar

Dupoux, E., Parlato, E., Frota, S., Hirose, Y., & Peperkamp, S. (2011). Where do illusory vowels come from? Journal of Memory and Language, 64, 199–210.Google Scholar

Ettlinger, M., Finn, A., & Kam, C. H. (2012). The effect of sonority on word segmentation: evidence for the use of a phonological universal. Cognitive Science, 36, 655–73.Google Scholar

Evert, S. (2004). A simple LNRE model for random character sequences. In Proceedings of JADT. Online <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.642>..>Google Scholar

Evert, S., & Baroni, M. (2007). zipfR: word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions. Online <https://www.aclweb.org/anthology/P07-2008>..>Google Scholar

Fourtassi, A., Börschinger, B., Johnson, M., & Dupoux, E. (2013). Whyisenglishsoeasytosegment? In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W13-2601>..>Google Scholar

Frank, M., Goldwater, S., Griffiths, T., & Tenenbaum, J. (2010). Modeling human performance in statistical word segmentation. Cognition, 117, 107–125.Google Scholar

Frank, S., Keller, F., & Goldwater, S. (2013). Exploring the utility of joint morphological and syntactic learning from child-directed speech. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Online <https://www.aclweb.org/anthology/D13-1004>..>Google Scholar

Friederici, A., & Wessels, J. (1993). Phonotactic knowledge of word boundaries and its use in infant speech-perception. Perception & Psychophysics, 54, 287–95.Google Scholar

Gambell, T., & Yang, C. (2005). Word segmentation: quick but not dirty. Unpublished ms, Yale University. Online <http://www.ling.upenn.edu/~ycharles/papers.html>..>Google Scholar

Gervain, J., & Guevara Erra, R. G. (2012). The statistical signature of morphosyntax: a study of Hungarian and Italian infant-directed speech. Cognition, 125, 263–87.Google Scholar

Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Bayesian framework for word segmentation: exploring the effects of context. Cognition, 112, 21–54.Google Scholar

Goodsitt, J. V., Morgan, J. L., & Kuhl, P. K. (1993). Perceptual strategies in prelingual speech segmentation. Journal of Child Language, 20, 229–52.Google Scholar

Graf Estes, K., & Hurley, K. (2013). Infant-directed prosody helps infants map sounds to meanings. Infancy, 18, 797–824.Google Scholar

Grønnum, N. (2003). Why are the Danes so hard to understand? In Galberg Jacobsen, H., Bleses, D., Madsen, T. O. & Thomsen, P. (Eds.), Take Danish – for instance: linguistic studies in honour of Hans Basbøll presented on the occasion of his 60th birthday 12 July 2003. Odense: University Press of Southern Denmark.Google Scholar

Hallé, P. A., & de Boysson-Bardies, B. (1994). Emergence of an early receptive lexicon: infants’ recognition of words. Infant Behavior and Development, 17, 119–29.Google Scholar

Hammarström, H., Forkel, R., & Haspelmath, M. (2018). Glottolog 3.3. Online <https://glottolog.org>..>Google Scholar

Hartman, K., Bernstein Ratner, N., & Newman, R. (2017). Infant-direct speech (IDS) vowel clarity and child language outcomes. Journal of Child Language, 44, 1140–62.Google Scholar

Hay, J., Pelucchi, B., Estes, K., & Saffran, J. (2011). Linking sounds to meanings: infant statistical learning in a natural language. Cognitive Psychology, 63, 93–106.Google Scholar

Hendrickson, A., & Perfors, A. (2019). Cross-situational learning in a Zipfian environment. Cognition, 189, 11–22.Google Scholar

Hockema, S. (2006). Finding words in speech: an investigation of American English. Language Learning and Development, 2, 119–46.Google Scholar

James, W. (1890). The principles of psychology, Volume 1. New York: Henry Holt and Company.Google Scholar

Johnson, E., & Jusczyk, P. (2001). Word segmentation by 8-month-olds: when speech cues count more than statistics. Journal of Memory and Language, 44, 548–67.Google Scholar

Johnson, E., & Tyler, M. (2010). Testing the limits of statistical learning for word segmentation. Developmental Science, 13, 339–45.Google Scholar

Johnson, M. (2008). Unsupervised word segmentation for Sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Online <https://www.aclweb.org/anthology/W08-0704>..>Google Scholar

Jusczyk, P. W., Cutler, A., & Redanz, N. (1993). Preference for the predominant stress patterns of English words. Child Development, 64, 675–87.Google Scholar

Jusczyk, P. W., Luce, P., & Charles-Luce, J. (1994). Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language, 33, 630–45.Google Scholar

Kidd, E., Junge, C., Spokes, T., Morrison, L., & Cutler, A. (2018). Individual differences in infant speech segmentation: achieving the lexical shift. Infancy, 23, 770–94.Google Scholar

Krogh, L., Vlach, H. A., & Johnson, S. P. (2012). Statistical learning across development: flexible yet constrained. Frontiers in Psychology, 3. doi:10.3389/fpsyg.2012.00598Google Scholar

Kurumada, C., Meylan, S., & Frank, M. (2013). Zipfian frequency distributions facilitate word segmentation in context. Cognition, 127, 439–53.Google Scholar

Ladefoged, P. (2003). Commentary: some thoughts on syllables–an old-fashioned interlude. In Local, J., Ogden, R., & Temple, R. (Eds.), Phonetic interpretation: Papers in Laboratory Phonology VI. (pp. 269–78). Cambridge University Press.Google Scholar

Larsen, E., Cristia, A., & Dupoux, E. (2017). Relating unsupervised word segmentation to reported vocabulary acquisition. In Proceedings of INTERSPEECH. Online <https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0937.html>..>Google Scholar

Lignos, C. (2012). Infant word segmentation: an incremental, integrated model. In Proceedings of the West Coast Conference on Formal Linguistics. Online <http://www.lingref.com/cpp/wccfl/30/paper2821.pdf>..>Google Scholar

Linzen, T., & Gallagher, G. (2017). Rapid generalization in phonotactic learning. Laboratory Phonology, 8, 1–32.Google Scholar

Long, J. (2018). jtools: analysis and presentation of social scientific data. R package version 1.1.1. Online <https://cran.r-project.org/package=jtools>..>Google Scholar

MacWhinney, B. (1982). Basic syntactic processes. In Kuczaj, S. (Ed.), Language acquisition. volume 1: syntax and semantics (pp. 73–136). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar

MacWhinney, B. (2000). The CHILDES project: tools for analyzing talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar

Mampe, B., Friederici, A. D., Christophe, A., & Wermke, K. (2009). Newborns’ cry melody is shaped by their native language. Current Biology, 15, 1–4.Google Scholar

Mani, N., & Pätzold, W. (2016). Sixteen-month-old infants’ segment words from infant- and adult-directed speech. Language Learning and Development, 12, 499–508.Google Scholar

Mattys, S., & Jusczyk, P. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91–121.Google Scholar

Mattys, S., White, L., & Melhorn, J. (2005). Integration of multiple segmentation cues: a hierarchical framework. Journal of Experimental Psychology: General, 134, 477–500.Google Scholar

May, L., Byers-Heinlein, K., Gervain, J., & Werker, J. F. (2011). Language and the newborn brain: Does prenatal language experience shape the neonate neural response to speech? Frontiers in Psychology, 2. doi:10.3389/fpsyg.2011.00222Google Scholar

McCauley, S., Monaghan, P., & Christiansen, M. (2015). Language emergence in development. In MacWhinney, B. & O'Grady, W. (Eds.), The handbook of language emergence (pp. 415–36). Oxford: Blackwell.Google Scholar

Mehler, J., Dommergues, J. Y., Frauenfelder, U., & Segui, J. (1981). The syllable's role in speech segmentation. Journal of Verbal Learning and Verbal Behavior, 20, 298–305.Google Scholar

Mintz, T., Walker, R., Welday, A., & Kidd, C. (2018). Infants’ sensitivity to vowel harmony and its role in segmenting speech. Cognition, 171, 95–107.Google Scholar

Monaghan, P., & Christiansen, M. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of Child Language, 37, 545–64.Google Scholar

Moon, C., Panneton Cooper, R., & Fifer, W. P. (1993). Two-day-olds prefer their native language. Infant Behavioral Development, 16, 495–500.Google Scholar

Nespor, M., Peña, M., & Mehler, J. (2003). On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e Linguaggio, 2, 221–47.Google Scholar

Ngon, C., Martin, A., Dupoux, E., Cabrol, D., Dutat, M., & Peperkamp, S. (2013). (Non)words, (non)words, (non)words: evidence for a protolexicon during the first year of life. Developmental Science, 16, 24–34.Google Scholar

Ota, M., & Skarabela, B. (2018). Reduplication facilitates early word segmentation. Journal of Child Language, 45, 204–18.Google Scholar

Pelucchi, B., Hay, J., & Saffran, J. (2009a). Learning in reverse: eight-month-old infants track backward transitional probabilities. Cognition, 113, 244–7.Google Scholar

Pelucchi, B., Hay, J., & Saffran, J. (2009b). Statistical learning in a natural language by 8-month-old infants. Child Development, 80, 674–85.Google Scholar

Phillips, L. (2015). The role of empirical evidence in modeling speech segmentation (Unpublished dissertation, University of California, Irvine). Retrieved from <http://eric.ed.gov/?id=ED568017>..>Google Scholar

Phillips, L., & Pearl, L. (2015). Utility-based evaluation metrics for models of language acquisition: a look at speech segmentation. In Proceedings of the Sixth Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W15-1108>..>Google Scholar

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Online <https://www.r-project.org>..>Google Scholar

Räsänen, O., Doyle, G., & Frank, M. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–50.Google Scholar

Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33, 859–77.Google Scholar

Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–8.Google Scholar

Saksida, A., Langus, A., & Nespor, M. (2017). Co-occurrence statistics as a language-dependent cue for speech segmentation. Developmental Science, 20(3). doi.org/10.1111/desc.12390 Google Scholar

Schüppert, A., Hilton, N. H., & Gooskens, C. (2016). Why is Danish so difficult to understand for fellow Scandinavians? Speech Communication, 79, 47–60.Google Scholar

Shoemaker, E., & Wauquier, S. (2019). The emergence of speech segmentation in adult L2 learners of French. Language, Interaction and Acquisition, 10, 22–44.Google Scholar

Siyanova-Chanturia, A., Conklin, K., Caffarra, S., Kaan, E., & Van Heuven, W. (2017). Representation and processing of multi-word expressions in the brain. Brain and Language, 175, 111–22.Google Scholar

Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132.Google Scholar

Tamis-LeMonda, C., Kuchirko, Y., Luo, R., Escobar, K., & Bornstein, M. (2017). Power in methods: language to infants in structured and naturalistic contexts. Developmental Science, 20. doi.org/10.1111/desc.12456 Google Scholar

Thiessen, E. D., & Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39, 706–16.Google Scholar

Tomasello, M. (2000). The item-based nature of children's early syntactic development. Trends in Cognitive Sciences, 4, 156–63.Google Scholar

Trecca, F., Bleses, D., Madsen, T. O., & Christiansen, M. H. (2018). Does sound structure affect word learning? An eye-tracking study of Danish learning toddlers. Journal of Experimental Child Psychology, 167, 180–203.Google Scholar

Trecca, F., McCauley, S. M., Andersen, S. R., Bleses, D., Basbøll, H., Højen, A., Madsen, T. O., Ribu, I. S. B., & Christiansen, M. H. (2019). Segmentation of highly vocalic speech via statistical learning: initial results from Danish, Norwegian, and English. Language Learning, 69(1), 143–76.Google Scholar

Vihman, M., dePaolis, R., Nakai, S., & Hallé, P. A. (2004). The role of accentual pattern in early lexical representation. Journal of Memory and Language, 50, 336–53.Google Scholar

Winter, B., & Wieling, M. (2016). How to analyze language change using mixed models, growth curve analysis and generalized additive modeling. Journal of Language Evolution, 1, 7–18.Google Scholar

Ziegler, J. C., & Goswami, U. (2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: a psycholinguistic grain size theory. Psychological Bulletin, 131, 3–29.Google Scholar

Zipf, G. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley.Google Scholar

Article contents

The cross-linguistic performance of word segmentation models over time

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests