Hostname: page-component-5f7774ffb-rjhn2 Total loading time: 0 Render date: 2026-02-19T09:53:06.276Z Has data issue: false hasContentIssue false

The Development of Canonical Proportion as a Function of Community, Multilingualism, and Target Language’s Syllable Complexity

Published online by Cambridge University Press:  19 February 2026

Kai Jia Tey*
Affiliation:
Département d’Etudes Cognitives, Laboratoire de Sciences Cognitives et de Psycholinguistique, ENS, EHESS, CNRS, Paris, France
Sarah Walker
Affiliation:
School of Economics, University of New South Wales, Australia
Amanda Seidl
Affiliation:
Department of Communication Sciences and Disorders, University of Delaware, USA
Camila Scaff
Affiliation:
Institute of Evolutionary Medicine, Switzerland
Loann Peurey
Affiliation:
Département d’Etudes Cognitives, Laboratoire de Sciences Cognitives et de Psycholinguistique, ENS, EHESS, CNRS, Paris, France
Bridgette L. Kelleher
Affiliation:
Department of Psychological Sciences, Purdue University, USA
William Havard
Affiliation:
Laboratoire Ligérien de Linguistique, University of Orléans, France
Lisa Hamrick
Affiliation:
Department of Psychology, University of South Carolina, USA
Pauline Grosjean
Affiliation:
School of Economics, University of New South Wales, Australia
Margaret Cychosz
Affiliation:
Department of Linguistics, University of California Los Angeles, USA
Heidi Colleran
Affiliation:
BirthRites Lise Meitner Research Group, Max Planck Institute for Evolutionary Anthropology, Germany
Marisa Casillas
Affiliation:
Comparative Human Development, University of Chicago Division of the Social Science, USA
Elika Bergelson
Affiliation:
Department of Psychology, Harvard University, USA
Kasia Hitczenko
Affiliation:
Department of Computer Science, The George Washington University, USA
Alejandrina Cristia
Affiliation:
Département d’Etudes Cognitives, Laboratoire de Sciences Cognitives et de Psycholinguistique, ENS, EHESS, CNRS, Paris, France
*
Corresponding author: Kai Jia Tey; Email: kaijiatey@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

This study investigates the development of canonical proportion (CP), an indicator of speech development, across diverse language and environmental contexts. Using the Speech Maturity Dataset (SMD) comprising 366 children, aged 0;2–6;4, across 10 different languages and cultures, we explore the influence of multilingual exposure, language syllable complexity, and community type (industrialised, non-industrialised) on CP. We find that monolingual children display higher CP measures than their multilingual peers. In addition, CP is higher for children learning languages with simple syllable complexity than those with more complex syllables. We also find no significant differences in the CP trajectory of children from industrialised versus non-industrialised communities. Integrating these findings in the broader literature, we highlight the importance of diversifying participant samples to capture the complex relationship between language exposure, social environment, and language development.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press

1. Introduction

Although language is universal among humans, a host of research suggests meaningful individual and group variation that can potentially shed light on the factors and processes involved in typical and atypical language development, including speech development. Recent work suggests that canonical proportion (CP; introduced below) is one measure of speech development that can be deployed in a large age range at scale. The present study explores a recently released large dataset, the Speech Maturity Dataset (SMD; Hitczenko et al., Reference Hitczenko, Peurey, Harvard, Tey, Seidl, Semenzin, Scaff, Lavechin, Kelleher, Hamrick, Gautheron, Cychosz, Casillas and Cristia2025), which builds upon previous work (Cychosz et al., Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021; Hitczenko et al., Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023; Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021), and includes data from hundreds of children growing up exposed to various language and cultural backgrounds. Our primary goal is to harness variability in the SMD to understand how several aspects of language environments may relate to the development of CP.

1.1. Indicators of speech development

Research reveals a variety of indicators of the process whereby children come to control the form and meaning of their vocalisations. For example, the number of speech-like vocalisations (i.e., Children’s Vocalisation Counts) increases with children’s age (e.g., Gilkerson et al., Reference Gilkerson, Richards, Warren, Montgomery, Greenwood, Kimbrough Oller, Hansen and Paul2017) and correlates with standardised tests such as the MacArthur–Bates Communicative Development Inventory (Wang et al., Reference Wang, Williams, Dilley and Houston2020). Other work studies how infants employ such speech-like vocalisations. For example, Long et al. (Reference Long, Bowman, Yoo, Burkhardt-Reed, Bene and Oller2020) reported that over half of infants’ speech-like vocalisations (“protophones” in that paper) are directed to nobody in particular (rather than socially motivated), at least based on lab recordings of children aged 3–10 months.

Other research zooms into phonetic characteristics of the vocalisations themselves. For instance, vocalisations containing clear consonant–vowel transitions begin to appear in infants’ babble at around 7 months of age, constituting about 15% of their syllables by 10 months of age (Oller et al., Reference Oller, Eilers, Neal and Schwartz1999). Research tracking the syllable or word shapes apparent in infants’ speech often finds continuity across babbling and first words (e.g., Vihman, Reference Vihman2019). Yet others, able to determine which words children aim to say, measure the percentage of sounds correctly produced and derive the timeline of mastery of various consonants (a review in McLeod & Crowe, Reference McLeod and Crowe2018). Fine-grained phonetic measures suggest speech development continues well beyond early childhood (Nip & Green, Reference Nip and Green2013).

1.2. Canonical proportion in early speech development: background and previous research

The present study follows much previous research in setting aside all non-speech-like vocalisations (like crying and laughing), which will not be studied further here. Instead, we focus on a division among speech-like vocalisations, aimed at capturing children’s increasing ability to produce well-formed syllables in both early babbling and later-developing meaningful speech (Cychosz et al., Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021). Among speech-like vocalisations, we distinguish those containing clear consonant–vowel or vowel–consonant transitions (canonical) and those that do not (non-canonical). Experts may argue that the meaning of “canonical” here is different than that found in discussions of canonical babbling ratios (e.g., CBRsyl and CBRutt; Molemans et al., Reference Molemans, van den Berg, Van Severen and Gillis2012; Oller et al., Reference Oller, Eilers, Steffens, Lynch and Urbano1994; note that Cychosz & Long, Reference Cychosz and Long2025, reported no effect of using syllables vs. utterances on any of their analyses). Canonical babbling ratio often indexes the proportion of utterances or syllables that are canonical based on a very precise definition (e.g., with a supraglottal constant and fully resonant vowel joined by a rapid, adult-like transition), and it is typically applied by expert annotators (but see Oller et al., Reference Oller, Eilers and Basinger2001) onto babble by pre-linguistic infants. To protect privacy and avoid sharing sensitive information (e.g., names that could be obvious when listening to full vocalisations), we follow Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) by using automated algorithms to select vocalisations and then split them into 500 ms clips. This approach additionally allows judgment by minimally trained annotators. Even though the methods diverge, it is possible that CP and CBR yield similar results. To the best of our knowledge, no published study has assessed this.

To date, there have been only four studies using CP, which we summarise next. Two cross-linguistic investigations have examined parts of the dataset used in the present study (see Table 1). Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) analysed 52 infants aged 1–36 months, who were growing up in a variety of communities, as monolinguals or bilinguals. The authors found that, similarly to canonical babbling ratios, CP in these data reached about 15% by 10 months of age. However, by employing CP, they could ascertain that this percentage continued to increase in toddlerhood, reaching about 40% by 36 months. The authors argued that CP, thus defined, is a broader indicator of speech development that remains relevant beyond the babbling stage. In addition, they did not visually detect large differences across groups, which they interpreted as a sign that CP may be resilient to differences across languages and/or communities. However, they did not explicitly test for a difference between monolinguals and the others.

Table 1. Number of participants included in each corpora in previous studies and in the present dataset. The three papers largely build on each other. Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) included all the data in Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) (except for USA–California). Our study includes all the data in Hitczenko et al. and more

Building on Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021)’s dataset, Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) extended coverage by adding 77 children (total N = 129 children) from three additional communities. By extending the age range to 6 years, they documented increases in CP beyond toddlerhood reaching up to ~80%, lending further credence to Cychosz et al.’s argument that CP may continue to track speech development beyond the pre-linguistic period. Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) also reported differences in CP based on the typological properties of the language(s) children were learning and whether children were growing up in an industrialised community. They did not investigate potential differences as a function of multilingualism.

Two other studies on CP bring important information to bear regarding the value of the CP measure. Crucially, Semenzin et al. (Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021) demonstrated high levels of convergence across laboratory coding and citizen science approaches when studying a sample of 20 North American English-learning children. They derived two CP measures: one by having laboratory annotators, who had access to content and context of entire children’s vocalisations, and another through the same citizen science platform our own study builds on. They report strong correlations (r > .8) between laboratory CP and citizen science CP, indicating a high degree of agreement between the two methods. In addition, Semenzin et al. (Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021) documented age-related CP increases among low risk controls aged 4–18 months, whereas children with Angelman syndrome (a neurogenetic disorder) aged 11–53 months exhibited age-related decreases. This aligns conceptually with other research linking lower canonical babbling ratios with atypical language development (e.g., Patten et al., Reference Patten, Belardi, Baranek, Watson, Labban and Oller2014).

Finally, Ott and Cychosz (Reference Ott and Cychosz2025) examined CP in 130 English-learning children at a mean age of 3 years, whose CP ranged between 20% and 76%. They observed a positive correlation between CP and age in their sample of 28–49-month-olds. Moreover, they documented that CP significantly predicted standardised assessments of speech and language, which the children completed approximately 1 year later, including consonant articulation, vocabulary size, phonological awareness, and phonological working memory. This parallels findings that delayed onset of canonical babbling (the age at which canonical syllables emerge) is associated with smaller vocabularies in later development (Oller et al., Reference Oller, Eilers, Neal and Schwartz1999).

1.3. Potential factors that may influence the development of CP

The above literature suggests that, although there are systematic increases in CP with age, there may additionally be meaningful individual (Ott & Cychosz, Reference Ott and Cychosz2025) and group variation in CP (Cychosz et al., Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021; Hitczenko et al., Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023; Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021). Here, we study three factors that vary in our dataset (SMD; Hitczenko et al., Reference Hitczenko, Peurey, Harvard, Tey, Seidl, Semenzin, Scaff, Lavechin, Kelleher, Hamrick, Gautheron, Cychosz, Casillas and Cristia2025) and that reflect language and environmental diversity: multilingual exposure, syllable complexity, and community. In this section, we justify why these three factors might (not) influence the trajectory of CP.

Multilingual exposure. Should monolingual status affect CP? Since CP reflects the proportions of well-formed syllables in children’s speech, it serves as a potential indicator of whether multilingual exposure influences the rate of speech development. One long-standing hypothesis in the field of language development has postulated that multilingual exposure might delay early speech milestones (a discussion in Fibla et al., Reference Fibla, Sebastian-Galles and Cristia2021) because of at least two reasons: First, the presence of multiple linguistic systems may introduce cognitive and linguistic complexity and thus confuse the learners; and second, all else equal, multilinguals may receive less input in each of their languages than corresponding monolinguals, which could influence how quickly stable speech forms emerge (in perception and/or production). Although there is a long controversy in the field of lexical acquisition (with some work suggesting delays, and others showing none; Bialystok et al., Reference Bialystok, Luk, Peets and Sujin2010; Byers-Heinlein, Reference Byers-Heinlein2013; Hoff et al., Reference Hoff, Core, Place, Rumiche, Señor and Parra2012; Oller et al., Reference Oller, Pearson and Cobo-Lewis2007; Pearson et al., Reference Pearson, Fernández and Oller1993), the bulk of the evidence suggests that there may not be any difference between monolinguals and multilinguals. However, there is less work pertaining to phonological structure. By examining CP in monolingual and multilingual children, we aim to determine whether multilingual exposure influences not just vocabulary size but also the very structure of early speech patterns. Evidence of this comes from both specific and broader overall multilingualism studies. Oller et al. (Reference Oller, Eilers, Urbano and Cobo-Lewis1997) examined 29 English–Spanish bilingual infants and found they exhibited similar canonical babbling ratios as 44 monolingual infants. In contrast, more recently, Bergelson et al. (Reference Bergelson, Soderstrom, Schwarz, Rowland, Ramírez-Esparza, Hamrick, Marklund, Kalashnikova, Guez, Casillas, Benetti, Alphen and Cristia2023) took a wider perspective, analysing 1,001 children’s speech-like vocalisations using data collected with LENA, a wearable device that records daylong, naturalistic language input and output. The sample included children (2–48 months) from diverse linguistic backgrounds, from both industrialised and non-industrialised communities, with some children at risk of atypical development. Despite the large sample size in their study, they found that the number of speech-like vocalisations children produced did not vary as a function of multilingual status (but see Zheng et al., Reference Zheng, Degotardi, Sweller and Djonov2023). Admittedly, no previous study looked at CP specifically, a gap we sought to fill.

Syllable complexity. Some previous studies have demonstrated that ambient language influences early speech development (Andruski et al., Reference Andruski, Casielles and Nathan2014; de Boysson-Bardies et al., Reference de Boysson-Bardies, Hallé, Sagart and Durand1989; de Boysson-Bardies & Vihman, Reference de Boysson-Bardies and Vihman1991; Levitt & Wang, Reference Levitt and Wang1991; Poulin-Dubois & Goodz, Reference Poulin-Dubois, Goodz, Cenoz and Genesee2001; Sundara et al., Reference Sundara, Ward, Conboy and Kuhl2020). For instance, previous cross-linguistic research has shown that children learning languages with simpler syllable structures, such as Turkish (moderate syllable complexity), acquire phonological awareness faster than those learning languages with more complex syllable structures, such as English (Stringer, Reference Stringer2021). Given that SMD contains data from several different languages, but not enough in each, we sought to employ a typological classification that systematically differs across languages, and which may directly influence the various types of syllable structures that children are exposed to, since syllables are a foundation of speech production. For this, we turned to Maddieson (Reference Maddieson, Dryer and Haspelmath2013), who categorised languages as a function of the types of syllables allowed: a simple syllable complexity if only open syllables were allowed ((C)V); moderately complex syllable complexity if additionally some codas and complex onsets were allowed ((C)(C)V(C)); and complex syllable complexity if additional syllable shapes were permissible ((C)(C)(C)V(C)(C)(C)). The details of Maddieson’s categorisation will be discussed in Section 2.

How may a language’s syllable complexity status influence early speech development? Previous work shows that children simplify their output to match their articulatory capacities (e.g., Vihman, Reference Vihman2014), such that, early on, CV syllables are over-represented. Moreover, children’s own productions also influence how they perceive spoken input, heightening their attention to sounds and sound sequences they already produce (DePaolis et al., Reference DePaolis, Vihman and Nakai2013; Vihman, Reference Vihman2014). This is consistent with the findings that children whose native language allows only simple syllable complexity might be exposed to substantially more input matching their preferred patterns than children exposed to languages with more complex syllable complexity. Potentially, in languages with simpler syllable complexity, as infants begin to babble, their caregivers more easily recognise their babble as real words since there are fewer mismatches with the target word due to, for example, dropped codas or simplified clusters, probably affecting the social feedback loop that reinforces canonical vocalisations (Warlaumont et al., Reference Warlaumont, Richards, Gilkerson and Oller2014). Thus, considering both input frequency and social feedback, one might expect children learning simple syllable complexity languages to develop CP faster.

Using a subset of the SMD, Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) found a significant effect of syllable complexity on CP. However, the strongest difference was between moderate and the other types, which challenges the assumption that syllable complexity linearly influences CP development. Similarly, Lee et al. (Reference Lee, Jhang, Relyea, Chen and Oller2018) investigated the canonical babbling ratio in long-form recordings from 21 infants learning English (a complex syllable complexity language) in the United States and Chinese (a moderate one) in both the United States and Taiwan, hypothesising that the lower level of syllable complexity found in Chinese may affect outcomes. Children learning Chinese showed numerically higher canonical ratios, although the difference was not statistically significant. Together, these previous findings invite further attention to the input language’s syllable complexity classification.

Communities. In this study, we follow previous work by adopting a simple first approach to classifying communities (Cristia, Reference Cristia2023; Hitczenko et al., Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023). Communities are classified as non-industrialised if they are rural, practice a subsistence-based economy, and have limited access to formal education and healthcare, in contrast to industrialised communities, which have a market-based economy and wide access to formal education and healthcare services. SMD contains data from children growing up in 10 communities. Based on this classification, most communities in SMD were considered non-industrialised. Several studies in children’s lexical development have reported that children in non-industrialised communities have smaller vocabularies and/or lower language scores than their industrialised counterparts (e.g., Ma et al., Reference Ma, Jonsson, Feng, Weisberg, Shao, Yao, Zhang, Dill, Guo, Zhang, Friesen and Rozelle2021; Vogt et al., Reference Vogt, Mastin and Aussems2015; see also Ma et al., Reference Ma, Zhang, Pappas, Rule, Gao, Dill, Feng, Zhang, Wang, Cunha and Rozelle2024). However, other studies have found comparable language development trajectories between children from non-industrialised and industrialised communities (Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021).

This raises the question: How might community differences influence CP? The two previous studies on which we build reached opposite conclusions. Based on the first 52 children, Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) commented on the lack of salient visual differences in CP across the various communities (each represented by only 3–16 children). In contrast, by increasing the sample size to 129 children and attempting a comparison based on a dichotomic non-industrialised/industrialised distinction, Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) found significant variability: Children in non-industrialised communities showed higher CPs than those in industrialised communities but similar developmental trajectories, as there were no interaction effects with age. Our expanded dataset allows us to examine the robustness and generalisability of these patterns across a broader sample.

1.4. The present study

Our study harnesses variability in SMD to explore how CP varies by three experiential factors: multilingualism, syllable complexity, and community characteristics. Since our dataset includes and expands upon data analysed in Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) and Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023), our work is an exploratory extension of theirs, and not confirmatory testing given the overlapping data.

2. Methods

Reproducibility of analyses has been ensured using a repository accessible at the review stage via OSF: https://osf.io/yjq7r/?view_only=afc0ad9758dc457b9af97056f97540d2.

2.1. Dataset

Our data come from the SMD (Hitczenko et al., Reference Hitczenko, Peurey, Harvard, Tey, Seidl, Semenzin, Scaff, Lavechin, Kelleher, Hamrick, Gautheron, Cychosz, Casillas and Cristia2025). Children were individually recorded using an unobtrusive wearable recording device, resulting in long audios (e.g., 15 consecutive hours), which were analysed either manually or using state-of-the-art software (called VTC; Lavechin et al., Reference Lavechin, Bousbib, Bredin, Dupoux and Cristia2020) to automatically identify sections of the audio in which the child or others vocalised (i.e., babbled, cooed, or spoke). These vocalisations were sampled using different methods, including key child vocalisation sampling (extract audio sections attributed to the key child), loudness sampling (extract audio sections based on the amplitude profile, so as to be independent from VTC), and female adult vocalisation sampling (extract audio sections attributed to female adults). Once sampled, these audio sections were cut into short ~500 ms clips. The 500 ms duration was selected because it is short enough to meet ethical and privacy standards, preserving participants’ sensitive information and voice identity, yet long enough for listeners to reliably judge whether the clip contains a canonical vocalisation for perceptual classification, as shown in prior work (e.g., Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021).

CP has been calculated over different units in different studies, such as utterances (in the laboratory annotation used in Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021), or chunked vocalisation (Cychosz et al., Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021; Hitczenko et al., Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023; and citizen science CP in Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021). Here, we calculated CP using those same fixed-length 500 ms clips. These clips do not necessarily align with complete child utterances or individual syllables.

The clips were then uploaded to a citizen science platform. For the vast majority of the data, the platform was Zooniverse, and the project was called the Maturity of Baby Sounds (https://www.zooniverse.org/projects/laac-lscp/maturity-of-baby-sounds). In this case, several thousand non-expert individuals contributed to crowdsourcing classifications after minimal training. Before the annotation, citizen scientists completed a tutorial and had access to reference materials and discussion forums for general questions. The data from the 52 participants from Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021) came from a different citizen science platform, which is no longer available. In all cases, at least three citizen scientists classified each individual clip into different categories: canonical, non-canonical, laughter, crying, or none of the above. Considering only clips in which majority annotators agreed on the label, Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) calculated CP for each child by dividing the total number of clips that had been classified as canonical by the sum of the number of clips classified as either canonical or non-canonical. They did this for every combination of recording date and sampling method. For instance, a child who was recorded on two separate days, and had key-child as well as female-adult vocalisations separately sampled from each day, would have four CP values associated with their participant ID, only two of which – the two key-child CP values – are relevant to the present study. For a more comprehensive description of the data collection and processing involved in the SMD, refer to Hitczenko et al. (Reference Hitczenko, Peurey, Harvard, Tey, Seidl, Semenzin, Scaff, Lavechin, Kelleher, Hamrick, Gautheron, Cychosz, Casillas and Cristia2025).

No intra-annotator reliability was assessed, as each citizen scientist typically labelled each clip only once. Support for inter-rater reliability at the individual clip label comes from Zhang et al. (Reference Zhang, Suresh, Warlaumont, Hitczenko, Cristia and Cychosz2025; Interspeech), who studied the SMD and reported fair-to-moderate agreement among multiple human annotators, with a weighted Fleiss’ kappa of .276 (95% CI [.266, .285]). However, our analyses do not rely on individual clip labels. Instead, we aggregate all clips belonging to each child on the same recording day to compute a single child-level CP value. The aggregation substantially reduces the impact of any disagreements on individual clips. There is also promising evidence that these aggregated CP values are highly accurate. The validity of this crowd-sourcing approach was supported by Semenzin et al. (Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021), which analysed data from 20 children (10 with Angelman Syndrome, 10 typically developing; this is the USA–Indiana corpus, within SMD). They found high correlations between CP derived from expert human annotators and citizen scientists (overall: r(18) = .938, CI [.848, .976], p < .001; Angelman syndrome: r(8) = .855, CI [.487, .965], p < .001; typical developing children: r(8) = .966, CI [.858, .992], p < .001).

In its rawest format, SMD data are based on individual chunks. For our scientific purposes, we selected data allowing us to calculate a CP measure per child at a given age. Therefore, we excluded chunk-level data when they had been sampled to focus on female voice or when they had not been labelled as belonging to an infant or child. When multiple CP measures were available less than 15 days apart, we merged them. We also excluded 10 CP measures from children with atypical language development (Semenzin et al., Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021). The final dataset includes 366 children (described in detail in Section 2.2), with each child contributing up to one to three CP measures at different ages (median = 1, mean = 1.04), for a total of 379 included CP measures. On average, each CP measure was based on 59 canonical clips and 114 non-canonical clips. Table 2 presents, for each corpus, the number of CP measures in each age range. Note that children recorded at multiple time points contribute one CP measure for each recording session.

Table 2. Number of CP measures across age range, by corpus

^Languages spoken in Solomon Islands: Roviana, Avaso, Babatana, Marco, Marovo, Pidjin, Sengga, Simbo, Sisinga, Ughele, Vaghua, and Varisi. *Languages spoken in Vanuatu: Bislama, Venen Taut, Petarmul, Neverver, Uripiv, Vinmavis, Novol, Epi, Nah’ai, Paama, Ninde, Tautu, French, Pinalum, Malo, Rano, Tauta, Santo Language, Ambae, Maevo, South, Atchin, and Tempun.

2.2. Participants

We analysed data from 366 unique children (193 boys; some children contributed data for >1 day) aged 2–76 months. These children came from a variety of linguistic backgrounds, representing 10 different corpora. Table 3 provides an overview of the characteristics of each corpus.

Table 3. Summary of corpus characteristics

*Languages spoken in Solomon: Roviana, Avaso, Babatana, Marco, Marovo, Pidjin, Sengga, Simbo, Sisinga, Ughele, Vaghua, and Varisi. Languages spoken in Vanuatu: Bislama, Venen Taut, Petarmul, Neverver, Uripiv, Vinmavis, Novol, Epi, Nah’ai, Paama, Ninde, Tautu, French, Pinalum, Malo, Rano, Tauta, Santo Language, Ambae, Maevo, South, Atchin, and Tempun.

Within each corpus, SMD participants are classified as monolingual or multilingual following Bergelson et al. (Reference Bergelson, Soderstrom, Schwarz, Rowland, Ramírez-Esparza, Hamrick, Marklund, Kalashnikova, Guez, Casillas, Benetti, Alphen and Cristia2023): monolinguals are reportedly exposed to only one language (30.1% in the dataset); otherwise, they are multilinguals (i.e., non-monolinguals), regardless of the number of languages.

Monolingual children were further categorised based on the syllable complexity of their input language. Although it would have been possible to classify multilingual children as the highest syllable complexity of input languages (see Supplementary Table 2), we thought this may introduce additional noise due to inconsistent exposure levels across languages. Detailed language exposure information at the individual levels is not available. Classification followed Maddieson’s (Reference Maddieson, Dryer and Haspelmath2013) simple, moderate, and complex syllable complexity languages as follows: In languages classified as having a simple syllable structure, syllable structure is restricted to (C)V (C: consonant; V: vowel) sequences, allowing no onset consonant clusters, and only V and CV as permissible syllables. This category is relatively rare, representing only 12% of the languages studied by Maddieson. In our data, Yélî Dnye is the only language representing this category.

In contrast, languages classified as having moderate syllable complexity allow single codas and/or consonant clusters (typically involving an approximant or liquid) in onsets. This includes syllable types such as VC, CVC, CCV, and CCVC, in addition to the simpler structures V and CV. Around 57% of the languages studied by Maddieson are in this category. Tsimane’ is the only language classified as moderate syllable complexity in SMD.

Finally, languages with complex syllable complexity allow more intricate consonant clusters in both onsets and codas. Some examples of syllable structures that are allowed in languages with complex syllable complexity are V, CV, VC, CVC, CCV, CCVC, CVCC, CCVCC, CCCVCC, and CCCVCCC. The remaining 31% of languages were classified by Maddieson in this category. In SMD, French, English, and Tseltal are classified as having complex syllable complexity.

As for the last factor focused in this study, communities were classified as industrialised or non-industrialised (Cristia, Reference Cristia2023; Hitczenko et al., Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023). Our dataset includes four corpora from industrialised communities and six corpora from non-industrialised communities (see Table 3 for details). Although we acknowledge that the actual classification of communities in the real world is a complex and multifaceted process, the simple classification serves as a starting point for our analysis.

2.3. Analyses

All analyses were conducted in R version 4.3.2 (R Core Team, 2024), using the stats (R Core Team, 2024) and car (Fox & Weisberg, Reference Fox and Weisberg2019) packages for statistical analysis, and the ggplot2 (Wickham, Reference Wickham2016) package for data visualisation. To better match age distributions and in the presence of multicollinearity, we examined the predictive value of multilingual exposure, native language syllable complexity, and community type in three separate mixed-effects logistic regression models, with CP as the dependent variable. Weighted analyses were employed to account for the differences in number of clips available for each child and ensure accurate representation in the models. Specifically, each CP measure was weighted by the total number of canonical and non-canonical clips it was derived from (i.e., excluding those labelled as junk, NA, no majority label, and non-speech). The linear and quadratic terms of age (both z-scored) were also included, in interaction with the main effect being studied. As our study builds on Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023), with a larger dataset, we followed many of their statistical analysis approaches, including the inclusion of the quadratic term for age. We address multiple comparisons within each model by using q-values (which adjust for false discovery rate). Type 3 Analysis of Variance (ANOVA) tests were performed to confirm the significance of main effects and interactions identified in the regression models.

Multilingual exposure, syllable complexity, and community were treated as categorical variables, with monolinguals, simple syllable complexity, and non-industrialised communities set as reference levels, respectively. Re-levelling was performed in relevant analyses to facilitate specific pairwise comparisons among different categories, as described in Section 3. Child_id was treated as a random variable nested within the corpus to account for individual and corpus-level variability.

3. Results

3.1. Multilingual exposure

To explore whether CP varies as a function of multilingual exposure, we analysed data from all participants, namely 366 children, spanning the entire age range (2–76 months). We fit a mixed-effects logistic regression model to test the effects of multilingual status, age, and their interaction on CP: CP ~ age × multilingualism + age 2 × multilingualism + (1|corpus/child_id).

The results revealed a significant main effect of multilingualism (Estimate = −.65, SE = .17, z = −3.83, p < .001), indicating that monolingual children demonstrated higher CP measures compared with multilingual ones (see Sections 1 and 2 of the Supplementary Material for robustness checks). At the mean age of the sample (25.3 months), CP is estimated at 0.36 for monolinguals and 0.22 for multilinguals, leading to a difference of 63.6% higher CP for the former compared with the latter. Both linear and quadratic terms for age were significant predictors of CP (Estimate = .52, SE = .06, z = 8.62, p < .001; Estimate = −.14, SE = .05, z = −2.87, p < .01). These results suggest that CP increases with age overall (as indicated by the positive linear term), but the significant negative quadratic term suggests that this increase slows down with age. Notably, no significant interaction between age and multilingualism was observed (Estimate = −.02, SE = .1, z = −.21, p = .84, 95% CI [−0.21, 0.17]). While this result may suggest similar age-related trends across groups, we caution that the confidence interval includes both negligible and small (positive and negative) effect sizes. Thus, the data are consistent with either true absence of an interaction or insufficient power to detect a true but small interaction. Figure 1 shows the CP distribution in monolingual and multilingual children across the children’s age.

Figure 1. Canonical proportions by age and multilingual exposure (full sample). The regression line represents the fitted model, and the shaded bands surrounding the line represent the 95% confidence intervals. Each data point represents a single child, with point size representing the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations).

Overall, these results suggest that CP increased with age, with growth slowing over time, and that monolingual children consistently exhibited higher CP than multilingual children.

3.2. Syllable complexity

Second, we explored the relationship between syllable complexity and CP among monolingual children. Since this analysis focuses only on monolingual children (as we do not have detailed information on languages spoken by each child in multilingual settings), we included 110 monolingual children (31 learning a simple syllable complexity language, 41 moderate, and 38 complex). This selection excluded 256 children from the analysis. We fit a mixed-effects logistic regression model to test the effects of syllable complexity, age, and their interaction on CP: CP ~ age × syllable_complexity + age 2 × syllable_complexity + (1|corpus/child_id).

This analysis revealed a significant main effect of syllable complexity on CP. Specifically, children learning a language of moderate syllable complexity (Estimate = −.73, SE = .14, z = −5.11, p < .001) and complex syllable complexity (Estimate = −.95, SE = .27, z = −3.52, p < .001) exhibited significantly lower CP compared with the reference category, namely children learning a language with simple syllable complexity, who exhibited the highest CP (see Section 3 of the Supplementary Material for a robustness check). Re-levelling syllable complexity with “moderate” as the reference category showed that the difference between moderate and complex syllable complexity was not statistically significant (Estimate = −.23, SE = .26, z = −0.87, p = .384). At the mean age of the sample (28.7 months), CP is estimated at .51 for simple, .35 for moderate, and .22 for complex, leading to a difference of 45.7% higher CP for the simple versus moderate, and 59.1% for moderate versus complex. As in the multilingual model discussed in the previous section, the linear age predictor had a significant positive effect on CP (Estimate = .38, SE = .13, z = 2.99, p < .005), and the interactions were not significant (all p > .05). However, unlike in the previous analysis, the quadratic term for age did not explain significant variance. This suggests that disaggregating monolinguals by syllable complexity reveals less of a plateau in older age groups than in the combined analysis represented in Figure 1. Figure 2 shows monolingual children’s CP as a function of age and their native language’s syllable complexity level.

Figure 2. Canonical proportions by age and syllable complexity in monolingual children. The regression line represents the fitted model, and the shaded bands surrounding the line represent 95% confidence intervals. Each data point represents a single child, with point size indicating the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations). *Others represent French, Tseltal, and English.

3.3. Communities

Last, we explored the relationship between community types (industrialised vs. non-industrialised) and CP. Noticing an unequal age range between industrialised and non-industrialised groups, we subset data to children aged 3–19 months, resulting in 148 children (115 non-industrialised, 33 industrialised) being included in a mixed-effects logistic regression model: CP ~ age × community + age 2 × community + (1|corpus/child_id).

The analysis revealed no significant main effects for the community (Estimate = −.38, SE = .42, z = −0.9, p = .37), suggesting no significant differences in CP measures between community types. Both the linear (Estimate = .16, SE = .06, z = 2.8, p < .01) and quadratic (Estimate = .16, SE = .08, z = 2.09, p < .05) terms for age were positively associated with CP, suggesting a developmental increase in CP over time that speeds up in this age range (3–19 months), unlike the significant plateauing observed in the analysis centred on multilingualism which spanned our full age range (2–76 months). No significant interaction was observed between community and age. Figure 3 reveals that children from non-industrialised communities exhibited a numerically higher CP than industrialised children at the youngest ages, with this difference narrowing with age. This pattern goes against what we might expect based on experiential factors. Given that industrialised children are typically exposed to more structured linguistic input, they would be expected to have higher CP. However, our results show the opposite, non-industrialised children start with higher CP, though the gap narrows with age.

Figure 3. Canonical proportions by age and community. The regression line represents the fitted model, and the shaded bands surrounding the line represent 95% confidence intervals. Each data point represents a single child (N = 115 non-industrialised; N = 33 industrialised), with point size indicating the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations).

4. Discussion

The present study explored the relationship between CP and three factors: multilingual exposure, ambient language syllable complexity, and community. Because our analyses were exploratory, we use statistical significance descriptively and not as null hypothesis testing, especially since we did not adjust for repeated testing across models. In our analyses of the SMD, monolingual children have higher CP than multilingual children. At the mean age of the sample (25.3 months), CP was estimated at .36 for monolinguals and .22 for multilinguals, meaning that monolinguals exhibited a 63.6% higher CP than multilinguals. We also found that monolingual children learning languages with simple syllable complexity exhibited the highest CP, followed by those learning languages with moderate and complex syllable complexity. At the mean age of the monolingual sample (28.7 months), CP was estimated at .51 for simple, .35 for moderate, and .22 for complex syllable complexity. This reflects a 45.7% higher CP for children learning simple versus moderate syllable complexity and a 59.1% difference between moderate and complex syllable complexity. Finally, we failed to find a significant difference between children from non-industrialised and industrialised communities. Next, we discuss each of these effects.

4.1. Multilingual exposure

The observed higher mean CP in monolingual versus multilingual children is consistent with the prediction made in the Introduction, whereby reduced input per language found in multilingual children causes acquisition delays (Fibla et al., Reference Fibla, Sebastian-Galles and Cristia2021). In line with this prediction, an Australian study (Zheng et al., Reference Zheng, Degotardi, Sweller and Djonov2023) found that monolingual children produced more vocalisations than multilingual children at preschool. However, the latter difference was not statistically significant after controlling for variables such as family income and home environment; and previous studies reported no significant differences in babbling (Oller et al., Reference Oller, Pearson and Cobo-Lewis2007) or speech-like vocalisation patterns between monolingual and multilingual children (Bergelson et al., Reference Bergelson, Soderstrom, Schwarz, Rowland, Ramírez-Esparza, Hamrick, Marklund, Kalashnikova, Guez, Casillas, Benetti, Alphen and Cristia2023).

4.2. Syllable complexity

In our analyses, children learning languages with simple syllable complexity showed higher CP than those learning languages with moderately complex or complex syllables. Note that we observed a different pattern from Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023). That analysis attributed all the Solomon corpus children to a “simple” complexity type, whereas we excluded them since we did not have detailed information about their ambient languages. Differences in results thus likely reflect both the increased sample size (Serdar et al., Reference Serdar, Cihan, Yücel and Serdar2021) and our methodological decision to focus specifically on monolingual children, illustrating how results can shift as datasets are expanded and analytical approaches are refined.

Could it be that the CP measure misrepresents languages with consonant clusters in some way? One may wonder if in languages with greater syllable complexity, where onset clusters are allowed, the sections of the 500 ms clip that reflect vowels or consonants are relatively shorter than in simpler syllables, and this leads to a higher proportion of clips labelled as non-canonical. We think this is not a good explanation of the effects we observed for syllable complexity on CP for several reasons. First, this would predict that children at ages where they start producing clusters would show lower CP, but this is not what we see in the data, where group differences between e.g. simple syllable complexity and complex syllable complexity languages are obvious even before 18 months, an age at which children do not readily produce many consonant clusters. Second, the 500 ms duration almost certainly covers a full syllable, even for syllables with complex onsets and codas. Finally, our request to citizen scientists was for speech-like VC or CV transitions, which should not penalise children who produce more clusters.

4.3. Communities

Our results did not align with Hitczenko et al.’s (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) in terms of community differences. While their analysis of a subset of SMD yielded significantly higher CP in children from non-industrialised communities, ours did not reveal significant differences as a function of community type. This divergence may reflect an interplay between sample size, dataset composition, and analytical approaches that can occur when datasets are expanded.

4.4. Age

The data from all three analyses can shed light on how CP changes with age. During the early stages of development, CP increases rapidly as children make significant progress in oral motor and phonological skills, thereby improving their ability to articulate complex canonical forms (Anthony & Francis, Reference Anthony and Francis2005). In fact, our third analysis (Figure 3) suggests that in the 3–19-month range, CP is increasing rapidly, as evidenced by a positive and significant quadratic term. The first stage in which CP continues to increase goes on until at least 36 months of age (see Figure 2, regression line for complex syllable languages). However, as children reach more stable stages of speech production, the rate of CP growth slows. This slowing was reflected in the negative and significant estimate of the quadratic term of age observed in the first analysis (Figure 1). This slowdown aligns with other developmental milestones in children’s speech development. Coordination between lips and jaw continues to be refined between 2 and 6 years of age (Green et al., Reference Green, Moore, Higashikawa and Steeve2000), and developmental phonological processes (e.g., consonant cluster simplification) become less frequent after around 48 months (Sampallo-Pedroza et al., Reference Sampallo-Pedroza, Cardona-López and Ramírez-Gómez2014). Similarly, Hustad et al. (Reference Hustad, Mahr, Natzke and Rathouz2021) report that speech intelligibility increases markedly from 30 to about 48 months, followed by more gradual growth thereafter. Additional research is necessary to study the relationship between CP and other speech development indices, such as in Ott and Cychosz (Reference Ott and Cychosz2025).

Is it the case that these developmental trends vary as a function of the language’s syllable complexity? Our analyses did not reveal a significant interaction, but we hope the question is revisited by future work with larger sample sizes and better age coverage. Notice that the oldest children learning languages with complex syllable complexity were only 36 months of age; for moderate and simple syllable complexity, this was over 70 months. Future work should also problematise the fact that children learning languages with simple or moderate syllable complexity tend to have a higher CP starting point than children learning complex syllable complexity languages, which is unexpected if differences truly are due to exposure.

4.5. A better understanding of CP differences

This and previous papers present CP as a measure of speech development. But what precise conceptual aspect of speech development does CP capture and how does this relate to the effects we observed (multilingualism, syllable complexity) or lack thereof (community)? As evidence accumulates, it would be important to start developing inductive and deductive theoretical frameworks to provide causal accounts linking children’s experience to outcomes.

One possibility we speculate on is that CP captures differences in syllable duration. According to this explanation, our CP measure will better capture full CV or VC transitions in the speech of children whose syllables are shorter. This explanation would readily account for CP differences as a function of age (since older children produce shorter syllables; Nip & Green, Reference Nip and Green2013). Perhaps it may also account for syllable complexity differences, should future research find that syllables are shorter in languages with simple syllable complexity. However, we do not readily see how it may account for the multilingualism effect we observed.

Another possibility evoked in the Introduction is that children exposed to more input develop higher CPs because the additional exposure facilitates the formation of perceptual targets, similarly to processes likely involved in children developing larger vocabularies when exposed to more speech. We feel this hypothesis does not carry a large explanatory value with respect to our results. For instance, multilingualism-related differences have rather been attributed to input quantity within each language and ensuing confusion (Fibla et al., Reference Fibla, Sebastian-Galles and Cristia2021), and not the absolute quantities afforded. Similarly, the pathways we described in the Introduction for syllable complexity differences related to the amount of exposure to templates matching children’s spontaneous production, and not the overall amount of input.

Longitudinal work like Ott and Cychosz (Reference Ott and Cychosz2025) is beginning to shed light on the ways in which CP relates to other measures of speech and language development. Further work employing other statistical approaches (e.g., standard equation modelling) and carefully considering causal chains is needed to better understand how experience affects CP and speech development more broadly.

4.6. Limitations

This study builds on datasets previously analysed by Cychosz et al. (Reference Cychosz, Cristia, Bergelson, Casillas, Baudet, Warlaumont, Scaff, Yankowitz and Seidl2021), Semenzin et al. (Reference Semenzin, Hamrick, Seidl, Kelleher and Cristia2021), and Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023), with those earlier data forming part of our current sample. For that reason, this study cannot be confirmatory, given that our dataset was not statistically independent from previous observations. This exploratory nature does not, in our view, detract from the value of the present work. For example, Hitczenko et al. (Reference Hitczenko, Bergelson, Casillas, Colleran, Cychosz, Grosjean, Hamrick, Kelleher, Scaff and Seidl2023) found that children from rural communities have higher CP, whereas our analysis of the expanded dataset did not. This difference highlights the importance of revisiting earlier findings using expanded datasets, rather than assuming that a given positive result will necessarily be ratified with a larger sample.

Another limitation concerns statistical inference in large shared datasets like SMD. Our field does not yet have clear guidelines for multiple-comparison correction with resources that are analysed repeatedly, and where some patterns are already known from previous studies. In such cases, it may be more appropriate to use an explicit exploratory approach, as we have done here. Because the same data are used in multiple studies, it is not possible for future researchers to be completely unaware of existing patterns in previous work.

Three other limitations relate to SMD in particular. At present, SMD contains only one language representing each of the simple and moderate syllable complexity categories, and only three for the complex one. In addition, we could only test the difference across industrialised and non-industrialised communities in a narrow age range (3–19 months). A broader age range would be important to capture the full developmental trajectory. Second, SMD lacks detailed information on the specific language spoken by multilingual children and their extent of exposure to each language, in part because of interdisciplinary collaboration and data reuse. For example, infants in Solomon Islands were recorded as part of a larger study, limiting the collection of specific language details due to survey length constraints. Today, there are no standardised methods for quickly gathering multilingual exposure information that do not require lengthy (30-min) surveys. As a result, we cannot categorise children by factors thought to be key in bilingualism and multilingualism research, such as language dominance and proportion of exposure (Byers-Heinlein, Reference Byers-Heinlein and Schwieter2015). In addition, SMD relies on judgments based on 500 ms clips. In such clips, suprasegmental cues, such as intonation, rhythm, and duration, are often not fully captured, particularly when isolated from their broader utterance context. Previous studies (Kehoe, Reference Kehoe2022; Vihman et al., Reference Vihman, DePaolis and Davis1998) have shown that, when children grow, they begin to adjust syllable duration according to prosodic structure, typically producing longer final syllables. Such changes cannot be studied with SMD data.

5. Conclusions

This study offers valuable insights into the potential impact of multilingual exposure, syllable complexity, and community factors on the development of CP in children’s early language production. By analysing a large and diverse dataset, which includes understudied languages and communities, our findings emphasise how universal developmental processes and language-specific characteristics impact early speech development. Our findings suggest that multilingual exposure may explain some variability in CP development. Moreover, provided that a large enough sample and age range is represented, children exposed to languages with simple syllable complexity may tend to have higher CP than those learning languages with moderate and complex syllable complexity, underscoring the influence of syllable structure of the ambient language on the development of speech production. Finding differences in speech development as a function of these further emphasises the importance of considering a range of linguistic and cultural contexts when studying language acquisition. Although an analysis based on a binary classification of communities into non-industrialised versus industrialised did not reveal significant differences, further research with more nuanced classification is needed to shed light on potential community-based differences. Ultimately, this study contributes to a nuanced understanding of the diverse pathways children take across languages during early speech development.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0305000926100476.

Acknowledgements

ChatGPT was used for minor language refinement.

Funding statement

This work was funded by the J. S. McDonnell Foundation Understanding Human Cognition Scholar Award and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (ExELang; Grant Agreement No. 101001095).

Competing interests

We have no known competing interests to disclose.

Disclosure of use of AI tools

No AI tools were used for data analysis, study design, or interpretation.

Statement of ethical approval

This present paper makes use of the SMD, which comprises 10 different corpora collected independently by different research teams. Ethical approval and consent procedures were obtained separately for each corpus, as detailed below.

1. Corpus: Bolivia (Quechua and Spanish). a. Data were obtained via the scientific archive HomeBank. Ethical approval details are not included in the archive metadata and were managed by the original data collectors.

2. Corpus: Bolivia (Tsimane’). a. Name of the ethics committee that approved the research: UNM Health Sciences Center Human Research Review Committee (HRRC).

b. Ethics Committee approval number: 17-262.

c. Types of consent obtained: Verbal informed consent was obtained from participants and/or parents or legal guardians of child participants using an IRB-approved verbal consent procedure. Child assent was obtained where developmentally appropriate.

3. Corpus: France (French). a. Name of the ethics committee that approved the research: Comité d’Éthique de la Recherche (CERES/CER Paris Descartes; later CER U-Paris).

b. Ethics Committee approval number: IRB 2015140001072 (2015-14), with amendment 2020-20-BERGMANN-CRISTIA-HAVRON.

c. Types of consent obtained: Written informed consent was obtained from parents or legal guardians of participating children. Child assent was obtained where developmentally appropriate.

4. Corpus: Mexico (Tseltal). a. Data were obtained via the scientific archive HomeBank. Ethical approval details are not included in the archive metadata and were managed by the original data collectors.

5. Corpus: PNG (Yélî Dnye). a. Name of the ethics committee that approved the research: Ethics Committee of the Max Planck Institute for Psycholinguistics (ECSW).

b. Ethics Committee approval number: ECSW2017-3001-474 (Manko–Rowland; Language Development).

c. Types of consent obtained: Written informed consent was obtained, including parental consent and child assent/co-consent where applicable.

6. Corpus: Solomon Islands. a. Name of the ethics committee that approved the research: University of New South Wales Human Research Ethics Committee (UNSW HREC).

b. Ethics Committee approval number: HC180755.

c. Types of consent obtained: Written informed consent was obtained, including parental consent and child assent/co-consent where applicable, in accordance with UNSW Human Research Ethics Committee approval.

7. Corpus: USA–California (English & Spanish). a. Data were obtained via the scientific archive HomeBank. Ethical approval details are not included in the archive metadata and were managed by the original data collectors.

8. Corpus: USA–Indiana (English). a. Data were obtained via the scientific archive HomeBank. Ethical approval details are not included in the archive metadata and were managed by the original data collectors.

9. Corpus: USA–New York (English). a. Data were obtained via the scientific archive HomeBank. Ethical approval details are not included in the archive metadata and were managed by the original data collectors.

10. Vanuatu. a. Name of the ethics committee that approved the research: Ethics Committee (Ethik-Kommission), Universitätsklinikum Jena, Germany.

b. Ethics Committee approval number: 4818-06/16.

c. Types of consent obtained: Written informed consent was obtained, including parental consent and child assent/co-consent where applicable, in accordance with ethics approval and Vanuatu National Cultural Council research requirements.

References

Andruski, J. E., Casielles, E., & Nathan, G. (2014). Is bilingual babbling language-specific? Some evidence from a case study of Spanish–English dual acquisition. Bilingualism: Language and Cognition, 17(3), 660672. https://doi.org/10.1017/S1366728913000655CrossRefGoogle Scholar
Anthony, J. L., & Francis, D. J. (2005). Development of phonological awareness. Current Directions in Psychological Science, 14(5), 255259. https://doi.org/10.1111/j.0963-7214.2005.00376.xCrossRefGoogle Scholar
Bergelson, E., Soderstrom, M., Schwarz, I.-C., Rowland, C. F., Ramírez-Esparza, N., Hamrick, L. R., Marklund, E., Kalashnikova, M., Guez, A., Casillas, M., Benetti, L., Alphen, P. V., & Cristia, A. (2023). Everyday language input and production in 1,001 children from six continents. Proceedings of the National Academy of Sciences, 120(52), Article e2300671120. https://doi.org/10.1073/pnas.2300671120CrossRefGoogle ScholarPubMed
Bialystok, E., Luk, G., Peets, K. F., & Sujin, Y. (2010). Receptive vocabulary differences in monolingual and bilingual children. Bilingualism: Language and Cognition, 13(4), 525531. https://doi.org/10.1017/S1366728909990423CrossRefGoogle ScholarPubMed
Byers-Heinlein, K. (2013). Parental language mixing: Its measurement and the relation of mixed input to young bilingual children’s vocabulary size. Bilingualism: Language and Cognition, 16(1), 3248. https://doi.org/10.1017/S1366728912000120CrossRefGoogle Scholar
Byers-Heinlein, K. (2015). Methods for studying infant bilingualism. In Schwieter, J. W. (Ed.), The Cambridge handbook of bilingual processing (pp. 133154). Cambridge University Press.10.1017/CBO9781107447257.005CrossRefGoogle Scholar
Casillas, M., Brown, P., & Levinson, S. C. (2020). Early language experience in a Tseltal Mayan village. Child Development, 91(5), 18191835. https://doi.org/10.1111/cdev.13349CrossRefGoogle Scholar
Casillas, M., Brown, P., & Levinson, S. C. (2021). Early language experience in a Papuan community. Journal of Child Language, 48(4), 792814. https://doi.org/10.1017/S0305000920000549CrossRefGoogle Scholar
Cristia, A. (2023). A systematic review suggests marked differences in the prevalence of infant-directed vocalization across groups of populations. Developmental Science, 26(1), Article e13265. https://doi.org/10.1111/desc.13265CrossRefGoogle ScholarPubMed
Cychosz, M., Cristia, A., Bergelson, E., Casillas, M., Baudet, G., Warlaumont, A. S., Scaff, C., Yankowitz, L., & Seidl, A. (2021). Speech development in a large-scale crosslinguistic corpus. Developmental Science, 24(5), Article e13090. https://doi.org/10.1111/desc.13090CrossRefGoogle Scholar
Cychosz, M., & Long, H. L. (2025). Canonical babbling development in infancy: A systematic review and meta-analysis of methodological and ambient language influences. https://doi.org/10.31219/osf.io/8ub2v_v1CrossRefGoogle Scholar
de Boysson-Bardies, B., Hallé, P., Sagart, L., & Durand, C. (1989). A crosslinguistic investigation of vowel formants in babbling. Journal of Child Language, 16(1), 117. https://doi.org/10.1017/S0305000900013404CrossRefGoogle ScholarPubMed
de Boysson-Bardies, B., & Vihman, M. M. (1991). Adaptation to language: Evidence from babbling and first words in four languages. Language, 67(2), 297319. https://doi.org/10.1353/lan.1991.0045CrossRefGoogle Scholar
DePaolis, R. A., Vihman, M. M., & Nakai, S. (2013). The influence of babbling patterns on the processing of speech. Infant Behavior and Development, 36(4), 642649. https://doi.org/10.1016/j.infbeh.2013.06.007CrossRefGoogle ScholarPubMed
Fibla, L., Sebastian-Galles, N., & Cristia, A. (2021). Is there a bilingual disadvantage for word segmentation? A computational modeling approach. Journal of Child Language, 49(6), 11191146. https://doi.org/10.1017/S0305000921000568CrossRefGoogle Scholar
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (R package version 3.0-6) [Computer software]. https://CRAN.R-project.org/package=carGoogle Scholar
Gilkerson, J., Richards, J. A., Warren, S. F., Montgomery, J. K., Greenwood, C. R., Kimbrough Oller, D., Hansen, J. H., & Paul, T. D. (2017). Mapping the early language environment using all-day recordings and automated analysis. American Journal of Speech-Language Pathology, 26(2), 248265. https://doi.org/10.1044/2016_AJSLP-15-0169CrossRefGoogle ScholarPubMed
Green, J. R., Moore, C. A., Higashikawa, M., & Steeve, R. W. (2000). The physiologic development of speech motor control: Lip and jaw coordination. Journal of Speech, Language, and Hearing Research, 43(1), 239255. https://doi.org/10.1044/jslhr.4301.239CrossRefGoogle ScholarPubMed
Hitczenko, K., Bergelson, E., Casillas, M., Colleran, H., Cychosz, M., Grosjean, P., Hamrick, L. R., Kelleher, B. L., Scaff, C., & Seidl, A. (2023). The development of canonical proportion continues past toddlerhood. In Proceedings of the 20th International Congress of Phonetic Sciences (ICPhS) (pp. 12101214). https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2023/full_papers/774.pdfGoogle Scholar
Hitczenko, K., Peurey, L., Harvard, W. N., Tey, K. J., Seidl, A., Semenzin, C., Scaff, C., Lavechin, M., Kelleher, B., Hamrick, L., Gautheron, L., Cychosz, M., Casillas, M., & Cristia, A. (2025). Speech Maturity Dataset: A cross-cultural corpus of naturalistic child and adult vocalizations. OSF Preprints. https://osf.io/preprints/psyarxiv/w438x_v1?view_only=Google Scholar
Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012). Dual language exposure and early bilingual development. Journal of Child Language, 39(1), 127. https://doi.org/10.1017/S0305000910000759CrossRefGoogle ScholarPubMed
Hustad, K. C., Mahr, T. J., Natzke, P., & Rathouz, P. J. (2021). Speech development between 30 and 119 months in typical children I: Intelligibility growth curves for single-word and multiword productions. Journal of Speech, Language, and Hearing Research, 64(10), 37073719. https://doi.org/10.1044/2021_JSLHR-21-00142CrossRefGoogle ScholarPubMed
Kehoe, M. (2022). The prosody of two-syllable words in French-speaking monolingual and bilingual children: A focus on initial accent and final accent. Language and Speech, 65(2), 444471. https://doi.org/10.1177/00238309211030312CrossRefGoogle ScholarPubMed
Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E., & Cristia, A. (2020). An open-source voice type classifier for child-centered daylong recordings. Interspeech, 30723076. https://doi.org/10.21437/Interspeech.2020-1690Google Scholar
Lee, C.-C., Jhang, Y., Relyea, G., Chen, L., & Oller, D. K. (2018). Babbling development as seen in canonical babbling ratios: A naturalistic evaluation of all-day recordings. Infant Behavior and Development, 50, 140153. https://doi.org/10.1016/j.infbeh.2017.12.002CrossRefGoogle ScholarPubMed
Levitt, A. G., & Wang, Q. (1991). Evidence for language-specific rhythmic influences in the reduplicative babbling of French- and English-learning infants. Language and Speech, 34(3), 235249. https://doi.org/10.1177/002383099103400302CrossRefGoogle ScholarPubMed
Long, H. L., Bowman, D. D., Yoo, H., Burkhardt-Reed, M. M., Bene, E. R., & Oller, D. K. (2020). Social and endogenous infant vocalizations. PLoS One, 15(8), Article e0224956. https://doi.org/10.1371/journal.pone.0224956CrossRefGoogle ScholarPubMed
Ma, Y., Jonsson, L., Feng, T., Weisberg, T., Shao, T., Yao, Z., Zhang, D., Dill, S.-E., Guo, Y., Zhang, Y., Friesen, D., & Rozelle, S. (2021). Variations in the home language environment and early language development in rural China. International Journal of Environmental Research and Public Health, 18(5), Article 2671. https://doi.org/10.3390/ijerph18052671CrossRefGoogle ScholarPubMed
Ma, Y., Zhang, X., Pappas, L., Rule, A., Gao, Y., Dill, S., Feng, T., Zhang, Y., Wang, H., Cunha, F., & Rozelle, S. (2024). Associations between urbanization and the home language environment: Evidence from a LENA study in rural and peri-urban China. Child Development, 95(2), e74e92. https://doi.org/10.1111/cdev.14034CrossRefGoogle Scholar
Maddieson, I. (2013). Syllable structure. In Dryer, M. S. & Haspelmath, M. (Eds.), WALS Online (v2020.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13950591; http://wals.info/chapter/12 (accessed 2025-03-08).Google Scholar
McLeod, S., & Crowe, K. (2018). Children’s consonant acquisition in 27 languages: A cross-linguistic review. American journal of speech-language pathology, 27(4), 15461571. https://doi.org/10.1044/2018_AJSLP-17-0100CrossRefGoogle ScholarPubMed
Molemans, I., van den Berg, R., Van Severen, L., & Gillis, S. (2012). How to measure the onset of babbling reliably? Journal of Child Language, 39(3), 523552. https://doi.org/10.1017/S0305000911000171CrossRefGoogle ScholarPubMed
Nip, I. S. B., & Green, J. R. (2013). Increases in cognitive and linguistic processing primarily account for increases in speaking rate with age. Child Development, 84(4), 13241337. https://doi.org/10.1111/cdev.12052CrossRefGoogle ScholarPubMed
Oller, D. K., Eilers, R. E., & Basinger, D. (2001). Intuitive identification of infant vocal sounds by parents. Developmental Science, 4(1), 4960. https://doi.org/10.1111/1467-7687.00148CrossRefGoogle Scholar
Oller, D. K., Eilers, R. E., Neal, A. R., & Schwartz, H. K. (1999). Precursors to speech in infancy: The prediction of speech and language disorders. Journal of Communication Disorders, 32(4), 223245. https://doi.org/10.1016/s0021-9924(99)00013-1CrossRefGoogle ScholarPubMed
Oller, D. K., Eilers, R. E., Steffens, M. L., Lynch, M. P., & Urbano, R. (1994). Speech-like vocalizations in infancy: an evaluation of potential risk factors. Journal of Child Language, 21(1), 3358. https://doi.org/10.1017/S0305000900008667CrossRefGoogle ScholarPubMed
Oller, D. K., Eilers, R. E., Urbano, R., & Cobo-Lewis, A. B. (1997). Development of precursors to speech in infants exposed to two languages. Journal of Child Language, 24(2), 407425. https://doi.org/10.1017/S0305000997003097CrossRefGoogle ScholarPubMed
Oller, D. K., Pearson, B. Z., & Cobo-Lewis, A. B. (2007). Profile effects in early bilingual language and literacy. Applied Psycholinguistics, 28(2), 191230. https://doi.org/10.1017/S0142716407070117CrossRefGoogle ScholarPubMed
Ott, C. M. & Cychosz, M. (2025). Can automated vocal analyses over child-centered audio recordings be used to predict speech-language development? In Proceedings of the Annual Meeting of the Cognitive Science Society (p. 47). https://escholarship.org/uc/item/6gs7p4cdGoogle Scholar
Patten, E., Belardi, K., Baranek, G. T., Watson, L. R., Labban, J. D., & Oller, D. K. (2014). Vocal patterns in infants with autism spectrum disorder: Canonical babbling status and vocalization frequency. Journal of Autism and Developmental Disorders, 44, 24132428. https://doi.org/10.1007/s10803-014-2047-4CrossRefGoogle ScholarPubMed
Pearson, B. Z., Fernández, S. C., & Oller, D. K. (1993). Lexical development in bilingual infants and toddlers: Comparison to monolingual norms. Language Learning, 43(1), 93120. https://doi.org/10.1111/j.1467-1770.1993.tb00174.xCrossRefGoogle Scholar
Poulin-Dubois, D., & Goodz, N. (2001). Language differentiation in bilingual infants: Evidence from babbling. In Cenoz, J. & Genesee, F. (Eds.), Trends in language acquisition research (Vol. 1, pp. 95106). John Benjamins Publishing Company. https://doi.org/10.1075/tilar.1.06pouGoogle Scholar
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.3.2) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/Google Scholar
Sampallo-Pedroza, R. M., Cardona-López, L. F., & Ramírez-Gómez, K. E. (2014). Description of oral-motor development from birth to six years of age. Revista de la Facultad de Medicina, 62(4), 593604. https://doi.org/10.15446/revfacmed.v62n4.45211.CrossRefGoogle Scholar
Semenzin, C., Hamrick, L., Seidl, A., Kelleher, B. L., & Cristia, A. (2021). Describing vocalizations in young children: A big data approach through citizen science annotation. Journal of Speech, Language, and Hearing Research, 64(7), 24012416. https://doi.org/10.1044/2021_JSLHR-20-00661CrossRefGoogle Scholar
Serdar, C. C., Cihan, M., Yücel, D., & Serdar, M. A. (2021). Sample size, power and effect size revisited: Simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochemia Medica, 31(1), 2753. https://doi.org/10.11613/BM.2021.010502CrossRefGoogle ScholarPubMed
Stringer, H. (2021). Phonological awareness: What comes before letters and sounds? Getting children ready for phonics [Handout]. Newcastle University. https://research.ncl.ac.uk/media/sites/researchwebsites/languageinterventionintheearlyyears/Lively_handout_HStringer.pdfGoogle Scholar
Sundara, M., Ward, N., Conboy, B., & Kuhl, P. K. (2020). Exposure to a second language in infancy alters speech production. Bilingualism: Language and Cognition, 23(5), 978991. https://doi.org/10.1017/S1366728919000853CrossRefGoogle ScholarPubMed
Vihman, M. M. (2014). Phonological development: The first two years. Wiley-Blackwell.10.1002/9781394260997CrossRefGoogle Scholar
Vihman, M. M. (2019). Phonological templates in development (Vol. 4). Oxford University Press.10.1093/oso/9780198793564.001.0001CrossRefGoogle Scholar
Vihman, M. M., DePaolis, R. A., & Davis, B. L. (1998). Is there a “trochaic bias” in early word learning? Evidence from infant production in English and French. Child development, 69(4), 935949. https://doi.org/10.2307/1132354CrossRefGoogle Scholar
Vogt, P., Mastin, J. D., & Aussems, S. (2015). Early vocabulary development in rural and urban Mozambique. Child Development Research, 2015, 115. https://doi.org/10.1155/2015/189195CrossRefGoogle Scholar
Wang, Y., Williams, R., Dilley, L., & Houston, D. M. (2020). A meta-analysis of the predictability of LENA™ automated measures for child language development. Developmental Review, 57, Article 100921. https://doi.org/10.1016/j.dr.2020.100921CrossRefGoogle ScholarPubMed
Warlaumont, A. S., Richards, J. A., Gilkerson, J., & Oller, D. K. (2014). A social feedback loop for speech development and its reduction in autism. Psychological Science, 25(7), 13141324. https://doi.org/10.1177/0956797614531023CrossRefGoogle ScholarPubMed
Wickham, H. (2016). ggplot2: Create elegant data visualizations using the grammar of graphics (Version 3.4.0) [Computer software]. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=ggplot2 10.1007/978-3-319-24277-4CrossRefGoogle Scholar
Zhang, T., Suresh, M., Warlaumont, A. S., Hitczenko, K., Cristia, A., & Cychosz, M. (2025). Employing self-supervised learning models for cross-linguistic child speech maturity classification. Proceeding of Interspeech 2025, 28252829. https://doi.org/10.21437/Interspeech.2025-1946CrossRefGoogle Scholar
Zheng, Z., Degotardi, S., Sweller, N., & Djonov, E. (2023). Effects of multilingualism on Australian infants’ language environments in early childhood education centers. Infant Behavior and Development, 70, Article 101799. https://doi.org/10.1016/j.infbeh.2022.101799CrossRefGoogle ScholarPubMed
Figure 0

Table 1. Number of participants included in each corpora in previous studies and in the present dataset. The three papers largely build on each other. Hitczenko et al. (2023) included all the data in Cychosz et al. (2021) (except for USA–California). Our study includes all the data in Hitczenko et al. and more

Figure 1

Table 2. Number of CP measures across age range, by corpus

Figure 2

Table 3. Summary of corpus characteristics

Figure 3

Figure 1. Canonical proportions by age and multilingual exposure (full sample). The regression line represents the fitted model, and the shaded bands surrounding the line represent the 95% confidence intervals. Each data point represents a single child, with point size representing the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations).

Figure 4

Figure 2. Canonical proportions by age and syllable complexity in monolingual children. The regression line represents the fitted model, and the shaded bands surrounding the line represent 95% confidence intervals. Each data point represents a single child, with point size indicating the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations). *Others represent French, Tseltal, and English.

Figure 5

Figure 3. Canonical proportions by age and community. The regression line represents the fitted model, and the shaded bands surrounding the line represent 95% confidence intervals. Each data point represents a single child (N = 115 non-industrialised; N = 33 industrialised), with point size indicating the total number of vocalisations contributed by that child (larger points represent children who produced more vocalisations).

Supplementary material: File

Tey et al. supplementary material

Tey et al. supplementary material
Download Tey et al. supplementary material(File)
File 442 KB