Introduction
Reaching intelligible speech is an important milestone in children’s speech and language development. For children with a severe-to-profound hearing impairment who received a cochlear implant, becoming as intelligible as their normally hearing peers is an ultimate goal of their rehabilitation. Intelligibility is often viewed as a crucial benchmark because it “requires all core components of speech perception, cognitive processing, linguistic knowledge, and articulation to be mastered” (Freeman, Pisoni, Kronenberger, & Castellanos, Reference Freeman, Pisoni, Kronenberger and Castellanos2017, p. 278). If a child is intelligible for the outside world, they can be considered to have acquired and developed these crucial components. As such, intelligibility is considered to be the most practical single index to apply in assessing competence in oral communication (Kent, Miolo, & Bloedel, Reference Kent, Miolo and Bloedel1994; Subtelny, Reference Subtelny and Bess1977, p. 183). Consequently, measures of speech intelligibility are often applied as diagnostics for speech therapy. According to Gordon-Brannan and Hodson (Reference Gordon-Brannan and Hodson2000), when one third of the continuous speech of a four-year-old cannot be transcribed correctly by others, this child is a candidate for speech therapy. Because of the general importance of intelligibility and because intelligibility scores can give an indication of whether or not speech therapy is advisable for particular children, speech intelligibility measures are considered “the gold standard for assessing the benefit of cochlear implantation” (Chin, Bergeson, & Phan, Reference Chin, Bergeson and Phan2012, p. 356).
In the present study, intelligibility is conceptualised as the extent to which the elements (i.e., words) in an acoustic signal generated by a speaker can be correctly recovered by a listener (Freeman et al., Reference Freeman, Pisoni, Kronenberger and Castellanos2017; van Heuven, Reference van Heuven2008; Whitehill, & Ciocca, Reference Whitehill and Ciocca2000). For instance, in a transcription task, intelligibility refers to the extent to which a transcriber can identify the words contained in an utterance. For typically developing children, speech is estimated to be intelligible for all listeners, including those not familiar with the child, around the age of four (Baudonck, Buekers, Gillebert, & Van Lierde, Reference Baudonck, Buekers, Gillebert and Van Lierde2009; Bowen, Reference Bowen2011; Chin, & Tsai, Reference Chin and Tsai2001; Chin, Tsai, & Gao, Reference Chin, Tsai and Gao2003; Flipsen, Reference Flipsen2006; Weiss, Reference Weiss1982). For instance, Flipsen (Reference Flipsen2006) compared the intelligibility of children’s conversational speech between the ages of 3;01 and 8;05 using different measures. He found that, irrespective of the specific measure used in the analysis, children were already highly intelligible between 4;0 and 5;0, with scores ranging from 88% to 100%. More recently, Hustad, Mahr, Natzke, and Rathouz (Reference Hustad, Mahr, Natzke and Rathouz2020) studied the mean percentage of intelligible words in normally hearing (NH) children’s imitated speech between 2;06 and 3;11. They found a steady increase of the mean intelligibility of multiword utterances from 40% at 2;06, 55% at 3;0, 66% at 3;06 and 78% at 3;11. This means that approximately three out of four words of a four-year-old can be identified by an adult listener not familiar with the child. Thus, the literature on NH children shows that their intelligibility increases with chronological age. Older children tend to be more intelligible than younger ones. However, this does not mean that even 10-year-olds are fully intelligible (Grandon, Martinez, Samson, & Vilain, Reference Grandon, Martinez, Samson and Vilain2020).
In the current study, the speech intelligibility of children with a cochlear implant (CI) is investigated in comparison with that of peers with normal hearing. A CI partially restores a severe-to-profound sensorineural hearing loss. Even though the signal provided by a CI is still degraded compared to the signal in normal hearing (Drennan, & Rubinstein, Reference Drennan and Rubinstein2008), the device enables children with severe-to-profound hearing impairment to perceive speech and other environmental sounds. After cochlear implantation, children’s speech perception has been shown to improve considerably and as a result cochlear implantation is also beneficial for speech and language production (O’Donoghue, Reference O’Donoghue2013; Wie, von Koss Torkildsen, Schauber, Busch, & Litovsky, Reference Wie, von Koss Torkildsen, Schauber, Busch and Litovsky2020). Research has shown that children with CI can attain spoken language skills similar to those of their normal hearing peers after three to four years of device use (i.a., Bruijnzeel, Ziylan, Stegeman, Topsakal, & Grolman, Reference Bruijnzeel, Ziylan, Stegeman, Topsakal and Grolman2016; Dettman, Dowell, Choo, Arnott, Abrahams, Davis, Dornan, Leigh, Constantinescu, Cowan, & Briggs, Reference Dettman, Dowell, Choo, Arnott, Abrahams, Davis, Dornan, Leigh, Constantinescu, Cowan and Briggs2016; Geers, & Nicholas, Reference Geers and Nicholas2013; Wie et al., Reference Wie, von Koss Torkildsen, Schauber, Busch and Litovsky2020). However, the population of children with CI is characterized by remarkable variation. On the one hand, variation relates to differences between individual children: while a considerable number of children with CI appear to catch up with their NH peers, some do not catch up at all (Duchesne, & Marschark, Reference Duchesne and Marschark2019; Geers, Nicholas, Tobey, & Davidson, Reference Geers, Nicholas, Tobey and Davidson2016; Nicholas, & Geers, Reference Nicholas and Geers2007). On the other hand, variation also relates to differences between domains: some areas of speech and language appear to be more difficult to master than others (Duchesne, & Marschark, Reference Duchesne and Marschark2019). For instance, Faes, Gillis, & Gillis (Reference Faes, Gillis and Gillis2015) showed that in a group of children with CI acquiring Dutch, inflectional morphology and sentence length (as a proxy of syntagmatic development) were age-appropriate when the children were 7;0, but the former (and not the latter) was already age-appropriate at age 5;0. Moreover, the phonetics of the same children’s production of vowels was still significantly different from the vowels of their NH peers at the age of 7;0 (Verhoeven, Hide, De Maeyer, Gillis, & Gillis, Reference Verhoeven, Hide, De Maeyer, Gillis and Gillis2016). Thus, although children with CI start with an initial delay in spoken language, a quite significant group eventually reaches age appropriate levels of linguistic functioning. But the individual variation is also quite large: while some children do catch up with their normally hearing peers, others do not achieve much language comprehension and production even after five years of device use (Barnard, Fisher, Johnson, Eisenberg, Wang, Quittner, Carson, & Niparko, Reference Barnard, Fisher, Johnson, Eisenberg, Wang, Quittner, Carson and Niparko2015).
As to intelligibility, most studies found that CI children’s speech intelligibility is less well developed than that of their NH peers (i.a., Castellanos, Kronenberger, Beer, Henning, Colson, & Pisoni, Reference Castellanos, Kronenberger, Beer, Henning, Colson and Pisoni2014; Chin, & Kuhns, Reference Chin and Kuhns2014; Freeman et al., Reference Freeman, Pisoni, Kronenberger and Castellanos2017; Grandon et al., Reference Grandon, Martinez, Samson and Vilain2020). For instance, Freeman et al. (Reference Freeman, Pisoni, Kronenberger and Castellanos2017) compared the intelligibility of 24 children with CI, mean age 4;02, with on average almost three years of device use, with 30 NH age-matched peers. On the BIT test (Osberger, Robbins, Todd, & Riley, Reference Osberger, Robbins, Todd and Riley1994) in which children are asked to imitate short utterances, the children with CI reached an intelligibility score of 51% (range 0.8% - 95.5%) and the children with NH a score of 84% (range 52.1% - 99.3%). On a retest one year later, both groups’ intelligibility score had increased to 67.7% (range 6.1%-98%) for the children with CI and 90.4% (range 78.9%-95.6%) for the children with NH. Even at the age of 9;05 and with on average seven to eight years of device use, the children with CI’s intelligibility remains significantly lower than that of children with NH (Chin, & Kuhns, Reference Chin and Kuhns2014). Thus it can safely be concluded that, in general, children with CI are less intelligible than their NH peers, and that there is more individual variation in the intelligibility of children with CI than in NH children.
What causes the variation of children with CI’s speech and language development and their intelligibility in addition to the variation which can be expected from children with NH? This issue is still high on the research agenda (i.a., Bavin, Sarant, Leigh, Prendergast, Busby, & Peterson, Reference Bavin, Sarant, Leigh, Prendergast, Busby and Peterson2018; Duchesne, & Marschark, Reference Duchesne and Marschark2019; Houston, Beer, Bergeson, Chin, Pisoni, & Miyamoto, Reference Houston, Beer, Bergeson, Chin, Pisoni and Miyamoto2012). Many factors have been shown to contribute to the success of spoken language development of children with CI, including: (1) audiology related factors, such as the age at implantation, the duration of device use, bilateral (or contralateral) cochlear implantation and the children’s preoperative and postoperative hearing levels; (2) child related factors, such as the cause of the hearing impairment (genetic, infections), gender, additional disabilities (mental retardation, speech motor problems); and (3) environmental factors, such as communication modality. An overview is provided in Boons, Brokx, Dhooge, Frijns, Peeraer, Vermeulen, Wouters, and van Wieringen, Reference Boons, Brokx, Dhooge, Frijns, Peeraer, Vermeulen, Wouters and van Wieringen2012, Fagan, Eisenberg, and Johnson, Reference Fagan, Eisenberg, Johnson, Marschark and Knoors2020, Gillis, Reference Gillis, Bar-On and Ravid2018 and Niparko, Tobey, Thal, Eisenberg, Wang, Quittner, and Fink, Reference Niparko, Tobey, Thal, Eisenberg, Wang, Quittner and Fink2010. A factor of particular importance here is age. Studies have shown that chronological age is an important factor for intelligibility: as they grow older, children’s intelligibility increases irrespective of their hearing status (Grandon et al., Reference Grandon, Martinez, Samson and Vilain2020). But in the case of children with CI, age is a complicated factor, since it can not only refer to children’s chronological age (as is the case for children with NH), but also to the children’s so-called hearing age, which is the amount of time between the activation of their device and their chronological age. For instance, a child implanted at the age of 1;0 has a hearing age of two years at the age of 3;0. In addition, the age at implantation has been shown to play a critical role in children’s spoken language achievements. In general, earlier implantation appears to lead to better results than later implantation in several domains (Boons et al., Reference Boons, Brokx, Dhooge, Frijns, Peeraer, Vermeulen, Wouters and van Wieringen2012; Niparko et al., Reference Niparko, Tobey, Thal, Eisenberg, Wang, Quittner and Fink2010). But the research findings with respect to the effect of the variable age on children with CI’s intelligibility are not unequivocal. In some studies, a significant effect of chronological age on children’s intelligibility was found (i.a., Flipsen, & Colvard, Reference Flipsen and Colvard2006; Grandon et al., Reference Grandon, Martinez, Samson and Vilain2020; Habib, Waltzman, Tajudeen, & Svirsky, Reference Habib, Waltzman, Tajudeen and Svirsky2010) but not in others (e.g., Khwaileh, & Flipsen, Reference Khwaileh and Flipsen2010). Hearing age was found to be a significant predictor of intelligibility by i.a., Flipsen and Colvard (Reference Flipsen and Colvard2006), but hearing age was not always considered as a predictor. Age at implantation predicted children’s intelligibility in a considerable number of studies (i.a., Grandon et al., Reference Grandon, Martinez, Samson and Vilain2020; Habib et al., Reference Habib, Waltzman, Tajudeen and Svirsky2010; Montag, AuBuchon, Pisoni, & Kronenberger, Reference Montag, AuBuchon, Pisoni and Kronenberger2014; Svirsky, Chin, & Jester, Reference Svirsky, Chin and Jester2007) but this was not the case in other studies (i.a., Flipsen, & Colvard, Reference Flipsen and Colvard2006; Khwaileh, & Flipsen, Reference Khwaileh and Flipsen2010). Nevertheless, a general finding appears to be that earlier implantation leads to better results in speech and language development and in intelligibility. At present there is consistent evidence that implantation in the first two years of life leads to consistently better results in spoken language development in comparison to later implantation, and even (inconclusive) evidence for even better outcomes of implantation in the first year of life (Bruijnzeel et al., Reference Bruijnzeel, Ziylan, Stegeman, Topsakal and Grolman2016; Dettman et al., Reference Dettman, Dowell, Choo, Arnott, Abrahams, Davis, Dornan, Leigh, Constantinescu, Cowan and Briggs2016).
In the present study, the intelligibility of congenitally hearing-impaired children with a cochlear implant was assessed in comparison with that of normally hearing seven-year-old peers. The children were implanted on average around their first birthday, and all demographic variables were held constant as far as possible (see Method section).
Measuring intelligibility
In the studies reviewed so far, children’s speech intelligibility was assessed in many different ways. The methods can be situated relative to two dimensions: (1) the amount of control that the investigator exerts on the material that is collected and analyzed; and (2) the analytic versus holistic nature of the assessment method, or objective ratings versus subjective ratings (Hustad et al., Reference Hustad, Mahr, Natzke and Rathouz2020). With respect to the first dimension, the vast majority of studies used read or imitated speech (i.a., Castellanos et al., Reference Castellanos, Kronenberger, Beer, Henning, Colson and Pisoni2014; Chin et al., Reference Chin, Bergeson and Phan2012; Chin, & Kuhns, Reference Chin and Kuhns2014; Freeman et al., Reference Freeman, Pisoni, Kronenberger and Castellanos2017; Khwaileh, & Flipsen, Reference Khwaileh and Flipsen2010; Montag et al., Reference Montag, AuBuchon, Pisoni and Kronenberger2014). Using imitated or read aloud speech has several advantages over spontaneously produced speech. For instance, an examiner has a large amount of control over the stimuli so that it is easy to compare a target word or utterance with the child’s production. This makes it straightforward to quantify the overlap between the child’s rendition and the target. This controlled approach can be useful for speech and language pathologists who use the results of the intelligibility test as a starting point for their child-specific speech therapy (Flipsen, Reference Flipsen2006). However, read or imitated speech have been suggested to be “rather poor predictors of scores for connected speech and everyday performance with hearing aids” (Cox, & McDaniel, Reference Cox and McDaniel1989, p. 347), especially for clinical populations such as hearing-impaired children (Ertmer, Reference Ertmer2010).
Spontaneous speech is an alternative for read or imitated speech in assessing speech intelligibility. The most important advantage of spontaneous speech is its greater ecological validity. In other words, spontaneous speech is more comparable to everyday informal speech. Despite this major advantage, only few studies use spontaneous speech for assessing children’s speech intelligibility (i.a., De Raeve, Reference De Raeve2010; Lejeune, & Demanez, Reference Lejeune and Demanez2006; Tye-Murray, Spencer, & Woodworth, Reference Tye-Murray, Spencer and Woodworth1995; Van Lierde, Vinck, Baudonck, De Vel, & Dhooge, Reference Van Lierde, Vinck, Baudonck, De Vel and Dhooge2005). This may be due to the lack of control over the speech sample: whereas in read or imitated speech, the investigator or the clinician decides on the words or utterances that the child is asked to read or imitate, this control is far less in spontaneous speech because the child decides what to say. Hence, in computing the degree of intelligibility, a straightforward measure such as the number or percentage of words read or imitated correctly cannot be relied on, since there is no predetermined set of words or sentences to be produced. This calls for a measure that does not rely on checking if what the child produced equals what the child was supposed to produce. In the present paper such a method will be proposed.
As to the second dimension, measures of the intelligibility can be categorized as “subjective ratings” versus “objective ratings” (Hustad et al., Reference Hustad, Mahr, Natzke and Rathouz2020). Subjective ratings use a continuous or an ordinal rating scale on which a holistic, personal perception of a speaker’s intelligibility is represented. Probably the most frequently used rating scale is the Speech Intelligibility Rating (SIR) developed by Cox and McDaniel (Reference Cox and McDaniel1989) (i.a., Calmels, Saliba, Wanna, Cochard, Fillaux, Deguine, & Fraysse, Reference Calmels, Saliba, Wanna, Cochard, Fillaux, Deguine and Fraysse2004; De Raeve, Reference De Raeve2010; Flipsen, Reference Flipsen2008; Lejeune, & Demanez, Reference Lejeune and Demanez2006; Toe, & Paatsch, Reference Toe and Paatsch2013). The SIR requires that participants score a child’s speech on a five-point scale with a verbal description for each score, ranging from unintelligible speech even for an adult familiar with the child to completely intelligible for all listeners. Rating scales such as the SIR offer a valid indication of the children’s speech intelligibility (AlSanosi, & Hassan, Reference AlSanosi and Hassan2014; Fang, Ko, Wang, Fang, Chao, Tsou, & Wu, Reference Fang, Ko, Wang, Fang, Chao, Tsou and Wu2014; Flipsen, Reference Flipsen2008), especially for assessing the intelligibility of very young children or children with CI implanted at a relatively late age, e.g., late kindergarten (Baudonck, Dhooge, & Van Lierde, Reference Baudonck, Dhooge and Van Lierde2010; De Raeve, Reference De Raeve2010; Toe, & Paatsch, Reference Toe and Paatsch2013). The reason is that children soon reach ceiling scores on the SIR. For instance, De Raeve (Reference De Raeve2010) investigated the intelligibility of children implanted before 18 months of age, and found that three years after implantation, the 50th percentile of the group of 45 children scored at the highest level of the SIR. This ceiling score indicates that – according to the SIR – their speech is intelligible to all listeners. However, it is not clear, for instance, whether intelligibility for all listeners pertains to all of the children’s speech or only to a limited or particular portion. In other words: children may be considered to be very intelligible according to rating scales but there may still be unintelligible parts in their speech (Miller, Reference Miller2013). Or children may be rated as “completely intelligible” on the SIR rating scale, but one child may still be more intelligible than another, a difference that cannot be captured using SIR. In this respect, a continuous rating scale, such as the one used in the present study may offer a more diversified picture of children’s intelligibility.
“Objective ratings” or analytic ratings take a different approach towards measuring speech intelligibility. Typically, listeners phonetically or orthographically transcribe children’s speech. In the case of the imitated or read aloud speech, calculating intelligibility then amounts to applying some measure of overlap between the intended targets and the transcription of the listener(s). But calculating the intelligibility score based on a transcription is not straightforward because a clear target is missing (Flipsen, Reference Flipsen2006; Flipsen, & Colvard, Reference Flipsen and Colvard2006; Lagerberg, Asberg, Hartelius, & Persson, Reference Lagerberg, Asberg, Hartelius and Persson2014). Alternative methods have been proposed that rely on the number of (un)intelligible syllables or words, but these are not unproblematic either (Flipsen, Reference Flipsen2006; Lagerberg et al., Reference Lagerberg, Asberg, Hartelius and Persson2014; Strömbergsson, Holm, Edlund, Lagerberg, & McAllister, Reference Strömbergsson, Holm, Edlund, Lagerberg and McAllister2020).
Since transcriptions of spontaneous speech are difficult to judge in terms of correct or incorrect, the method explored in the present study abandons this dichotomous choice and instead makes use of multiple transcriptions. The intelligibility of the speech material is quantified relative to the entropy of the transcriptions. Entropy was originally developed in information theory (Shannon, Reference Shannon1948) as a measure that expresses the degree of disorder (“chaos”) in data. In linguistic research, entropy measurements were already used for investigating the mutual intelligibility of two closely related languages such as Swedish and Danish (Frinsel, Kingma, Swarte, & Gooskens, Reference Frinsel, Kingma, Swarte and Gooskens2015; Moberg, Gooskens, Nerbonne, & Vaillette, Reference Moberg, Gooskens, Nerbonne, Vaillette, Dirix, Schuurman, Vandeghinste and Van Eynde2007). In the present context, the assumption is that if a child is highly intelligible, the transcriptions of several listeners will show much uniformity, the degree of disorder or chaos will be low, and, hence, the entropy will be low. Alternatively, if the child’s speech exhibits lower intelligibility, the transcriptions will be less uniform, more chaotic, and will thus have a higher entropy score.
Aims of this study
The aim of the present study was to investigate the intelligibility of primary school aged NH and CI children’s spontaneous speech. The children were all approximately seven years old, and the children with CI received their device on average at 1;0, and at the time of testing had minimally five years of device experience. The entropy of multiple transcriptions of the children’s utterances was used as an index of their intelligibility. It was expected that the children with CI produced speech which was at best as intelligible as the speech of their NH peers. However, given the fact that the NH children had at least one more year of hearing experience, this could be the cause for a lasting advantage of the NH children’s intelligibility. A second expectation related to the extent of variability in the two groups of children. Following the reported trends in the CI literature, it was expected that the entropy scores would show greater variability between subjects with CI than between subjects with NH (Castellanos et al., Reference Castellanos, Kronenberger, Beer, Henning, Colson and Pisoni2014; Freeman et al., Reference Freeman, Pisoni, Kronenberger and Castellanos2017; Montag et al., Reference Montag, AuBuchon, Pisoni and Kronenberger2014; Nittrouer, Caldwell-Tarr, Moberly, & Lowenstein, Reference Nittrouer, Caldwell-Tarr, Moberly and Lowenstein2014; Peng, Spencer, & Tomblin, Reference Peng, Spencer and Tomblin2004; Yanbay, Hickson, Scarinci, Constantinescu, & Dettman, Reference Yanbay, Hickson, Scarinci, Constantinescu and Dettman2014; Young, & Killen, Reference Young and Killen2002). Therefore, the analysis will proceed in two steps. First of all, the intelligibility of children with CI and NH will be compared at a group level. Secondly, the individual variation between the children will be investigated, and the specific demographic variables pertaining to the children with CI will be examined.
A secondary aim of the present study was to examine the relation between the entropy scores obtained from the transcription task and the scores obtained from the holistic judgements on a continuous rating scale. It was assumed that the entropy of the transcriptions was an index of the intelligibility of the children. If this assumption was correct, the entropy scores derived from a comparison of different transcriptions were expected to show some degree of correlation with other measures of speech intelligibility, such as the score on a rating scale. In other words, we expected a correlation between the entropy scores resulting from the “objective” measurement of entropy, and the “subjective” measure of raters’ judgements of intelligibilityFootnote 1 .
Method
The aim was to assess the intelligibility of the spontaneous speech of children with CI and children with NH. An experiment was set up in which speech samples were used which originated from children’s spontaneous speech. The participating children and the method of collecting and selecting appropriate stimuli for the experiment will be described first. In the experiment, the children’s speech was transcribed by a group of listeners and the same samples were rated by another group of listeners. The participants in the two tasks and the experimental procedure will be described. Finally, the processing of the data resulting from the two experimental tasks and the statistical analyses will be elaborated on.
Stimuli: participating children
In this study, spontaneous speech samples of NH children and children with CI were judged. The parents of the NH and the CI group belonged to the mid-to-high SES stratum as estimated by the Hollingshead Index (Hollingshead, Reference Hollingshead1975), were native speakers of Belgian Dutch living in Flanders. The control group consisted of sixteen children with NH (ten girls, six boys), native speakers of Belgian Dutch. They were enrolled in the mainstream education system and had no reported hearing loss or additional disabilities as could be judged from the outcome of the UNHS and parental report. At the time of the recording, these children were on average 7;2 years old (SD = 0;7). Their chronological age was comparable to that of the children with CI (Wilcoxon rank sum test: z = –0.11382, p = 0.9094).
Sixteen children with CI (ten boys, six girls) participated in this study. They were all native speakers of Belgian Dutch, living in Flanders, the Dutch speaking area of Belgium. Their parents were native speakers of Dutch with no self-reported hearing impairment, raising their children orally (monolingual Dutch) with a limited support of signs. The children’s hearing impairment was established by the Universal Neonatal Hearing Screening (UNHS) using automated Auditory Brainstem Response hearing tests for newborns, which was administered as a standard procedure in the first weeks of life in Flanders. After the identification of their hearing loss, the children were referred to a specialized audiological centre for further audiological workup. They received acoustic hearing aids and their progress was further monitored. Since their auditory progress was deemed insufficient, they were enrolled as candidates for cochlear implantation. CI candidacy included bilateral hearing loss of at least 85dB HL (up to 2019). All children were implanted before the age of two (mean = 1;0 (years;months), SD = 0;5). Eleven children underwent sequential bilateral implantation, two of them were simultaneously implanted bilaterally. At the time of the recording, the children were between six and eight years old (mean = 7;02, SD = 0;09), and had a minimum of five years of device use, with an average of 6;02 (SD = 0;10). Prior to implantation, their average pure tone average (PTA) was 114 dB HL (SD = 9 dB HL). Their average aided hearing threshold was 29 dB HL (SD = 9 dB HL). Detailed information on the individual children is provided in Table 1. Their medical records and the treating audiological center did not mention any other additional health or developmental issues. Hence, there were no known additional comorbidities beside their hearing impairment. At the time of the recording, all the children were enrolled in the mainstream education system.
Stimuli: Recording and selection
Audio recordings were made of the children in a quiet room in the comfort of their home or school. The children were asked to tell a story cued by the picture book “Frog, where are you” (Mayer, Reference Mayer1969). Before starting the recordings, the children were allowed to flip through the booklet and look at the pictures. Next, they were asked to tell the story to the researcher and/or caregiver who “did not know the story”. The children were stimulated to tell the story independently, but if needed the caregiver or the researcher encouraged and helped the child.
The recordings were orthographically transcribed with the CLAN editor in CHAT format (MacWhinney, Reference MacWhinney2000). The transcriptions were only used in the selection process of the stimuli for the experiment. In the first step, all the utterances of approximately seven words were selected (e.g., Dutch: “De jongen is bang van de uil”, English: “The boy is afraid of the owl”). Then, the corresponding audio fragments were checked. Fragments with background noise, crosstalk and the like were not retained. In addition, utterances with long hesitations, revisions or non-words were also excluded, as well as syntactically ill-formed or incomplete sentences. Finally, a selection of ten utterances was randomly made for each child with NH and each child with CI, resulting in a total of 320 stimuli for the experiment.
The 320 stimuli were divided into five series of 64 utterances. Each series contained two utterances of each CI and NH child, which were randomly selected (without replacement) from the final selection of 10 utterances per child. These five series of 64 utterances were entered into the online tool Qualtrics (Qualtrics, 2005).
Procedure
The experiment consisted of two tasks: a transcription task and a rating task. Two different and non-overlapping groups of participants were recruited for the tasks in which the same series of stimuli were used.
Transcription task
One hundred language students at the University of Antwerp participated in the transcription study. They were native speakers of Belgian Dutch without self-reported hearing problems and without any particular experience with the speech of hearing-impaired children. They were on average 23 years old (SD = 5). The experiment was performed on campus in a computer lab. The students sat in front of a computer screen with headphones which they could set at a comfortable level. The participants were divided into five groups. Each group of 20 students was assigned one of the five Qualtrics series and transcribed all 64 stimuli of that series, resulting in 20 transcriptions of each utterance. Each stimulus could be repeated only three times.
Prior to the actual experiment the participants were instructed on how to transcribe. Examples were given in order to ensure that the instructions were correctly understood. More precisely, the listeners were instructed to use only existing Dutch words in standard orthography and to represent the utterances as accurately as possible. This implied that they should not correct the linguistic errors which are typical for children’s speech, such as errors against grammatical gender, the use of erroneous verb declinations, etc. For unintelligible speech, the symbol ‘X’ was the agreed upon transcription symbol. In other words, the listeners were instructed to write one X to replace an unintelligible word, an unintelligible part of an utterance or a completely unintelligible utterance.
Rating task
One hundred and fifty students enrolled in the applied linguistics program at the University of Antwerp participated in the rating task. They were all native speakers of Belgian Dutch without self-reported hearing problems and without any particular experience with the speech of hearing-impaired children. They were an average of 20 years old (SD = 4). The students completed the rating task at home on their own computer. They were instructed to use headphones to complete the task but received minimal further instructions. On entering the online tool Qualtrics, they saw the instruction: “Duid aan door te klikken of te slepen hoe verstaanbaar deze zin was op een schaal van ‘zeer onverstaanbaar’ tot ‘zeer verstaanbaar’” (Eng.: indicate by clicking or dragging the slider how intelligible the sentence is on a scale from “fully unintelligible” to “fully intelligible”). Underneath that instruction the slider represented in Figure 1 was shown together with a play button and a proceed button. Each stimulus could be repeated only three times. The initial position of the slider was always at the far left of the scale, and only the middle point of the scale was indicated by three vertical dashes.
The experiment was presented to the students as a listening exercise, and they had to use their listening experience to write a short essay on the topic “what is intelligible speech?” as part of their course credit.
Data analysis
Transcription task
Processing the data of the transcription task proceeded in two steps: (1) aligning the transcriptions of the participants of each sentence and (2) computing the entropy of the aligned transcriptions.
Transcription task: Alignment of the transcriptions
The transcriptions of the participants were aligned at the word level. This procedure was repeated for each stimulus separately. As an example, five transcriptions of the same stimulus are provided in Table 2, together with a literal English translation. It can readily be seen in Table 2 that the first transcription (the row indicated by Transcription participant 1) contains five words: “de jongen ziet de kikker”. The transcription of the second transcriber contains only four words and the transcriber used the symbol X to indicate that the last word was unintelligible. Thus, aligning the transcriptions amounts to the following: the transcribers wrote in a free text field (in Qualtrics) and the 20 transcriptions of each utterance needed to end up in a column-like grid structure as in Table 2.
Mean Entropy score = 0.5588)
A first version of the alignment was automatically produced by a Python script, the output of which was manually checked and adjusted – if needed – in order to maximally align words appropriately. The principal task of the script was to find (nearly) matching words in the orthographic transcriptions and align them (see e.g., the five instances of de ‘the’ in the column Word1 of Table 2 or the four instances of jongen ‘boy’ in the next column of Table 2). If there was no exact match of the words (e.g., hond ‘dog’ in the transcription of participant 5 in Table 2), the alignment took into account the length of the transcriptions and a word’s position. If the transcription length matched, (non-identical) words were aligned if they were on the same position (e.g., the word jongen ‘boy’ in the transcription of the first four participants was aligned with the word hond ‘dog’ of the fifth participant). If the transcription length did not match, the script looked further along the utterance and left blank spaces (indicated as “-----” in the transcription of participant 3 in Table 2) until finding (nearly) matching words (see kokkin ‘cook’ in the transcription of participant 3 in Table 2 which nearly matches kikker ‘frog’ and kikkers ‘frogs’ of participant 1 and 4).
Transcription task: Computing entropy
Given the aligned transcriptions, their relative entropy was calculated using Equation 1. This formula is based on Shannon’s original formula of entropy divided by the maximum entropy (Shannon, Reference Shannon1948). In this study, the entropy calculations were performed at the word level (as is visualised by the different columns in Table 2). If all transcriptions of the individual listeners contained exactly the same words, an entropy score of 0 was obtained (as is the case in the column Word1 containing de in Table 2). When all entries were different, such as in the last column of Table 2, the relative entropy score was 1. Thus, if all transcriptions are the same, the entropy score is low which indicates high intelligibility. If the listeners’ transcriptions do not agree, the entropy score is higher, and if there is no agreement at all between the transcribers, entropy equals 1.
where pi = the probability of each word’s occurrence; n = the total number of occurrences and N = the number of participants
Three aspects influence the word entropy score: the degree of variance between the transcriptions (i.e., the number of different words in a column), the number of blank spaces and the number of Xs. If listeners identified different words in a particular position in the utterance, this leads to a higher entropy score. When the number of alternative transcriptions increases, the entropy increases by definition, due to the nature of the computation of entropy according to the equation in (1). Blank spaces (if they occurred more than once) on the other hand indicate that the transcriptions agree on the absence of a word in a particular position. Thus, those listeners agreed on the absence of a particular word, while some other listener(s) identified a particular word at that position in the utterance.
Xs are a different matter. X indicates that the listener is not able to identify a particular existing Dutch word. If several X’s were aligned, this meant that the listeners agreed that at that position in the utterance, an unidentifiable word occurred. So the agreement between the listeners pertained to the unidentifiability, and hence, unintelligibility of the word uttered by the child. But the agreement did not relate to the identity of the word uttered by the child. For instance, the first column of Table 2 contains the same word five times – hence, the transcribers identified the same word and the entropy equals zero. If that column would have contained five times the symbol X, the same entropy would result, indicating the same degree of agreement between the transcribers. Nevertheless, in the first case the agreement pertains to a specific lexical item that was transcribed identically by all transcribers, while in the second case, the agreement pertains only to the fact that the transcribers were not able to identify a particular word. In order not to inflate the agreement between listeners in the case of Xs, all Xs aligned in a particular column in a datatable such as Table 2, were recoded as unique entries (e.g., X1, X2, …).
After calculating the relative entropy score for each column, these scores were averaged per utterance resulting in the final utterance entropy score. This numerical utterance entropy score was used as the dependent variable in the statistical analyses.
Rating task
In the rating task the listeners indicated the relative intelligibility of each utterance on a scale from "completely unintelligible" to "fully intelligible". The position on the rating scale was transformed automatically by the Qualtrics software into a natural number between 0 and 100. These scores were standardized (converted into a z-score) in order to take into account the idiosyncratic differences between individual participants’ rating behaviors. The resulting z-scores were entered into the statistical analyses as dependent variables.
Statistical analyses
The statistical analyses were performed in JMP® Pro 15.2. More specifically, multilevel models (MLM) were applied. This type of statistical approach is especially suited for hierarchically structured data. For this study, hierarchy meant that the utterances originated from individual children, which were at their turn nested in a hearing status (NH or CI). Building the best fitting MLM model is an iterative process in which random effects and fixed effects are successively entered into a model. After adding an effect, a likelihood ratio test was used to assess whether the addition of that factor led to a significantly better fit of the model. If that was the case, that effect was left in the model, otherwise it was removed. Only the best fitting model is discussed in the results section and included in the tables.
In this study, the main factor of interest was Hearing status (with values CI and NH) and also a (linear or quadratic) effect of chronological age was controlled for. The quadratic effect was included in order to test whether the children’s intelligibility scores reached a plateau. If the quadratic effect did not lead to a better fit of the model, it was discarded and not reported in the tables with statistical results. Since the utterances were divided into five series, the factor Series was consistently entered as the first fixed effect. However, adding this factor never led to a better fitting model. Therefore, it was left out of the analyses. The same holds for the factor Gender. In the analyses in which the CI group was considered separately, fixed effects pertaining only to that group were tested. These factors are the ones referred to in Table 1, viz. Age at Implantation, Hearing Age, Etiology (the cause of the hearing impairment: Genetic, CMV infection, Unknown), Bilateral versus Unilateral CI, Aided and Unaided PTA. If adding these fixed effects did not result in a significantly better model fit, their estimates will not be reported in the tables in the results section. In all analyses, results were considered significant when p < 0.05.
In the second part of the results section, individual entropy scores were estimated from the null model, i.e., a model with only the random effect of the individual children without any predicting variables. In this way, the deviation of each child from the intercept was computed and represented in a boxplot.
Results
Intelligibility scores for children with CI and NH: entropy
The main question of this study is whether the intelligibility of normally hearing (NH) children’s spontaneous speech differs from that of children with a cochlear implant (CI). In this analysis, the intelligibility is represented by entropy scores.
In the first instance, the observed values are inspected. The distribution of all the observed entropy scores has a mean of 0.18 (SD=0.19). The mean entropy score of one child (0.72) falls outside the range determined by the interquartile rule. Hence this child, referred to as CI2 in Table 1, can be considered as an outlier and will not be further considered in the statistical analyses. Inspection of the observed values reveals that the entropy of the transcriptions of the children with NH is considerably lower than the entropy of the children with CI. For the NH children: mean = 0.13, SD = 0.12, 95% CI: 0.11 – 0.15, median = 0.09, IQR = 0.18. For the CI children: mean = 0.21, SD = 0.19, 95% CI: 0.18-0.24, median = 0.15, IQR = 0.24. This suggests that the transcriptions of the NH children are considerably more uniform than those of the CI children. Moreover, the variation between the CI children is much larger than that between the children with NH, judging from the standard deviation and the interquartile range of the distributions of their entropy scores.
The best fitting model for the data is reported in Table 3 and consists of the fixed effects Hearing status (NH versus CI) and Chronological age (centred at 85 months). Moreover, the individual children as a random effect improves the model significantly (p = 0.013). The model shows that the entropy scores of children with NH and CI differ significantly (p = 0.006). More specifically, the entropy score of children with CI is significantly higher than that of children with NH, meaning that the transcriptions of children with CI’s samples show less agreement between the listeners. The estimated entropy score for NH children at intercept is 0.18 and for children with CI 0.22. Considering that the entropy scores can range from 0 to 1, it appears that NH as well as CI children show relatively low entropy scores suggesting relatively high intelligibility of both groups of children.
Furthermore, the best fitting model shows a significant linear effect of chronological age (p = 0.001). This means that an increase in chronological age leads to a significant decrease of the entropy score (as visualised in Figure 2), suggesting that older children reach lower entropy scores (and thus higher intelligibility) than the younger children in the sample. Thus, speech intelligibility improvements seem to continue into advanced childhood (primary school age). The quadratic effect of chronological age was not significant and did not lead to a better fitting model, implying that no floor effect was estimated. An interaction between the factors Hearing status and Chronological age did not lead to a better fitting model and, hence, is not reported in Table 3. The lack of an interaction effect between hearing status and chronological age suggests that the change in entropy score relative to chronological age is comparable for children with NH and CI (p > 0.05).
In order to assess the development of children with CI and to investigate the effect of the demographic variables specific for this group, a separate model was constructed. As in the previous analysis, the best fitting model contained the variable Chronological Age. Adding other variables to the model, including Hearing Age, Age at Implantation, Gender, Etiology, (Un)aided PTA, or Bilateral versus Monolateral implants, did not ameliorate the model fit, and hence did not explain a significant portion of the variance.
Intelligibility scores for children with CI and NH: Rating scales
An analysis of the distribution of the scores on the rating scale, child CI2 shows a discrepant score, viz. a mean score of 20.5 (SD = 10.5) on a scale from 0 to 100, and for the entire group of children the mean score is 62.1 (SD = 20.5). This means that this child can be considered as an outlier and was further discarded in the statistical analyses. The observed (not standardized) values of the ratings of the intelligibility of the children with NH are considerably higher than those of the children with CI. For the children with NH: mean = 69.45 (SD = 17.03), 95% CI 66.79-72.11, median = 72.45, IQR = 27.88, and for the children with CI: mean = 57.06 (SD = 19.58), 95% CI = 53.90-60.22, median = 58.43, IQR = 31.65. As was reported for the entropy scores, the ratings for the two groups of children differ considerably.
The best fitting model for the data is similar to the one for the entropy scores, viz. Hearing status and Chronological age are the significant predictors, as shown in Table 4. As was the case for the entropy scores, adding a quadratic effect of Chronological age did not improve the model, and neither did the interaction of the predictors Hearing status and Chronological age.
Considering only the ratings of the children with CI reveals a similar pattern of the results for the entropy scores: only chronological age is a significant predictor. Adding the other demographic variables, viz. Gender, Etiology, bilateral versus monolateral CI, Hearing Age and Age at Implantation, did not lead to a significantly better fit.
Individual differences
The previous section indicated that adding the individual children as a random effect significantly improved the model estimating the entropy scores. In order to look into the variability of the individual children, an estimated entropy score is calculated for each child in the sample which is visualised in Figure 3. These scores represent the BLUPs, the best linear unbiased predictions (Henderson, Reference Henderson1975; Liu, Rong, & Liu, Reference Liu, Rong and Liu2008).
For the group of NH children, the median estimated entropy score is 0.11 (range: 0.03-0.30). For children with CI, the first striking observation is the outlier in the distribution of entropy values. One child has an average estimated entropy of 0.72, which is almost double the highest score of the other children. As mentioned in the previous section, this outlier was not included in the statistical modelling. Leaving this outlier out of the analysis, the median entropy score for the children with CI is 0.17 (range: 0.06-0.37). The individual scores of the children with CI show a larger amount of intra-group variability. However, eight children with CI score below the third quartile of the NH scores, and 12 children obtain scores below the fourth quartile of the NH children. Four children (i.e., CI1, CI2, CI10 and CI12) have intelligibility scores outside the distribution of the NH children, i.e., scores above the 4th quartile.
Correlation of entropy and scale scores
In the present study, 100 participants transcribed utterances of the two groups of children and 150 participants rated the intelligibility of the same utterances. The rating was holistic in the sense that the participants listened to the stimuli and then positioned a slider between the extremes “fully unintelligible” and “fully intelligible”. The resulting position of the slider was then projected on a scale between 0 and 100 by the Qualtrics software. It was hypothesized if both tasks tapped onto the same reality, viz. the intelligibility of the children’s spoken utterances, a high correlation should surface in a correlational analysis. More specifically, since a high entropy score indicates an elevated level of divergence of the transcriptions (and hence low intelligibility), a negative correlation was expected with the score on the rating scale. Indeed, low intelligibility corresponds with a low value on the rating scale, and conversely, high intelligibility with a high value on that scale.
A correlational analysis confirms this expectation: a pairwise correlational analysis of the entropy scores resulting from the transcription task and the z-score converted ratings on the scale yields a high negative correlation (Pearson production-moment correlation = –0.906, p < 0.0001). This shows a significant linear relationship between the two variables. Further analysis reveals that the best relationship is a quadratic one, which is shown in the scatterplot in Figure 4 in which for the sake of familiarity the raw scores are represented on the X-axis.
The quadratic relationship between the Entropy score and the Scale Score (SS) is expressed in equation (2):
This relationship is highly significant: the R2 Adjusted equals 0.79, indicating that 79% of the variance in the Entropy score is explained by difference in speech intelligibility expressed by the rating scale. Conversely, 21% of the variance of the entropy is left unexplained by the rating scale.
Discussion
The aim of the present study was to investigate the spontaneous speech intelligibility of seven-year-old Dutch speaking children with CI compared to their chronological age matched NH peers. The children with CI were all early implanted at around 1 year of age. The children’s intelligibility was estimated by comparing multiple transcriptions of their speech and computing the entropy of the transcriptions, and by having listeners rate the intelligibility on a perceptual rating scale. The main findings of the study can be summarized as follows. First of all, it was found that the intelligibility of children with CI whose implant was activated around one year of age (the youngest child was implanted at the age of five months), was still lagging behind that of children with NH, even after approximately six years of device use. Secondly, children’s intelligibility appears to increase linearly with age. That is, older children were more intelligible for the listeners than younger ones. This effect was apparent in the group of children with CI as well as in the group with NH, indicating that between approximately six and eight years of life children’s intelligibility still increases significantly. Moreover, the linear effect of age and the lack of a significant quadratic effect of age suggests that their intelligibility has not reached a ceiling level yet.
The third finding concerns the method for measuring the intelligibility of spontaneous speech. Children’s intelligibility has predominantly been studied using highly controlled speech, as in imitation studies. Spontaneous speech was deemed out of reach because an objective basis for judging their productions (as correct or not) was lacking. By using multiple transcriptions of children’s spontaneous speech samples and by computing the entropy of those transcriptions, a method was implemented for assessing intelligibility without assuming a “correct” transcription. A fourth noteworthy finding is that this "analytic" approach of speech intelligibility correlated in a significant way with assessments using “holistic” judgements on a rating scale, thus providing an empirical validation of the approach using entropy.
Intelligibility of children with CI in comparison to NH children
The present study shows that the entropy score and the perceptual ratings were significantly higher for NH children than for children with CI. In other words, NH children’s intelligibility appears to be higher than that of children with CI. For both groups, there was an effect of chronological age, which means that intelligibility increases as children grow older. The effect of chronological age established in the present study corroborates the findings of other studies in which age was found to correlate with language outcomes including intelligibility (Boons, De Raeve, Langereis, Peeraer, Wouters, & van Wieringen, Reference Boons, De Raeve, Langereis, Peeraer, Wouters and van Wieringen2013; Chin et al., Reference Chin, Tsai and Gao2003; Flipsen, & Colvard, Reference Flipsen and Colvard2006). Remarkably, the intelligibility of neither of the two groups reached a plateau, as can be inferred from the lack of a significant quadratic effect of age. However, it is yet unknown if and when children reach maximal intelligibility. Hustad et al. (Reference Hustad, Mahr, Natzke and Rathouz2020) estimated the intelligibility of four-year-olds at around 78% in an imitation task. In the present study the average entropy score at approximately eight years of age is predicted to be 0.04 on a scale from 0 to 1, which almost tops a perfect score. But for children with CI the estimated entropy score is still considerably higher, viz. 0.15. In this respect Miller (Reference Miller2013, p. 606) already noted “that even ‘healthy’ speakers do not achieve 100% intelligibility”. In the present study also NH children of approximately seven years of age did not score a 100% intelligibility score. The question then is: what is maximal intelligibility? What is the level of intelligibility that "healthy speakers", in Miller’s words, eventually are able to reach and at which age do they reach that point? Since the sample in this study only contained a single recording per child and were not selected from a longitudinal follow up of the same children, the effect of chronological age can only be interpreted as: older children are more intelligible than the younger ones. Hence, at this point a longitudinal follow-up is called for in order to confirm that children’s speech intelligibility still continues to improve up to and after age seven.
The findings of the present study corroborate those reported in the literature concerning the effect of age on children’s intelligibility: irrespective of their hearing, children’s intelligibility increases as they grow older, but at a particular age CI children’s intelligibility lags behind that of NH children (Chin, & Kuhns, Reference Chin and Kuhns2014; Freeman et al., Reference Freeman, Pisoni, Kronenberger and Castellanos2017). Moreover, the variability among children with CI is much larger than that among NH children. This can easily be inferred from the results of the present study (see Figure 3). However, some caution is also required in interpreting the results. On the one hand, of the children with CI participating in the present study, twelve score within the range of the NH children, and the score of only four of them is outside that range, including an obvious outlier. This result seems to corroborate the findings of an increasing number of studies which show that early implanted children are catching up with their NH peers after a few years of device experience (Boons et al., Reference Boons, De Raeve, Langereis, Peeraer, Wouters and van Wieringen2013; Geers, & Nicholas, Reference Geers and Nicholas2013; Habib et al., Reference Habib, Waltzman, Tajudeen and Svirsky2010; Nicholas, & Geers, Reference Nicholas and Geers2007; Wie, Reference Wie2010). On the other hand, the children with CI participating in this study are not an unbiased sample of congenitally hearing-impaired children with a CI. The present sample consists of children with an early detected hearing impairment, who were implanted at an early age, with no additional comorbidities, with parents belonging to the mid-to-high SES, etc. These are all characteristics which have been shown to be favourable circumstances for speech and language development.
The relative homogeneity of the sample of children with CI probably explains some unexpected findings of the present study. First of all, a factor which has been shown time and again to influence the outcome of children’s speech and language development is the age at implantation (i.a., Boons et al., Reference Boons, Brokx, Dhooge, Frijns, Peeraer, Vermeulen, Wouters and van Wieringen2012; Niparko et al., Reference Niparko, Tobey, Thal, Eisenberg, Wang, Quittner and Fink2010). The analyses presented here show that the age at implantation is not a significant predictor of the children’s intelligibility at age seven, contrary to the findings presented by i.a., Habib et al. (Reference Habib, Waltzman, Tajudeen and Svirsky2010). This seems to suggest that it is not the age at implantation, but the children’s experience with their implant, i.e., their hearing age, which determines more strongly their intelligibility. But also that did not turn out to be the case: length of device use was not a significant predictor in the analyses presented here. It was the children’s chronological age which determined the entropy of the transcriptions and the scores on the rating scale most significantly. At present we can only speculate about the relative effect of these factors. The fact that chronological age was found to be a significant predictor of intelligibility (and not age at implantation or hearing age) may be interpreted as indicating that, given the small range of the age at implantation of the children studied here and given the small range of their hearing age, the variability was too small to exert a significant effect. But alternatively, it may be the case that, after a certain amount of time, the effect of the age at implantation is simply not significant anymore, and other factors take over that role, as advocated by Szagun and Stumper (Reference Szagun and Stumper2012).
Using entropy to measure the intelligibility of spontaneous speech
Intelligibility has been mainly measured using (highly) controlled speech in studies of children’s speech and especially in clinical studies. Participants (patients) were typically instructed to read a list of words or sentences. Or participants were instructed to repeat or imitate words or sentences read to them. In such a procedure, the researcher typically judges each word or sentence as either correct or incorrect and uses some summary statistics to quantify the level of intelligibility of the participant’s speech as, for instance, the percentage of words repeated correctly. The main advantage of such a procedure is that the target is clearly determined in advance: the list of words or sentences to be read or repeated is the target and the participant’s rendition can be compared with that target. However, in spontaneous, conversational speech, the target of the speaker is in principle unknown, unless the investigator addresses the participant’s introspection, which is obviously difficult, if not impossible, in the case of young children. Thus, there is no prespecified target with which a child’s spontaneous production can be compared. This makes a transcription task hazardous: how to rate a transcribed word as (in)correct, if the target is unknown?
In the literature the lack of a target has been addressed by using rating scales on which the child’s intelligibility is situated relative to two extremes, such as a Likert scale with the extremes “fully unintelligible” and “fully intelligible”, or a scale on which the various grades are labelled as is the case for the SIR (Cox, & McDaniel, Reference Cox and McDaniel1989). In all of these cases, intelligibility is graded in a “holistic” way: irrespective of the (unknown) target, what the child says is evaluated relative to an implicit scale of intelligibility. The alternative approach proposed here takes, as its starting point, the child’s speech production; and several transcribers produce a transcription. The assumption is that transcribers will agree on what the child says if the utterance is intelligible and will disagree more relative to declining intelligibility. The methodology proposed in this study is to compute entropy as a quantitative expression of the degree of consensus or the degree of chaos among multiple transcriptions. Entropy takes into account the degree of agreement between transcriptions, but also the degree of disagreement between transcriptions (how many different items occur in the transcriptions?) as well as the distribution of those agreements and disagreements. As such, entropy is not only a suitable measure for transcriptions of spontaneous speech, but also for transcriptions of read or imitated speech, especially to shed light on the degree of (un)intelligibility of speech samples. For instance, when imitated speech is transcribed, the intelligibility is usually expressed as the percentage of correctly identified words (relative to the total number of words). The decision is binary: a word is either correctly or incorrectly identified. If the target is, e.g., “frog”, all instances of “frog” are labelled as “correct”, while alternative transcriptions are labelled as “incorrect”. But incorrect instances are not further taken into account, which obscures to a considerable extent the degree of intelligibility; since the score remains the same, whether or not an incorrectly transcribed word is rendered in exactly the same way by the transcribers or in various different ways. For instance, suppose that a particular word is transcribed correctly in 50% of the cases. What does the remaining 50% of the transcriptions consist of? Possibly the remaining 50% of the transcriptions contains exactly the same word so that there are only two variants in the transcriptions (e.g., the correct transcription “frog” and an incorrect one, such as “frogs”). But it is also conceivable that all the incorrect transcriptions are different words (e.g., “frogs”, “fox”, “fog”, etc.). In both cases, the percentage correct is 50%, but the entropy score will be markedly different in both cases. In the case in which there are only two different forms in the transcriptions, the entropy score is still fairly small. However, in the case of the second scenario, the entropy score is greatly affected by the number of different or even unique transcribed words (see, for instance, the rightmost column in Table 2), and the entropy score will be fairly high. Hence, these different scenarios are reflected in differences of the entropy score. Thus, the use of entropy provides a fine grained metric of speech intelligibility that goes far beyond what traditional methods have provided.
Interestingly a high correlation was established between the “objective” measurement of intelligibility based on the entropy of transcriptions and the "subjective" holistic measures provided by rating on an unlabelled scale. In the current study, a highly significant correlation of r=-0.85 was computed. This correlation is higher than the one reported by Habib et al. (Reference Habib, Waltzman, Tajudeen and Svirsky2010), viz. r=0.79, but lower than the one reported by Peng et al. (Reference Peng, Spencer and Tomblin2004), viz. r=0.91. This implies that both measures estimate the same reality but to a different extent. They do not measure exactly the same variable. In the transcription task transcribers identify and transcribe words in the child’s speech. The metric measures the degree of agreement between different transcribers’ identifications. In the rating scale approach, identification of the linguistic items probably plays an important role but that is not necessarily the case. More and different information can be taken into account in addition to the identification of words, such as the child’s quality of voice, articulatory features such as accuracy, regional accent, and the like. The fact that such ratings use implicit criteria of intelligibility makes them less open for a more accurate and explicit assessment.
Perspectives for future research
The present study was restricted to computing the entropy of transcriptions and relating those measurements to particular explanatory variables, such as the children’s hearing status and their chronological age. However, the question turns up which specific linguistic or acoustic variables explain particular entropy values. For example, does the level of entropy of the transcriptions produced by listeners increase or decrease given particular phonetic or phonological variables, or other linguistic variables such as certain word types or utterance length? In other words, what are the linguistic determinants of entropy values?
A preliminary qualitative investigation of our data revealed discrepancies between transcribers at different levels. For instance, at the segmental level differences of voicing of the same segment in transcriptions of the same word were found. For example, boom [bo:m] ‘tree’ versus pomp [pɔmp] ‘pump’. Or differences in the place of articulation between the transcriptions of listeners, e.g., hen [ɦɛn] ‘them’ versus hem [ɦɛm] ‘him’, or gaat [ɣa:t] ‘goes’ versus had [ɦɑt] ‘has’). And listeners identified different vowels (including diphthongs) at the same position, e.g., bijen [bɛjən] ‘bees’ versus buien [bœyən] ‘shower’, as well as consonants, e.g., was [wɑs] ‘was’ versus valt [vɑlt] ‘fell’). It should be noted that for these kinds of discrepancies a phonemic transcription is obviously more appropriate than an orthographic one, as was used in the present study. Morphological differences were also apparent. In our sample word endings were often deviant, as in kikker ‘frog’ versus kikkers ‘frogs’, schoen ‘shoe’ versus schoenen ‘shoes’, sta ‘stand’ versus staat ‘stands’.
These differences between transcriptions may be used in a more refined calculation of entropy. In the present study, each deviance of the transcriptions was equally weighed. In other words, each difference equally increased the entropy score. However, some deviances in the transcriptions are fairly small (e.g., kikker ‘frog’ vs. kikkers ‘frogs’), whereas others can really be considered as mismatches (e.g., jongen ‘boy’ vs. hond ‘dog’). Further research is needed for finding fruitful ways to refine the measure by taking into account the (linguistic) distance between different transcriptions. For example, the orthographic transcriptions of the listeners could be converted to and aligned on a phonemic level and calculating entropy could take into account the phonological distance of the different alignments (Faes, Gillis, & Gillis, Reference Faes, Gillis and Gillis2016).
Conclusion
This study investigated the spontaneous speech intelligibility of seven-year-old normally hearing (NH) children and children with a cochlear implant (CI). Intelligibility scores were calculated using transcription entropy, i.e., a measure of the degree of chaos among listeners’ transcriptions. In addition, intelligibility was holistically judged on a rating scale. A first conclusion is that the intelligibility of the early implanted children with CI was significantly lower as that of their normally hearing peers, implying that they have not caught up with their NH peers yet. Despite the group differences between children with NH and CI, a remarkable result of this study is that there is a high degree of overlap between both groups when considering the children as individuals rather than a group: a majority of the children with CI reach intelligibility scores within the range defined by the NH children. A second conclusion is that speech intelligibility still seems to develop over time. In both groups of children, older children reach higher levels of intelligibility than the younger ones.
Acknowledgements
We especially thank Jolien Faes for preparing and running the Qualtrics rating scale experiment. Thanks are also due to the action editor and two anonymous reviewers for the constructive comments. This project was funded by a predoctoral research grant of the Research Foundation – Flanders (FWO) to the first author (1100316N). This study was approved by the Ethics Committee for the Social Sciences and Humanities (SHW_15_37) of the University of Antwerp. The authors declare no conflict of interest.