Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2024-12-26T20:13:55.995Z Has data issue: false hasContentIssue false

The articulatory and acoustic characteristics of Polish sibilants and their consequences for diachronic change

Published online by Cambridge University Press:  28 March 2016

Véronique Bukmaier
Affiliation:
Institute of Phonetics and Speech Processing, University of Munich, Germanybukmaier@phonetik.uni-muenchen.de, jmh@phonetik.uni-muenchen.de
Jonathan Harrington
Affiliation:
Institute of Phonetics and Speech Processing, University of Munich, Germanybukmaier@phonetik.uni-muenchen.de, jmh@phonetik.uni-muenchen.de
Rights & Permissions [Opens in a new window]

Abstract

The study is concerned with the relative synchronic stability of three contrastive sibilant fricatives /s ʂ ɕ/ in Polish. Tongue movement data were collected from nine first-language Polish speakers producing symmetrical real and non-word CVCV sequences in three vowel contexts. A Gaussian model was used to classify the sibilants from spectral information in the noise and from formant frequencies at vowel onset. The physiological analysis showed an almost complete separation between /s ʂ ɕ/ on tongue-tip parameters. The acoustic analysis showed that the greater energy at higher frequencies distinguished /s/ in the fricative noise from the other two sibilant categories. The most salient information at vowel onset was for /ɕ/, which also had a strong palatalizing effect on the following vowel. Whereas either the noise or vowel onset was largely sufficient for the identification of /s ɕ/ respectively, both sets of cues were necessary to separate /ʂ/ from /s ɕ/. The greater synchronic instability of /ʂ/ may derive from its high articulatory complexity coupled with its comparatively low acoustic salience. The data also suggest that the relatively late stage of /ʂ/ acquisition by children may come about because of the weak acoustic information in the vowel for its distinction from /s/.

Type
Research Article
Copyright
Copyright © International Phonetic Association 2016 

1 Introduction

While there have been many studies in the last 30 years on the acoustic (Evers, Reetz & Lahiri Reference Evers, Reetz and Lahiri1998, Jongman, Wayland & Wong Reference Jongman, Wayland and Wong2000, Nowak Reference Nowak2006, Shadle Reference Shadle2006, Cheon & Anderson Reference Cheon and Anderson2008, Maniwa, Jongman & Wade Reference Maniwa, Jongman and Wade2009), perceptual (McGuire Reference McGuire2007, Cheon & Anderson Reference Cheon and Anderson2008, Li et al. Reference Li, Munson, Edwards, Yoneyama and Hall2011) and articulatory characteristics of sibilants (Narayanan, Alwan & Haker Reference Narayanan, Alwan and Haker1995), the large majority of these have been focused on the two-way distinction between alveolar /s/ and post-alveolar /ʃ/. Here we are concerned with the comparatively rarer three-way place contrast in sibilants in Polish. Apart from Swedish and Mandarin Chinese, Standard Polish is one of the very few languages that distinguishes lexically between dental /s/ (e.g. sali /sali/ ‘room, gen’), retroflex /ʂ/ (e.g. szaliali/ ‘scale, gen’), and alveolopatal /ɕ/ (e.g. sialiali/ ‘sown’) sibilants (Gussmann Reference Gussmann2007, Żygis, Pape & Jesus Reference Żygis, Pape and Jesus2012a).

In recent years, these three sibilants have been analysed physiologically for Polish in Toda, Maeda & Honda (Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010), for Mandarin in Proctor et al. (Reference Proctor, Lu, Zhu, Goldstein and Narayanan2012), and in both these languages by Hu (Reference Hu2008). These studies have shown that the three sibilants differ articulatorily not only in tongue position, but also in tongue posture. The fricatives are also distinguished from each other by two other tongue shape properties. Firstly, whereas in /ʂ s/ the vertical orientation of the tongue tip is typically upward-facing, it is downward-facing for /ɕ/. Secondly, while the tongue tip tends to be curled back to a greater extent for /ʂ/ than for /s/, the degree to which it is retracted has been shown to be somewhat less in Polish and Mandarin than in Indian languages (Hamann Reference Hamann, Hall, Pompino-Marschall and Rochoń2002a, Reference Hamann, Baauw, Huiskes and Schoorlemmerb; Hu Reference Hu2008; Toda et al. Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010): for these reasons, there is a greater resemblance in tongue shape between /s ʂ/ in Polish than in Indian languages.

There is some evidence from the physiological analysis of four Polish L1 speakers in Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) for greater variability in /ʂ/ than in the other sibilants. At a slow speech rate, /s ʂ/ were clearly differentiated in tongue-tip orientation such that /ʂ/ was a sub-laminal production in which the underside of the tongue tip/blade made contact with the place of articulation. However, at a fast speech rate, these orientation differences were much less in evidence such that /ʂ/ resembled /s/ in being supra- rather than sub-laminal. Hu's (Reference Hu2008) physiological analysis of Mandarin Chinese also pointed to a greater articulatory variability in /ʂ/ than in the other two fricatives.

As far as the acoustics are concerned, many studies in the last 50 years have shown that the place of articulation distinction between English /s ʃ/ can be based to a large extent on the spectral characteristics of the fricative noise (Whalen Reference Whalen1991, Shadle & Mair Reference Shadle and Mair1996, Evers et al. Reference Evers, Reetz and Lahiri1998, Stevens Reference Stevens1998, Jongman et al. Reference Jongman, Wayland and Wong2000, Shadle Reference Shadle, Cohn, Fougeron and Huffman2012): more specifically, the shorter front cavity in /s/ causes the energy in the spectrum to be shifted towards higher frequencies, so that both acoustically and perceptually (Fujisaki & Kunisaki 1977, Mann & Repp Reference Mann and Repp1980), a higher spectral centre of gravity (Forrest et al. Reference Forrest, Weismer, Milenkovic and Dougall1988) differentiates it from /ʃ/. For the three-way place contrast in Polish, these spectral characteristics in the noise can separate /s/ from the other sibilants (Żygis et al. Reference Żygis, Pape, Jesus and Jaskuła2014a, Reference Żygis, Pape, Jesus and Jaskułab), but as various studies (Jassem Reference Jassem1995, Żygis & Hamann Reference Żygis and Hamann2003, Nowak Reference Nowak2006) have shown, the centre of gravity in the noise by itself is generally insufficient for the /ʂ ɕ/ separation.

The issue of whether formant transitions contribute to the acoustic and perceptual distinction of place of articulation within fricatives is still unresolved. Some of the first studies to address this issue (Harris Reference Harris1958, Heinz & Stevens Reference Heinz and Stevens1961) showed that formant transitions were not necessary for the distinction between sibilants but that they were for the non-sibilant /f θ/ separation in English. On the other hand, although acoustic studies showed evidence of formant transitions extending well into the fricative noise (Soli Reference Soli1981), subsequent research suggested that vowel transitions were perceptually less important for the perceptual distinction between place of articulation in sibilant and non-sibilant fricatives (LaRiviere, Wintz & Herriman Reference LaRiviere, Winitz and Herriman1975, Jongman Reference Jongman1989). However, most of these studies were based on languages with only two sibilant fricatives. By contrast, a more recent cross-linguistic investigation by Wagner, Ernestus & Cutler (Reference Wagner, Ernestus and Cutler2006) showed that the effectiveness of formant transition cues was language-dependent: more specifically, listeners were shown to rely on formant transitions to a greater extent in languages like Polish which has fricatives such as /ʂ ɕ/ that are largely undifferentiated in the fricative noise. These results were consistent with those by Nowak (Reference Nowak2006) in which L1 Polish listeners identified Polish sibilants from isolated sections of friction noise and in VCV sequences with the transitions into the following vowel removed. Nowak's (Reference Nowak2006) results showed that, while fricatives could be reliably identified from the noise section, formant transitions were essential for the separation of /ʂ ɕ/ in VCV sequences. Compatibly, Toda et al. (Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010) showed how the quite different tongue shapes for /ʂ ɕ/ contributed to the differences between these sibilants in vowel formant transitions.

Studies of the acquisition of Polish sibilants have shown that children acquire /ʂ/ relatively late and typically after the other sibilants have been acquired (Łukaszewicz Reference Łukaszewicz2006, Reference Łukaszewicz2007). The articulatory instability in /ʂ/ and the findings from language acquisition might also be related to the diachronic change of the three-way /s ʂ ɕ/ to a two-way distinction as a result of an /s ʂ/ merger in both the Min variety of Mandarin (Duanmu Reference Duanmu2006, Chuang & Fon Reference Chuang and Fon2010) and in several Polish dialects (Żygis, Pape & Czaplicki Reference Żygis, Pape and Czaplicki2012b). One of the main motivations for the present study was to investigate the synchronic basis for the diachronic collapse of the /s ʂ/ contrast towards /s/. The more specific aims were to analyse both the physiological and acoustic characteristics of these three fricatives in order to assess whether the identification of /ʂ/ is disadvantaged in comparison with the other two fricatives. In order to do so, we carried out an electromagnetic articulographic (henceforth EMA) study of nine Polish L1 speakers producing these sibilants and assessed the acoustic distinctiveness of the three fricatives from each other in both the noise and transitions.

2 Method

2.1 Data collection and speakers

Acoustic and speech movement data were acquired using electromagnetic articulography in a soundproof booth at the IPS in Munich (AG501, Carstens Medinzinelektronik) in order to obtain measurements of the horizontal, vertical, and lateral position of the articulators. For the EMA recordings, two sensors were placed on the tongue (Figure 1): one on the midline 1 cm behind the tip of the tongue (TT) and the other on a level with the molar teeth at the tongue back (TB). Additionally, two sensors were placed on the upper and lower lip, i.e. on the skin just above and below the lips. Four additional sensors were fixed to the maxilla (to the tissue just above the teeth), the nose bridge, as well as to the left and right mastoid bones in order to correct for head movement. For the present study, only the data from the sensor attached to the tongue tip were analysed. The acoustic speech signal was recorded synchronously with the physiological data using a Sennheiser ME66 supercardioid microphone with bass rolloff filter turned on (−6 dB at 200 Hz) positioned at a distance of approximately one metre in front of the subject. Audio data was recorded with a National Instruments Compact DAQ multichannel data acquisition front-end, with USB connection to a notebook computer. Synchronization of the audio and speech movement signal was carried out in the post-processing of the data after the recording session (see Hoole & Zierdt Reference Hoole, Andreas, Maassen and van Lieshout2010 for further details of the post-processing of acoustic and articulatory data).

Figure 1 The placement of the two sensors on the surface of the tongue.

The subjects in this experiment were nine L1 Standard Polish speaking adults spanning an age range between 19 and 28 years and included four male and five female speakers. Six speakers were born and went to school in dialectal regions with a three-way sibilant contrast (two each from Silesia, Lesser Poland and Greater Poland). The remaining three speakers were born and lived most of their lives (i.e. went to school) in dialect regions in which the alveolar/retroflex contrast is neutralized (two from Mazovia and one from Kashubia). These three subjects were nevertheless included in our analysis because they were judged by an L1-Polish speaker with linguistics training to be speakers of Standard Polish with no perceptible regional colouring. None of the participants had lived outside of Poland for more than two years at the time of recording.

2.2 Speech material and experimental set-up

The participants produced symmetrical CVCV (e.g. /sɛsɛ/) non-words (in which C=/s, ʂ ɕ/ and V=/a ɛ ɔ/) as well as Polish disyllabic real words (Table 1) with initial CV sequences (in which C=/s ʂ ɕ/ and V=/a ɛ ɔ/). All target words were embedded in the carrier phrase ‘Ania woła [TARGET WORD] aktualnie’ (literally ‘Ania shouts [TARGET WORD] currently’), where the target word was produced with a nuclear pitch accent. The participants read the sentences aloud as they were automatically presented to them on a computer screen one at a time in randomized order. In cases of mispronunciations and productions of incorrect prosody, the participants were asked to repeat the sentence.

Table 1 Distribution of CV sequences in real words and non-words used as target words.

The recording session consisted of ten blocks alternating between slow and fast speech rates. In order to define individual speech rates as well as to adjust the corresponding recording time, participants were asked to read examples of the speech material at a self-selected fast and slow speech rate in a pretest prior to the actual recording. The display incorporated a progress bar linked to the desired speech rate that was defined for each speaker and condition based on the mean durations of the pre-recording and that indicated the time frame for each token. For each speech rate, each of the 22 target words containing nine non-words and 13 real words (Table 1) was repeated ten times in randomized order. Some word initial CV sequences occur more often in the onsets of Polish disyllabic real words e.g. /sa/-, /ɕa/- and /ʂa/-word onsets, as a result of which there were more (near) minimal pairs for these CV sequences (see Table 1, row 1: /sara/, /sama/, /sava/; row 4: /ʂari/, /ʂafa/; row 7: /ɕatka/, /ɕanɔ/). Because of this skewed distribution of CV sequences, the materials for this study included between two (/ɕa/ and /ʂa/ onsets) and three (/sa/ onsets) target words, with other, rarer sequences only being represented with one target word (e.g. /sɛ sɔ ʂɛ ʂɔ ɕɛ ɕɔ/). Table 1 contains the complete distribution of CV sequences.

The experiment contained 3960 (22 target words × 10 repetitions per speech rate × 9 speakers) sentences, of which 3895 sentences were analysed in this study. The data loss of 65 tokens was due to technical problems during the recording session and post-processing. The total number of analysed sibilant–vowel combinations for both real and non-words is given in Table 2.

Table 2 Total number of analysed CV sequences in real and non-words separately for each sibilant–vowel combination.

2.3 Data analysis

2.3.1 Physiological analysis

The post-processing of the physiological raw data was done semi-automatically in Matlab (version MathWorks MATLAB R2012a) including rotation of the data so that they were parallel to the occlusal plane (Hoole & Zierdt Reference Hoole, Andreas, Maassen and van Lieshout2010).

The articulatory annotation of the three sibilants was based on the vertical movement of the tongue tip (TT) and the TT tangential velocity. Physiological labels included seven different landmarks (Figure 2). Typically, a complete CVC movement cycle was divided into a CV or opening phase, a nucleus, or quasi target phase, and a VC or closing phase. Onsets and offsets of opening and closing gestures were determined by using a 20% threshold criterion of the tangential velocity signal (Hoole & Mooshammer Reference Hoole, Mooshammer, Auer, Gilles and Spiekemann2002, Hoole et al. Reference Hoole, Bombien, Kühnert, Mooshammer, Fant, Fujisaki and Shen2009). The vowel nucleus was then defined as the interval between CV offset and VC onset.

Figure 2 Schematic representation of landmark positions: gestural onset ( g on ), maximum velocity at gestural onset (von ), onset of constriction plateau (pon ), maximum in constriction (mon ), offset of constriction plateau (poff ), maximum velocity in gestural offset (voff ), and gestural offset ( g off ).

Using the landmarks in Figure 2, we extracted data from the vertical and horizontal positions of the tongue tip (TT) and the tongue dorsum (TD). We also analysed the orientation of the TT since this potentially provided information about the retroflex, in which the tongue tip is often known to be curled upwards (Ladefoged Reference Ladefoged2001; Hamann Reference Hamann, Hall, Pompino-Marschall and Rochoń2002a, Reference Hamann, Baauw, Huiskes and Schoorlemmerb; Toda et al. Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010; Bukmaier et al. Reference Bukmaier, Harrington, Reubold and Kleber2014).

2.3.2 Acoustic analysis

The synchronized acoustic data was digitized at 25,600 Hz and automatically segmented and labelled using forced alignment (Munich Automatic Segmentation tool; Schiel Reference Schiel2004). Calculations were made of spectra (256 point discrete Fourier transform with a 40 Hz frequency resolution, 5 ms Blackman window, and a frame shift of 5 ms) and of formant frequencies (F1-F4; pre-emphasis of −0.8, 20 ms Blackman window with a frame shift of 5 ms).

For the acoustic analysis of the fricative noise, spectra were extracted at the temporal midpoint between the acoustic onset and offset of each sibilant. These spectral data were reduced to a set of coefficients using the discrete cosine transformation (DCT) after converting the Hz to the mel scale. For an N-point mel-scaled spectrum, x(n), extending in frequency from n = 0 to N–1 points over the frequency range of 500–3500 mel (414-10 313 Hz), the mth DCT-coefficient Cm (m = 0, 1, 2) was calculated with the following equation:

$$\begin{equation*} {C_m} = \frac{{2{k_m}}}{N}\mathop \sum \limits_{n = 0}^{N - 1} x\left( n \right)cos\left( {\frac{{2n + 1m\pi }}{{2N}}} \right) \end{equation*}$$

These three coefficients Cm (m = 0, 1, 2) encode the mean, the slope, and curvature, respectively, of the signal to which the DCT transformation was applied (Harrington Reference Harrington2010). Since preliminary studies of these had shown that the sibilants were optimally distinguished in the fricative noise from C1 and C2 (i.e. from the slope and curvature of the spectrum respectively), all further quantifications of the sibilants were based on these coefficients.

The articulatory and formant data were speaker-normalized using standard normalization (Lobanov Reference Lobanov1971). More specifically, where xP.i.T is a raw value of an articulatory or formant parameter P from speaker i at time point T, the corresponding normalized value XP.i.T was given by the following formula:

$$\begin{eqnarray*} &&\hspace*{-57pt}{{\rm{X}}_{{\rm{P}}{\rm{.i}}{\rm{.T}}}}\ = \left( {{{\rm{x}}_{{\rm{P}}{\rm{.i}}{\rm{.T}}}} - {{\rm{x}}_{{\rm{P}}{\rm{.i}}{\rm{.m}}}}} \right)/{{\rm{x}}_{{\rm{P}}{\rm{.i}}{\rm{.s}}}}\\ &&\textrm{where}\, {\it x_{P.i.m}}\, \textrm{and}\, {\it x_{P.i.s}}\, \textrm{are the mean and standard deviation respectively},\\ &&\hspace*{-57pt}\textrm{calculated across all frames of the same parameter for the same speaker}\end{eqnarray*}$$

When normalization was applied to the data in the fricative noise, (xP.i.m , xP.i.s ) were calculated from all frames of data between the fricatives' acoustic onset and offset; when normalization was applied to the formant parameters, (xP.i.m , xP.i.s ) were calculated from all frames extending between the acoustic vowel onset and offset.

We also carried out a Gaussian classification of the acoustic (spectral, formant) data in order to determine the degree of separation between the three fricative places of articulation. Classification was based on quadratic discriminant analysis (Srivastava, Jermyn & Joshi Reference Srivastava, Jermyn and Joshi2007) in which there was a training and a testing stage. During the training stage, each fricative class consisting of a number of observations in a two-dimensional acoustic space was modelled as a bivariate Gaussian distribution; in the testing phase, observations were classified as one of the fricative classes based on the greatest posterior probability. The relationship between training and testing was accomplished using the leave-one-out procedure in which, iteratively for each of the nine speakers in turn, a given speaker's data were classified following training on the data of the other eight speakers. For the fricative noise, the two parameters were C1 and C2 as defined above extracted at the acoustic temporal midpoint of the fricative; for the vowel, the two parameters were F2 and F3 at the acoustic vowel onset. In vowel classifications, training and testing were additionally carried out using this leave-one-out procedure separately in each of the three /a ɛ ɔ/ vowel contexts. The classifications as described above were separately accomplished in the slow and fast rate contexts (thus four classifications: two (slow/fast) based on C1 and C2 and two (slow/fast) on F2 and F3 at the acoustic vowel onset).

3 Results

The results are presented below separately for the fricative noise (Section 3.1) and for vowel transitions (Section 3.2). In both cases, the aim was to determine the extent to which there was separation between the three fricative places of articulation and to assess how far these two sets of cues provide complementary information for this purpose.Footnote 1

3.1 Frication

3.1.1 Physiological analysis

The aggregated tongue-tip data in Figure 3 shows a clear separation between the fricatives for each of the nine speakers. For most subjects, /s ʂ/ had the most fronted and retracted positions respectively, with /ɕ/ located along the front–back dimension between the other two sibilants. Additionally, the tongue tip was generally lower for /s/ than for the other two fricatives; and /ʂ/ tended to reach the highest position, perhaps as the tongue tip unfolded from an initially curled position.

Figure 3 Lobanov-normalized vertical (y-axis) and horizontal (x-axis) TT trajectories averaged across vowels between the two velocity maxima separately for the dental (grey), retroflex (dark grey) and alveolopalatal (black) fricative (with the circle marking the starting point of the trajectories).

Subsequent analyses showed that various combinations of two physiological parameters provided a very clear separation between the three fricative places of articulation. One of the most effective of these was for the combination of the horizontal position of the tongue tip and its vertical orientation (Figure 4). Recall that the latter provides information about the sensor's rotation along the front–back axis. Since the tongue tip can be expected to be curled back in /ʂ/, but not in /ɕ/, then the sensor which is affixed just behind the tongue tip should be rotated for /ʂ/ about the axis that is perpendicular to the sagittal plane – or at least to a greater extent than it is in /ɕ/. This, as Figure 4 shows, was the case for eight out of nine speakers, in which the rotation was greater for /ʂ/ than for /ɕ/: note in particular that this is the distinguishing feature for two speakers (P5, P8) for whom /ʂ ɕ/ were otherwise undifferentiated as far the horizontal position of the TT was concerned. Figure 4 also shows that, with the exception of P6, there was almost complete separation between the three fricatives on these two dimensions for the remaining speakers. Thus the general conclusion is that /s ʂ ɕ/ were separated from each other as far as tongue-tip posture is concerned.

Figure 4 Lobanov-normalized TT orientation (y-axis) and horizontal TT position (x-axis) shown separately for each speaker at the moment of the maximum constriction (mon in Figure 1) and separately for /s/ (grey triangle), /ʂ/ (dark grey cross) and /ɕ/ (black circle). The confidence ellipses are 2.47 standard deviations and enclose 95% of the data points.

3.1.2 Acoustic analysis

We now consider the extent to which the clear physiological separation between the three fricatives was matched acoustically. The ensemble-averaged spectra in Figure 5 show that /s/ was separated from the other two fricatives by greater energy at higher frequencies, but that the ensemble-averaged spectral shapes for /ʂ ɕ/ were quite similar (see Appendix). We tested various combinations of spectral parameters at the fricatives' temporal midpoint including spectral moments (Forrest et al. Reference Forrest, Weismer, Milenkovic and Dougall1988). The two which were most effective in separating the places of articulation were those that are proportional to the linear slope (C1 ) and curvature (C2 ) derived from the discrete cosine transformation, calculated after transforming the frequency axis to the mel scale as described in 2.3. For C1 , if a regression line were drawn through the three spectra, then, as Figure 5 suggests, /s/ would be differentiated from the other two by its rising as opposed to falling slope. For C2 , the greater the resemblance of the ensemble-averaged spectrum to a parabolic shape, then the greater the values on C2 . There is a clear parabolic shape in evidence for the ensemble-averaged /s/ spectrum in Figure 5, and the generally higher amplitude levels over a mid-frequency range for /ɕ/ than for /ʂ/ may provide some basis for their differentiation on this parameter.

Figure 5 Ensemble-averaged fricative spectra at the fricatives’ temporal midpoints separately for the dental (grey), retroflex (dark grey) and alveolopalatal (black) sibilants. The averaging was done across all fricatives produced by all speakers.

For most speakers, Figure 6 shows an overlap of /ʂ ɕ / in the C1 × C2 space, whereas /s/ was clearly separated from the other two sibilants (except for speaker P6). The data in Figure 6 were consistent with the classifications (see Section 2.2 above for details) which showed for the slow rate of speech (Table 3) 96% correct classification for /s/ as opposed to 77% and 63%, respectively, for /ʂ ɕ/. Table 3 shows a high degree of /ʂ ɕ/ confusion for the slow rate of speech with 25% of /ɕ/ being misclassified as /ʂ/ and 23% of /ɕ/ misclassified as /ʂ/. Table 3 also shows that the classification scores at the fast rate of speech showed a broadly similar pattern.

Figure 6 First (slope) and second (curvature) mel-scaled DCT coefficients separately for the dental (grey triangle), retroflex (dark grey cross) and the alveolopalatal (black circle) sibilants averaged across vowel contexts and speech rates.

Table 3 Results for correct classification of the three sibilants with Gaussian training/testing in the C1 × C2 space based on the leave-one-out method i.e. total of nine speakers (k = 1, 2 . . . 9), test on speaker k, train on all other eight speakers. The results are shown for classifications from the slow speech rate and from the fast rate in parentheses.

We tested the influence of place of articulation and rate on classification scores. We also tested the influence of whether or not the sibilant had occurred in a real or non-word. For this purpose, we ran a mixed model with the binary response correctly or incorrectly classified consonant as the dependent variable, with fixed factors that included place of articulation (three levels: /s ʂ ɕ/), word-type (two levels: real word/non-word), and rate (two levels: slow/fast); and with the speaker (nine levels) and word (22 levels: the separate words and non-words in Table 1) as random factors. We also included all the interaction terms between the fixed factors in the model. We assessed the influence of word-type and rate by comparing two models: one with all the factors included, as outlined above; and one that differed from this by dropping word-type and rate. A comparison of these two models (one full with another without word-type and rate) showed no significant differences: thus neither word-type nor rate had any significant influence on classification scores. Predictably, classification scores were significantly influenced by place of articulation (χ2 = 67.2, p < .001).

3.2 Coarticulatory effects on adjacent vowels

In the preceding section, we showed that the very clear separation between the three fricatives based on the tongue configuration was not matched by the acoustic analysis of the fricative noise, which showed a substantial /ʂ ɕ/ confusion. Here we apply a similar type of analysis to the onset of the transitions into the vowel.

3.2.1 Physiological analysis

With the exception of speaker P4 (for whom the TB trajectories of dental and alveolopalatal were quite similar), Figure 7 shows that the vertical TB position was higher in vowels following /ɕ/ (indicated by higher vertical TB values), while in vowels following /s ʂ/ the vertical TB position was lower (indicated by lower vertical TB values). These findings suggest that /ɕ/ exerted a strong coarticulatory influence on the following vowel. Figure 7 also shows that, with the exception of speaker P6, /ʂ/ had a more retracted tongue body position compared to /s/. Thus, there is considerable information in the tongue dorsum at the vowel onset and often throughout the vowel for the distinction between the three fricatives.

Figure 7 Trajectories of the vertical and horizontal TB (tongue body) position between the acoustic vowel onset and offset following the dental (grey), retroflex (dark grey) and the alveolopalatal (black) sibilants separately for each speaker.

3.2.2 Acoustic analysis

For all speakers, the F2 transition data in Figure 8 shows higher F2 values following /ɕ/, consistent with the observations of the physiological analysis in Figure 7. Although /ʂ ɕ/ overlapped in F2, they were separated to a certain extent by the lower F3 for /ʂ/.

Figure 8 Aggregated, time-normalized trajectories in an F2 (y-axis) × F3 (x-axis) space between the acoustic vowel onset and its temporal midpoint shown separately for each vowel following the three sibilants: dental (grey), retroflex (dark grey) and alveolopalatal (black). The averaging was carried out following Lobanov-normalization of the speaker data.

Figure 9 illustrates further the strong coarticulatory influence of /ɕ/ on the vowels causing marked F2 raising for all vowels and F1 lowering in an /ɛ/ context. Thus, these data provide further evidence that vowels in a /ɕ/ context are strongly palatalized.

Figure 9 F1 (y-axis) and F2 (x-axis) extracted at the vowels' temporal midpoints in /ɕ/ (left panel) and in /s ʂ/ (right panel) contexts for /ɛ/ (grey), /a/ (black) and /ɔ/ (dark grey). The black vowel symbols (a’ ɛ’ ɔ’) in the left panel are the centroids extracted from the ellipses in the right panel, i.e. the centroids of the formants in vowels following /s ʂ/. The ellipses are 2.47 standard deviations and include at least 95% of tokens.

The results of the leave-one-out classification based on a two-parameter model of F2 onset and F3 onset show a high classification score of 91% at the slow rate for /ɕ/ with equal confusion between the other two fricatives on these parameters (Table 4). Although the identification rates of /s ʂ/ at the slow rate (82%, 71% respectively) were well above chance level (33%), there was also marked confusion between them (26% /ʂ/ misclassified as /s/ and 13% of /s/ misclassified as /ʂ/). Table 4 also shows a similar pattern of classification scores for the fast speech condition. A mixed model with the binary response correct/incorrect classification score based on these classifications from the combined F2 and F3 onset and with the same fixed and random factors as deployed earlier (Section 3.1.1) showed no significant effects for either rate or for word-type. Thus once again, the classification scores were unaffected by rate or word-type (whether or not the word was a real word or a non-word). Predictably, the classification scores were significantly influenced by consonant place of articulation (χ2 = 23.6, p < .001).

Table 4 Results for correct classification of the three sibilants with Gaussian training/testing based on the F2 and F3 values of the three values using the leave-one-out method i.e. total of nine speakers (k = 1, 2 . . . 9), test on speaker k, train on all other eight speakers. The results are shown for classifications from the slow speech rate and from the fast rate in parentheses

4 General discussion

The main aim of the present study has been to shed light on the acoustic and articulatory characteristics of the three Polish sibilants /s ʂ ɕ/ and to test whether the greater phonetic instability in /ʂ/ may be the source of the reduction of the three-way contrast to a two-way distinction that has been observed in certain Polish varieties and in Mandarin Chinese. We begin by considering the degree to which the fricatives were separated in the noise and transitions in turn.

Earlier studies have generally reported a very high separation between /s ʂ ɕ/ when listeners are presented with noise sections alone (Nowak Reference Nowak2006). Our physiological data for nine speakers shows quite unequivocally that /s ʂ ɕ/ were all distinguished on the basis of the position and configuration of the tongue. In particular, /s/ was (predictably) shown to have a very forward tongue-tip constriction, and it was most retracted for /ʂ/: this result is consistent with a physiological analysis of fricatives in Mandarin Chinese by Hu (Reference Hu2008) and Proctor et al. (Reference Proctor, Lu, Zhu, Goldstein and Narayanan2012), who showed a more retracted position for /ʂ/ than for /s ɕ/. The tongue-tip retraction in our data came about because the tip was (as is typical for retroflex consonants) curled back towards the hard palate. This posture was also the main characteristic that differentiated it from /ɕ/; that is, /ɕ ʂ/ differed according to the rotation of the tongue tip about the axis that is perpendicular to the sagittal plane. Just these two parameterizations of the tongue tip (horizontal position, rotation) were sufficient for an almost complete separation between /s ʂ ɕ/. Thus the very high perceptual distinction between these fricatives based on noise found by Nowak (Reference Nowak2006) is likely to be related to their marked physiological differences in the tongue position and orientation found in our study.

Our acoustic analysis of the fricative noise was consistent with that of Nowak (Reference Nowak2006) and others (e.g. Jassem Reference Jassem1995) in showing a very clear separation between /s/ and the other two categories based on the greater concentration of energy at higher frequencies. According to Halle & Stevens (Reference Halle, Stevens, Kiritani, Hirose and Fujisaki1997), theoretical considerations of vocal tract modeling suggest that energy typically found in the region associated with the second formant frequency should be lower for /ʂ/ than for /ɕ/; thus, /ɕ/ has a much narrower palatal constriction that suppresses back cavity resonances leading to an energy increase in the spectral region close to F2. In general, such a difference should result in a slightly greater weighting of spectral energy towards the lower frequency values for /ʂ/ than for /ɕ/. This is exactly what is evident from the ensemble-averaged spectra in Figure 5 above, which show a spectral peak in the vicinity of 2 kHz (i.e. in the region of F2) for /ʂ/ which is absent for /ɕ/. These observed differences are consistent with the findings by Li, Edwards & Beckman (Reference Li, Edwards and Beckman2007), who found that energy in this F2 region of the noise spectrum effectively distinguished between /ʂ ɕ/ in Mandarin Chinese. Beyond these differences, and consistently with Nowak (Reference Nowak2006), our study shows very similar spectral shapes for /ʂ ɕ/: that is, /ʂ ɕ/ differed principally in that a similar spectral shape occurred at slightly lower frequencies for the retroflex. Compatibly, Żygis & Hamann (Reference Żygis and Hamann2003) showed that a lower spectral centre of gravity of the noise separated /ʂ/ from /ɕ / for a female speaker, although not in their male speaker. In the semi-open classification test in which we trained and tested the three fricative categorizations based on a DCT parameterization, although their classification rates were well above chance, around 25% of /ʂ ɕ/ were nevertheless confused with each other. This result suggests that, in spite of the very clear physiological distinction, the acoustics of the fricative noise alone are unlikely to provide sufficient information in more casual, spontaneous speech for their separation.

Numerous studies in the last 50 years have shown that formant transitions provide contributory information to fricatives’ place of articulation distinctions. This was shown to be especially so for the non-sibilants /f θ/ in English (Harris Reference Harris1958). However, other studies have shown that formant transitions into the following vowel can also be important for the /s ʃ/ separation (Delattre, Liberman & Cooper Reference Delattre, Liberman and Cooper1962, Soli Reference Soli1981, Whalen Reference Whalen1991, Lisker Reference Lisker, Grønnum and Rischel2001, Gordon, Barthmaier & Sands Reference Gordon, Barthmaier and Sands2002, Wagner et al. Reference Wagner, Ernestus and Cutler2006, Li et al. Reference Li, Edwards and Beckman2007; see also Wagner et al. Reference Wagner, Ernestus and Cutler2006 for a comprehensive review). Drawing on an analysis of Shona fricatives, Bladon, Clark & Mickey (Reference Bladon, Clark and Mickey1987) were among the first to suggest that formant transitions may be critical in languages with a three-way place of articulation contrast in sibilants. The results from our study show that the second formant transition provided especially salient information for identifying /ɕ/. In agreement with Nowak (Reference Nowak2006) and Sawicka (Reference Sawicka and Wróbel1995), our results also show that /ɕ/ exerted a strong coarticulatory influence on adjacent vowels causing them to be palatalized: in particular, our physiological data showed a raised tongue-dorsum position at vowel onset extending well into the vowel for /ɕ/ for all speakers and contexts (Figure 4 above) and a concomitant raised F2 throughout the first half of the vowel once again in all speakers and contexts. This finding is also consistent with a perceptual study by Lisker (Reference Lisker, Grønnum and Rischel2001) who showed that English listeners were able to separate /ʂ ɕ/ quite reliably only on the basis of acoustic information in the vowel. A further new finding from our study is that it is not just F2 but also F3 that may contribute to this distinction. The acoustic theory of speech production predicts that retroflex consonants should be associated with F3 lowering (Fant Reference Fant1960) and the results from our study show that F3 of /ʂ/ is lower than for the other two fricatives. F3 lowering for /ʂ/ was also found in the acoustic analysis of the Toda language by Gordon et al. (Reference Gordon, Barthmaier and Sands2002). The main result from our semi-open categorizations based on a two-dimensional space of F2 and F3 at vowel onset was that classification scores were well above chance and that almost 90% of /ɕ/ could be identified from this information in the vowel. While the classification scores for /ʂ/ are high at just over 70%, the same data also show that there is substantial /s ʂ/ confusion such that 12% of /s/ were misclassified as /ʂ/ and 26% of /ʂ/ as /s/. However, this confusion would presumably be resolved in combination with the fricative noise, which according to our analyses enabled an almost 95% separation of /s/ from the other two fricative categories. Overall then, the general finding from this study is that the fricative noise provides positive information for the separation of /s/ from /ʂ ɕ/ and that transitions distinguish /ɕ/ from /s ʂ/; therefore, the successful identification of /ʂ/ from acoustic data must depend on information both in the noise (to separate it from /s/) and on information in the vowel (to separate it from /ɕ/). Our study shows that /s/ can be distinguished from the other two fricative categories with reference to information in the noise alone (if the energy in the spectrum of the noise is concentrated in the upper part of the spectrum) and /ɕ/ can be separated from the other two fricative categories using information in the vowel (if F2 at vowel onset is high). But on the other hand, /ʂ/ requires for its identification two sets of cues to separate it from the other two fricative categories: both in the noise (the energy must be concentrated in the lower part of the spectrum to distinguish it from /ɕ/) and in the vowel (F2 and F3 must be low to distinguish it from /s/).

Our study also showed in contrast to an earlier analysis in Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) that rate had no effect either on the fricative noise nor on the vowel transitions. This was not because the speakers did not vary speaking rate: for every one of the speakers, the duration of both the fricative noise and of the vowel was less at the fast than at the slow tempo. We currently have no explanation for the divergent findings between the present study and that of Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) but can tentatively conclude that rate effects are unlikely to be a synchronic factor involved in diachronic /ʂ/ attrition.

Finally, we consider the issue of whether the retroflex consonant is likely to be the most unstable of the three categories both from a synchronic and diachronic perspective. The instability of the retroflex has been suggested by both Duanmu (Reference Duanmu2006) and Nowak (Reference Nowak2006) independently of this study, who point to the likelihood of the collapse of a three-way to a two-way contrast in many varieties of Mandarin and Polish, typically because of a merger of the dental and retroflex consonants. Similarly, the Taiwan variety of Mandarin lacks the three-way sibilant contrast found in standard Mandarin because the retroflex is frequently substituted by the dental fricative (Chuang & Fon Reference Chuang and Fon2010) under the influence of Min (which lacks retroflex consonants). Their study also showed that under prosodic prominence speakers typically only enhanced one of the two /s ʂ/ fricatives, rather than both, and in most cases the enhancement was in /s/. As Ladefoged & Bhaskararao (Reference Ladefoged and Bhaskararao1983) point out, it is the complexity of gestures involved in the production of retroflex consonants which may explain not only the type of diachronic changes noted above, but also why they are typologically rare, occurring only in languages with large coronal inventories (i.e. there is no known language that has retroflex consonants as the only coronal). To this we would add that it is perhaps not just the articulatory complexity but also the non-linear relationship to acoustics that may make /ʂ/ unstable: that is, whereas in our study retroflex consonants were unambiguously separated from the other two categories on the basis of tongue position, they remained highly confusable with both /ɕ/ in the fricative noise and with /s/ in the vowel. Thus /ʂ/ may be an example of what in Lindblom's (Reference Lindblom, Hurford, Studdert-Kennedy and Chris1998) model is considered to be a high-cost articulation involving complex articulatory maneuvers that nevertheless effect only a limited degree of acoustic or perceptual salience in relation to other fricative categories with which they contrast.

Studies and analyses of child language acquisition also point to the relative instability of /ʂ/. Some studies of Polish have suggested either that /ʂ/ is only acquired after /s ɕ/ (Łukaszewicz Reference Łukaszewicz2006) and/or that the contrast between dental and retroflex places of articulation emerges quite late (Łobacz Reference Łobacz1996). Moreover, Nittrouer & Studdert-Kennedy (Reference Nittrouer and Studdert-Kennedy1987) and Nittrouer (Reference Nittrouer1992, Reference Nittrouer2002) provide evidence that children rely much more than adults on dynamic than static information for phonetic categorization: for example, young children make far greater use of vowel transitions than the noise for fricative categorization. With increasing age, this relationship changes so that they progressively take greater advantage of the information that is available in the fricative noise. In terms of the present Polish data, such a model predicts that the /s ʂ/ distinction would be most vulnerable and prone to confusion: this is both because there is, as the present study shows, insufficient information for the clear separation of the /s ʂ/ distinction based on vowel information and because children would be, according to Nittrouer's model, less able to take advantage of the critical cues for the separation of /ʂ/ from other fricatives in the comparatively much more static noise section. Further empirical analyses of these fricatives need to be conducted in order to test whether the diachronic instability of /ʂ/ has its origins in the greater confusion of the production and perception of /s ʂ/ by children that is predicted by the results of the present study.

Acknowledgements

This research was supported by ERC grant number 295573 ‘Sound change and the acquisition of speech’. We are grateful to three anonymous JIPA reviewers for their comments on an earlier version of this paper.

Appendix. Calculating spectra using the multitaper methodology

It has been suggested by a reviewer that ensemble-averaging requires special conditions following a multitaper methodology (Jesus & Shadle Reference Jesus and Shadle2002; Shadle Reference Shadle2006, Reference Shadle, Hardcastle and Laver2010). We tested whether this was so by calculating spectra using a multitaper approach in Matlab (Percival & Walden Reference Percival and Walden1993). We used the default settings with a value of four for the time-bandwidth product (which in turn determines the number of tapers used); the multitaper analysis also used Thomson's (Reference Thomson1982) adaptive nonlinear method for combining the individual spectral estimates. As Figure A1 shows, we obtained almost identical results using our approach (Figure 5 in the main text above) and the multitaper method. Our results are therefore consistent with those in Reidy (Reference Reidy2015) showing that, while the multitaper approach may be beneficial to estimating peaks and troughs in the spectrum, its use makes very little difference to parameterizations such as spectral moments (or the types of DCT coefficients we have used here) that are based on the sum of amplitude estimates.

Figure A1 Ensemble-averaged spectra calculated using the methodology in Figure 5 (original) and using the multitaper approach (taper) shown separately for the three fricative categories.

Footnotes

1 We also tested whether there was any effect of lexical frequency on the classification scores. Lexical frequencies for the real words were obtained using SUBTLEX-PL, a freely available database of Polish word frequencies with 101 million words from movie subtitles (Mandera et al. Reference Mandera, Keuleers, Wodniecka and Brysbaert2014). The database includes word frequencies transformed to the Zipf scale, a logarithmic scale of frequency per billion words (for further details see also van Heuven et al. Reference van Heuven, Mandera, Keuleers and Brysbaert2014). We ran a mixed model with the binary response correctly or incorrectly classified consonant from the spectral data (Table 3) as the dependent variable, with place of articulation (three levels: /s ʂ ɕ/) and lexical frequency as fixed factors and with the speaker (9 levels) and word (22 levels) as random factors. We also ran another mixed model with classification scores from the vowel transitions (Table 4) as the dependent variable. There was no influence of lexical frequency on classification scores in either case.

References

Bladon, Anthony, Clark, Christopher & Mickey, Katrina. 1987. Production and perception of sibilant fricatives: Shona data. Journal of the International Phonetic Association 17, 3965.Google Scholar
Brown, Keith (ed.). 2006. Encyclopedia of language and linguistics, 2nd edn., 351355. Oxford: Elsevier.Google Scholar
Bukmaier, Véronique, Harrington, Jonathan, Reubold, Ulrich & Kleber, Felicitas. 2014. Synchronic variation in the articulation and the acoustics of the Polish place distinction in sibilants and its implications for diachronic change. 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore.Google Scholar
Cheon, Sang Y. & Anderson, Victoria B.. 2008. Acoustic and perceptual similarities between English and Korean sibilants: Implications for second language acquisition. Korean Linguistics 14, 4164.Google Scholar
Chuang, Yu-Ying & Fon, Janice. 2010. The effect of dental and retroflex sibilants in Taiwan Mandarin spontaneous speech. Proceedings of the 5th International Conference on Speech Prosody, No. 100414.Google Scholar
Delattre, Pierre C., Liberman, Alvin M. & Cooper, Franklin S.. 1962. Formant transitions and loci as acoustic correlates of place of articulation in American fricatives. Studia Linguistica 16, 104121.CrossRefGoogle Scholar
Duanmu, San. 2006. Chinese (Mandarin): Phonology. In Brown (ed.), 351–355.Google Scholar
Evers, Vincent, Reetz, Henning & Lahiri, Aditi. 1998. Crosslinguistic acoustic categorization of sibilants independent of phonological status. Journal of Phonetics 26, 345370.CrossRefGoogle Scholar
Fant, Gunnar. 1960. Acoustic theory of speech production. The Hague: Mouton.Google Scholar
Forrest, Karen, Weismer, Gary, Milenkovic, Paul & Dougall, Ronald D.. 1988. Statistical analysis of word-initial voiceless obstruents: Preliminary data. The Journal of the Acoustic Society of America 84, 115123.Google Scholar
Fujisaki, Hiroya & Kunisaki, Osamu. 1978. Analysis, recognition, and perception of voicleless fricative consonants in Japanese. IEEE Transactions (ASSP) 26, 2127.Google Scholar
Gordon, Matthew, Barthmaier, Paul & Sands, Kathy. 2002. A cross-linguistic acoustic study of voiceless fricatives. Journal of the International Phonetic Association 32, 141174.CrossRefGoogle Scholar
Gussmann, Edmund. 2007. The phonology of Polish. Oxford: Oxford University Press.Google Scholar
Halle, Morris & Stevens, Kenneth N.. 1997. The postalveolar fricative of Polish. In Kiritani, Shigeru, Hirose, Hajime & Fujisaki, Hiroshi (eds.), Speech production and language: In honour of Osamu Fujimura, 177193. Berlin: Mouton de Gruyter.CrossRefGoogle Scholar
Hamann, Silke. 2002a. Retroflexion and retraction revised. In Hall, Tracy A., Pompino-Marschall, Bernd & Rochoń, Marzena (eds.), Papers on phonetics and phonology: The articulation, acoustics and perception of consonants (ZAS Papers in Linguistics 28), 1326. Berlin: Zentrum für Allgemeine Sprachwissenschaft (ZAS).Google Scholar
Hamann, Silke. 2002b. Postalveolar fricatives in Slavic languages as retroflexes. In Baauw, Sergio, Huiskes, Mike & Schoorlemmer, Maaike (eds.), OTS yearbook 2002, 105127. Utrecht: Utrecht Institute of Linguistics.Google Scholar
Harrington, Jonathan. 2010. Phonetic analysis of speech corpora. Chichester: Wiley-Blackwell.Google Scholar
Harris, Kathrine S. 1958. Cues for the discrimination of American English fricatives in spoken syllables. Language and Speech 1, 17.CrossRefGoogle Scholar
Heinz, Jeff & Stevens, Kenneth N.. 1961. On the properties of fricative consonants. Journal of the Acoustical Society of America 33, 589593.Google Scholar
Hoole, Phil, Bombien, Lasse, Kühnert, Barbara & Mooshammer, Christine. 2009. Intrinsic and prosodic effects on articulatory coordination in initial consonant clusters. In Fant, Gunnar, Fujisaki, Hiroshi & Shen, Jiaxuan (eds.), Frontiers in phonetics and speech science: Festschrift for Wu Zongji, 275286. Beijing: Commercial Press.Google Scholar
Hoole, P. & Andreas, Zierdt. 2010. Five-dimensional articulography. In Maassen, Ben & van Lieshout, Pascal H. H. M. (eds.), Speech motor control: New developments in basic and applied research, 331349. Oxford: Oxford University Press.CrossRefGoogle Scholar
Hoole, Phil & Mooshammer, Christine. 2002. Articulatory analysis of the German vowel system. In Auer, Peter, Gilles, Peter & Spiekemann, Helmut (eds.), Silbenschnitt und Tonakzente, 129152.Google Scholar
Hu, Fang. 2008. The three sibilants in Standard Chinese. Proceedings of the 8th International Seminar on Speech Production (ISSP 2008), 105–108.Google Scholar
Jassem, Wiktor. 1995. The acoustic parameters of Polish voiceless fricatives: Analysis of variance. Phonetica 52, 252258.Google Scholar
Jesus, Luis M. T. & Shadle, Christine H.. 2002. A parametric study of the spectral characteristics of European Portuguese fricatives. Journal of Phonetics 30, 437464.CrossRefGoogle Scholar
Jongman, Allard. 1989. Duration of fricative noise required for identification of English fricatives. Journal of the Acoustical Society of America 85, 17181725.Google Scholar
Jongman, Allard, Wayland, Ratree & Wong, Serena. 2000. Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America 108 (3), 12521263.Google Scholar
Ladefoged, Peter & Bhaskararao, Peri. 1983. Non-quantal aspects of consonant production: A study of retroflex consonants. Journal of Phonetics 11, 291302.Google Scholar
Ladefoged, Peter. 2001. Vowels and consonants. Malden, MA: Blackwell.Google Scholar
LaRiviere, Conrad, Winitz, Harris & Herriman, Eve. 1975. The distribution of perceptual cues in English prevocalic fricatives. Journal of Speech and Hearing Research 18, 613622.Google Scholar
Li, Fangfang, Edwards, Jan & Beckman, Mary E.. 2007. Spectral measures for sibilant fricatives of English, Japanese, and Mandarin Chinese. Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS XVI), vol. 4, 917–920.Google Scholar
Li, Fangfang, Munson, Ben, Edwards, Jan, Yoneyama, Kiyoko & Hall, Kathleen C.. 2011. Language specificity in the perception of voiceless sibilant fricatives in Japanese and English: Implications for cross-language differences in speech-sound development. Journal of the Acoustical Society of America 129, 9991011.Google Scholar
Lindblom, Björn. 1998. Systemic constraints and adaptive change in the formation of sound structure. In Hurford, James R., Studdert-Kennedy, Michael & Chris, Knight (eds.), Approaches to the evolution of language: Social and cognitive bases, 242264. Cambridge: Cambridge University Press.Google Scholar
Lisker, Leigh. 2001. Hearing the Polish sibilants [s š ś]: Phonetic and auditory judgements. In Grønnum, Nina & Rischel, Jørgen (eds.), To honour Eli Fischer-Jørgensen (Travaux du Cercle Linguistique de Copenhague XXX), 226238.Google Scholar
Łobacz, Piotra. 1996. Polska fonologia dziecięca [Polish child phonology]. Warszawa: Energeia.Google Scholar
Lobanov, Boris. 1971. Classification of Russian vowels spoken by different speakers. The Journal of the Acoustical Society of America 49, 606608.Google Scholar
Łukaszewicz, Beata. 2006. Extrasyllabicity, transparency and prosodic constituency in the acquisition of Polish. Lingua 116, 130.Google Scholar
Łukaszewicz, Beata. 2007. Reduction in syllable onsets in the acquisition of Polish: Deletion coalescence, metathesis, and gemination. Journal of the Child Language 34 (1), 5282.Google Scholar
Mandera, Paweł, Keuleers, Emmanuel, Wodniecka, Zofia & Brysbaert, Marc. 2014. Subtlex-pl: Subtitle-based word frequency estimates for Polish. Behavior Research Methods 47 (2), 471483.Google Scholar
Maniwa, Kazumi, Jongman, Allard & Wade, Thomas. 2009. Acoustic characteristics of clearly spoken English fricatives. Journal of the Acoustic Society of America 125 (6), 39623973.CrossRefGoogle ScholarPubMed
Mann, Virgina & Repp, Bruno. 1980. Influence of vocalic context on the perception of the [ʃ]–[s] distinction. Perception & Psychophysics 28, 213228.Google Scholar
McGuire, Grant. 2007. English listeners’ perception of Polish alveopalatal and retroflex voiceless sibilants: A pilot study (UC Berkley Phonology Lab Annual Report).Google Scholar
Narayanan, Shrikanth, Alwan, Abeer & Haker, Katherine. 1995. An articulatory study of fricative consonants using magnetic resonance imaging. Journal of the Acoustic Society of America 98 (3), 13251347.Google Scholar
Nittrouer, Susan. 1992. Age-related differences in perceptual effect of formant transitions within syllables and across syllable boundaries. Journal of Phonetics 20, 132.Google Scholar
Nittrouer, Susan. 2002. Learning to perceive speech: How fricative perception changes, and how it stays the same. Journal of the Acoustical Society of America 112, 711719.CrossRefGoogle ScholarPubMed
Nittrouer, Susan & Studdert-Kennedy, Michael. 1987. The role of coarticulatory effects in the perception of fricatives by children and adults. Journal of Speech and Hearing Research 30, 319329.Google Scholar
Nowak, Pawel M. 2006. The role of vowel transitions and frication noise in the perception of Polish sibilants. Journal of Speech and Hearing Research 34, 139152.Google Scholar
Percival, Donald B. & Walden, Andrew T.. 1993. Spectral analysis for physical applications: Multitaper and conventional univariate techniques. Cambridge: Cambridge University Press.Google Scholar
Proctor, Michael, Lu, Li H., Zhu, Yinghua, Goldstein, Louis & Narayanan, Shrikanth. 2012. Articulation of Mandarin sibilants: A multi-plane realtime MRI study. Proceedings of the 14th Australasian International Conference on Speech Science and Technology, Sydney, Australia.Google Scholar
Reidy, Patrick F. 2015. A comparison of spectral estimation methods for the analysis of sibilant fricatives. JASA Express Letters 137 (4), EL248–EL254.Google ScholarPubMed
Sawicka, Irena. 1995. Fonologia [Phonology]. In Wróbel, Henryk (ed.), Gramatyka wspóɫczesnego języka poskiego: Fonetyka i fonologia [A grammar of contemporary Polish: Phonetics and phonology], 105194. Warszawa: Wydawnictwo Instytutu Języka Polskiego.Google Scholar
Schiel, Florian. 2004. MAUS goes iterative. Proceedings of the IVth International Conference on Language Resources and Evaluation, Lisbon, Portugal, 1015–1018.Google Scholar
Shadle, Christine H. 2006. Acoustic phonetics. In Brown (ed.), 442–460.Google Scholar
Shadle, Christine H. 2010. The aerodynamics of speech. In Hardcastle, William J. & Laver, John (eds.) The handbook of phonetic sciences, 3980, Blackwell: Oxford.CrossRefGoogle Scholar
Shadle, Christine H. 2012. The acoustic and aerodynamics of fricatives. In Cohn, Abigail C., Fougeron, Cécile & Huffman, Marie K. (eds.), The Oxford handbook of laboratory phonology, 511526. Oxford: Oxford University Press.Google Scholar
Shadle, Christine H. & Mair, Sheila J.. 1996. Quantifying spectral characteristics of fricatives. Proceedings of the International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, 1517–1520.Google Scholar
Soli, Sigfrid D. 1981. Second formants in fricatives: Acoustic consequences of fricative vowel coarticulation. Journal of the Acoustic Society of America 70, 976984.Google Scholar
Srivastava, Anuj, Jermyn, Ian & Joshi, Shantanu. 2007. Riemannian analysis of probability density functions with applications in vision. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007), 1–8.Google Scholar
Stevens, Kenneth N. 1998. Acoustic phonetics. Cambridge, MA: MIT Press.Google Scholar
Thomson, David J. 1982. Spectrum estimation and harmonic analysis. Proceedings of the IEEE 70 (9), 10551096.Google Scholar
Toda, Martine, Maeda, Shinji & Honda, Kiyoshi. 2010. Formant-cavity affiliation in sibilant fricatives. In Fuchs, Susanne, Toda, Martine & Żygis, Marzena (eds.), Turbulent sounds: An interdisciplinary guide, 343374. Berlin & New York: De Gruyter Mouton.Google Scholar
van Heuven, Walter J. B., Mandera, Paweł, Keuleers, Emmanuel & Brysbaert, Marc. 2014. SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology 67 (6), 11761190.Google Scholar
Wagner, Anita, Ernestus, Mirjam & Cutler, Anne. 2006. Formant transitions in fricative identification. Journal of the Acoustical Society of America 120 (4), 22672277.Google Scholar
Whalen, Douglas H. 1991. Perception of the English /s/–/∫/ distinction relies on fricative noises and transitions, not on brief spectral slices. The Journal of the Acoustical Society of America 90 (4), 17761785.Google Scholar
Żygis, Marzena & Hamann, Silke. 2003. Perceptual and acoustic cues of Polish coronal fricatives. Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS XV), Barcelona, 3–9 August, 395–398.Google Scholar
Żygis, Marzena, Pape, Daniel & Jesus, Luis M. T.. 2012a. (Non)retroflex Slavic affricates and their motivation: Evidence from Czech and Polish. Journal of the International Phonetic Association 42 (3), 281329.Google Scholar
Żygis, Marzena, Pape, Daniel & Czaplicki, Bartłomiej. 2012b. Dynamics in sibilant systems: Standard Polish and its dialects (Phonetik & Phonologie 8). Jena: Friedrich-Schiller-Universität Jena.Google Scholar
Żygis, Marzena, Pape, Daniel, Jesus, Luis M. T. & Jaskuła, Marek. 2014a. Intended intonation of statements and polar questions in whispered, semi-whispered and normal speech modes. Proceedings of Speech Prosody, Dublin, 678–682.Google Scholar
Żygis, Marzena, Pape, Daniel, Jesus, Luis M. T. & Jaskuła, Marek. 2014b. How do voiceless consonant clusters contribute to intended intonation? A comparison of normal, whispered, and semi-whispered speech. Proceedings of the 10th International Seminar on Speech Production, Cologne, 476–480.Google Scholar
Figure 0

Figure 1 The placement of the two sensors on the surface of the tongue.

Figure 1

Table 1 Distribution of CV sequences in real words and non-words used as target words.

Figure 2

Table 2 Total number of analysed CV sequences in real and non-words separately for each sibilant–vowel combination.

Figure 3

Figure 2 Schematic representation of landmark positions: gestural onset (gon), maximum velocity at gestural onset (von), onset of constriction plateau (pon), maximum in constriction (mon), offset of constriction plateau (poff), maximum velocity in gestural offset (voff), and gestural offset (goff).

Figure 4

Figure 3 Lobanov-normalized vertical (y-axis) and horizontal (x-axis) TT trajectories averaged across vowels between the two velocity maxima separately for the dental (grey), retroflex (dark grey) and alveolopalatal (black) fricative (with the circle marking the starting point of the trajectories).

Figure 5

Figure 4 Lobanov-normalized TT orientation (y-axis) and horizontal TT position (x-axis) shown separately for each speaker at the moment of the maximum constriction (mon in Figure 1) and separately for /s/ (grey triangle), /ʂ/ (dark grey cross) and /ɕ/ (black circle). The confidence ellipses are 2.47 standard deviations and enclose 95% of the data points.

Figure 6

Figure 5 Ensemble-averaged fricative spectra at the fricatives’ temporal midpoints separately for the dental (grey), retroflex (dark grey) and alveolopalatal (black) sibilants. The averaging was done across all fricatives produced by all speakers.

Figure 7

Figure 6 First (slope) and second (curvature) mel-scaled DCT coefficients separately for the dental (grey triangle), retroflex (dark grey cross) and the alveolopalatal (black circle) sibilants averaged across vowel contexts and speech rates.

Figure 8

Table 3 Results for correct classification of the three sibilants with Gaussian training/testing in the C1 × C2 space based on the leave-one-out method i.e. total of nine speakers (k = 1, 2 . . . 9), test on speaker k, train on all other eight speakers. The results are shown for classifications from the slow speech rate and from the fast rate in parentheses.

Figure 9

Figure 7 Trajectories of the vertical and horizontal TB (tongue body) position between the acoustic vowel onset and offset following the dental (grey), retroflex (dark grey) and the alveolopalatal (black) sibilants separately for each speaker.

Figure 10

Figure 8 Aggregated, time-normalized trajectories in an F2 (y-axis) × F3 (x-axis) space between the acoustic vowel onset and its temporal midpoint shown separately for each vowel following the three sibilants: dental (grey), retroflex (dark grey) and alveolopalatal (black). The averaging was carried out following Lobanov-normalization of the speaker data.

Figure 11

Figure 9 F1 (y-axis) and F2 (x-axis) extracted at the vowels' temporal midpoints in /ɕ/ (left panel) and in /s ʂ/ (right panel) contexts for /ɛ/ (grey), /a/ (black) and /ɔ/ (dark grey). The black vowel symbols (a’ ɛ’ ɔ’) in the left panel are the centroids extracted from the ellipses in the right panel, i.e. the centroids of the formants in vowels following /s ʂ/. The ellipses are 2.47 standard deviations and include at least 95% of tokens.

Figure 12

Table 4 Results for correct classification of the three sibilants with Gaussian training/testing based on the F2 and F3 values of the three values using the leave-one-out method i.e. total of nine speakers (k = 1, 2 . . . 9), test on speaker k, train on all other eight speakers. The results are shown for classifications from the slow speech rate and from the fast rate in parentheses

Figure 13

Figure A1 Ensemble-averaged spectra calculated using the methodology in Figure 5 (original) and using the multitaper approach (taper) shown separately for the three fricative categories.