When a seven is not a seven: Self-ratings of bilingual language proficiency differ between and within language populations

BRENDAN TOMOSCHUK; VICTOR S. FERREIRA; TAMAR H. GOLLAN

doi:10.1017/S1366728918000421

When a seven is not a seven: Self-ratings of bilingual language proficiency differ between and within language populations

Published online by Cambridge University Press: 13 June 2018

BRENDAN TOMOSCHUK ,

VICTOR S. FERREIRA and

TAMAR H. GOLLAN

Show author details

BRENDAN TOMOSCHUK*: Affiliation:
University of California, San Diego
VICTOR S. FERREIRA: Affiliation:
University of California, San Diego
TAMAR H. GOLLAN: Affiliation:
University of California, San Diego
*: Address for correspondence: Brendan Tomoschuk, Department of Psychology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0109btomoschuk@ucsd.edu

Article contents

Abstract
Analysis 1: Self-ratings and Language combination
Analysis 2: MINT validation
Analysis 3: Exploring language dominance
Analysis 4: Chinese–English bilinguals with different language learning history
Analysis 5: Languages grouped by dominance
General discussion
Footnotes
References

Rights & Permissions

Abstract

Self-ratings of language proficiency are ubiquitous in research on bilingualism, but little is known about their validity, especially when the same scale is used across different types of bilinguals. Self-ratings and picture naming data from 1044 Spanish–English and 519 Chinese–English bilinguals were analyzed in five between- and within-population comparisons. Chinese–English bilinguals scored more extremely than Spanish–English bilinguals, and in opposite directions at different endpoints of the self-ratings scale. Regrouping bilinguals by dominant language, instead of language membership, reduced discrepancies but significant group differences remained. Population differences appeared even in English, though this language is shared between populations. These results demonstrate significant problems with self-ratings, especially when comparing bilinguals of different language combinations; and subgroups of bilinguals who speak the same languages but vary in acquisition history and/or dominance. Objective proficiency measures (e.g., picture naming or proficiency interviews) are superior to self-ratings, to maximize classification accuracy and consistency across studies.

Keywords

bilingualism language dominance self-ratings MINT Oral Proficiency Interview

Information

Type: Research Article
Information: Bilingualism: Language and Cognition , Volume 22 , Issue 3 , May 2019 , pp. 516 - 536

DOI: https://doi.org/10.1017/S1366728918000421 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Language proficiency is a uniquely important variable in bilingual research. It affects how quickly and effectively bilinguals can access words in their languages, how easily they control language choice and output, and many other phenomena that have implications for understanding linguistic behavior more generally. It is therefore important for researchers to measure a bilingual's language proficiency in the most accurate way possible.

Proficiency is most often measured by self-ratings (Li, Sepanski & Zhao, Reference Li, Sepanski and Zhao2006). Participants are asked to report how well they read, write, speak or comprehend spoken language, typically on a scale of 1 to 7 (or 1 to 10) with 1 representing not at all proficient in a language and 7 being a native speaker of that language. These self-ratings are simple to collect and record. Unfortunately, this simplicity comes with some drawbacks. Self-ratings are vulnerable to the subjectivity and variability of the participants who provide them, as well as more broadly the way researchers frame the questions and the experiment (see Schwarz, Reference Schwarz1999; Dunn & Fox Tree, Reference Dunn and Fox Tree2009). Zell and Krizan (Reference Zell and Krizan2014), in particular, examined the relationship between self-evaluations and performance measures across 22 meta-analyses and found that there was only a moderate correlation between the two (M = 0.29, SD = 0.11).

A related and ongoing discussion within the field has been lack of consistency across researchers in how self-ratings are collected (see Grosjean, Reference Grosjean1998). For this reason, some investigators have developed standardized language history questionnaires, with the intent of reducing between-study variability. One of the most commonly used was developed by Marian, Blumenfeld and Kaushanskya (Reference Marian, Blumenfeld and Kaushanskaya2007), who standardized self-rated proficiency questions and explored the relationship between language background and objective measures of bilingual language proficiency. They administered their questionnaire and a battery of objective proficiency measures (picture naming, passage comprehension, reading fluency, sound awareness and grammaticality judgment) in two different multilingual populations and used a principal components analysis to identify several factors of note when using language background to predict proficiency. In a factor they called “relative L2-L1 competence”, they found that self-rated proficiency of the non-dominant language and estimated current language use combine to account for the most variance (about 25%) in predicting objective proficiency. Many bilingual studies use these results to justify the use of self-ratings, but do not also consider estimated daily language use, acquisition history or other factors the LEAP questionnaire examined.

Although self-ratings are simple and a standardized questionnaire can increase consistency between labs and across experiments relative to not measuring proficiency at all, speakers of different languages can be very different in terms of their linguistic profiles. Languages differ from one another in structure and form, and the people that speak them come from different cultures in which similarly worded questions can take on different meanings. Even within a bilingual language population, some bilinguals may have learned and constantly use both languages at home and at work, while others might have learned one language first and use different languages at home or in school or work, causing language proficiency to vary by setting. Grosjean (Reference Grosjean1998) describes this difference as the complementarity principle, stating that “bilinguals are rarely equally fluent in all language skills in all languages”. These and other cultural and personal differences can affect language proficiency and dominance, which could in turn affect how proficiency is self-rated. It seems unlikely, therefore, that bilinguals from diverse backgrounds would factor all of this variation into a one-dimensional rating of their abilities in each language in the same way. Despite these drawbacks, many researchers still opt for self-ratings rather than objective proficiency. Hulstijn (Reference Hulstijn2012) reports that 55% of 140 empirical studies published in Bilingualism: Language and Cognition did not measure language proficiency objectively.

In this paper we hope to demonstrate the importance of factoring objective measures of proficiency into studies of bilingualism. One such objective measure is the Multilingual Naming test or MINT. The MINT is a standardized picture-naming task in which participants name 68 pictures of varying frequency in both of their languages. It has been validated as a proficiency measure that captures variance in lexical retrieval for bilinguals who speak English, Spanish, and Mandarin (Gollan, Weissberger, Runnqvist, Montoya & Cera, Reference Gollan, Weissberger, Runnqvist, Montoya and Cera2012; Ivanova, Salmon & Gollan, Reference Ivanova, Salmon and Gollan2013; Sheng, Lu & Gollan, Reference Sheng, Lu and Gollan2014), and also appears to function similarly for predicting proficiency in Hebrew–English, Spanish–English and Chinese–English bilingual children and young adults (Gollan, Starr & Ferreira, Reference Gollan, Starr and Ferreira2015). The MINT excludes cognates (translations that are phonologically similar between the two target languages), and words with potential cultural differences (such as abacus which is low frequency in English but higher in Mandarin since it is used as an educational tool in China). While not a catch-all measure of all domains that affect language proficiency (including grammar and syntax), it was developed and measured against the more comprehensive Oral Proficiency Interviews (OPI), and demonstrated to be more accurate than the Boston Naming Task (BNT, Kaplan, Goodglass & Weintraub, Reference Kaplan, Goodglass and Weintraub1983) for capturing bilingual language proficiency. Here we seek to further improve consistency across studies in bilingual research by investigating how effective subjective metrics like self-rated proficiency are at capturing similarities and differences between language combinations, and how well these relate to the MINT.

In the present study, we performed five analyses on two pooled sets of data from previous studies that used the MINT, to measure the extent to which self-rated proficiency scores can reasonably be compared or collapsed across Spanish–English (typically people who grew up in the greater San Diego area) and Chinese–English (people who grew up in China studying at UC San Diego, or Chinese heritage speakers who grew up in the U.S.) bilinguals and with different dominance profiles (English-dominant or other-language dominant, see Table 1 for full participant information). For each of these analyses we investigated this relationship in self-reports of English as well as a bilingual's other language. We also report a simulation that explores the effects suggested by these analyses. One hypothesis is that the simple nature of self-rated proficiency is enough to allow bilinguals to reasonably estimate their own skills and that this estimation will allow for valid comparison between bilingual populations and within-language subgroups. If so, we should see that the relationship between the self-ratings and MINT scores pattern together regardless of bilingual population (Analyses 1 and 2) and within-language subgroups (Analyses 3, 4 and 5). Alternatively, different bilingual sub-groups may rate themselves based on distinct subjective standards: for example, assessing their own performance against different comparison groups. If so, between-group comparisons could reveal substantial differences across groups in chosen self-rating level and objectively measured performance. The latter pattern would raise significant concerns with the use of self-ratings to measure proficiency when comparing or collapsing across bilinguals of different language combinations or even dominance profiles within bilinguals of just one group.

Table 1a. Participant characteristics of Spanish–English bilinguals from Analyses 1,3 and 5.

Table 1b. Participant characteristics of Chinese–English bilinguals from Analyses 1, 3 and 5.

Analysis 1: Self-ratings and Language combination

To examine consistency in self-rated language proficiency between populations, we first looked at MINT scores as a function of self-rated proficiency in both languages, split into Spanish–English bilinguals and Chinese–English bilinguals.

Method

Participants

Spanish–English (n = 992) and Chinese–English (n = 223) bilingual undergraduates at the University of California, San Diego participated in 15 different studies for course credit. All Spanish–English bilinguals reported proficiency in Spanish and English with 702 reporting English as their dominant language, 128 reporting Spanish, and 162 reporting balanced proficiency. All Chinese–English bilinguals reported proficiency in both Mandarin and English with 72 reporting English as their dominant language, 139 reporting Mandarin as their dominant language, and 12 reporting balanced proficiency. Full participant characteristics are listed in Table 1.

Procedure

Bilinguals completed a language history questionnaire in which they rated their proficiency in both languages (and any other they reported knowing) on speaking, reading, writing, and listening on a scale from 1 to 7, with the following anchors: 1 – Almost none, 2 – Very Poor, 3 – Fair, 4 – Functional, 5 – Good, 6 – Very Good, 7 – Like a native speaker. In most cases, bilinguals completed the questionnaire at the beginning of the experiments and the MINT (Gollan et al., Reference Gollan, Weissberger, Runnqvist, Montoya and Cera2012) at the end, first in English and then in Spanish or Mandarin. Forty of the Spanish–English bilinguals completed their language history questionnaire at the end of the experiment, after the MINT.

Analysis

Simple regression was done using the Stats package in R (R Core Team, 2013). Self-rated speaking proficiency was the independent variable and MINT scores – first with either Mandarin or Spanish, and then again with English – were the dependent measures. In this way, self-rated speaking accounts for as much of the variance as possible before the factors of interest are considered.

Results and discussion

Figure 1 illustrates the results of these first analyses with Figure 1a showing the other-language results, and Figure 1b showing English. Figure 1a reveals a crossover interaction showing that, on average, Chinese–English bilinguals obtained higher other-language MINT scores at higher self-ratings and lower MINT scores at lower ratings as compared to Spanish–English bilinguals. To illustrate, Chinese–English bilinguals who rated themselves a 7 (out of 7) in Chinese proficiency scored an average of 59.0 (6.1) out of 68 on the Chinese MINT whereas Spanish–English bilinguals who rated themselves as a 7 in Spanish proficiency scored 50.9 (8.0) out of 68 – that is, greater than a standard deviation difference across language combinations. Conversely, for the bilinguals who rated themselves a 3, Chinese–English bilinguals averaged 30.1 (12.0) out of 68 while Spanish–English bilinguals averaged 42.1 (9.9) out of 68, an even larger difference. Though there are considerably fewer data points at the low than at the high end of the self-rating scale, particularly for Chinese–English bilinguals, these differences resulted in a significant interaction between self-rated proficiency and language combination, as shown in Table 2.

Figure 1. MINT scores as a function of self-rated proficiency in 992 Spanish-English and 223 Chinese-English bilinguals.

Table 2. Regression of other-language MINT scores on to subjective self-rating speaking ability and language combination for Analysis 1, shown in Figure 1a.Footnote ¹

adj. R² = 0.30

Furthermore, Figure 1b also shows a significant interaction (analyses reported in Table 3) between English self-rated speaking and English MINT scores such that Spanish–English bilinguals scored higher in the MINT at any given self-rating as compared to Chinese–English bilinguals, except at the higher end of the scale. This may suggest that Spanish–English bilinguals had higher standards of performance in both languages, but this cannot account for the cross-over pattern found in Figure 1a. Population differences in self-rating, especially in the language both bilingual populations share (English, in this case), could introduce potentially serious problems in studies that use self-ratings to select proficient bilinguals.

Table 3. Regression of English MINT onto subjective self-rating speaking ability and language combination for Analysis 1, shown in Figure 1b.

adj. R² = 0.39

Why might self-rating differences arise between bilinguals of different language combinations? It may be that Chinese–English bilinguals perform more extremely at either end of the self-rated proficiency scale (when rating Chinese), simply due to linguistic differences between the Chinese and Spanish languages, or cultural differences between the populations. Alternatively, it may be that other common factors of bilingualism research (such as first versus second language dominant bilinguals, age of acquisition) may drive this population level effect. Before considering these options, it is important to confirm that the MINT converges across languages with other objective measures of proficiency to a greater extent than with the self-ratings.

Analysis 2: MINT validation

One reason why scores might differ across populations is if the MINT itself introduces a between-population bias. To assess this empirically, in Analysis 2 we examined the validity of the MINT by reanalyzing data from Gollan et al. (Reference Gollan, Weissberger, Runnqvist, Montoya and Cera2012) and Sheng et al. (Reference Sheng, Lu and Gollan2014) together to provide direct comparison of self-rated proficiency across the two different language combinations (something that was not done in the original MINT papers). These studies investigated the validity of the MINT, in English and either Spanish or Chinese by comparing MINT scores to Oral Proficiency Interview (OPI) scores. OPI scores are proficiency ratings given by a single experimenter who is trained to look for specific criteria when determining proficiency level based on a structured face-to-face interview in each language. These interviews were modeled on methods developed by the American Council for Teaching Foreign Languages (ACTFL; see Gollan et al., Reference Gollan, Weissberger, Runnqvist, Montoya and Cera2012). Participant characteristics are listed in Table 4.

Table 4. Participant characteristics for Analysis 2, adapted from Gollan et al. (Reference Gollan, Weissberger, Runnqvist, Montoya and Cera2012) and Sheng et al. (Reference Sheng, Lu and Gollan2014). See original publications for full participant characteristics. Note that Self-Rated Speaking is out of a possible 10 rather than 7 and MINT is out of a possible 1.0 rather than 68.