Summations
-
The search identified 58 studies that administered the ASEBA Child Behaviour Checklist or Youth Self-Report forms to sub-Saharan African participants and reported at least one psychometric property.
-
Most studies reported only coefficient alpha as a measure of internal consistency. Only nine studies, all from East African countries, were specifically focused on the psychometric properties of the ASEBA forms.
-
There is some evidence to support the structural validity and internal consistency of the ASEBA forms. Evidence concerning the content validity, criterion validity, and reliability of the ASEBA forms is limited.
Considerations
-
Although most studies used translated versions of the ASEBA forms, very few reported details on the translation and adaptation process. This limited our ability to evaluate the validity of the translation process.
-
Inconsistencies and large variation (not just in measurement properties, but in country, language as well as other factors) meant that we could not quantitatively pool the results and arrive at reasonable conclusions.
-
Limitations and inconsistencies meant that we were not able to make definitive recommendations regarding the use of the ASEBA forms in sub-Saharan Africa based on the available evidence. More comprehensive good quality psychometric studies are needed.
Introduction
Behavioural and emotional problems represent a significant health burden amongst children and adolescents in sub-Saharan Africa (SSA; Cortina, et al., Reference Cortina, Sodha, Fazel and Ramchandani2012; Kusi-Mensah et al., Reference Kusi-Mensah, Donnir, Wemakor, Owusu-Antwi and Omigbodun2019; Mulatu, Reference Mulatu1995; Ndetei et al., Reference Ndetei, Mutiso, Musyimi, Mokaya, Anderson, McKenzie and Musau2016). As specialist child mental health services in SSA are limited (Jenkins et al., Reference Jenkins, Baingana and Belkin2010), screening tools, such as the Achenbach System of Empirically Based Assessment (ASEBA) forms (Achenbach & Rescorla, Reference Achenbach and Rescorla2001; Achenbach & Rescorla, Reference Achenbach and Rescorla2000), are often administered in community settings to detect common childhood emotional and behavioural problems (Hoosen et al., Reference Hoosen, Davids, de Vries and Shung-King2018). Screening tools typically do not require specialist training and are generally quick and easy to administer and score, making them particularly advantageous in low- and middle-income settings (Sharp et al., Reference Sharp, Venta, Marais, Skinner, Lenka and Serekoane2014). Despite their widespread use, there are no systematic reviews on the validity, reliability, and cultural appropriateness of the ASEBA forms in the diverse populations and contexts of sub-Saharan Africa. This review seeks to address that gap by evaluating the use and reported psychometric properties of the ASEBA forms in SSA.
The majority of behavioural screening tools used in SSA are developed in North America or Europe and are generally well established in these regions (Fernald et al., Reference Fernald, Kariger, Engle and Raikes2009; Sharp et al., Reference Sharp, Venta, Marais, Skinner, Lenka and Serekoane2014; Sweetland et al., Reference Sweetland, Belkin and Verdeli2014). Using such tools in SSA is often more efficient and feasible compared to developing new tools locally and also enables comparison of findings cross-culturally (Van Widenfelt et al., Reference Van Widenfelt, Treffers, De Beurs, Siebelink, Koudijs and Siebelink2005). However, tools developed in the global north are generally designed for direct application to English-speaking individuals from Western and urbanised populations (Nezafat Maldonado et al., Reference Nezafat Maldonado, Chandna and Gladstone2019). It follows that using a tool with individuals who are not English-speaking, or who are from cultures that differ substantially from that of the original target population, may present issues for both administration and interpretation (Sweetland et al., Reference Sweetland, Belkin and Verdeli2014).
There are several challenges associated with translating a tool from English into another language (De Kock et al., Reference De Kock, Kanjee, Foxcroft, Foxcroft and Roodt2013; Fernald et al., Reference Fernald, Kariger, Engle and Raikes2009). For example, an English word or phrase may not have a linguistic equivalent in the target language, or, if an equivalent word or phrase does exist, it may not form part of the vernacular of the target population. A direct translation may also have a slightly different or ambiguous meaning in the target language (Van Widenfelt et al., Reference Van Widenfelt, Treffers, De Beurs, Siebelink, Koudijs and Siebelink2005). Many African languages do not have established terms to describe specific mental illnesses, emotions, or personality traits (Atilola, Reference Atilola2015; Van Eeden & Mantsha, Reference Van Eeden and Mantsha2007). Poor translations of items that measure psychological constructs may therefore introduce bias, which may, in turn, compromise the validity and reliability of the scores derived from the tool.
It is also important to consider whether constructs being measured by a tool are relevant and understood in the same way in different cultures (i.e., construct equivalence; Van De Vijver & Leung, Reference Van De Vijver, Leung, Matsumoto and van de Vijver2011). This is applicable even when a tool is administered in its original English. A South African study conducted a pilot test to evaluate the cultural appropriateness of items on the Child Behaviour Checklist (CBCL) in English and in two other South African languages (LeCroix et al., Reference LeCroix, Chan, Henrich, Palin, Shanley and Armistead2020; Palin et al., Reference Palin, Armistead, Clayton, Ketchen, Lindner, Kokot-Louw and Pauw2009). Feedback from participants led to the removal of the item “sets fires”, intended to measure rule-breaking behaviour. In this context, it is likely that setting fires (e.g., for cooking) is commonplace amongst children and adolescents as part of daily life, and so participants interpreted the item in this way, instead of the intended interpretation (i.e., setting a fire with intent to cause harm or damage). Hence, establishing linguistic and construct equivalence prior to using a tool outside of its original context is critical to ensure that the tool is measuring what it intends to measure.
Measurement, or psychometric, properties (e.g., validity and reliability) are not properties of the tool itself, but are characteristics of the data derived from the tool in a specific context (Zumbo & Chan, Reference Zumbo, Chan, Zumbo and Chan2014). Most applied research studies conduct rudimentary psychometric evaluations of scores obtained from psychological tools (Dima, Reference Dima2018; Flake et al., Reference Flake, Pek and Hehman2017; Vacha-Haase & Thompson, Reference Vacha-Haase and Thompson2011). The result is the use of tools, including behavioural screening tools, without sufficient evidence to support the validity and reliability of the scores derived from the tools in a given context. Scores generated from such a tool may not accurately reflect the ‘true scores’ of the respondents (De Kock et al., Reference De Kock, Kanjee, Foxcroft, Foxcroft and Roodt2013). This may, in turn, increase the risk of misdiagnosis, which has implications for referral and the provision of appropriate interventions. Hence, until psychometric equivalence of a behavioural screening tool is established, results should be interpreted with caution.
The ASEBA forms, developed in the United States, are currently used as screening tools for clinical and research purposes in SSA. The ASEBA forms are designed to quickly and effectively measure maladaptive behaviours in children and adolescents. One major advantage of the ASEBA forms is that data can be obtained from multiple informants (i.e., caregiver, teacher, and self-report), allowing for a comprehensive overview of the child’s behaviour in different contexts. The most recent version of the Preschool Forms include the parent-report Child Behaviour Checklist for ages 1.5–5 (CBCL/1.5–5) and the Caregiver-Teacher Report Form for Ages 1.5–5 (C-TRF; Achenbach & Rescorla, Reference Achenbach and Rescorla2000). The School-Age Forms include the parent-report Child Behaviour Checklist for Ages 6–18 (CBCL/6–18), the Teacher’s Report Form (TRF), and the Youth Self Report (YSR; Achenbach & Rescorla, Reference Achenbach and Rescorla2001). In this review, we refer to the parent report forms collectively as the ‘CBCL’, but when referring to a specific age form, we use the corresponding ASEBA abbreviation (e.g., CBCL/1.5–5 or CBCL/6–18).
The forms are presented as lists of items describing a range of behaviours, (e.g., “avoids looking others in the eye”, “has trouble getting to sleep”). Respondents indicate their agreement with the items by selecting either “not true” (scored as a 0), “somewhat or sometimes true” (scored as a 1) or “very true/often true” (scored as a 2). These scores are summed to provide a Total Problems score, where higher scores indicate the presence of more problem behaviours. Items are grouped into syndrome scales, which are further grouped into two broad band scales (Internalizing Problems and Externalizing Problems, see Figs. 1 and 2). Items are also grouped into DSM-oriented scales (see Figs. 1 and 2), aligned with diagnostic criteria for a number of disorders specified in the fifth edition of the Diagnostic and Statistical Manual for Mental Disorders (DSM-5; American Psychiatric Association, 2013).
It is not clear where or how the ASEBA forms are used in sub-Saharan Africa, or to what extent the scores from the ASEBA forms have been evaluated for their validity, reliability, and cultural appropriateness for the diverse populations and contexts in this region. This study had four primary objectives, namely (i) to collate all studies that used the ASEBA forms with sub-Saharan African (SSAn) participants, (ii) to describe the use of the ASEBA forms across SSA (including the use of translations), (iii) to evaluate the reported psychometric properties of the scores of different forms and subscales, and finally, (iv) to make recommendations regarding the use of the ASEBA forms in SSA based on available evidence.
Methods
We searched PubMed, EBSCO (APA PsycInfo, APA PsycArticles, ERIC, Academic Search Premier, Health Source: Nursing/Academic Edition, Africa-Wide Information, CINAHL), Scopus, and Google Scholar databases. In addition, we searched ProQuest and the University of Cape Town’s (UCT) library database for relevant dissertations/theses, book chapters, and conference abstracts. A detailed overview of the search strategy is presented in Table S1 in the digital supplement.
Search strategy
For all database searches, the ASEBA terms “child behaviour checklist” OR “child behavior checklist” OR CBCL OR “Youth Self-Report” OR “Teacher’s Report Form” were added as the first line of the search. Although most publications used the American spelling (“behavior”), the British spelling was included as an option so as not to exclude articles from journals that use British spelling. A preliminary search including only the ASEBA terms revealed a number of journal articles that referred to the CBCL as the “Children’s Behaviour Checklist”, “Child Behaviour Check List”, or “CBC” to refer to the same tool. However, there was no substantial difference in the number of search results when these variants were included in the final search terms. We conducted another trial search without inverted commas around “Child Behaviour Checklist” to check if authors were using the tool name more loosely. However, many more results showed up and most of them were not meaningful (i.e., included other tools with the word ‘checklist’ in their names). We did not include the abbreviations for the Youth Self-Report (YSR) or the Teacher’s Report Form (TRF). Including “YSR” did not substantially affect the number of results and including “TRF” generated too many irrelevant results.
The SSAn search terms were adapted from Pienaar et al. (Reference Pienaar, Grobler, Busgeeth, Eisinga and Siegfried2011) and the list of SSAn countries from the United Nation’s Standard Country or Area Codes for Statistical Use (United Nations Statistics Division, 1998). We excluded surrounding islands and territories (e.g., Madagascar, Comoros) from the SSAn search terms, as we were primarily interested in continental SSAn countries. For four countries, we included alternative names (e.g., “Ivory Coast” in addition to “Côte d’Ivoire”). After conducting preliminary searches with the ASEBA and SSA terms described above, we noticed that many results were coming up with African-American samples. Hence, for all searches, we narrowed the search by excluding the following terms: “African-American” OR “African American”.
We did not include any psychometrics-related words or terms (e.g., validity, Cronbach’s alpha) because of (i) the broad range of psychometric-related terms, and (ii) inconsistencies in indexing and reporting of psychometric properties. As it was more likely that a study utilised the CBCL as an outcome measure rather than the tool itself being the subject of the study, no Medical Subject Headings (MeSH terms) were included in the search terms. For the same reason, “all/full text” fields were selected for all lines of the searches, except for one database where it yielded too many results (see Table S1 in the digital supplement). No coverage dates (i.e., year limits) were selected in any database search, nor was the type of publication. All records were saved to a reference library (Endnote X9), after which duplicate records were removed.
Inclusion criteria
A study was eligible to be included in the final analysis, if:
-
i. The study was written in English.
-
ii. The study reported original findings.
-
iii. The study sample (or at least a portion of the sample) was from a SSAn country. Immigrants and refugees currently living outside sub-Saharan Africa were eligible if the study reported specific data for the SSAn participants. For immigrants/refugees, either the child or at least one parent/caregiver had to have been born in a SSAn country.
-
iv. The study used an ASEBA form (any form or any version) in its standard format and reported the data derived from the tool. Minor adaptations to the tool (e.g., excluding items due to cultural inappropriateness etc.) were acceptable, as long as the modifications were clearly specified and justified, and the tool was still recognisable as an ASEBA form.
-
v. The study reported psychometric properties (e.g., validity, reliability) of the ASEBA form for the study sample. Inherent in this criterion is the exclusion of case studies.
Screening and review process
Two of the authors (M.R.Z. and C.F.) independently screened and reviewed all records for eligibility. As the ASEBA forms were generally not the subject of the study (and therefore did not appear in the title or abstract), the full text of each article was scanned or read at each stage of the review process until relevance or eligibility (or lack thereof) became clear. At each stage, the reviewers identified and discussed any discrepancies until a consensus was reached. In the event of an impasse, the reviewers presented the article in question to the fourth author (K.A.D.), who made the final decision.
The review comprised three distinct stages:
-
i. A brief screening of all full-text studies to check for relevance. Did the study include a SSAn sample and use an ASEBA form?
-
ii. A more thorough screening of the relevant studies to check for eligibility. Did the study describe the SSAn sample (or sub-sample) and were there specific data for those participants? Was the ASEBA form used in the standard way? We excluded studies if the description of the sample or the country of origin was vague or if there were no specific data pertaining to the SSAn participants. We also excluded studies that used an ASEBA form in a non-standard way, as we wanted to reasonably compare the psychometric properties across studies. At this stage, the reviewers scanned the reference lists of relevant articles to look for other literature that did not appear in the original search results.
-
iii. A review of the studies identified at the second stage to determine if the study reported psychometric properties of the tool. Any psychometric analyses were acceptable, so long as the statistics were for the study sample (i.e., not from another study or the tool manual). We included studies that met these criteria in the final analysis.
We extracted and summarised key information from the included studies, such as details related to the sample, the country of origin, the ASEBA versions administered, the language(s) of administration, any translation or adaptation processes, and the psychometric analyses conducted. If any details were missing from an article or were unclear, we contacted the corresponding author for clarification or, if applicable, referred to other articles related to the same umbrella study. In the event that the corresponding author did not respond after two attempts to contact them, we noted the uncertainty in our records.
We then evaluated the psychometric properties of the ASEBA forms with reference to COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) criteria for good measurement properties (Prinsen et al., Reference Prinsen, Mokkink, Bouter, Alonso, Patrick, de Vet and Terwee2018). COSMIN describes three phases of psychometric evaluation. The first phase involves investigating a tool’s content validity, that is, the extent to which the content of the tool adequately reflects the construct being measured. Specifically, COSMIN recommends that tools be relevant, comprehensive, and comprehensible with respect to the construct of interest and the target population (Terwee et al., Reference Terwee, Prinsen, Chiarotto, Westerman, Patrick, Alonso, Bouter, de Vet and Mokkink2018b). The second phase concerns evaluating the internal structure of the tool, including structural validity, internal consistency, and cross-cultural validity (measurement invariance). The third phase involves evaluating the remaining measurement properties, including reliability, measurement error, criterion validity, and hypothesis testing for construct validity, including concurrent, convergent, divergent, and known-groups validity (Mokkink et al., Reference Mokkink, Prinsen and Patrick2018b). Each measurement property is rated on a three-point scale of ‘sufficient’, ‘indeterminate’, or ‘insufficient’.
We also assessed the methodological quality of the studies using the COSMIN Risk of Bias checklist (Mokkink et al., Reference Mokkink, de Vet, Prinsen, Patrick, Alonso, Bouter and Terwee2018a). COSMIN utilises a four-point rating system to grade each measurement property in a study as ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’. The overall rating of the quality of each measurement property is determined by taking the lowest rating of any standards corresponding to that property.
Results
Search results
A flow diagram adapted from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement (Moher et al., Reference Moher, Liberati, Tetzlaff and Altman2010) details the number of studies included and excluded at each stage of the review process (Fig. 3).
During the search for manuscripts, it became apparent that full-text versions of many “grey” references, including conference abstracts, poster presentations, dissertations/theses, and other unpublished work, were difficult to access. We decided to randomly sample 20 records from each category of grey literature to check for relevance. The aim of this test was to determine if it was worth pursuing the full-text search for these records. Out of the 20 conference abstracts/presentations sampled, only one was relevant to the aims of the systematic review. The dissertation category was more promising in terms of relevance. However, we could not access full texts for over 60% of the records in these categories. We experienced similar difficulties when trying to access full-text versions of book chapters (digital and/or hard copies). Considering all these factors, we decided to exclude 245 records of “grey” literature and book chapters, and to focus exclusively on published journal articles. These included articles from peer reviewed monographs or reports from reputable institutions with original data, as these papers were easily accessible and contained relevant and useful data.
We could not access the full texts of 43 out of the 804 journal articles (5%); these records were therefore excluded. The remaining 761 full-text articles were screened for relevance. Five hundred and sixty-five records (74%) were deemed irrelevant to the aims and scope of the systematic review. During the screening process, the reviewers discovered 13 possibly relevant journal articles that were not included in the search results. This resulted in a total of 209 journal articles with (possible) SSAn participants who completed the ASEBA forms. Of the 209 studies, 64 (31%) were excluded for reasons related to the sample description and use of the ASEBA forms. Specifically, 20 studies described the African sample vaguely (e.g., “participants from Africa”, with no specific reference to country of origin), and a further 20 studies included SSAn participants in their samples but did not report stratified ASEBA data for those participants. Furthermore, 19 studies used or scored items in non-standardised ways. Examples of non-standardised use included administering a small selection of items independently (i.e., not as part of an established ASEBA subscale), using an incomplete subscale without justification, using a few items as part of a new measure, and altering the standard response format (e.g., from a three-point scale to a two-point scale). We did not have access to ASEBA forms published before 1991, so we could not ascertain whether the three studies that used these older versions administered full or partial subscales. Two studies (reporting data from the same sample) used a modified version of the YSR whose items bore little resemblance to those of the original YSR. Two studies administered the ASEBA forms as part of the study but did not report the actual data. Finally, we excluded two case studies. A brief description of these excluded articles can be found in Table S2 in the digital supplement.
After these exclusions, we were left with 145 studies with specific data for SSAn participants that used the ASEBA forms in the standardised way. A few different articles stemmed from the same ‘umbrella’ study, as indicated by identical or overlapping samples. In addition, a few multi-nation studies utilised the same data set for secondary data analysis. Of the 145 studies, only 58 (40%) reported the psychometric properties of the ASEBA forms for the study sample, and these studies were included in the final analysis. The digital supplement presents a list of the studies without reported psychometric properties (n = 87; Appendix S1).
Study characteristics
Region
Figure 4 illustrates the number of studies from the different SSAn countries. The map presents data from all 145 studies that met the first two inclusion criteria (i.e., SSAn participants and standardised use of the ASEBA forms), regardless of whether psychometric properties were reported. We displayed data from all 145 studies to visualise patterns of ASEBA usage across sub-Saharan Africa. Six studies from outside of sub-Saharan Africa comprised immigrant participants who originated from one of the following fourteen countries: Angola, Eritrea, Ethiopia, Gambia, Ghana, Guinea, Guinea-Bissau, Kenya, Mali, Niger, Nigeria, Sierra-Leone, Senegal, and Somalia.
An overview of the map shows the predominance of studies with samples from Southern African (n = 70, 48%) and East African (n = 66, 45%) countries. In Southern Africa, all samples originated from South Africa, while East African samples were distributed across a number of countries, including Kenya (n = 27, 41%), Uganda (n = 19, 29%), Ethiopia (n = 11, 17%), Malawi (n = 4, 6%), Rwanda (n = 2, 3%), Zambia (n = 1, 2%), and Tanzania (n = 1, 2%). In contrast, there were only two studies from Central Africa, which came from the Democratic Republic of the Congo (DRC) and Cameroon, respectively, and only one study from West Africa, which came from Ghana. Many participants from immigrant samples originally came from countries in West Africa.
The distribution of studies across sub-Saharan Africa was similar after narrowing the studies down to those that reported psychometric properties of the ASEBA forms (N = 58). The breakdown of studies by region was as follows: Southern Africa (n = 29, 50%), East Africa (n = 25, 43%), Central Africa (n = 1, 2%) and immigrant samples (n = 3, 5%) from East and West Africa, living in Sweden (n = 1) and the USA (n = 2).
Relevant information about the studies including the reported psychometric properties (n = 58) is presented in Table 1.
CBCL = Child Behaviour Checklist, YSR = Youth Self-Report, CFA = confirmatory factor analysis, IC = internal consistency, TRR = test-retest reliability, IRR = inter-rater reliability, ROC = receive operating characteristic. NR = Not reported.
For journal articles corresponding to the same umbrella study: If the sample sizes and the psychometric statistics were markedly different, these articles were listed separately in different rows. If these differences were negligible, the articles are listed in a single row.
Names of language and ethnic groups are reported verbatim from the articles. The same language is sometimes referred to by slightly different names (e.g., Xhosa = isiXhosa). The syndrome scale name ‘Delinquent Behaviour/Delinquency’ was updated to ‘Rule-Breaking Behaviour’ in the 2000/2001 versions.
# Corresponding author provided clarification and/or additional information upon request.
* Psychometric studies (i.e., the primary focus of the study was the psychometric properties of the ASEBA forms).
a Multi-nation studies.
b Sample comprises refugees from other sub-Saharan African countries.
CFA = confirmatory factor analysis, EFA = exploratory factor analysis, AUC = area under the curve.
The standard has been relaxed for the purpose of this review, as all included studies used Pearson’s or Spearman’s correlation coefficients to estimate reliability.
a We decided that a clinical assessment by a qualified mental health professional (e.g., registered clinical psychologist, psychiatrist, social worker) based on a standardised and validated diagnostic tool (e.g., DSM-5), could be considered a gold standard in this context.
b COSMIN recommends an interval period of two weeks.
c Note that COSMIN recommends the use of weighted kappa to estimate the reliability of ordinal (i.e., non-continuous) variables.
ASEBA versions
Studies were published between the years 2003 and 2021. Most studies used the most recent versions of the ASEBA forms published in 2000 and 2001 (n = 27, 47%) or older versions published in the 1990s (n = 29, 50%). One study used the 2001 CBCL/6-18 and the 1991 YSR. Three studies (5%) used versions published in the 1980s. The parent report CBCL forms were used more frequently (n = 38, 66%) than the YSR (n = 28, 48%). Eight studies (14%) administered both the CBCL and the YSR. No studies used the TRF.
Sample characteristics
The majority of samples comprised school aged children and adolescents, with only four studies (7%) including pre-school aged children. Participants across all regions were typically from poorly resourced communities or populations with specific vulnerabilities (e.g., refugees, orphans, survivors of trauma or violence, living with an HIV-infected parent, etc). Although most samples were drawn from communities or schools (i.e., non-clinical populations), six samples (10%) comprised children with one or more illnesses, the most common being HIV and cerebral malaria.
Sample sizes ranged from 17 (Gershoff et al., Reference Gershoff, Grogan-Kaylor, Lansford and Dodge2010) to 3516 (Meinck et al., Reference Meinck, Orkin and Cluver2019). The average sample size across all 58 studies was 493.16 (SD = 642.77), and the median sample size was 281 (IQR = 105.25–600).
Use of the ASEBA forms
The CBCL and YSR were primarily used as an outcome measure of child behavioural and emotional problems. Two studies, one each from South Africa and Kenya, used the CBCL and YSR, respectively, to estimate the prevalence of child behavioural and emotional problems (Cortina et al., Reference Cortina, Fazel, Hlungwani, Kahn, Tollman, Cortina-Borja, Stein and Mazza2013; Magai et al., Reference Magai, Malik and Koot2018). Twenty-nine studies (50%) used one or both of the broad band scales, ‘Internalising Problems’ and ‘Externalising Problems’, and/or all problem items as a single scale (‘Total Problems’, n = 16, 28%). Twenty-four studies (41%) used one or more of the syndrome scales individually (i.e., not as part of a broadband scale). Only four studies (7%) used one or more of the DSM-oriented scales.
Of the 58 studies, only nine (16%) were “psychometric” studies (i.e., where a primary focus of the study was the psychometric properties of the CBCL or YSR). Interestingly, although half of the studies were from South Africa, all psychometric studies came from East African countries, namely Ethiopia (n = 4), Kenya (n = 2), Uganda (n = 2), and Zambia (n = 1).
Languages of administration, translations, and adaptations
Forty-nine studies (84%) used at least one translated version of the tool, while five studies (9%) administered the forms exclusively in English. Information regarding the language(s) of administration was not available for two studies. In South Africa, 10 out of 29 studies (34%) administered the tool in more than one language. Studies from other regions of SSA typically administered the ASEBA forms in one of the local languages (e.g., Swahili or Luo in Kenya). Two Kenya-based studies obtained official Swahili translations from ASEBA (Magai et al., Reference Magai, Koot, Newton and Abubakar2021; Magai et al., Reference Magai, Malik and Koot2018). At least one study from Uganda used the Luganda translation of the CBCL prepared by Bangirana et al. (Reference Bangirana, Nakasujja, Giordani, Opoka, John and Boivin2009). The study from Cameroon used an existing French translation of the CBCL/4-18 from another study (Wadji et al., Reference Wadji, Ketcha Wanda, Wicky, Morina and Martin-Soelch2020). One study from Ethiopia also used an existing Amharic translation of the CBCL/6-18 but slightly modified some translated items to improve their comprehensibility for a rural setting (Isaksson et al., Reference Isaksson, Deyessa, Berhane and Högberg2017). Seven studies, one from Sweden (with Somali immigrants), two from Kenya, and four from the same umbrella study in South Africa, obtained a translation license from ASEBA.
The level of detail reported about the translation and adaptation processes varied considerably. Some studies included statements such as “all research materials were translated and back-translated”. Others reported the translation process in great detail, including the number of people involved at each stage of the process, as well as each person’s qualifications and areas of expertise. Published guidelines on how to approach translation and adaptation of tools vary somewhat but tend to have overlapping features (Sousa & Rojjanasrirat, Reference Sousa and Rojjanasrirat2011). According to COSMIN translation guidelines, a ‘very good’ translation process requires (i) at least two independent forward translators with a mother-tongue in the target language, one with expertise in the construct being measured, the other naïve on the construct being measured, (ii) at least two independent back-translators, naïve on the construct being measured, with a mother tongue in the source (original) language, (iii) a clear description of how discrepancies will be resolved, (iv) a review committee (excluding the translators, preferably including the tool developer), (v) a pilot study (e.g., cognitive interview) inspecting the content validity of the translated version with a sample representative of the target population, and (vi) a written feedback report on the translation process (Mokkink et al., Reference Mokkink, Prinsen, Patrick and Terwee2019). In light of the inconsistent reporting of the translation processes in the included studies, we could not evaluate the translations using these guidelines.
All but three studies that conducted their own translations and adaptations (n = 44) reported using forward and back-translation methods. Thirteen studies reported using an expert panel – typically including cultural advisors, community representatives, local healthcare workers or mental health experts, and psychometricians – to evaluate the tool instructions, response format, and items for conceptual equivalence and cultural appropriateness. Three studies also conducted interviews and focus groups with members of the target population to rate the clarity of the instructions, response format, and individual items. Fourteen studies piloted the translated versions in samples ranging in size from 20 to 200 individuals. Most of these studies did not report detailed findings of the pilot testing or focus groups. However, a few studies removed some items based on community feedback. A study from Uganda did not administer “culturally inappropriate” items on the YSR, including “I set fires” (Eggum et al., Reference Eggum, Sallquist and Eisenberg2011). Another two South African studies from the same umbrella study removed the item “sets fires” from the CBCL (LeCroix et al., Reference LeCroix, Chan, Henrich, Palin, Shanley and Armistead2020; Palin et al., Reference Palin, Armistead, Clayton, Ketchen, Lindner, Kokot-Louw and Pauw2009). Interestingly, one study each from Kenya and Ethiopia also removed this item post-hoc from the YSR and CBCL, respectively, as it did not perform well in confirmatory factor analyses (CFA; Ivanova et al., Reference Ivanova, Achenbach, Dumenci, Rescorla, Almqvist, Weintraub, Bilenberg, Bird, Chen, Dobrean, Döpfner, Erol, Fombonne, Fonseca, Frigerio, Grietens, Hannesdóttir, Kanbayashi, Lambert, Larsson, Leung, Liu, Minaei, Mulatu, Novik, Oh, Roussos, Sawyer, Simsek, Steinhausen, Metzke, Wolanczyk, Yang, Zilber, Zukauskiene and Verhulst2007b; Magai et al., Reference Magai, Malik and Koot2018). A study from Ethiopia removed items related to suicide and another two studies with Somali participants removed sex-related items (Hall et al., Reference Hall, Puffer, Murray, Ismael, Bass, Sim and Bolton2014; Murray et al., Reference Murray, Hall, Dorsey and Bolton2018; Osman et al., Reference Osman, Flacking, Schön and Klingberg-Allvin2017).
Psychometric properties
Fifty-six out of 58 studies (97%) reported internal consistency for one or more subscale using coefficient alpha (also known as Cronbach’s alpha). There was substantial variation in the alpha coefficients reported for the same subscale. For example, alpha for the CBCL Internalising Problems scale across 16 studies ranged from 0.66 to 0.95 and for the same subscale on the YSR across nine studies from 0.61 to 0.95. There were too few studies in each country and language category to conduct a stratified reliability generalisation meta-analysis. Hence, we were not able to calculate an ‘aggregated’ internal consistency statistic for each translated version of the subscales.
Among the South African studies (n = 29), all but two reported only internal consistency statistic(s) for one or more subscales. Eleven studies from South Africa administered the tool in more than one language, but only three of those studies reported separate alpha statistics for the different translated versions. Two studies conducted separate CFAs for the Internalising Problems and Externalising Problems broad band scales but did not report detailed results for these analyses. All studies from Kenya (n = 10) reported coefficient alpha statistics, three conducted CFAs, and one study reported test-retest reliability for the broad band scales, syndrome scales, and DSM-oriented scales. Studies from Ethiopia (n = 7) conducted more comprehensive psychometric analyses. Two studies (both by Ivanova et al., Reference Ivanova, Achenbach, Rescorla, Dumenci, Almqvist, Bilenberg, Bird, Broberg, Dobrean, Döpfner, Erol, Forns, Hannesdottir, Kanbayashi, Lambert, Leung, Minaei, Mulatu, Novik, Oh, Roussos, Sawyer, Simsek, Steinhausen, Weintraub, Winkler Metzke, Wolanczyk, Zilber, Zukauskiene and Verhulst2007a, b, using data from Mulatu, Reference Mulatu1997) conducted CFAs on the CBCL/4-18 and YSR, respectively. Two studies evaluated criterion-related validity using receiver operating characteristic (ROC) curves. Finally, one investigated the test-retest reliability of the YSR, and another investigated combined test-retest and inter-rater reliability of both the CBCL/6-18 and the YSR. All studies from Uganda (n = 6) reported coefficient alpha and one study also reported test-retest reliability. The only study from Zambia conducted a comprehensive psychometric evaluation of the YSR, including CFA, internal consistency, criterion validity, test-retest reliability, and hypothesis testing. Studies from Tanzania (n = 1), Rwanda (n = 1), Cameroon (n = 1), and those from outside SSA with immigrant samples (n = 3) reported coefficient alpha only.
COSMIN evaluation of the psychometric-focused studies
We thoroughly reviewed eight of the nine psychometric-focused studies from East African countries using COSMIN guidelines. We excluded one study from Kenya, as it was the only study that used the CBCL/1.5-5 (Kariuki et al., Reference Kariuki, Abubakar, Murray, Stein and Newton2016). Four of the eight studies evaluated the School-Age version of the CBCL, and six evaluated the YSR (two studies administered both). We decided to limit the COSMIN evaluation to psychometric-focused studies only, as the remaining studies (n = 49) reported only internal consistency statistics, typically for one or two subscales.
Table 2 displays the COSMIN criteria for good measurement properties, with a visual icon allocated to each rating. We added a fourth rating, ‘mixed results’, to indicate a measurement property with different ratings for different sub-groups of participants. The table lists only the measurement properties and criteria that were relevant to the included studies. In this review, there were no published studies that specifically evaluated the content validity, cross-cultural validity, measurement error, or responsiveness of the ASEBA forms.
We relaxed three standards set out by COSMIN. First, the guidelines state that reliability for ordinal scales should be estimated using weighted Cohen’s Kappa (k). However, as no studies reported this statistic, we accepted Pearson’s or Spearman’s correlation coefficients as an acceptable method to estimate reliability. Second, with regards to hypothesis testing for construct validity, COSMIN recommends that reviewers formulate a set of hypotheses about the expected magnitude and direction of the correlations between measures and mean differences in scores between groups, based on theoretical understandings of the construct, prior to the review. This is intended to reduce the possibility of bias and to ensure standardisation across studies. This was not feasible for the present review as child behaviour is such a broad construct with many possible correlates. Hence, we accepted hypotheses as long as the authors provided evidence to substantiate them. In general, the authors of needed to provide a rationale (including empirical evidence) for (i) comparing groups of individuals who differed on a particular characteristic, or (ii) comparing tools that measured similar, related or unrelated constructs respectively. For example, one study hypothesised that girls would, on average, report higher levels of internalizing symptoms than boys and cited previous studies to support this assumption (Magai et al., Reference Magai, Malik and Koot2018). Another study hypothesised that greater mental health symptom severity/frequency would be associated with lower caregiver support but did not provide any evidence to support this association (Murray et al., Reference Murray, Bolton, Kane, Lakin, Skavenski Van Wyk, Paul and Murray2020). Accordingly, we assigned ‘very good’ and ‘doubtful’ risk of bias ratings to these studies respectively. Third, for the internal consistency standard, COSMIN also requires “at least low evidence for sufficient structural validity”. However, we removed this standard for individual studies as the ASEBA forms are well-established worldwide and many studies have confirmed their structural validity (Ivanova et al., Reference Ivanova, Achenbach, Rescorla, Dumenci, Almqvist, Bilenberg, Bird, Broberg, Dobrean, Döpfner, Erol, Forns, Hannesdottir, Kanbayashi, Lambert, Leung, Minaei, Mulatu, Novik, Oh, Roussos, Sawyer, Simsek, Steinhausen, Weintraub, Winkler Metzke, Wolanczyk, Zilber, Zukauskiene and Verhulst2007a, b). COSMIN also recommends that reviewers determine a reasonable ‘gold standard’ prior to assessing the methodological quality of criterion-validity studies. We decided that a clinical assessment by a qualified mental health professional (e.g., registered clinical psychologist, psychiatrist, social worker) based on a standardised and validated diagnostic tool (e.g., DSM-5), could be considered a gold standard in this context. Table 3 displays the COSMIN criteria for evaluating the methodological quality of studies reporting psychometric properties. We assigned a colour to each rating for ease of reading. Table 4 and Table 5 present the combined results of the COSMIN analysis.
All studies measuring structural validity conducted a CFA using tetrachoric or polychoric correlation matrices with a robust weighted least squares estimator, which is recommended for ordinal data (Li, Reference Li2016). In terms of structural validity, two ‘very good’ quality studies, one each from Kenya and Ethiopia, supported the factorial structure of the CBCL syndrome scales using CFAs. The YSR syndrome scales also performed well in CFAs across all four studies, although the methodological qualities of three studies were somewhat compromised by the smaller sample sizes. These findings suggest that the latent constructs measured by the ASEBA syndrome scales are being adequately explained by the specific behavioural problems (i.e., items) in these populations (De Kock et al., Reference De Kock, Kanjee, Foxcroft, Foxcroft and Roodt2013).
Internal consistency was the most commonly reported measurement property, with three and five studies reporting coefficient alpha for CBCL and YSR subscales, respectively. The COSMIN methodological standards for internal consistency are minimal, hence the quality of the methods used for this measurement property were ‘very good’ overall. Alpha coefficients for the broadband scales were generally higher than those for the syndrome scales and DSM-oriented scales. This was not surprising as the value of alpha is influenced by the length of the tool (Cortina, Reference Cortina1993). For the CBCL, the Aggressive Behaviour, Attention Problems, and Somatic Complaints syndrome scales performed the best in terms of internal consistency, meeting the ‘sufficient’ criteria in at least two out of the three studies. The Withdrawn/Depressed syndrome scale, however, did not meet the necessary criteria in any of the three studies. One study in this category had a relatively smaller sample size (n = 64; Bangirana et al., Reference Bangirana, Nakasujja, Giordani, Opoka, John and Boivin2009) than the other two studies measuring internal consistency of the CBCL. In this smaller study, all but one of the syndrome and DSM-oriented scales did not meet the ‘sufficient’ criteria. For the YSR, Somatic Complaints was the only syndrome scale to meet the criteria in at least three of the four studies. Only two studies (one each for the CBCL and YSR) estimated internal consistencies for the DSM-oriented subscales, and these results were consistently insufficient.
The methodological quality of the reliability analyses for both the CBCL and the YSR were either ‘doubtful’ or ‘inadequate’. Reasons for the poorer quality of studies were too long or too short time intervals between administrations (e.g., 9 weeks, 5-7 days), participants undergoing an intervention in between administrations, and other methodological flaws, including not specifying how a subset of the sample was selected for re-administration, and a lack of evidence that participants were stable on the construct to be measured in between administrations. In terms of the measurement property itself, the correlation coefficients were consistently insufficient across forms and subscales. Overall, current evidence for test-retest reliability of the ASEBA forms is inadequate in these SSAn populations.
In terms of criterion validity, Geibel et al. (Reference Geibel, Habtamu, Mekonnen, Jani, Kay, Shibru, Bedilu, Kalibala and Seedat2016) study was the only one to use a psychiatric assessment as a gold standard for the ROC analysis. However, these assessments were not based on a standardised clinical diagnostic tool. Two studies developed their own criteria to identify cases (Hall et al., Reference Hall, Puffer, Murray, Ismael, Bass, Sim and Bolton2014; Murray et al., Reference Murray, Bolton, Kane, Lakin, Skavenski Van Wyk, Paul and Murray2020). Murray et al. (Reference Murray, Bolton, Kane, Lakin, Skavenski Van Wyk, Paul and Murray2020) created a four-item screening questionnaire, administered to the child and their caregiver, asking whether the child had significant psychosocial problems (“yes” or “no”). In Hall et al. (Reference Hall, Puffer, Murray, Ismael, Bass, Sim and Bolton2014) study, refugee camp social workers identified cases using a list of common internalising and externalising symptoms. The social workers’ assessments were then corroborated with caregivers’ responses to a short screening questionnaire. Based on the few studies included in this analysis, there is very limited evidence to substantiate the criterion validity of the ASEBA forms in sub-Saharan Africa.
Four studies conducted some form of hypothesis testing to examine the construct validity of the ASEBA subscales. All four studies evaluated known-groups validity based on clinical characteristics of the sample (i.e., ‘case’ vs ‘non-case’) with mixed results, and one study from Kenya examined sex differences in levels of internalising and externalising behaviours respectively (Harder et al., Reference Harder, Mutiso, Khasakhala, Burke, Rettew, Ivanova and Ndetei2014). The study from Zambia estimated convergent and divergent validity of the Internalising Problems and Externalising Problems subscales on the YSR (Murray et al., Reference Murray, Bolton, Kane, Lakin, Skavenski Van Wyk, Paul and Murray2020). Only two out of the five comparator measures used (measuring post-traumatic stress and well-being, respectively) had adequate psychometric properties in a similar population. For both the Internalising Problems and Externalising Problems subscales, less than 75% of the results were in accordance with the hypotheses.
Discussion
We identified 145 studies that used the ASEBA forms to measure child behaviour problems in SSAn samples. This suggests that the ASEBA forms are used frequently, at least for research purposes, in SSAn contexts. However, fewer than half of the studies reported any measurement properties of the ASEBA forms. Of the studies that did report measurement properties, most reported only coefficient alpha as a measure of internal consistency for the subscales used. The widespread use of the ASEBA forms in sub-Saharan Africa without evaluation of measurement properties warrants consideration. A tool’s measurement properties are inextricably tied to the context in which it is administered. Without sufficient evidence to support the validity of the information derived from the tools used, the dependability of results remains questionable. The tendency of applied researchers to conduct and report limited psychometric evaluations only (i.e., coefficient alpha), without any further investigation or interpretation, remains a challenge to research in this field. Comprehensive psychometric analyses are necessary to arrive at meaningful and accurate conclusions about a tool’s measurement properties (Dima, Reference Dima2018). In addition, psychometric analyses should be reported clearly and comprehensively, and this information should be easily accessible to readers. COSMIN is in the process of developing a checklist for standards on reporting measurement properties (see https://www.cosmin.nl/tools/checklists-assessing-methodological-study-qualities/). This will hopefully aid in developing a standardised and transparent approach to reporting measurement properties in research studies.
Most of the studies included in the final analysis administered at least one translated version of the CBCL or YSR, and almost all translations were created specifically for use in those studies. There were inconsistencies with regard to the reporting of the translation procedures. Descriptions of the translation procedures provided little detail, raising doubts about the integrity of the translations, as judged by COSMIN standards. Although it is possible that rigorous methodological guidelines were adhered to, this information was not readily available in most cases. Consequently, we were not able to evaluate the translation procedures across studies. The quality of a translation may significantly impact the validity of a tool (De Kock et al., Reference De Kock, Kanjee, Foxcroft, Foxcroft and Roodt2013; Van Widenfelt et al., Reference Van Widenfelt, Treffers, De Beurs, Siebelink, Koudijs and Siebelink2005). In a sense, a translated version of a tool becomes its own outcome measure that should be evaluated for content validity (Terwee et al., Reference Terwee, Prinsen, Chiarotto and Mokkink2018a). Although a few studies reported the use of focus groups and pilot testing to assess relevance and comprehensibility, the results of these investigations (e.g., any re-wordings or modifications made to the original draft) were not always reported. Transparent reporting of translations and adaptations serves two important purposes. First, it grants readers the opportunity to evaluate the validity of the translated versions. Second, it serves as a useful record for researchers or clinicians who may be interested in administering the translated tool in future studies or in clinical settings. In this review, we found that only three out of the ten studies that administered the ASEBA forms in more than one language conducted separate psychometric analyses for each version. This would be considered an important step to rule out potential measurement bias.
In this review, we also evaluated eight of the nine psychometric-focused studies using COSMIN standards and criteria for good measurement properties. To our knowledge, these nine studies are the only published journal articles addressing the validity of the ASEBA forms in SSAn contexts. Overall, evidence to support the validity and reliability of the CBCL and YSR in SSAn countries in the existing literature is limited. Furthermore, the variable quality of the methods used across different studies to assess the measurement properties of the CBCL and YSR preclude us from making confident recommendations regarding its use in these regions.
Having said this, the statistical methods used, as assessed by the COSMIN Risk of Bias Checklist, were generally adequate. The main exceptions to this were the reliability and criterion validity analyses. More studies with different designs and larger samples are needed to learn about the criterion validity, test-retest reliability, and inter-rater reliability of the ASEBA forms in SSA. Criterion validity is a very important measurement property if the ASEBA forms are to be used as screening tools in community settings. Although no single ‘gold-standard’ instrument currently exists for child behavioural and emotional problems, judging ASEBA scores against clinical assessments based on standardized diagnostic tools may be a strong starting point.
Coefficient alpha was the most frequently reported statistic across all studies. However, there are limitations of coefficient alpha as a measure of internal consistency (Dunn et al., Reference Dunn, Baguley and Brunsden2014; Sijtsma, Reference Sijtsma2009). Ordinal coefficient alpha may generate a more reliable estimate of internal consistency for Likert scales, such as the ASEBA forms (Zumbo et al., Reference Zumbo, Gadermann and Zeisser2007). In terms of structural validity, the majority of studies were of a very good standard, barring a few studies with sample sizes smaller than required for a tool measuring multiple constructs with many items. Although the broadband scales (Internalising Problems and Externalising Problems) were frequently administered in SSA, no studies conducted higher order or bifactor CFAs that would have investigated the unidimensionality of these broad band scales.
Most of the studies included in the final analysis came from South Africa, although none of these studies were specifically focused on the measurement properties of the ASEBA forms. Hence, there remains limited evidence to support the validity of the ASEBA forms in a South African context. A smaller but significant proportion of the included studies came from East African countries (notably Kenya, Ethiopia, and Uganda). All nine psychometric-focused studies came from East African countries. Compared to Southern and East Africa, there were very few studies from West and Central Africa. It is possible that other measures of child and behavioural problems are more popular in these regions. Although there was only one study that came from a West African country (i.e., Ghana), individuals of West African origin living outside of SSA were also represented in the included studies.
Limitations
The two reviewers made every effort to ensure that all papers were thoroughly screened and reviewed. However, it is possible that a few articles were either not included in the search results, accidentally removed from the reference library, or incorrectly screened. A limitation of our study was the exclusion of unpublished “grey” literature, including theses, books, and conference presentations. Although we made the decision to exclude these records for practical reasons, grey literature would have likely enriched our analysis and reduced the risk of publication bias. Another important limitation of our study was that we could not use the COSMIN GRADE approach to quantitatively pool the results from individual studies and grade the overall quality of evidence for each measurement property (Mokkink et al., Reference Mokkink, Prinsen and Patrick2018b). Results from individual studies were too inconsistent to pool quantitatively. Moreover, there were too few studies in each “sub-group” (e.g., country, language of administration, sample characteristics) to arrive at reliable conclusions for each possible combination of subscale and sub-group.
Gaps identified and recommendations
One important gap in the current literature is the dearth of studies evaluating the content validity of the ASEBA forms in sub-Saharan Africa. Content validity, the extent to which the content of a tool adequately represents the construct it measures, is arguably the most important of all measurement properties (Terwee et al., Reference Terwee, Prinsen, Chiarotto, Westerman, Patrick, Alonso, Bouter, de Vet and Mokkink2018b). If a tool does not have content validity, then all other measurement properties are irrelevant. As described earlier, there were some attempts to evaluate the relevance and comprehensibility of the ASEBA forms through pilot testing. To our knowledge, only one included study explored the comprehensiveness of the ASEBA forms. Prior to conducting their study, Hall et al. (Reference Hall, Puffer, Murray, Ismael, Bass, Sim and Bolton2014) used qualitative methods to identify local symptoms of internalising and externalising behaviours in Somali refugees living in Ethiopia. The authors added 11 and 4 of these locally derived symptoms to the Internalising Problems and Externalising Problems subscales, respectively. Although we could not include this preliminary study in the current analysis, the findings emphasise the importance of evaluating the comprehensiveness of behavioural screening tools in sub-Saharan Africa.
Summary and conclusion
The primary aim of the present review was to investigate the measurement properties of the ASEBA forms in SSAn countries, where translated versions of the forms are frequently administered. At present, evidence is limited in terms of both the number and quality of available studies. East African countries have already made significant progress with regard to evaluating translated versions of the ASEBA forms in local contexts. In South Africa, however, the measurement properties of the ASEBA forms remain under-studied despite their widespread use in research. Data from other areas of sub-Saharan Africa are largely absent. This review has demonstrated the importance of validating existing behavioural tools for culturally and linguistically diverse contexts in SSA. Comprehensive and ongoing psychometric evaluations of tools require time and resources. However, the result is that clinicians and researchers become more confident that the inferences made based on these tools are accurate and dependable.
Supplementary material
For supplementary material accompanying this paper visit https://doi.org/10.1017/neu.2022.5
Acknowledgements
We thank Mrs Mary Shelton from the University of Cape Town’s Health Sciences Library for her advice and assistance with the systematic review search strategy.
Authors’ contributions
All authors were involved in the design of the systematic review and the development of a search methodology. M.R.Z. and C.F. independently screened and reviewed all records and extracted the data from the articles. M.R.Z. drafted the manuscript, which was approved by all authors prior to submission.
Financial support
This research received no specific grant from any funding agency, commercial, or not-for-profit sectors. M.R.Z. received postgraduate fellowship funding from the Harry Crossley Research Foundation.
Conflict of interest
None.