Introduction
Clinical supervision is a cornerstone in psychotherapist training, but research in this area has been hampered by a lack of validated tools for assessing supervision quality (Watkins, Reference Watkins2012). This is partly due to the many different theoretical models for supervision, together with difficulties in operationalizing supervisor behaviours and supervision features (Watkins, Reference Watkins2011). The absence of assessment tools makes it difficult to draw firm conclusions about even fundamental supervision components (Freitas, Reference Freitas2002; Milne et al., Reference Milne, Reiser, Aylott, Dunkerley, Fitzpatrick and Wharton2010; Reiser and Milne, Reference Reiser and Milne2012), and to date, no supervision model has yet been empirically validated (Alfonsson et al., Reference Alfonsson, Parling, Spännargård, Andersson and Lundgren2018).
Without empirical guidance, psychotherapy supervision has mostly been modelled by psychotherapy practice. Liese and Beck’s model (Reference Liese, Beck and Watkins1997) for supervision in cognitive behaviour therapy (CBT), is one early example. Their model resembles cognitive therapy, and also includes elements used to promote behaviour change in CBT (e.g. goal setting, guided discovery, skills training). Other researchers have instead suggested that supervision may be a specific form of intervention by itself, and therefore should follow principles centred on the supervision context (Falender and Shafranske, Reference Falender and Shafranske2012; Milne, Reference Milne2008; Reiser and Milne, Reference Reiser and Milne2012). This is clearly expressed in the CORE Competence Framework (Roth and Pilling, Reference Roth and Pilling2007), which includes guidelines for both clinical supervision and competencies of different psychotherapeutic schools. Supervision in this model is thereby seen as a unique entity that includes specific competencies; to some extent common across different forms of psychotherapy, but also therapy-specific that can be observed and measured. The Supervision Competency Scale (SCS) (Kennerley and Clohessy, Reference Kennerley and Clohessy2010; Mueller, Reference Mueller2010) is a tool for assessing CBT supervision built on the CORE Competence Framework. SCS includes 33 items in six areas (i.e. structuring of sessions, enhancing learning, supervisory relationship, other process issues, professional/ethical practice and reflective practice). The instrument is rather extensive, and the validity and reliability are not yet fully known.
Another tool for assessing CBT supervision is SAGE (Supervision: Adherence and Guidance Evaluation) (Milne et al., Reference Milne, Reiser, Cliffe and Raine2011). Originally developed to assess the competencies of evidence-based supervision (EBCS) (Milne, Reference Milne2009; Milne and Dunkerley, Reference Milne and Dunkerley2010), SAGE mirrors existing observational coding instruments for evaluation of psychotherapy, such as the Cognitive Therapy Scale (CTS-R) (Blackburn et al., Reference Blackburn, James, Milne, Baker, Standart, Garland and Reichelt2001). The original 23-item version of SAGE was empirically derived to assess the competence of supervisors based upon direct observation of practice samples. The 23 items included different aspects of relationship variables, supervisor’s competencies, and supervisee’s engagement in experiential learning (Milne et al., Reference Milne, Reiser, Cliffe and Raine2011). As clinical supervision is difficult to reliably assess, the authors suggest a minimum of three to four supervision sessions in order to receive a reasonable assessment of competence. Recently, SAGE has been refined into a shorter 14-item version (Short–SAGE), which has undergone some preliminary psychometric evaluation supporting its usability (Reiser et al., Reference Reiser, Cliffe and Milne2018). However, rating therapist or supervisor competence is challenging, and poor inter-rater reliability is a common problem in observational coding instruments. Additionally, measures of reliability of ratings often vary widely both across and within studies (Roth et al., Reference Roth, Myles-Hooton and Branson2019), and many studies use incorrect statistical procedures, or fail to fully report information necessary to interpret the inter-rater reliability results (Hallgren, Reference Hallgren2012). Short–SAGE has not yet been evaluated in this regard, and the aim of this study was therefore to assess the inter-rater reliability of Short–SAGE.
Method
Design
Data in this study were retrieved from an ongoing study of basic and advanced level psychotherapy students’ courses in CBT. In the study, via audio recordings of therapy sessions, supervisors monitored and assessed students’ clinical performance when they provided psychotherapy in diverse clinical settings. The supervision sessions included performance feedback, and were conducted in groups with three to four supervisees (i.e. each supervisee received approximately 45 minutes of supervision per session). All sessions were audio recorded. The supervisors were experienced psychotherapists (i.e. 14–32 years of professional psychotherapy experience, and 4–24 years of experience as CBT supervisors), with advanced training in CBT and additional training in CBT supervision. They were not trained or instructed to follow the specific EBCS model, but to comply with the broader CBT supervision framework (e.g. review and provide feedback on students’ therapy sessions, foster reflective thinking using Socratic methods, encourage students’ responsibility for further learning, and, when necessary, teach/model therapeutic skills). All supervisors were independently contracted to the training centre and not part of the permanent staff. The research team approached all supervisors currently providing supervision at the training centre, and all chose to participate in the study.
This study used audio recordings from 48 supervision sessions, provided over 2 months by 12 different supervisors, collected from both basic and advanced CBT training courses. From these, a random sample of 20 recordings of approximately 50 minutes each were extracted for analysis. Prior to analysis, in order to delete identifiable patient information from the recordings, each recording was scrutinized by an independent staff member. The audio recordings were then randomly assigned to three of four coders in a cross-over design, and then independently coded with Short–SAGE. All four coders were clinical psychologists with advanced training in CBT. They were working as teachers in clinical and/or supervision courses, and had practical experience of supervision in diverse settings. The coders also had experience in assessing CBT and clinical supervision, and were familiar with the EBCS model of supervision. Additionally, they all had various previous training in other coding procedures, such as the Cognitive Therapy Scale, Teacher’s PETS and The Motivational Interviewing Treatment Integrity Code. Prior to coding, the coders received the Short–SAGE manual and were asked to familiarize themselves with the coding instructions. They then participated in three 3-hour Short–SAGE workshops, in which three randomly selected supervision sessions were analysed and discussed in order to promote a common understanding of the instrument and to reach scoring consensus. The training outline was based on the Short–SAGE manual, and the description of Loades and Armstrong (Reference Loades and Armstrong2016). In each workshop, the coders listened to a recorded supervision session and then discussed the ratings of each item until the rationale was clarified and consensus was reached. The coders had an additional meeting regarding two items they perceived as most difficult to interpret (i.e. Prompting and Conceptualizing). To further the coders’ understanding of these items, contact was made with the founders of SAGE, who provided more elaborate instructions. No coder in the group served as the gold standard or master coder. The goal was instead to reach consensus between all coders, for all scores, in each session. After the four workshops with joint coding, based on the level of agreement at the fourth workshop, the inter-rater agreement was deemed high enough for the independent coding. For the subsequent codings, each coder was provided with the Short–SAGE manual. All recordings were coded independently and submitted anonymously. The coding sheets were then compiled by an independent staff member, and provided with a code number.
Instrument
Short–SAGE contains 14 items (i.e. Managing, Agenda-setting, Formulating, Questioning, Prompting, Demonstrating, Teaching, Training/Experimenting, Evaluating, Feedback, Reflecting, Conceptualizing, Planning, and Experiencing), each scored on a 7- or 3-point Likert scale (Reiser et al., Reference Reiser, Cliffe and Milne2018). The factor structure is consistent with the underlying conceptual framework that guided the original 23-item SAGE version, including two of the major components: (1) The Supervision Cycle (i.e. specific supervisor behaviours which are believed to facilitate optimal experiential learning); and (2) The Supervisee Cycle (i.e. specific, observable supervisee learning competencies) (Reiser et al., Reference Reiser, Cliffe and Milne2018). In the main, 7–point scale of Short–SAGE, 0 to 2 indicates incompetent/not demonstrated, 3 to 4 indicates competent, and 5 to 6 indicates expert competence. The 3-point scale, or the RAG (red-amber-green) scale, provides a coarser (i.e. incompetent/competent/expert) rating for each item, and is mainly used for training purposes.
Data analyses
The inter-rater reliability was assessed with a two-way, random effects, absolute agreement, intraclass correlation (ICC) (Hallgren, Reference Hallgren2012; Koo and Li, Reference Koo and Li2016; McGraw and Wong, Reference McGraw and Wong1996). Results for both single (i.e. the reliability of the ratings based on ratings provided by a single coder) and average (i.e. the reliability of the ratings, based on the mean value of ratings provided by several coders) measures are presented, for both the 7- and 3-point scale of Short–SAGE. Following published guidelines, the chosen design should result in a 90% ability to detect modest (0.4) correlations between raters’ scores (Bujang and Baharum, Reference Bujang and Baharum2017; Walter et al., Reference Walter, Eliasziw and Donner1998). The ICCs were interpreted according to the recommendations of Cicchetti and Sparrow (Reference Cicchetti and Sparrow1981): <0.39 as poor; 0.40–0.59 as fair; 0.60– 0.74 as good; and 0.75–1.00 as excellent.
Results
Table 1 shows the mean Short–SAGE 7-point scale scores for each item, for each coder. In all coded supervision sessions, for the Short–SAGE 7-point scale, no item reached the instrument’s highest value (i.e. 6), so the range was somewhat restricted. Additionally, four of the 14 items showed floor effects (i.e. Formulating, Teaching, Evaluating and Reflecting) with no scores lower than 2 (Table 2). This indicates that all supervision sessions included elements of supervisors actively encouraging the supervisees to analyse/synthesise and generate clinical presentations, didactic information from the supervisors, monitoring activities from the supervisors, and that the supervisees in all 20 sessions, in the light of their own understanding, summarised relevant events from their therapy sessions. Moreover, four of the 14 items showed ceiling effects: three of them (i.e. Prompting, Demonstrating, and Training/Experimenting) with no scores higher than 4, and one of them (Feedback) with no scores higher than 3 (Table 2). This indicates that none of the supervisors, in an expert way, prompted/cued the supervisee about relevant material, modelled/illustrated skills or engaged the supervisees in experiential learning, and that none of them, in a proficient way, let the supervisees summarise the supervision session.
a single measure.
b average measure.
All inter-rater correlations for the 7-point scale were statistically significant. For the single measures, six of the 14 items were in the poor range (i.e. <.40), seven in the fair range (i.e. 0.40–0.59), and one was in the good range (i.e. 0.60–0.74). For the average measures, three of the items were in the fair range, six in the good range, and five were in the excellent range (i.e. 0.75–1.00) (Table 2).
For the 3-point scale, five of the 14 inter-rater correlations turned out non-significant (i.e. Prompting, Teaching, Evaluating, Feedback and Reflecting). For the remaining nine items, the single measure analyses resulted in six items in the poor range (i.e. <.40), two in the fair range (i.e. 0.40–0.59), and one in the good range (i.e. 0.60–0.74). The average measure analyses for the remaining nine items resulted in four items in the fair range, four in the good, and one item in the excellent range (i.e. 0.75–1.00) (Table 3).
a single measure.
b average measure.
c range too restricted for adequate analysis.
For both scales, there were rather large discrepancies between the ICC for single and average measures, indicating low levels of percentage agreement across items (Table 2 and 3). In a post-hoc analysis, all coders were compared pairwise in order to detect any outlier with consistently lower inter-rater reliability, but no single coder stood out in this regard.
Discussion
The aim of this study was to assess the inter-rater reliability of Short–SAGE. For the 7-point scale, the range was somewhat restricted, and the analyses revealed floor effects for four of the 14 items, and ceiling effects for another four. For both scales, for the single ICC measures, only one item was in the good range, and the rest of the items were in the fair to poor range. Moreover, on the 3-point scale, five of the 14 inter-rater correlations turned out to be non-significant. This 3-point RAG scale (i.e. red-amber-green) has been proposed by the authors as an educational tool; it is useful as a basis for supervisor–supervisee discussions. Poor inter-rater reliability may be less problematic in non-evaluative contexts. However, results of the current study indicate that the RAG scale is not reliable for evaluating purposes where the exact level (e.g. ‘fail’ and ‘pass’) is important, at least not without extensive coder training. Moreover, the average measurements showed, not surprisingly, better results for both scales. However, as the Short–SAGE assessment is normally done by a single rater, the single-measures results are most relevant for the assessment of the scale’s inter-rater reliability.
Unfortunately, this study’s results did not provide additional information regarding whether larger samples of sessions and/or more extensive coder training would have generated higher ICC scores. As Syed and Nelson (Reference Syed and Nelson2015) state in their article on Guidelines for Establishing Reliability when Coding Narrative Data: reliability is not a product, but a process that involves multiple time-intensive steps. However, as the literature rarely describes the process of training of coders who measure treatment fidelity (Kramer Schmidt et al., Reference Kramer Schmidt, Andersen, Nielsen and Moyers2019), it is difficult to know exactly how that training should be conducted. The Short–SAGE manual proposes a 1-day training workshops for raters, guided by the full SAGE manual. Other researchers have instead proposed a considerable amount of training, conducted in a stepped training approach, with a level of inter-rater reliability specified a priori (Hallgren, Reference Hallgren2012; Syed and Nelson, Reference Syed and Nelson2015).
Interestingly, for some of the supervision sessions, many of the items scored 0 (Table 2), indicating absence of features, or highly inappropriate performance in that specific domain. This is especially surprising when it comes to Agenda-setting, Questioning, Demonstrating, and Feedback; supervisor behaviours that could arguably be expected in most, if not all, CBT supervision sessions. However, while CBT supervision text books quite unanimously promote supervisor behaviours, such as the use of agenda, Socratic questioning and modelling (e.g. Watkins and Milne, Reference Watkins and Milne2014), our experience tells us that CBT supervision content varies to a large degree. Clinical supervision has not been monitored or scrutinized as closely as some psychotherapy methods, and the content has not been studied more objectively until recently (Alfonsson et al., Reference Alfonsson, Parling, Spännargård, Andersson and Lundgren2018). Hopefully, clinical supervision can develop in a similar way as psychotherapy, including more transparency and a closer adherence to published guidelines. That being said, few of the supervision techniques, including those described in Short–SAGE, have been experimentally explored, and to a large extent, we still do not know exactly which supervision behaviours are effective in the training of psychotherapists. In other words, even if it is possible to improve the reliability of Short–SAGE, both the validity of the instrument, and the underlying model that Short–SAGE is supposed to measure, are still unclear. Taken together, it is difficult to know if the low ICC levels in this study are related to the training of coders, the sample (i.e. both ceiling and floor effects) and/or the instrument itself.
Limitations
This study has important limitations: Our sample of both supervision sessions (i.e., 20) and coders (i.e., 4) were small. However, neither the Short–SAGE manual, nor the article describing the instrument’s psychometric properties (Reiser et al., Reference Reiser, Cliffe and Milne2018) contain any information on the recommended number of sessions or coders for assessing the inter-rater reliability of Short–SAGE. To our knowledge, inter-rater reliability has not previously been investigated for a supervision coding instrument. However, instruments for assessing CBT competence, such as CTS-R, have been able to prove adequate inter-rater reliability (Blackburn et al., Reference Blackburn, James, Milne, Baker, Standart, Garland and Reichelt2001). In their study, Blackburn and colleagues had a total of 102 sessions coded by two out of four coders. The present study had a similar approach using fewer sessions (i.e., at least three coders coded each recorded session in a cross-over design). However, the restricted ranges, with both floor and ceiling effects, and the use of absolute agreement in this study’s analyses, resulted in lower statistical power than expected. The restricted ranges also limits the conclusions that can be drawn regarding a context that includes a fuller range of supervisor behaviours, which thus limits the generalizability of the results. A larger sample of sessions and coders may be needed to further analyse the inter-rater reliability of Short–SAGE, but coding of sessions consume large amounts of resources, and are often associated with practical difficulties. Moreover, in this study, the coders did not have any formal training in using Short–SAGE. Similar to the proposed Short–SAGE manual’s one-day training workshops for raters, the coders in this study had participated in three inhouse three-hour Short–SAGE workshops, provided by two clinical psychologists with advanced training in CBT, and experience in using SAGE in their work as supervisors. In the workshops, three recorded supervision sessions were analysed. They also had an additional meeting regarding two difficult items, and received more elaborate instructions from the founders of SAGE. The results of the study indicate that a more extensive training proposed by researchers like Hallgren (Reference Hallgren2012) and Syed and Nelson (Reference Syed and Nelson2015), which requires considerable efforts, may be essential for reaching an adequate inter-rater reliability. Taken together, the results of this study do not provide information regarding whether the low ICC levels were related to the training of coders, the coded sample (i.e. both ceiling and floor effects) and/or the instrument itself (i.e. the Short–SAGE).
Conclusions
For both research and training purposes, validated tools for assessing supervision quality are highly needed. However, instruments for measuring adherence and/or competence are of little value if the coders do not attain inter-rater reliability. In this study, only one of the 14 items of Short–SAGE was in the good range, and the rest of the items were in the fair to poor range. Unfortunately, the results did not provide additional information regarding whether more extensive training and/or larger samples of sessions and coders would have generated a higher degree of correlation and agreement between items. Due to limited research in this area, it is unclear whether expert codings of global scores of supervisor behaviours is a valid method for measuring supervision quality. More behaviour-oriented approaches, like those developed for motivational interviewing (Moyers et al., Reference Moyers, Martin, Manuel, Hendrickson and Miller2005), may be more accurate. Codings that provide more detailed information clearly provide richer data, and might also more easily generate agreement between items. Short–SAGE may be used to assess supervision quality in both research and training settings. Whether quality of supervision is associated with improvements in supervisees’ competencies is as yet unclear. Short–SAGE provides a tool that may enable empirical research in this area. Further studies are needed to assess whether extensive training can improve the instrument’s inter-rater reliability.
Acknowledgements
None.
Financial support
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Conflict of interest
None.
Ethics statements
All participants were informed about the study procedure and provided informed consent. The procedure was approved by the Regional Ethics Committee Board (No. 2018/1735-31/3), and all authors have abided by the Ethical Principles of Psychologists and Code of Conduct as set out by the BABCP and BPS.
Key practice points
-
(1) Clinical supervision is a cornerstone in psychotherapist training, but there are few available tools to assess clinical supervision quality.
-
(2) For both research and training purposes, tools for assessing supervision are highly needed. However, instruments for measuring adherence and/or competence are of little value if the coders do not attain inter-rater reliability.
-
(3) Whether quality of supervision is associated with improvements in supervisees’ competencies is as yet unclear.
Comments
No Comments have been published for this article.