1. Introduction and outline
Customer service dialog systems have become widely used in many everyday settings, but still make frustrating errors that are sometimes obvious to a human—for example, failing to understand a customer’s message and asking an irrelevant question as a result. In practice, tuning these systems to limit these behaviors is an expensive and time-consuming art. This paper describes the design, implementation, and early results of an approach to improving overall dialog system quality by recognizing and addressing such individual failures. This is done by combining individual Actionable Conversational Quality Indicators (ACQIs) with a running interaction quality score (IQ), to show which problems were identified, what steps can be taken to fix them, and how these issues affected an overall assessment of the user experience. IQ is a standard conversational quality measure developed by Schmitt, Schatz, and Minker (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015). This paper introduces ACQIs, which are designed so that each conversational quality indicator has associated recommended actions, but do not require explicit knowledge of the dialog system architecture.
The paper starts by explaining some of the background on how task-oriented dialog systems are built and maintained (Section 2): while there are several online tools that support this, many readers may be unfamiliar with their use. Previous works on evaluating dialog system performance (discussed in Section 3) have investigated the use of conversation-level quality metrics and individual turn-level assessments of good and bad interactions. In particular, the IQ score of Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015) is discussed in this section, and the combined use of ACQIs and IQ score plays a major role throughout the rest of the paper. Section 4 introduces the datasets that are used for examples and experiments throughout the rest of the paper.
Section 5 explains the heart of this paper: the design of the ACQI taxonomy.Footnote a This explains the motivation and decisions behind ACQIs, including how they are made to be actionable, explainable, and to give feedback that is specifically tailored to the dialog system in question. ACQIs are agnostic of dialog system architecture, relying only on textual input, but able to be mapped to any number of associated actions. Section 6 describes the annotation work for ACQI datasets, analysis of the ACQI distributions, and the work on combining ACQI and IQ scoring to distinguish those parts of the dialog that need particular improvement.
Section 7 describes experiments on automatically predicting ACQIs based on the annotated datasets and features extracted from the dialog text. We demonstrate that correct ACQI labels can be predicted with a weighted average f1-score of 79% and demonstrate the effectiveness of textual features to predict labeled IQ scores (1–5) with an average accuracy of 60%. Analyzing the distribution of ACQIs and suggested actions when IQ drops shows that the number of potential improvement strategies to evaluate could be reduced by up to 81%, depending on the accuracy of the ACQI and IQ classifiers. We argue that tools built using these approaches could improve the effectiveness and reduce cognitive burden on bot-builders.
2. Conversational AI systems and challenges for bot-builders
Since the year 2000, dialog systems (often called chatbots) have gone from mainly research demonstration systems to include various user-facing commercial offerings. Dialog systems fall into three broad categories (Deriu et al. Reference Deriu, Rodrigo, Otegi, Echegoyen, Rosset, Agirre and Cieliebak2020): conversational agents, question answering systems, and task-oriented systems (which are the topic of this paper). Each category has corresponding dialog quality measurement strategies. Conversational agents, which often receive the most attention in news articles when released, are typically unstructured and open-domain, with no particular objective other than an engaging conversation. Success in conversational agents is frequently measured by how long users are willing to continue interacting. Question answering systems can be evaluated by considering the accuracy of answers given.
Task-oriented systems, which are the focus of this article, typically have a rigid structure and a limited scope. Most commercial and many research dialog systems have an underlying modular structure as in Fig. 1. These systems are built to resolve the consumer’s issue, answer questions, route to an appropriate representative, or guide the user through a task as efficiently as possible. Like question answering systems, task-oriented systems have clear “failure” cases: just as a question answering system can fail to answer a question, a task-oriented system can fail to complete a task.
Task-oriented systems are prevalent in customer service: ideally, they can automate simple tasks like routing requests to the right agent, thus freeing up human customer service specialists to handle more demanding situations. Several technology companies offer dialog system services to support such chatbots. These include offerings from large and general technology companies such as Microsoft LUIS, IBM Watson, Google DialogFlow, and from more specialist providers such as Salesforce, Intercom, and LivePerson.
While there are several differences between these platforms, there are some typical themes:
-
The platforms provide various tools and widgets for customers to build and deploy their own dialog systems.
-
The dialogs built in this way are “designed” or “scripted.” The process of building a dialog involves declaring various steps, inputs the user can be encouraged to make at those steps, and what action the system should take in response.
-
This process admits the possibility of error or failure states, when the input from a user is something the system (knowingly or unknowingly) cannot process successfully.
A simple example might be that the dialog is at a stage where the user is asked to pick from a list of available options, such as “Enter 1 for English, Ingrese 2 para Español.” In this case, any entry other than the numbers “1” or “2” would be a problem. This problem can be addressed in a few ways. A traditional keyboard interface might say “Option unrecognized, please type 1 or 2.” A form-filling approach might be to use a “Select from Menu” element instead of a “Prompted Text Entry” element, so that the customer can only give input that corresponds to the actions the dialog system can take right now. An example from the LivePerson Conversation Builder is shown in Fig. 2. The dialog developer is about to add a new interaction to the conversation so far, they have a choice of widgets including those for text entry, scheduling, payments, and there are various formats for asking questions and tracking the responses. There are analogous features in other bot-building platforms.
It is crucial to note that dialog system platforms are typically used by customer support specialists, not the customers themselves. In the rest of this paper, these users will be described using the colloquial but industry-wide term “bot-builders.” This is emphasized because one of the ways to improve these dialog systems is to provide tools that make bot-builders more effective. The work described in this paper shows that two established approaches to measuring conversations, scoring individual actions and assessing the conversation quality as a whole, can be combined to provide such a tool that helps bot-builders to identify and fix particular pain points in a deployed dialog system. The intuitively appealing assumption that useful actionable feedback about dialog systems can be produced through (semi-) automated methods is supported by Hockey et al. (2003), which shows that users of a task-oriented dialog system who received actionable feedback in failure cases outperformed the control group that did not.
3. Previous approaches to evaluating dialog systems, including IQ
Meaningful evaluation of automated dialog system performance has long been recognized as crucial to progress in dialog system research (Deriu et al. Reference Deriu, Rodrigo, Otegi, Echegoyen, Rosset, Agirre and Cieliebak2020) and is an active area of development, including the recent introduction of the “sensibleness and specificity” metric of Adiwardana et al. (Reference Adiwardana, Luong, So, Hall, Fiedel, Thoppilan, Yang, Kulshreshtha, Nemade and Lu2020). Since the introduction of dialog system-building platforms as described in Section 2, it has also become a daily concern for customer service bot-builders.
One of the most common industry measures for dialog system effectiveness is the automation rate or containment rate. Containment rate directly affects the cost savings that a business may make through automation, for example “Over three years and a conservative 25% containment rate, the cost savings is worth more than $\$$ 13.0 million to the organization” (Forrester Research 2020). However, analysis of containment or automation rates is still relatively rare in the research literature and is mainly found in evaluation of speech/ voice-driven dialog systems (Pieraccini et al. Reference Pieraccini, Suendermann, Dayanidhi and Liscombe2009). The tradeoff between the automation rate and the number of setbacks a user can encounter is analyzed by Witt (Reference Witt2011), again with spoken dialog. The relative lack of research literature on containment rates can be easily attributed to different settings and incentives. Task-oriented commercial dialog systems are often developed to support existing customer service operations staffed by human agents, whereas most dialog systems in research settings do not have human agents to respond to escalations, so the systems cannot escalate, and containment cannot be measured.
For the task-oriented dialog systems that are our primary concern, there is an underlying concept of giving a “right” response. Early evaluation work based on this idea compared the actual responses to a predefined key of reference answers (Hirschman et al. Reference Hirschman, Dahl, McKay, Norton and Linebarger1990). The portion of matches to the key gave the measure of performance. Well-known weaknesses of this approach include being specific to particular systems, domains, and dialog strategies. More portable strategies that measure inappropriate utterance ratio, turn correction ratio, or implicit recovery (Polifroni et al. Reference Polifroni, Hirschman, Seneff and Zue1992; Shriberg, Wade, and Price Reference Shriberg, Wade and Price1992; Hirschman and Pao Reference Hirschman and Pao1993; Danieli and Gerbino Reference Danieli and Gerbino1995) are intellectual ancestors of the ACQI part of our approach, in that they identify events that are indicators of the quality of the conversation. Both of these early approaches share the limitation of being unable to model or compare the contribution that the various factors have on performance.
The PARADISE approach (Walker et al. Reference Walker, Litman, Kamm and Abella1997) overcomes this limitation by using a decision-theoretic framework to specify the relative contribution of various factors to a dialog system’s overall performance. This and other ideas introduced by PARADISE, such as separating the accomplishment of a task from how the system does it, support evaluation that is portable across different systems, domains, and dialog strategies. We have tried to emulate some of these best practices by defining ACQIs that avoid being too implementation-specific, though without insisting that all ACQIs must be relevant to all dialog systems (see Section 5). Also, for our IQ models, in contrast with much work on IQ (Schmitt et al. Reference Schmitt, Schatz and Minker2011; Reference Schmitt, Ultes and Minker2012; Schmitt and Ultes Reference Schmitt and Ultes2015; Stoyanchev, Maiti, and Bangalore Reference Stoyanchev, Maiti and Bangalore2017), we have chosen not to utilize any features extracted from the dialog systems themselves, such as semantic parse from the natural language understanding system (NLU) or features from an automatic speech recognition (ASR) component because they may not be available (e.g., there are no ASR features in a text-only system, and not all NLU produces semantic parses). Instead in this paper, we rely solely on the text of the dialog, partly to facilitate comparison, and partly in order to improve portability across a wider variety of dialog systems (though this remains challenging, as shown in the experiments below).
3.1. IQ score
More recent work has developed several dialog quality measurement strategies which are categorized by Bodigutla, Polymenakos, and Matsoukas (Reference Bodigutla, Polymenakos and Matsoukas2019) as follows: sentiment detection; per-turn dialog quality; explicitly soliciting feedback from the user; task success; and dialog-level satisfaction ratings as in PARADISE (Walker et al. Reference Walker, Litman, Kamm and Abella1997). These methods are useful but have well-known limitations: for example, sentiment analysis on messages misses many problems, there is response bias in user polling, and negative outcomes weigh more in the consumer’s mind than positive ones (Han and Anderson Reference Han and Anderson2020).
The per-turn dialog quality approaches such as Response Quality (Bodigutla et al. Reference Bodigutla, Tiwari, Vargas, Polymenakos and Matsoukas2020) or IQ (Schmitt et al. Reference Schmitt, Schatz and Minker2011; Schmitt, Ultes, and Minker Reference Schmitt, Ultes and Minker2012; Schmitt and Ultes Reference Schmitt and Ultes2015) are based on turn-by-turn expert annotator ratings on a five-point scale that contribute to a running evaluation of the dialog up to that point.
The method in this paper most directly builds on the IQ method of Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015). We also draw from later work of Stoyanchev, Maiti, and Bangalore (Reference Stoyanchev, Maiti and Bangalore2019) which applies the Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015) IQ method to customer service dialogs. One reason we preferred IQ over the similar RQ approach is that with three positive states and two negative, RQ’s scale is less useful for identifying problems, and we would be surprised if in the data we are studying, the “excellent” rating was ever used. IQ’s “dissatisfaction scale” with mostly negative ratings was better but still not ideal for our purposes. This led us to alterations of the IQ scale, which are discussed in detail in Section 6.2. We also preferred IQ to RQ because IQ has been shown to correlate with user satisfaction (Schmitt and Ultes Reference Schmitt and Ultes2015) and because of the availability of the IQ annotated LEGOv2 dataset and comparable prior work. In IQ annotation of a conversation, one point is typically added for a good interaction and a point is subtracted for a bad interaction, with some exceptional cases, for example when the dialog “obviously collapses.” A benefit of this method is that it can help to identify where there are problems in a dialog system, which is noted as a contrast and improvement over previous methods by Schmitt et al. (Reference Schmitt, Schatz and Minker2011):
While the intention of PARADISE is to evaluate and compare SDS or different system versions among each other, it is not suited to evaluate a spoken dialog at arbitrary points during an interaction. (Schmitt et al. Reference Schmitt, Schatz and Minker2011), Section1)
Our work can be seen as an extension of the IQ approach in our use of a running dialog quality measure. Adding ACQI improves over IQ alone by recommending how to resolve a problem instead of only identifying where it exists.
4. Datasets used in this work
The experimental work in the paper uses two main datasets, which were annotated and used in the prediction experiments in subsequent sections. Statistics about these datasets are summarized in Table 1.
4.1. LEGOv2
LEGOv2 (Schmitt et al. Reference Schmitt, Ultes and Minker2012; Schmitt and Ultes Reference Schmitt and Ultes2015) is a parameterized and annotated corpus derived from the CMU Let’s Go database from Raux et al. (Reference Raux, Langner, Bohus, Black and Eskénazi2005, Reference Raux, Bohus, Langner, Black and Eskénazi2006). This corpus was developed by the Dialogue Systems Group at Ulm University and was used in the development of the IQ score of Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015), discussed in Section 3.
The data consist of phone-mediated customer service interactions between an automated dialog system and callers drawn from the general population in the vicinity of Pittsburgh, Pennsylvania. The LEGOv2 dataset represents many of the characteristics that are still challenging when deploying task-oriented spoken language systems “in the wild”:
-
The callers are drawn from the general population.
-
The task-at-hand is authentic: callers have a presumed need to access bus information.
-
The callers are using standard personal or public telephones in real-world settings that include such challenges as third-party speech, television programs in the background, very variable audio quality, and irritated and or amused speech.
Things that have changed since the initial creation of LEGOv2 include:
-
Speech recognition technology is far more capable now than it was when Let’s Go began.
-
The public is much more familiar with conversational AI systems.
-
Both speech-mediated and text-mediated dialog systems are commercially important and widely deployed, so lessons learned from Let’s Go have greater potential impact.
Previous approaches for modeling quality in LEGOv2 have included features automatically extracted from the dialog system (Asri et al. Reference Asri, Khouzaimi, Laroche and Pietquin2014; Rach, Minker, and Ultes Reference Rach, Minker and Ultes2017; Stoyanchev et al. Reference Stoyanchev, Maiti and Bangalore2017; Ultes Reference Ultes2019) or have used meta-data (Ling et al. Reference Ling, Yao, Kohli, Pham and Guo2020). We have chosen to use only features derived from the text itself from a transcribed version of LEGOv2, as this approach is also applicable to text-based dialog systems including those developed by LivePerson and avoids depending on particular system implementation decisions for features.
4.2. Datasets from livePerson dialog systems
An approvedFootnote b collection of transcripts of text-based customer service conversations with LivePerson dialog systems was also extracted and annotated.
We have intentionally chosen a set of bots that serve different functions, come from different industries, and have different overall quality (as measured by final IQ score). To this end, 130 conversations were chosen from each of 4 dialog systems, giving a total of 520 conversations, a comparable number of conversations to those used from the LEGOv2 dataset, though the LivePerson conversations themselves on average have fewer turns (Table 1).
These dialog systems often respond with structured content, such as embedded HTML with buttons, toggles, or drop-down menus. In these cases, we represent the text of the options separated by “***,” for example “BUTTON OPTIONS *** Main Menu *** Pick a Color *** Pick a different item.”
5. The design of the ACQI taxonomy
This central section describes the design of the ACQI taxonomy and how it extends the IQ system introduced above.
A running score like IQ allows bot-builders to identify dialog system responses where there is a meaningful decrease in score, but does not provide direct diagnosis of the problems that may be there. If nothing other than the running score is available, bot-builders have little option other than to manually review, form their own taxonomy of failure causes, and come up with appropriate fixes. This process in practice typically takes days or weeks to complete and is prioritized largely by operator intuition.
To make the problems explicit and to suggest solutions, we introduce Actionable Conversation Quality Indicators or ACQIs. ACQIs highlight moments in chatbot conversations that impact customer experience. In previous work, Finch and Choi (Reference Finch and Choi2020) list 21 dimensions across 20 publications that human evaluators have used to measure the quality of open-domain dialog systems. While these dimensions are not used directly in our work, they partially inspire our taxonomy of ACQIs (Table 5, leftmost column). For instance, our “Doesn’t Understand” ACQI is inspired by Coherence (Luo et al. Reference Luo, Xu, Lin, Zeng and Sun2018; Wu et al. Reference Wu, Guo, Zhou, Wu, Zhang, Lian and Wang2019), Correctness (Liu et al. Reference Liu, Chen, Ren, Feng, Liu and Yin2018; Wang et al. Reference Wang, Liu, Bi, Liu, He, Xu and Yang2020), Relevance (Moghe et al. Reference Moghe, Arora, Banerjee and Khapra2018; Lin et al. Reference Lin, Madotto, Shin, Xu and Fung2019; Qiu et al. Reference Qiu, Li, Bi, Zhao and Yan2019), Logic (Li and Sun Reference Li and Sun2018) and Sensibleness (Adiwardana et al. Reference Adiwardana, Luong, So, Hall, Fiedel, Thoppilan, Yang, Kulshreshtha, Nemade and Lu2020). Other elements of our taxonomy are informed by Jain et al. (Reference Jain, Kumar, Kota and Patel2018) who provide a set of best practices when developing dialog systems for messaging-based bots. We separated “misunderstanding” into “Input Rejected,” “Ignores Consumer,” and “Does Not Understand” categories, because each of these requires different actions that can mitigate the understanding issue. Such separation of failure states highlights the design-for-actionability of the ACQI taxonomy.
Table 2 shows our complete ACQI taxonomy, including descriptions, and a targeted range of potential actions for bot-builders to take. In addition to making the ACQIs actionable, the taxonomy aims to make the issues aggregable, so that bot-builders can analyze aggregated statistics about conversations and prioritize fixing the most prevalent issues accordingly. By exposing predicted ACQIs in an appropriately aggregated format, we empower bot-builders to make more data-driven decisions when improving their dialog systems.
The ACQIs and the taxonomy dimensions are derived from analyzing the user experiences directly at the message level, as well as: literature on the (human) evaluation of open-domain dialog systems, consultation with experts, and ongoing feedback from expert users involved in bot-building. The ACQIs in our taxonomy can have negative, neutral, or positive impact on the conversation (see Section 6.4 for an exploration of positive, negative, and neutral changes in IQ for each ACQI). The ACQIs are dialog system architecture-independent, in that we use only the text of the user’s input and the system’s output for annotation and training predictive models. A text version of the input and output, the actual language of the conversation, is available in nearly all systems since even in spoken systems the ASR produces a written representation of its result. This approach is in contrast to using system-specific indicators, such as backend errors that depend on a particular type of functionality or ASR confidence scores that depend on having ASR. Moreover, the taxonomy is such that not all ACQIs are relevant to all dialog systems: for example, the LEGOv2 system does not support transfer to a human agent, so “Bad Transfer” cannot occur (though the associated action “Set Transfer Expectations” can still potentially be useful, for example by indicating a need to inform the user there are no human agents available).
Because IQ + ACQI will be aggregated and used to improve consumer experience (CX), we put an explicit emphasis on CX and actionability. In order to connect our ACQIs to actions available to bot-builders, we carried out a series of interviews with bot-builders and assembled a repertoire of tuning actions based on their practices, with descriptions and examples for each action. We identified 28 distinct actions bot-builders take at LivePerson. In the case of LEGOv2, we consulted with 2 domain experts and identified 31 actions.
Once the action set was defined, we next needed to attach the actions to ACQIs. A LivePerson expert in improving dialog systems created a proposal mapping each action onto any applicable ACQI(s). The expert then conducted interviews with LivePerson bot-builders to validate the mappings. After validation, we transposed the relationship so that for each of our ACQIs there was an assigned set of actions that the bot-builder may make (Table 2). The mapping showed that our ACQIs could guide bot-builders to take 23 of 28 possible actions for LivePerson systems and 25 of 31 possible actions for LEGOv2.
5.1. Desirable properties of a ACQI taxonomy
A good ACQI taxonomy should be actionable, easy to understand, have highly bot-dependent ACQI incidence rates, and have a significant impact on consumer experience. In the following sections, we will justify these properties and provide measurement strategies where appropriate. For some ACQIs and dialog systems, it may be appropriate to bypass the need for annotation by automatically extracting ACQIs from the system logs. An example for this would be an ACQI that indicates that the NLU returned a confidence score less than a predefined threshold and responded with a request for the user to rephrase their intent (Table 3).
5.1.1. Actionable
Without actionable ACQIs, bot-builders are left to the time-consuming process of reviewing large amounts of transcripts and/or a careful analysis of the model’s feature space, from which they can attempt to deduce what mitigation strategies are appropriate. Unfortunately, many features are not actionable from the bot-builder’s perspective. For instance, conversation length can be highly predictive of conversation quality, but from their perspective, there are no clear steps to uniformly reduce conversation length. The problem is that conversation length is not the cause of a bad conversation, it is a consequence of it.
For example, for the dialog system “Food Expert,” the initial turn was predicted to have ignored a consumer’s initial intent. This can lead to longer conversations, as can an order with complicated modifications, but the associated dialog quality improvement strategies for the operator are quite different. ACQI allows for separation and sizing of these situations: the second is varied and complicated, whereas the first is a single problem with a negative effect on IQ. From the bot-builder’s perspective, this one is a relatively easy fix: they just need to make sure that intent recognition is applied to the first customer utterance.
5.1.2. Easy to understand
For bot-builders, understanding is a necessary condition for fixing an issue. The terms used in the ACQI taxonomy are deliberately chosen to be familiar to bot-builders and have been refined with use to add clarity where requested. The relative success of this effort is reflected in the encouraging inter-annotator agreements reported in Table 4 below. However, this finding is potentially influenced partly by the use of experts in bot-building and conversation modeling as annotators. This has yet to be corroborated with less experienced bot-builders.
5.1.3. Reusable where appropriate
The ACQI taxonomy in Table 2 is designed to fit a wide breadth of task-driven dialog systems, and the considerable overlap between the LivePerson and the LEGOv2 columns reflects the fact that many of the ACQIs and actions are pertinent to both settings. However, the ACQI taxonomy for two dialog systems should only be shared insofar as this makes sense for the two systems. A key example is that many research systems do not have the ability to transfer the dialog to a human agent, so “Bad Transfer” is not applicable. Another example is the “Unable to Resolve” ACQI: this situation is common in many dialog systems, but the specific reasons why a resolution is not possible depends on what other backend capabilities the dialog system can call upon. For example, two dialog systems may both understand that a user wishes to reschedule an appointment, but if they have different levels of access to calendar data and booking operations, this may be resolvable by one system and not the other.
5.1.4. Bot dependent ACQI reports
As the purpose of the ACQI taxonomy is to guide bot-builders to appropriate fixes, it is imperative that the rates at which ACQIs are present are bot-dependent. Teams working on very specific dialog systems may wish to use a subset or create their own smaller taxonomy. For instance, if the dialog system does not allow for transfers (this is the case for LEGOv2) “Bad Transfer” should be excluded from the taxonomy. There is clearly a design tension between these bot-specific concerns and the desire for the taxonomy to be reusable where appropriate. This tension is a challenge from the point of view of dialog system research, though it is natural in the context of customer service as a whole. Taking again the example of a request to reschedule an appointment: the backend integration to reliably automate such requests is nearly always considerably larger than the work of adapting the ACQIs themselves. Manual maintenance and adaptation of the ACQIs taxonomy is often a comparatively smaller concern for commercial than for research dialog systems, whereas the need for ACQIs dedicated to the needs of various situations is sometimes more pressing.
5.1.5. Impact on customer experience
Each ACQI should have an actionable relevance to CX. As we demonstrate in Section 6.4, several elements of the taxonomy only indicate a poor customer experience in particular circumstances. For example, the “Ask for Confirmation” is more likely to indicate a negative CX when it occurs multiple times within a given dialog.
This concludes the summary of the ACQI taxonomy itself. In practice, we found that ACQIs were most reliably predicted and effectively used in combination with a running quality score. This work is described next.
6. Combining ACQIs with IQ in conversational datasets
The section describes the work done on annotating ACQIs along with IQ, which turned out to be necessary for distinguishing those ACQIs that warrant action. This is due to the fact that dialog context matters. For instance, a system attempting to correct a misunderstanding of a malformed user statement is quite different than the system failing to understand an unambiguous answer to a direct question.
6.1. Observations from annotating ACQIs
Our initial approach to building a model for finding ACQIs and making associated recommendations to bot-builders was based on the assumption that particular ACQIs are bad for the user experience and should be avoided. This turned out not to be the case. As the conversations from the LivePerson datasets were annotated, we observed that ACQIs alone may be bad or good or neither. Asking for confirmation (i.e., “Ask for Confirmation” ACQI) contributes to a good conversation when the user input is genuinely vague, but results in a poor conversational experience when the user input were clear and the system should have understood. For example, asking the user to confirm that “next Wednesday” refers to a particular calendar date is sometimes helpful (especially if today is Tuesday!). Asking if the user’s response was “3” if the user just selected “3” can by contrast be obvious and irritating. If a consumer uses “Restart” one time, it can be seen as a good signal that the consumer requested to go back and the system responded appropriately, but if it is used more frequently it can be a signal that their request is not being properly handled. Based on ACQI designations alone, bot-builders cannot always be sure whether a corrective action is needed. Essentially, ACQIs are context-dependent and combine (with context and themselves) nonlinearly. In this work, we explore the context dependency via the relationship with IQ. We leave a more structured statistical modeling framework for future work.
6.2. Annotating ACQIs and IQ together
To mitigate the issue of ACQI instances not being universally good, bad, or neutral, it was decided to combine ACQIs with IQ scores. There are several differences between the guidelines given in Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015) and our own. Most importantly, the removal of all guidelines relating to how much the score can be increased/decreased. Given restrictive guidelines, the change in IQ is almost entirely (99.7%) in increments of 0,1, or -1. From observations in negativity bias (Rozin and Royzman Reference Rozin and Royzman2001), we know that humans are biased toward giving greater weight to experiences with undesirable behaviors. Allowing larger changes gives annotators the ability to reflect this. The impact of the removal of these guidelines can be seen in Fig. 3. In spite of these changes, our annotator agreement has slightly increased with $\rho =0.69,0.72$ for them and $\rho =0.76$ for us.
We also altered the “dissatisfaction scale” used by Schmitt and Ultes (Reference Schmitt and Ultes2015) (5-satisfactory, 4-slightly unsatisfactory, 3-unsatisfactory, 2-very unsatisfactory, 1-extremely unsatisfactory ) to 5-good, 4-satisfactory, 3-bad, 2-very bad, 1-terrible. While still retaining a dissatisfaction bias, we added the “good” category as we wanted to allow for the possibility of systems having good interactions. While we largely agree with Schmitt and Ultes (Reference Schmitt and Ultes2015) that a dissatisfaction scale has advantages for the task of identifying problems in a conversation, we find utility in having an additional positive option. Identifying good interactions can provide models for bot-builders of what works and the scale should be more robust to a future in which these systems perform well more of the time. The range of our IQ scores allows room for improvement even if currently the improvement is infrequent or not technically feasible. In addition to wanting to build on the IQ approach’s success, we stayed with a five-point scale rather than a simpler 3-point scale such as “good/neutral/bad” because the five-point scale fit the range of interactions better. Conversational interactions move the conversation forward smoothly (“good”) or clumsily (“satisfactory”) or something has gone wrong, for example the system has misunderstood and/or done the wrong action (“bad,” “very bad” or “terrible”). It is not clear what a neutral interaction would be, and reduction to a binary positive/negative labeling is too simplistic. With a range of “bad” values, bot-builders can use those values to prioritize which problem to address first.
Finally, the annotation guideline from Schmitt and Ultes (Reference Schmitt and Ultes2015) dictating that each conversation should start with a satisfactory rating was removed. Although annotators were encouraged to start conversations with a “Satisfactory” rating, they were permitted to assign other ratings when motivated, due in large part to issues we observed wherein a consumer could initiate a conversation, and sometimes have the bot ignore or misunderstand their first message.
6.3. Annotation and inter-annotator agreement
Annotation of the 531 LEGOv2 conversations and 520 LivePerson conversations was carried out by three experienced annotators employed by LivePerson. Annotators were given instructions (see Appendix A) and a small initial annotation job. The results of the initial job were quality assured by LivePerson taxonomy experts, and feedback was given to individual annotators to increase alignment and consistency. After QA and feedback, the annotators were given larger jobs, and those jobs were measured for agreement. Following Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015), we took the median score of our 3 annotators as ground truth. For 3-way ties with ACQIs (6.4% of turns), we chose to use the most common label for that particular dialog system out of the labels the annotators have chosen. So as in Schmitt et al. (Reference Schmitt, Schatz and Minker2011) and Schmitt and Ultes (Reference Schmitt and Ultes2015), we use third parties rather than the users of the dialog system to judge the turn-by-turn quality, and like Schmitt et al. (Reference Schmitt, Schatz and Minker2011), Schmitt and Ultes (Reference Schmitt and Ultes2015) our annotators are expert. The biggest differences between their annotation work and ours is that the guidelines for the running IQ score are simplified, and we include an additional annotation task for ACQI to recommend an appropriate fix. The averages of the minimum and final IQ scores are shown in Table 3, which shows that the LEGOv2 has the lowest dips in IQ score during a conversation, but by the end of the conversations, the IQ for the various bots is quite similar.
In spite of having similar overall IQ outcomes, the LEGOv2 dialog system has fewer neutral steps, and more positive and negative turns, whereas LivePerson dialog systems have more neutral turns. This is shown in Fig. 4.
We measured inter-annotator agreement to measure the clarity of our ACQI taxonomy and IQ rules when applied to real dialog systems. The results have Cohen’s Kappa (CK) of .68, which is substantial agreement according to Landis and Koch (Reference Landis and Koch1977). See Table 4 for more details.
6.4. Analysis of combined ACQI and IQ annotations
By looking at the turn-to-turn change in IQ given the presence of a particular ACQI, we gain a more nuanced understanding of how a dialog system is performing. In previous attempts to measure dialog quality, a lack of understanding and an inability to resolve a consumer’s issue (task completion) are taken to be unequivocally bad. This is not always the case.
From Fig. 5, we see that with the exception of “Bad Transfer” (which was not annotated as positive), all of our ACQIs can be positive, negative, or neutral. “Does Not Understand,” “Ignored Consumer Statement,” and “Input Rejected” are usually negative, but can be reasonable responses if the consumer is typing things that are out of scope or incomprehensible. However, they are negative when the consumer’s utterance should have been understood by the system. “Unable to Resolve” is positive when the system correctly identifies that the consumer’s request is out of scope. “Provides Assistance” is positive if the assistance was requested and appropriate, but negative if it was not, or if the assistance has already been provided. Similarly, “Ask for Information” can be positive or negative, depending on the relevance of the requested information.
Analyzing the annotated datasets by ACQI and dialog system is also instructive (see Table 5). For most of the dialog systems, “Does Not Understand” is the largest category of ACQI associated with a decrease in score. The Junior Sales Assistant is an exception, and for this, 54% of the ACQIs are marked as “Provides Assistance,” indicating that this bot is making too many improper transfers or inappropriate suggestions to the customer. While having the score alone would be somewhat valuable in locating this issue, the ACQI gives additional guidance on what steps can be taken to mitigate the undesirable behavior.
We can also analyze combinations of ACQIs and their effect on a conversation. For example, Fig. 6 shows the mean score change with 95% error bound for “Ask for Confirmation” ACQI. We observe that asking for confirmation once normally leads to an increase in IQ score, but after this the effectiveness decreases and asking for confirmation more than 6 times can be actively harmful. It is important to realize that the ACQIs combine nonlinearly. That is, a single instance of an ACQI in a conversation may be healthy, but multiple copies of it may be a negative indicator. The extent to which the ACQIs have known meaning is the extent to which structural statistical models can be built, allowing bot creators the ability to test and refine explicit hypotheses about how the conversational events actually aggregate to the dialog system user experience. The careful definition and annotation work described so far enables us to start quantifying such effects in ways that were not hitherto available. Analyzing such combinations of causes and effects in dialog systems will be extended in future work.
7. Experimental validation of the predictive ability of the ACQI and IQ model
While the annotation and analysis so far were already able to provide some useful insights, the larger goal of these efforts is to build systems that can automatically highlight crucial ACQIs in a dialog system so that they can be fixed.
There is prior work in predicting IQ from labeled conversations, using a variety of methods. The original work of Schmitt et al. (Reference Schmitt, Schatz and Minker2011) used Support Vector Machines, and more recent work has used stateful neural network models including LSTMs (Rach et al. Reference Rach, Minker and Ultes2017) and biLSTMs (Bodigutla et al. Reference Bodigutla, Tiwari, Vargas, Polymenakos and Matsoukas2020).
7.1. Vectorized features from conversations
In common with these approaches, we adopted a vector representation for features. However, we deliberately restricted our model to use only features extracted directly from the conversation text and the annotation, so that the method could be more universally applicable, and in particular, to be able to use the same annotation setup and featurization processes for data from the LEGOv2 and LivePerson dialog systems.
For our text features (indicated with “text” in Table 6), we use the pretrained contextual sentence embeddings of Sentence-BERT (Reimers and Gurevych Reference Reimers and Gurevych2019). Our embedding dimension is 768 for each speaker’s response. When predicting ACQI and IQ, we concatenate the embeddings for both consumer and dialog system for the two most recent turns (the current turn and one prior). This results in a $4\times 768 = 3072$ dimensional vector. When the previous utterance is unavailable, we use a zero vector of the appropriate dimension. This feature vector represents a system where the model uses only surface textual features for the previous two turns per user.
To compare with a system that also uses its observations and predictions of the rest of the conversation so far, we experimented with adding features derived from the ACQI labels. For the features “annotated-acqi” and “predicted-acqi,” we use cumulative counts on the one-hot encoding for the presence of ACQIs. These feature sets were also combined with the text features as “annotated-acqi+text” and “predicted-acqi+text,” for which we concatenated the cumulative counts to the contextual sentence embeddings with our cumulative ACQI counts.
7.2. Model training and prediction
To get a good representation of many ACQIs across the training and test sets, we performed nested cross-validation (following Krstajic et al. Reference Krstajic, Buturovic, Leahy and Thomas2014), which reduces bias when testing different configurations. We implemented nested cross-validation using the multi-label Scikit Learn python package and methodology of Szymański and Kajdanowicz (Reference Szymański and Kajdanowicz2017), which supports balanced multi-label train/test splits. Our implementation used 5 cross-validation folds for the inner and outer loop, where we store the estimations for each sample in the test set of a fold and computed a final score over all of the folds.
We use the features described in Section 7.1 to train a variety of classifiers: For predicting ACQI we tested logistic regression, random forest, and xgboost and for IQ we tested the above and a linear regressor. We found that the best-performing text-based model for predicting both ACQIs and IQ was logistic regression and reported the results using this model in Sections 7.3 and 7.4. In this model, C (inverse regularization strength) is set to 0.01 and we use the “balanced” class weight setting in Scikit Learn. For prediction IQ, we also trained a BERT classifier (BERT-Base; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) using back-propagation to tune the encoder weights. We use Hugginface’s bert-base-uncased, and tuned on subsets of the LEGO and LivePerson data. Our input for the BERT classifier is the concatenated text from the two most recent turns for the consumer and dialog system, separated with special [cust] and [bot] tokens to indicate the speaker.
7.3. Predicting IQ: Results and discussion
For the IQ prediction experiment, results for each of the dialog systems using a BERT model with text-only input and a logistic regression classifier using features derived from text and ACQI scores (as described in Sections 7.1 and 7.2) are presented in Table 6. The results found can also be compared with the annotator agreement findings of Table 4.
Points to highlight include:
-
The text-only prediction, which is the easiest to implement, achieves linear weighted Cohen’s Kappa performance slightly lower but comparable to annotator agreement (average 0.49 vs. 0.52).
-
Using only ACQI information (either annotated or predicted) leads to a loss in recall in all but one dialog system, and on average reduces recall by around 10% using annotated labels and 20% using predicted labels.
-
Though the use of annotated labels along with text features leads to improved correlation with annotators (e.g., an average 6% improvement in kappa score), these expectations do not transfer reliably to the more realistic case of using predicted ACQI labels as features.
-
The increase in average recall from using annotated labels is small (only 2%), and the use of predicted labels causes average recall to drop by 7%.
-
The logistic regression models outperform BERT for all dialog systems except LEGOv2. This suggests that BERT underperforms on this task when training data is limited.
-
The best Unweighted Average Recall (UAR) results are between the best result using BiLSTM using traditional cross-validation (0.78) and the best result using dialog-wise cross-validation (0.54) presented by Ultes (Reference Ultes2019).
These experiments leave much room for optimization and improvement of various kinds, including trying different text featurizers, and the number of turns and relative weights of messages used in the vector encoding. The important findings are that we can predict the exact IQ score approximately 60% of the time and that the use of vector embeddings derived directly from the message texts is the most reliable practical method tested for building features.
7.4. Predicting ACQI: Results and discussion
For the ACQI prediction experiment, results for each of the dialog systems using only text-derived features and logistic regression as a classifier (as described above) are presented in Table 7.
Points to note include:
-
The overall weighted average f1-score is 0.790. We are predicting the correct ACQI nearly 80% of the time overall.
-
The accuracy of the classification depends significantly on the support of the class: common ACQIs are predicted much more accurately than rare ones.
-
Because of this, the macro average performance (not weighted by support) is worse, with an f1-score of just 0.574.
-
The f1-score for an ACQI can be cross-referenced against Fig. 5 and Table 5 to guide improvement efforts. For example, “Does Not Understand” occurs relatively frequently and with overwhelmingly negative impact on IQ score, but the f1-score for predicting this class is only 0.509. Improving ACQI classification performance for this class would therefore be especially impactful.
Learning curves in Fig. 7 show some of the change in performance with different training set sizes.
-
It can be seen from that for both LEGOv2 and LivePerson dialog systems, macro and average performance on a 20% held-out set improves sharply for the first 125 training conversations.
-
Given that training on only 125 conversations leads to near-peak performance, it offers the possibility of fast-tuning iterations.
We investigated the generalizability of the IQ and ACQI models using a productionized version of the underlying LivePerson model. These results can be found in Appendix B. We find that the models transfer well across systems within the LivePerson framework; however, the LivePerson model does not generalize well outside of the framework to the LEGOv2 data. Furthermore, the LEGOv2 model does not generalize well to LivePerson data. Further investigation into cross-domain generalizability is an important area for future work, particularly the differences between written and transcribed speech data, though we find it promising that different bots within a single framework can share a model.
7.5. Using ACQIs to simplify bot-tuning
As a final result, we estimate the extent to which predicting the correct ACQI could help bot-builders involved in bot-tuning. Referring back to the ACQI taxonomy in Table 2, without any extra contextual guidance, bot-builders have 28 possible action strategies with LivePerson dialog systems and 31 with LEGOv2 dialog systems. The unaided default case for bot-builders is exhaustive search of those action strategies.
This was compared with the number of options that would be available using the predicted ACQI labeling. For this simple simulation, we made the following assumptions:
-
1. Each appropriate action is equally likely in the absence of IQ/ACQI.
-
2. If IQ is available, tuning is only required when the score decreases. If IQ is unavailable then all actions related to the system are relevant.
-
3. Given the presence of a decremented IQ score, each action is equally likely.
-
4. If an ACQI is available, all actions that are not assigned to at least one ACQI are included in our list of options that the bot-builder can make.
-
5. There is always a special action, $\langle$ No Action $\rangle$ , that may be applicable for the bot-builder.
The results of this exercise are presented in Table 8. We found that the largest single simplification comes from the use of IQ: if IQ can be modeled accurately, then the average number of recommended options is reduced from 28.6 to 9.73 (about 34%). Adding ACQI classifications as well reduces the average down to 5.4 (about 19% of the original number). This makes a hypothetical but strong case: if IQ and ACQI can be accurately predicted on a turn-by-turn level, the amount of effort it takes a bot-builder to diagnose problems and suggest possible solutions could be reduced by an estimated 81%. While this is an optimistic hypothesis, the potential reward is large enough to encourage more development in this area. Note that even with inaccurate ACQI classification and IQ scores, the search space is fixed so a bot-builder will do no worse than the default exhaustive search they would require without assistance. However, any correct ACQIs and IQs will reduce options for at least a portion of the cases. Although we have not explicitly modeled this, in practice the bot-builders will be looking at aggregate statistics on ACQIs and IQ scores and that aggregation will make inaccuracies in the predictions less important. Even less-than-perfect classification and scores will identify multiple occurrences of a problematic ACQI and the general direction of its IQ score.
8. Conclusion
ACQIs are designed to provide bot-builders with actionable explanations of why their deployed dialog systems fail based on data from user interactions. We have explored the key desirable properties for building an ACQI taxonomy, based on recommendations from the literature, interviews, and collaboration with dialog system experts. Based on an annotated dataset of just over 1000 conversations, we have shown that ACQIs are particularly useful when combined with IQ, in particular so that the decision of whether to take a recommended action can be focused on places in the dialog where quality decreases.
The annotated datasets were used to train predictive models, which achieved a weighted average f1-score of 79% using features based just on vectorized embeddings of recent messages in order and logistic regression for classification. While these results are preliminary, such a classification model could be used to reduce the number of options for a bot-builder to consider by as much as 81%. Results like this should be directly useful to bot-builders for troubleshooting and refining their dialog systems: if the ACQI-based suggestions show up as a tooltip (similar to refactoring tips in software-integrated development environments), they may be useful in the majority of cases, while being easy to ignore in the remainder.
The prioritization of the bot-builder as a key user persona is the driving principle for much of this work. We hope that more research focused on making bot-builders more effective is encouraged and highlighted in the dialog system community, as a crucial route to optimizing the experience of dialog system users overall.
Acknowledgements
The authors would like to thank LivePerson Inc. for financial support while this work was carried out.
Competing interests
All authors were employed at LivePerson, Inc during the main work described in this paper. The authors declare no other competing interest.
Appendix A: Annotation Instructions
Annotation was done by in-house LivePerson annotators using an annotation tool illustrated in Fig. A1 to apply the scores in Table A1 and labels in Table A2. The tool provides “tooltip” on demand by hovering over items. For the annotation used in this work, the tooltip bubble included the ACQI (as “Bot State” and IQ labels (as “Quality”); hovering over those labels accesses the label definition. As described earlier, ACQI and IQ were annotated at the same time in the same pass. Note that “start” in the instructions to the annotators means a state before the conversation begins.
The following are the annotation instructions used by our expert annotators:
Assumptions for annotations
-
Annotation starts at “ask for information” and “satisfactory.”
-
Increment and decrement as much as necessary depending on how the conversation is going.
-
The conversation is rated at the turn level and considering the conversation as it goes along.
Please make sure to pick carefully between the Bot States “request confirmation” and “request information.” These are easily confused, so make sure to take a moment and consider the difference between them. “Ask for confirmation” is only when the bot is asking to confirm something the customer has already said, while “ask for information” is a request for new information that the customer has not provided. Similarly, “input rejected,” “does not understand” and “ignored consumer statement” are very easily confused. Please refer to the definition tool-tips for these 3 Bot States so that you are applying them correctly.
For the Quality Rating, we recommend everyone starting each conversation with “Satisfactory” and going from there. This will help everyone align on the same Quality Rating.
[In addition in training, the annotators were told to start (in their heads, not as an annotation) with Satisfactory as the starting point and increment up or down based on how the bot performs starting with the first turn, and then adjusting per turn as the bot does better or worse]
Appendix B: Generalizability
Tables B1 and B2 represent generalization results across multiple LivePerson domains on an internal dataset. We considered four domains in the LivePerson internal dataset. For each domain, conversations from that domain were held in the test set, and the models were trained on the conversations from the rest of the domains. This process was repeated for all four domains.
The knowledge transfer within the LivePerson domains was quite good. However, we also compared the models trained on LivePerson data and evaluated on LEGOv2 data and vice versa. Tables B3 and B4 show that the knowledge transfer between these two datasets is poor. Further investigation into the reason for this is an important area for future work.