1. Introduction
We introduce the novel task of understanding multi-sentence questions. Specifically, we focus our attention on multi-sentence entity-seeking questions (MSEQs), that is, the questions that expect one or more entities as answer. Such questions are commonly found in online forums, blog posts, discussion boards, etc., and come from a variety of domains including tourism, books, and consumer products.
Figure 1 shows an example of MSEQ from a tourism forumFootnote a, where the user is interested in finding a hotel that satisfies some constraints and preferences; an answer to this question is thus the name of a hotel (entity) which needs to satisfy some properties such as being a “budget” option. A preliminary analysis of such entity-seeking questions from online forums reveals that almost all of them contain multiple sentences—they often elaborate on a user’s specific situation before asking the actual question.
In order to understand and answer such a user question, we convert the question into a machine representation consisting of labels identifying the informative portions in a question. We are motivated by our work’s applicability to a wide variety of domains and therefore choose not to restrict the representation to use a domain-specific vocabulary. Instead, we design an open semantic representation, inspired in part by Open QA (Fader, Zettlemoyer and Etzioni Reference Fader, Zettlemoyer and Etzioni2014), in which we explicitly annotate the answer (entity) type; other answer attributes, while identified, are not further categorized. For example, in Figure 1 “place to stay” is labeled as entity.type while “budget” is labeled as an entity.attr. We also allow attributes of the user to be represented. Domain-specific annotations such as location for tourism questions are permitted. Such labels can then be supplied to a downstream information retrieval (IR) or a QA component to directly present an answer entity.
We pose the task of understanding MSEQs as a semantic labeling (shallow parsingFootnote b) task where tokens from the question are annotated with a semantic label from our open representation. However, in contrast to related literature on semantic role labeling (SRL) (Yang and Mitchell Reference Yang and Mitchell2017), slot-filling tasks (Bapna et al.Reference Bapna, Tur, Hakkani-Tur and Heck2017), and query formulation (Vtyurina and Clarke Reference Vtyurina and Clarke2016; Wang and Nyberg Reference Wang and Nyberg2016; Nogueira and Cho Reference Nogueira and Cho2017), semantic parsing of MSEQs raises several novel challenges.
MSEQs express a wide variety of intents and requirements which span across multiple sentences, requiring the model to capture within-sentence as well as inter-sentence interactions effectively. In addition, questions can be unnecessarily belabored requiring the system to reason about what is important and what is not. Lastly, we find that generating training data for parsing MSEQs is hard due to the complex nature of the task. Thus, this requires the models to operate in low training data settings.
In order to address these challenges and label MSEQs, we use a bidirectional LSTM (conditional random field) CRF (BiLSTM CRF) (Huang, Xu and Yu Reference Huang, Xu and Yu2015) as our base model and extend it in three ways. First, we improve performance by inputting contextual embeddings from BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) into the model. We refer to this configuration as BERT BiLSTM CRF. Second, we encode knowledge by incorporating hand-designed features as well as semantic constraints over the entire multi-sentence question during end-to-end training. This can be thought of as incorporating constrained conditional model (CCM)-style constraints and inference (Chang, Ratinov and Roth Reference Chang, Ratinov and Roth2007) in a neural model. Finally, we find that crowdsourcing complete annotations is hard, since the task is complex. In this work, we are able to improve training by partially labeled questions which are easier to source.
1.1 Contributions
In summary, our paper makes the following contributions:
-
1. We present the novel task of understanding MSEQs. We define open semantic labels which minimize schema or ontology-specific semantic vocabulary and can easily generalize across domains. These semantic labels identify informative portions of a question that can be used by a downstream answering component.
-
2. The core of our model uses a BERT BiLSTM CRF model. We extend this by providing hand-designed features and using CCM inference, which allows us to specify within-sentence as well as inter-sentence (hard and soft) constraints. This helps encode prior knowledge about the labeling task.
-
3. We present detailed experiments on our models using the tourism domain as an example. We also demonstrate how crowdsourced partially labeled questions can be effectively used in our constraint-based tagging framework to help improve labeling accuracy. We find that our best model achieves 15 points (pt) improvement in F1 scores over a baseline BiLSTM CRF.
-
4. We demonstrate the applicability of our semantic labels in two different end tasks. (i) The first is a novel task of directly answering tourism-related MSEQs using a web-based semi-structured knowledge source. Our semantic labels help formulate a more effective query to knowledge sources and our system answers 36% more questions with 35% more (relative) accuracy as compared to baselines. (ii) The second task is semantic labeling of MSEQs in a new domain about book recommendations with minimal training data.
2. Problem statement
Given an MSEQ, our goal is to first parse and generate a semantic representation of the question using labels that identify informative portions of a question. The semantic representation of the question can then be used to return an entity answer for the question, using a knowledge source. Thus, our QA system consists of two modules (see Figure 2): question understanding (MSEQ parsing) and a querying module to return entity answers. The modularized two-step architecture allows us to tackle different aspects of the problem independently. The semantic representation generated by the question understanding module is generic and not tied to a specific corpora or ontology. This allows the answering module to be optimized efficiently for any knowledge source, and supports the integration of multiple data sources, each with their own schema and strengths for answering. In this paper, we experiment with the Google Places Web collectionFootnote c as our knowledge source. It consists of semi-structured data including geographic information, entity categories, entity reviews, etc. The collection is queried using a web API that accepts an unstructured text string as query.
3. Related work
To the best of our knowledge, we are the first to explicitly address the task of understanding MSEQs and demonstrate its use in an answering task. There are different aspects of our work that relate to existing literature and we discuss them in this section. We begin by contrasting our work on multi-sentence question understanding and answering with recent work on question-answering (Section 3.1). We then include a review of related work on semantic representations of questions (Section 3.2) followed by a brief survey of recent literature on semantic labeling (Section 3.3). We conclude with a summary in Section 3.4.
3.1 Question answering systems
There are two common approaches for QA systems—joint and pipelined, both with different advantages. The joint systems usually train an end-to-end neural architecture, with a softmax over candidate answers (or spans over a given passage) as the final layer (Iyyer et al. Reference Iyyer, Boyd-Graber, Claudino, Socher and Daume2014; Rajpurkar et al. Reference Rajpurkar, Jia and Liang2018). Such systems can be rapidly retrained for different domains, as they use minimal hand-constructed or domain-specific features. But, they require huge amounts of labeled QA pairs for training.
In contrast, a pipelined approach (Kwiatkowski et al. Reference Kwiatkowski, Choi, Artzi and Zettlemoyer2013; Berant and Liang Reference Berant and Liang2014; Fader et al. Reference Fader, Zettlemoyer and Etzioni2014; Fader, Zettlemoyer and Etzioni Reference Fader, Zettlemoyer and Etzioni2013; Vtyurina and Clarke Reference Vtyurina and Clarke2016; Wang and Nyberg Reference Wang and Nyberg2016) divides the task into two components—question processing (understanding) and querying the knowledge source. Our work follows the second approach.
We choose to summarize popular approaches in QA systems on the basis of (a) the type of questions they answer, (b) the nature of knowledge base /Corpus used for answering, and (c) the nature of answers returned by the answering system (See Table 1).
In this paper, we return entity answers to MSEQs. The problem of returning direct (non-document/passage) answers to questions from background knowledge sources has been studied, but primarily for single-sentence factoid-like questions (Berant and Liang Reference Berant and Liang2014; Fader et al. Reference Fader, Zettlemoyer and Etzioni2014; Sun et al. Reference Sun, Ma, Yih, Tsai, Liu and Chang2015; Yin et al. Reference Yin, Duan, Kao, Bao and Zhou2015; Saha et al. Reference Saha, Floratou, Sankaranarayanan, Minhas, Mittal and Özcan2016; Khot, Sabharwal and Clark Reference Khot, Sabharwal and Clark2017; Lukovnikov et al. Reference Lukovnikov, Fischer, Lehmann and Auer2017; Zheng et al. Reference Zheng, Yu, Zou and Cheng2018; Zhao et al. Reference Zhao, Chung, Goyal and Metallinou2019). Reading comprehension tasks (Trischler et al. Reference Trischler, Wang, Yuan, Harris, Sordoni, Bachman and Suleman2016; Joshi et al. Reference Joshi, Choi, Weld and Zettlemoyer2017; Trivedi et al. Reference Trivedi, Maheshwari, Dubey and Lehmann2017; Rajpurkar et al. Reference Rajpurkar, Jia and Liang2018; Yang et al. Reference Yang, Qi, Zhang, Bengio, Cohen, Salakhutdinov and Manning2018; Dua et al. Reference Dua, Wang, Dasigi, Stanovsky, Singh and Gardner2019) require answers to be generated from unstructured text also only return answers for relatively simple (single-sentence) questions.
Other works have considered multi-sentence questions, but in different settings, such as the specialized setting of answering multiple-choice SAT exam questions and science questions (Seo et al. Reference Seo, Hajishirzi, Farhadi, Etzioni and Malcolm2015; Clark et al. Reference Clark, Etzioni, Khot, Sabharwal, Tafjord, Turney and Khashabi2016; Guo et al. Reference Guo, Liu, He, Liu, Zhao and Wei2017; Khot et al. Reference Khot, Sabharwal and Clark2017; Palmer, Hwa and Riedel Reference Palmer, Hwa and Riedel2017; Zhang et al. Reference Zhang, Wu, He, Liu and Su2018), mathematical word problems (Liang et al. Reference Liang, Hsu, Huang, Li, Miao and Su2016), and textbook questions (Sachan, Dubey and Xing Reference Sachan, Dubey and Xing2016). Such systems do not return entity answers to questions. Community QA systems (Pithyaachariyakul and Kulkarni Reference Pithyaachariyakul and Kulkarni2018; Qiu and Huang Reference Qiu and Huang2015; Shen et al. Reference Shen, Rong, Jiang, Peng, Tang and Xiong2015; Tan et al. Reference Tan, Xiang and Zhou2015; Bogdanova and Foster Reference Bogdanova and Foster2016) match questions with user-provided answers, instead of entities from background knowledge source. IR-based systems (Vtyurina and Clarke Reference Vtyurina and Clarke2016; Wang and Nyberg Reference Wang and Nyberg2016; Pithyaachariyakul and Kulkarni Reference Pithyaachariyakul and Kulkarni2018) query the web for open-domain questions, but return long (1000-character) passages as answers; they have not been developed for or tested on entity-seeking questions. These techniques that can handle MSEQs (Vtyurina and Clarke Reference Vtyurina and Clarke2016; Wang and Nyberg Reference Wang and Nyberg2016; Pithyaachariyakul and Kulkarni Reference Pithyaachariyakul and Kulkarni2018) typically perform retrieval using keywords extracted from questions; these do not “understand” the questions and cannot answer many tourism questions, as our experiments show (Section 7). The more traditional solutions (e.g., semantic parsing) that parse the questions deeply can process only single-sentence questions (Fader et al. Reference Fader, Zettlemoyer and Etzioni2013, Reference Fader, Zettlemoyer and Etzioni2014; Kwiatkowski et al. Reference Kwiatkowski, Choi, Artzi and Zettlemoyer2013; Berant and Liang Reference Berant and Liang2014; Zheng et al. Reference Zheng, Yu, Zou and Cheng2018).
Finally, systems such as QANTA (Iyyer et al. Reference Iyyer, Boyd-Graber, Claudino, Socher and Daume2014) also answer complex multi-sentence questions, but their methods can only select answers from a small list of entities and also require large amounts of training data with redundancy of QA pairs. In contrast, the Google Places API we experiment with (as our knowledge source) has millions of entities. It is important to note that for answering an MSEQ, the answer space can include thousands of candidate entities per question, with large unstructured review documents about each entity that help determine the best answer entity. Thus, these documents are significantly longer than passages (or similar length articles) that have traditionally been used in neural QA tasks. Recently, tasks that require multi-hop reasoning have also been proposed. This involves simple QA via neural machine comprehension of longer/multi-passage documents (Trivedi et al. Reference Trivedi, Maheshwari, Dubey and Lehmann2017; Welbl et al. Reference Welbl, Stenetorp and Riedel2018; Yang et al. Reference Yang, Qi, Zhang, Bengio, Cohen, Salakhutdinov and Manning2018). Extending such a task for MSEQs could be an interesting extension for future work.
We discuss literature on parsing (understanding) questions in the next section.
3.2 Question parsing
QA systems use a variety of different intermediate semantic representations. Most of them, including the rich body of work in NLIDB (Natural Language Interfaces for Databases) and semantic parsing, parse single sentence questions into a query based on the underlying ontology or database schema and are often learned directly by defining grammars, rules, and templates (Zettlemoyer Reference Zettlemoyer2009; Liang Reference Liang2011; Berant et al. Reference Berant, Chou, Frostig and Liang2013; Kwiatkowski et al. Reference Kwiatkowski, Choi, Artzi and Zettlemoyer2013; Sun et al. Reference Sun, Ma, Yih, Tsai, Liu and Chang2015; Yih et al. Reference Yih, Chang, He and Gao2015; Reddy et al. Reference Reddy, Tackstrom, Collins, Kwiatkowski, Das, Steedman and Lapata2016; Saha et al. Reference Saha, Floratou, Sankaranarayanan, Minhas, Mittal and Özcan2016; Abujabal et al. Reference Abujabal, Yahya, Riedewald and Weikum2017; Cheng et al. Reference Cheng, Reddy, Saraswat and Lapata2017; Khot et al. Reference Khot, Sabharwal and Clark2017; Lukovnikov et al. Reference Lukovnikov, Fischer, Lehmann and Auer2017; Zheng et al. Reference Zheng, Yu, Zou and Cheng2018). Works such as Fader et al. (Reference Fader, Zettlemoyer and Etzioni2014) and Berant and Liang (Reference Berant and Liang2014) build open semantic representations for single sentence questions that are not tied to a specific knowledge source or ontology. We follow a similar approach and develop an open semantic representation for MSEQs. Our representation uses labels that help a downstream answering component return entity answers.
Recent works build neural models that represent a question as a continuous-valued vector (Bordes, Chopra, and Weston Reference Bordes, Weston and Usunier2014a; Bordes, Weston, and Usunier Reference Bordes, Chopra and Weston2014b; Chen et al. Reference Chen, Jose, Yu, Yuan and Zhang2016; Xu et al. Reference Xu, Reddy, Feng, Huang and Zhao2016; Zhang et al. Reference Zhang, Wu, Wang, Zhou and Li2016), but such methods require significant amounts of training data. Some systems rely on IR and do not construct explicit semantic representations at all (Sun et al. Reference Sun, Ma, Yih, Tsai, Liu and Chang2015; Vtyurina and Clarke Reference Vtyurina and Clarke2016); they rely on selecting keywords from the question for querying and as shown in our experiments do not perform well for answering MSEQs. Work such as that by Nogueira and Cho (Reference Nogueira and Cho2017) uses reinforcement learning to select query terms in a document retrieval task and requires a large collection of document-relevant judgments. Extending such an approach for our task could be an interesting extension for future work.
We now summarize recent methods employed to generate semantic representations of questions.
3.3 Neural semantic parsing
There is a large body of literature dealing with semantic parsing of single sentences, especially for frames in PropBank and FrameNet (Baker et al. Reference Baker, Fillmore and Lowe1998; Palmer, Gildea and Kingsbury Reference Palmer, Gildea and Kingsbury2005). Most recently, methods that use neural architectures for SRL have been developed. For instance, work by Zhou and Xu (Reference Zhou and Xu2015) uses a BiLSTM CRF for labeling sentences with PropBank predicate argument structures, while work by (He et al. Reference He, Lee, Lewis and Zettlemoyer2017, Reference He, Lee, Levy and Zettlemoyer2018) relies on a BiLSTM with BIO-encoding constraints during LSTM decoding. Other recent work by Yang and Mitchell (Reference Yang and Mitchell2017) proposes a BiLSTM CRF model that is further used in a graphical model that encodes SRL structural constraints as factors. Work such as Bapna et al. (Reference Bapna, Tur, Hakkani-Tur and Heck2017) uses a BiLSTM tagger for predicting task-oriented information slots from sentences. Our work uses similar approaches for labeling (parsing) MSEQs, but we note that such systems cannot be directly used in our task due to their model-specific optimization for their label space. However, we adapt the label space of the recent deep SRL system (He et al. Reference He, Lee, Lewis and Zettlemoyer2017) for our task and use its predicate tagger as a baseline for evaluation (Section 6).
3.4 Summary
In summary, while related work shares aspects with our task there are three main distinguishing features that are not jointly addressed in existing work: (i) Question type: A major focus of existing work has been on single sentence questions, sometimes with the added complexity arising out of entity relations and co-reference. Such questions are often posed as “which/where/when/who/what” questions. However, our work uses multi-sentence questions which can additionally contain vague expression of intents as well as information that is irrelevant for the answering task. (ii) Knowledge: Most information-seeking questions either answer factoid-style questions from knowledge graphs and structured knowledge bases or answer them from paragraphs of text which contain explicit answers. In contrast, our work uses unstructured or semi-structured knowledge sources and our querying representation makes no assumptions of the underlying knowledge store. (iii) Answer-type: Existing QA systems either return answer spans (reading comprehension tasks), or documents (from the web or large text collections) to fulfill a knowledge-grounded information query that relies on explicit mention (or with some degree of semantic gap) of the answer. In contrast, our QA pipeline returns entity answers from a (black box) web API that accepts a text string as query and internally uses structured and unstructured data including entity reviews containing subjective opinions to return an answer.
In the next section, we describe our question representation (Section 4) followed by details about our labeling system (Section 5). We present experiments in Section 6 and details of our answering component in Section 7. We finally conclude the paper in Section 8 along with suggestions for future work.
4. Semantic labels for MSEQs
As mentioned earlier, our question understanding component parses an MSEQ into an open semantic representation. Our choice of representation is motivated by two goals. First, we wish to make minimal assumptions about the domain of the QA task and, therefore, minimize domain-specific semantic vocabularyFootnote d. Second, we wish to identify only the informative elements of a question, so that a robust downstream QA or IR system can meaningfully answer it. As a first step toward a generic representation for an MSEQ, we make the assumptions that a multi-sentence question is asking only one final question, and that the expected answer is one or more entities. This precludes Boolean, comparison, “why”/“how,” and multiple part questions.
We have two labels associated with the entity being sought: entity.type and entity.attr, to capture the type and the attributes of the entity, respectively. We also include a label user.attr to capture the properties of the user asking the question. The semantic labels of entity.type and entity.attr are generic and will be applicable to any domain. Other generic labels to identify related entities (e.g., in questions where users ask for entities similar to a list of entities) could also be defined. We also allow the possibility of incorporating additional labels which are domain-specific. For instance, for the tourism domain, location could be important, so we can include an additional label entity.location describing the location of the answer entity.
Figure 1 illustrates the choice of our labels with an example from the tourism domain. Here, the user is interested in finding a “place to stay” (entity.type) that satisfies some properties such as “budget” (entity.attr). The question includes some information about the user herself, for example, “will not have a car” which may become relevant for answering the question. The phrase “San Francisco” describes the location of the entity and is labeled with a domain-specific label (entity.location).
5. MSEQ semantic labeling
We formulate the task of outputting the semantic representation for a user question as a sequence labeling problem. There is a one-to-one correspondence between our token-level label set and the semantic labels described in Section 4. We utilize a BERT BiLSTM CRF for sequence labeling, and as described previously, we extend the model in order to address the challenges posed by MSEQs: (a) First, we incorporate hand-engineered features especially designed for our labeling task. (b) Second, we make use of a CCM (Chang et al. Reference Chang, Ratinov and Roth2007) to incorporate within-sentence as well as inter-sentence constraints. These constraints act as a prior and help ameliorate the problems posed by our low-data setting. (c) Third, we use Amazon Mechanical Turk (AMT) to obtain additional partially labeled data which we use in our constraint-driven framework.
5.1 Features
We incorporate a number of (domain-independent) features into our BERT BiLSTM CRF model where each unique feature is represented as a one-hot vector and concatenated with the BERT embedding representation of each token. In experiments with BiLSTM CRF models without BERT, we replace the BERT embeddings with pre-trained word2vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) embeddings that are concatenated with the one-hot feature embeddings.
Our features are described as follows: (a) Lexical features for capitalization, indicating numerals, etc., token-level features based on part-of-speech tags and named-entity recognition labels. (b) Hand-designed entity.type and entity.attr specific features. These include indicators for guessing potential types, based on targets of WH (what, where, which) words and certain verb classes; multi-sentence features that are based on dependency parses of individual sentences that aid in attribute detection—for example, for every noun and adjective, an attribute indicator feature is on if any of its ancestors is a potential type as indicated by the type feature; indicator features for descriptive phrases (Contractor et al. Reference Contractor, Mausam and Singla2016), such as adjective–noun pairs. (c) For each token, we include cluster ids generated from a clustering of word2vec vectors (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) run over a large tourism corpus. (d) We also use the counts of a token in the entire post, as a feature for that token (Vtyurina and Clarke Reference Vtyurina and Clarke2016).
5.2 Constraints
Since we label multiple-sentence questions, we need to capture patterns spanning across sentences. One alternative would be to model these patterns as features defined over nonadjacent tokens (labels). But this can make the modeling quite complex. Instead, we model them as global constraints over the set of possible labels.
We design the following constraints: (i) type constraint (hard): every question must have at least one entity.type token; (ii) attribute constraint (soft) which penalizes absence of an entity.attr label in the sequence; (iii) a soft constraint that prefers all entity.type tokens occur in the same sentence. The last constraint helps reduce erroneous entity.type labels but allows the labeler, to choose entity.type-labeled tokens from multiple sentences only if it is very confident. Thus, while the first two constraints are directed toward improving recall, the last constraint helps improve precision of entity.type labels.
In order to use our constraints, we employ CCMs for our task (Chang et al. Reference Chang, Ratinov and Roth2007) which use an alternate learning objective expressed as the difference between the original log-likelihood and a constraint violation penalty:
Here, i indexes over all examples and k over all constraints. $\textbf{x}^{\textbf{(i)}}$ is the ith sequence and $\textbf{y}^{\textbf{(i)}}$ is its labeling. $\phi$ and w are feature and weight vectors, respectively. $d_{C_k}$ and $\rho_k$, respectively, denote the violation score and weight associated with kth constraint. The w parameters are learned analogous to a vanilla CRF and computing $\rho$ parameters resorts to counting. Inference in CCMs is formulated as an integer linear program (ILP); see Chang et al. (Reference Chang, Ratinov and Roth2007) for details. The original CCM formulation was in the context of regular CRFs (Lafferty, McCallum and Pereira Reference Lafferty, McCallum and Pereira2001) and we extend its use in a combined model of BERT BiLSTM CRF with CCM constraints (referred to as BERT BiLSTM CCM) that is trained end to end (Figure 3).
Specifically, let $\mathcal{Y}$ be the set of label indicesFootnote e. Let T be the sequence length and $x_0 \cdots x_{T-1}$ be the tokens, $\phi (x^{i}) \in \mathbb{R}^{|\mathcal{Y}|}$ be the feature vector for the ith token (the output of the feed-forward layer in the BiLSTM-CRF), and $\phi(x^i)[j]$ denoting the feature associated with the ith token and jth label. Let $w \in \mathbb{R}^{|\mathcal{Y}| \times |\mathcal{Y}|}$ be the transition matrix, with w[i, j] denoting the weights associated with a transition from $i \rightarrow j$. Then
defines the Viterbi decoding for a linear chain CRF. The variable $\mathbb{1}_{0, l} = 1$ if the first token of the sequence is tagged l in the optimal Viterbi sequence, and zero otherwise. Furthermore $\mathbb{1}_{i, l_s, l_t} = 1$ if the ith token is tagged with label $l_t$ and the $(i-1)$th token is tagged $l_s$ in the optimal Viterbi sequence, and is marked zero otherwise.
Type label constraints (hard): In order to model the type-based hard constraint (there has to be at least one entity.type label in the sequence), we add the following constraint to the optimization problem:
Here, $\mathbb{1}_{0, entity.type} = 1$ if the first token is tagged as a type, while $\sum_{l_s \in \mathcal{Y}}\mathbb{1}_{i, l_s, entity.type} = 1$ if the ith token is tagged as an entity.
Attribute label constraints (soft): In order to model the attribute-based constraint (the nonexistence of an entity.attr label in the sequence is penalized), we introduce a dummy variable d for our ILP formulation. Then, given the constraint violation penalty $\eta$, we change the model optimization problem as
Here, if the constraint is violated, then $d=1$ and the objective suffers a penalty of $\eta$. Conversely, since it is a minimization over d as well, if the constraint is satisfied, then $d=0$ and the objective is not penalized.
Inter-sentence-type constraint: We model the constraint that all entity.type labels should appear in a single sentence. We implement this as a soft constraint by imposing an L1 penalty on the number of sentences containing an entity.type (thereby insuring that fewer sentences contain type labels). Let the number of sentences be k. Let $e_i$ denote the index of the start of the ith sentence, such that $\{x_j, e_i \leq j < e_{i+1}\}$ are the tokens in the ith sentence (note that $e_0 = 0$). Define $z_0, ..., z_{k-1}$ to model sentence indicators, with $z_i = 1$ if the ith sentence contains a type. Let $\eta_2$ be the associated penalty. We modify the optimization problem then as follows:
Here, the variable j indexes over the tokens for the ith sentence. $\sum_{ls} \mathbb{1}_{j, l_s, entity.type} = 1$ if the jth token is a type, and is 0 otherwise. Hence if any of the tokens in the ith sentence is labeled a type, $z_i = 1$. Note that combined with Equation (3), we also have $\sum_i z_i \geq 1$.
5.3 Partially labeled data
Data collection: In order to obtain a larger amount of labeled data for our task, we make use of crowdsourcing (AMT). Since our labeling task can be complex, we divide our crowd task into multiple steps. We first ask the crowd to (i) filter out forum questions that are not entity-seeking questions. For the questions that remain, the crowd provides (ii) $user.*$ labels and (iii) $entity.*$ labels. Taking inspiration from He, Lewis and Zettlemoyer (Reference He, Lewis and Zettlemoyer2015), for each step, instead of directly asking for token labels, we ask a series of indirect questions as described in the next section that can help source high-precision annotations.
5.3.1 Crowdsourcing task
We defined three AMT tasks in the form of questionnaires:
-
Questionnaire 1 : To identify posts of relevance for our task. This is to filter posts that may be unrelated to our taskFootnote f.
-
Questionnaire 2 : To identify the user entities and its labels.
-
Questionnaire 3 : To identify the answer entities and its labels.
In the first questionnaire (AMT Task 1) we ask the users to identify any non-entity-seeking questions as well the number of entity types requested in a given query. We remove any posts that ask for multiple entity typesFootnote g. The second questionnaire (AMT Task 2) asks the following question to the AMT workers. We paid $$0.20$ to each worker for this task.
-
“Which continuous sequences of words (can be multiple sequences) in the QUESTION describes the nature/identity/qualities of USER?”
The QUESTION refers to the actual question posed by a user on a forum page and the answer to these questions gives us the user.attr labels. Figure 4 shows a sample snippet of the questionnaire.
The last questionnaire asks the following questions to the AMT workers.
-
“Given that the USER is asking only a single type of recommendation/suggestion, which sequence of words (only one sequence from a single sentence, prefer a continuous sequence) in QUESTION tells you what the USER is asking for?”
-
“What is the shortest sequence of words in ‘A1 (Answer to Question 1)’ describes a category? For example, place to stay, restaurant, show, place to eat, place to have dinner, spot, hotel, etc.”
-
“What words/phrases (need not be continuous, can be multiple) in the QUESTION give a sense of location about the ANSWER or ‘A2’ (Answer to Question 2)?”
-
“What words/phrases (need not be continuous, can be multiple) in the QUESTION give more description about the ANSWER or the ‘A2’ (Answer to Question 2)?”
These questions give us the entity.type, entity.location, and entity.attribute labels. We paid $$\$0.30$ to each worker for this task.
We obtain two sets of labels (different workers) on each question. However, due to the complex nature of the task we find that workers are not complete in their labeling and we therefore only use token labels where both set of workers agreed on labels. Thus, we are able to source annotations with high precision, while recall can be low. Table 2 shows token-level agreement statistics for labels collected over a set of 400 MSEQs from the tourism domain. Some of the disagreement arises from labeling errors due to complex nature of the task. In other cases, the disagreement results from their choosing one of the several possible correct answers. For example, in the phrase “good restaurant for dinner,” one worker labels $entity.type=$ “restaurant,” $entity.attr=$ “good,” and $entity.attr= $ “dinner,” while another worker simply chooses the entire phrase as entity.type.
5.3.2 Training with partially labeled posts
We devise a novel method to use this partially labeled data, along with our small training set of expert labeled data, to learn the parameters of our CCM model. We utilize a modified version of constraints-driven learning (CoDL) (Chang et al. Reference Chang, Ratinov and Roth2007) which uses a semi-supervised iterative weight update algorithm, where the weights at each step are computed using a combination of the models learned on the labeled and the unlabeled set (Chang et al. Reference Chang, Ratinov and Roth2007).
Given a data set consisting of a few fully labeled as well as unlabeled examples, the CoDL learning algorithm first learns a model using only the labeled subset. This model is then used to find labels (in a hard manner) for the unlabeled examples while taking care of constraints (Section 5.2). A new model is then learned on this newly annotated set and is combined with the model learned on the labeled set in a linear manner. The parameter update can be described as
Here, t denotes the iteration number, $U^{(t)}$ denotes the unlabeled examples, and $\text{Learn}$ is a function that learns the parameters of the model. In our setting, $\text{Learn}$ trains the neural network via back-propagation. Instead of using unlabeled examples in $U^{(t)}$, we utilize the partially set whose values have been filled in using parameters at iteration t, and inference over the set involves predicting only the missing labels. This is done using the ILP-based formulation described previously, with an added constraint that the predicted labels for the partially annotated sequences have to be consistent with the human labels. $\gamma$ controls the relative importance of the labeled and partial examples. To the best of our knowledge, we are the first to exploit partial supervision from a crowdsourcing platform in this manner.
6. Experimental evaluation
The goal of our experimental evaluation was to analyze the effectiveness of our proposed model for the task of understanding MSEQs. We next describe our data set, evaluation methodology, and results in detail.
6.1 Data set
For our current evaluation, we used the following three semantic labels: entity.type, entity.attr, and $entity.location$. We also used a default label other to mark any tokens not matching any of the semantic labels.
We use 150 expert-annotated tourism forum questions (9200 annotated tokens) as our labeled data set and perform leave-one-out cross-validation. This set was labeled by two experts, including one of the authors, with high agreement. For experiments with partially labeled learning, we add 400 partially annotated questions from crowdsourced workers to our training set. As described in Section 5.3.1, each question is annotated by two workers and we retain token labels marked the same by two workers, while treating the other labels as unknown. We still compute a leave-one-out cross-validation on our original 150 expert-annotated questions (complete crowd data is included in each training fold).
6.2 Methodology
Sequence-tagged tokens identify phrases for each semantic label; therefore, instead of reporting metrics at the token level, we compute a more meaningful joint metric over tagged phrases. We define a matching-based metric that first matches each extracted segment with the closest one in the gold set, and then computes segment-level precision using constituent tokens. Analogously, recall is computed by matching each segment in gold set with the best one in extracted set. As an example, for Figure 1, if the system extracts “convenient to the majority” and “local budget” for entity.attr (with gold entity.attr being “budget”, “best,” and “convenient to the majority that first time visitors would like to see”), then our matching-metric will compute precision as 0.75 (1.0 for “convenient to the majority”(covered completely by “convenient to the majority that first time visitors would like to see”) and 0.5 for “local budget”(partially covered by “budget”)) and recall as 0.45 (1.0 for “budget” (completely covered by predicted entity “local budget”), 0.0 for “best” (not covered by any predicted entities), and $0.333$ for “convenient to the majority … like to see”(covered by predicted “convenient to the majority”)).
We use the Mallet toolkitFootnote h for our baseline CRF implementation and the GLPK ILP-based solverFootnote i for CCM inference. In the case of BiLSTM-based CRF, we use the implementation provided by Gardner et al. (Reference Gardner, Grus, Neumann, Tafjord, Dasigi, Liu, Peters, Schmitz and Zettlemoyer2017). The BiLSTM network at each time step feeds into a linear chain CRF layer. The input states in the LSTM are modeled using a 200-dimension word vector representation of the token. These word vector representations were with pre-trained using the word2vec model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) on a large collection of 80,000 tourism questions. In case of BERT BiLSTM CRF, we use the contextualized BERT embeddings from the BERT-small pretrained model as an input to the LSTM layer and BERT implementation from HuggingFace Transformers (Wolf et al. Reference Wolf, Debut, Sanh, Chaumond, Delangue, Moi, Cistac, Rault, Louf, Funtowicz and Brew2019). For CoDL learning, we set $\gamma$ to 0.9 as per original authors’ recommendations.
6.3 Results
Table 3 reports the performance of our semantic labeler under different incremental configurations. We find that the models based on BiLSTM CRF and the BERT BiLSTM CRF (middle and lower halves of the table) outperform a CRF system (upper half of the table) in each comparable setting—for instance, using a baseline vanilla CRF-based system using all features gives us an aggregate F1 of 50.8 while the the performance of BiLSTM CRF and BERT BiLSTM CRF using features are 56.2 and 64.4, respectively. As a baseline, we use the neural predicate tagger from the deep SRL system (He et al. Reference He, Lee, Lewis and Zettlemoyer2017) to utilize our label space and we find that it performs similar to our CRF setup. The use of hand-designed features, CCM constraints in the BERT BiLSTM CRF (referred to as BERT BiLSTM CCM), along with learning from partially annotated crowd data has over a 15 pt gain over the baseline BiLSTM CRF model. Further, we note that the usage of hand-curated features, within-sentence and cross-sentence constraints as well as partial supervision, each help successively improve the results in all configurations. Next, we study the effect of each of these enhancements in detail.
6.3.1 Effect of features
In an ablation study performed to learn the incremental importance of each feature, we find that descriptive phrases and our hand-constructed multi-sentence type and attribute indicators improve the performance of each label by 2–3 pt. Word2vec features help type detection because entity.type labels often occur in similar contexts, leading to informative vectors for typical type words. Frequency of non-stopword words in the multi-sentence post is an indicator of the word’s relative importance, and the feature also helps improves overall performance.
6.3.2 Effect of constraints
A closer inspection of Table 3 reveals that the vanilla CRF configuration sees more benefit in using our CCM constraints as compared to the BiLSTM CRF-based model (4 vs. 1 pt). To understand why, we study the detailed precision-recall characteristics of individual labels; the results for entity.type are reported in Table 4. We find that the BiLSTM CRF-based model has significantly higher recall than their equivalent vanilla CRF counterpart while the opposite trend is observed for precision. As a result, since two of the three constraints we used in CCM are oriented toward improving recallFootnote j, we find that they improve overall F1 more by finding tags that were otherwise of lower probability (i.e., improving recall). Interestingly, in case of the BERT BiLSTM CRF-based model, we find that precision-recall characteristics are similar (higher precision than recall) to those seen in the vanilla CRF-based setup, and thus again, the benefit of using constraints is larger.
6.3.3 Effect of partial supervision
In order to further understand the effect of partial supervision, we trained a CCM-based model that makes use of all the crowdsourced labels for training, by adding conflicting labels for a question as two independent training data points. As can be seen, using the entire noisy crowd-labeled sequences (row labeled “CCM (with all crowd data)” in upper half of Table 3) hurts the performance significantly resulting in an aggregate F1 of just 48.0 while using partially labeled data with CCM results in an F1 of 56.5. The corresponding F1 scores of partially supervised BiLSTM CCM and BERT BiLSTM CCM systems (trained using partially labeled data) are 58.3 and 66.4, respectively.
Overall: Our results demonstrate that the use of each of hand-engineering features, within-sentence and inter-sentence constraints, and use of partially labeled data help improve the accuracy of labeling MSEQs.
7. MSEQ semantic labels: Application
We now demonstrate the usefulness of our MSEQ semantic labels and tagging framework (i) by enabling a QA end-task which returns entity answers for multi-sentence MSEQs—to the best of our knowledge, we are the first to attempt such a QA task—and (ii) by demonstrating the creation of an MSEQ labeler for a different domain (books recommendation).
7.1 MSEQ Labeler based QA system
Our novel QA task evaluation attempts to return entity answers for multi-sentence tourism forum questions. We use our sequence tagger described previously to generate the semantic labels of the questions. These semantic labels and their targets are used to formulate a query to the Google Places collection, which serves as our knowledge sourceFootnote k. The Google Places collection contains details about eateries, attractions, hotels, and other points of interests from all over the world, along with reviews and ratings from users. It exposes an end point that can be used to execute free text queries and it returns entities as results.
We convert the semantic-labels-tagged phrases into a Google Places query via the transformation: “concat(entity.attr) entity.type in $entity.location$.” Here, concat lists all attributes in a space-separated fashion. Since some of the attributes may be negated in the original question, we filter out these attributes and do not include it as part of the query for Google Places.
Detection of negations: We use a list of triggers that indicate negation. We start with a manually curated set of seed words, and expand it using synonym and antonym counter fitted word vectors (Mrksic et al. Reference Mrksic, Seaghdha, Thomson, Gasic, Rojas-Barahona, Su, Vandyke, Wen and Young2016). The resulting set of trigger words flags the presence of a negation in a sentence. We also define the scope of a negation trigger as a token (or a set of continuous tokens with the same label) labeled by our sequence tagger that occurs within a specified window of the trigger word. Table 5 reports the accuracy of our negation rules as evaluated by an author. The “Gold” columns denote the performance when using gold semantic label mentions. The “System” columns are the performance when using labels generated by our sequence tagger.
7.1.1 Baseline
Since there are no baselines for this task, we adapt and re-implement a recent complex QA system (called WebQA) originally meant for finding appropriate Google results (documents) to questions posed in user forums (Vtyurina and Clarke Reference Vtyurina and Clarke2016). WebQA first shortlists a set of top 10 words in the question using a tf-idf-based scheme computed over the set of all questions. A supervised method is then used to further shortlist three to four words, to form the final query. In our setting, we lack the data to train a supervised method for selecting these words from the tf-idf-ranked list. Therefore, for best performance, instead of using supervised learning for further shortlisting keywords (as in the original paper), in our implementation an expert chooses 3–4 best words manually from the top 10 words. This query executed against the Google Places collection API returns answer entities instead of documents.
We randomly select 300 new unseen questions (different from the questions used in the previous section), from a tourism forum website, and manually remove 110 of those that were not entity-seeking. The remaining 190 questions form our test set. Our annotators manually check each entity answer returned by the systems for correctness. Inter-annotator agreement for relevance of answers measured on 1300+ entities from 100 questions was $0.79$. Evaluating whether an entity answer returned is correct is subjective and time consuming. For each entity answer returned, annotators need to manually query a web-search engine to evaluate whether an entity returned by the system adequately matches the requirements of the user posting the question. Given the subjective and time-consuming nature of this task, we believe $0.79$ is an adequate level of agreement on entity answers.
7.1.2 MSEQ-QA: Results
Results: Table 6 reports Accuracy@3, which gives credit if any one of the top three answers is a correct answer. We also report mean reciprocal rank (MRR). Both of these measures are computed only on the subset of attempted questions (any answer returned). Recall is computed as the percentage of questions answered correctly within the top three answers over all questions. In case the user question requires more than one entity typeFootnote l, we mark an answer correct as long as one of them is attempted and answered correctly. Note that these answers are ranked by Google Places based on relevance to the query. As can be seen, the use of our semantic labelsFootnote m (MSEQ-QA) results in nearly 15 point higher accuracy with a 14 point higher recall compared to WebQA (manual), because of a more directed and effective query to Google Places collection.
Overall, our semantic labels-based QA system (MSEQ-QA) answers approximately 54% of the questions with an accuracy of 57% for this challenging task of answering MSEQs.
7.1.3 MSEQ-QA: Qualitative study and error analysis
Table 7 presents some examples of questionsFootnote n answered by the MSEQ Labeler-based QA system. As can be seen our system supports a variety of question intents/entities, and due to our choice of an open semantic representation, we are not limited to specific entity types, entity instances, attributes, or locations. For example, in Q1 the user is looking for “local dinner suggestions” on Christmas eve, and the answer entity returned by our system is to dine at the “St. Peter Stiftskulinarium” in Salzburg, while in Q2 the user is looking for recommendations for “SOM tours” (Sound of Music Tours). A quick internet search shows that our system’s answer, “Bob’s Special Tours,” is famous for their SOM tours in that area. This question also requests for restaurant suggestions in the old town, but since we focus on returning answers for just one entity.type, this part of the question is not attempted by our system. Questions with more than one entity.type requests are fairly common and this sometimes results in confusion for our system especially if entity.attribute tags relate to different entity.type values. Since we do not attempt to disambiguate or link different entity.attribute tags to their corresponding entity.type values, this is often a source of error. Our constraint that forces all entity.type labels to come from one sentences mitigates this to some extent, but this is can still be a source of errors. Q4 is incorrect because the entity returned does not fulfill the location constraints of being close to the “bazar” while Q5 returns an incorrect entity type.
Q9 is a complicated question with strict location, budget, and attribute constraints, and the top-ranked returned entity “Hotel Pegasus Crown” fulfills the most requirements of the userFootnote o.
Error analysis: We conducted a detailed error study on 105 of the test set questions and we find that approximately 60% of questions were not answered by our QA system pipeline due to limitations of the knowledge source while approximately, 40% of the recall loss in the system can be traced to errors in the semantic labels. See Table 8 for a detailed error analysis.
7.2 Understanding MSEQs in another domain
In contrast to methods that require tens of thousands of training data points, our question understanding framework works with a few hundred questions. We demonstrate the general applicability of our features and constraints by employing them on the task of understanding multi-sentence questions seeking book recommendations.
Using questions collected from an online book reading forum,Footnote p we annotatedFootnote q 95 questions with their semantic labels. We retrained both CRF- and CCM-based supervised systems as before on this data set. Because location is not relevant for books, we use the two general labels: entity.type and entity.attr.
We train the labeler with no feature adaptation or changes from the one developed for tourism, retaining the same constraints as before. We tune the hyper-parameters with a grid search. Table 9 shows the performance of our sequence labeler over leave-one-out cross-validation. We find that that our generic features for type and attr defined earlier work acceptably well for this domain as well and we obtain F1 scores comparable to those seen for tourism. These experiments demonstrate that simple semantic labels can indeed be useful to represent multi-sentence questions and that such a representation is easily applicable to different domains.
8. Conclusion and future work
We have presented the novel task of understanding MSEQs. MSEQs are an important class of questions, as they appear frequently on online forums. They expose novel challenges for semantic parsing as they contain multiple sentences requiring cross-sentence interactions and also need to be built in low-data settings due to challenges associated with sourcing training data. We define a set of open semantic labels that we use to formulate a multi-sentence question parsing task.
Our solution consists of sequence labeling based on a BiLSTM CRF model. We use hand-engineered features, inter-sentence CCM constraints, and partially supervised training, enabling the use of crowdsourced incomplete annotation. We find these methods results in a 7 pt gain over baseline BiLSTM CRFs. The use of contextualized pretrained embeddings such as BERT result in an additional 6–8 pt improvement. We further demonstrate the strength of our work by applying the semantic labels toward a novel end-QA task that returns entity answers for MSEQs from a web API-based unstructured knowledge source that outperforms baselines. Further, we demonstrate how our approach allows rapid bootstrapping of MSEQ semantic parsers for new domains.
We see our paper as the first attempt toward end-to-end QA in the challenging setting of multi-sentence questions answered directly on the basis of information in unstructured and semi-structured knowledge sources. Our best model answers 54% of the questions with an Accuracy@3 of 57%. Our work opens up several future research directions, which can be broadly divided in two categories. First, we would like to improve on the existing system in the pipelined setting. Error analysis on our test set suggests the need for a deeper IR system that parses constructs from our semantic representation to execute multiple sub-queries. Currently, about 60% of recall loss is due to limitations in the knowledge source and query formulation, while a sizeable 40% may be addressed by improvements to question understanding.
As a second direction, we would like to train an end-to-end neural system to solve our QA task. This would require generating a large data set of labeled QA pairs which could perhaps be sourced semiautomatically using data available in tourism QA forums. However, answer posts in forums can often refer to multiple entities and automatically inferring the exact answer entity for the question can be challenging. Further, we would have to devise efficient techniques to deal with hundreds of thousands of potential class labels (entities). Comparing the performance of the pipelined model and the neural model and examining if one works better than the other in specific settings would also be interesting to look at.
We will make our training data and other resources available for further research.
Acknowledgments
We would like to thank Poojan Mehta, who designed and set up the annotation tasks on AMT, and Krunal Shah, who assisted with the evaluation of the QA system. We would also like to acknowledge the IBM Research India PhD program that enables the first author to pursue the PhD at IIT Delhi. We thank the reviewers for their feedback and Dinesh Raghu, Sachindra Joshi, Dan Weld, Oren Etzioni, Peter Clark, Gaurav Pandey, Dinesh Khandelwal for their helpful suggestions on early versions of this paper. We thank all AMT workers who participated in our tasks.
Financial Section
Mausam is supported by Google language understanding and knowledge discovery focused research grants, a Bloomberg award, grants by 1MG and a Microsoft Azure sponsorship. Parag Singla is supported by the DARPA Explainable Artificial Intelligence (XAI) Program with number N66001-17-2-4032. Mausam and Parag Singla are both supported by the IBM AI-Horizons Network Grant, IBM SUR awards as well as Visvesvaraya faculty awards by Govt. of India.