1 Introduction

Questions are a means of acquiring knowledge, and since the advent of the Internet, many questions have been posted on community question-answering (CQA) sites. Therefore, to find questions efficiently, we need a system by which the important parts of questions can be displayed in search results. On a CQA site, as represented by Yahoo! Chiebukuro [31], the first sentence of a question tends to be displayed as a headline (or list item) because of a restriction on the display area. Note that, to reduce the burden on users who post questions, many CQA sites do not provide an input field for headlines in the submission form. The most important sentence in a question, however, should be displayed instead of the first sentence, because sometimes the first sentence does not provide enough information, as shown in Fig. 1 (translated to English).

Fig. 1.
figure 1

Example of a posted question and its answer.

This task can be formalized as extractive summarization, which has long been addressed, e.g., by using a graph-based method [21], a topic-based method [10], or a features-based method [22]. The development of neural networks has led some studies [7, 24] to report high-performance models that use large amounts of training data. Such large amounts of training data, however, incur a high cost to create and cannot always be prepared for practical use.

In this paper, we harness question-answer (QA) pairs to alleviate this problem. Many QA pairs on CQA sites can be easily obtained without annotation costs and are expected to be useful because, in general, each answer should be closely related to the most important sentence in the question. In fact, the answer in Fig. 1 includes keywords such as “initial setup” and “Wi-Fi” in the main question sentence. Our framework can be regarded as a semi-supervised approach with a small amount of labeled data and a large amount of unlabeled (paired) data. The main difference from classical semi-supervised settings is that unlabeled data has a paired structure. This allows us to formulate our problem as a multi-task problem of sentence extraction and answer generation. One of the difficulties of this formulation is “data imbalance”, meaning that there is a small amount of data for sentence extraction and a large amount for answer generation. Therefore, we focus on this data imbalance problem and investigate how to use the unlabeled paired data from the viewpoint of training methods.

The contributions of our study are as follows.

  • We address extractive question summarization with QA pairs as a case study of a semi-supervised setting with unlabeled paired data and we propose a simple framework to systematically examine different ways to use these pairs.

  • We compare different training methods, namely, pretraining, separate training, and multi-task training, as well as normal training. Our experimental results show that (a) multi-task training performs the best but does not work well without an appropriate sampling method to reduce the data imbalance, and that (b) the multi-task training method is further enhanced with data augmentation based on distant supervision, which can simply solve the data imbalance problem. Our data and code will be publicly available [14].

2 Framework

Our framework consists of two models (Fig. 2); the sentence extraction model (SEM) based on a sequence labeling structure, and the answer generation model (AGM) based on a sequence-to-sequence structure. SEM directly solves our task, whereas AGM provides auxiliary information via attention weights.

Fig. 2.
figure 2

Overview of our framework.

SEM first encodes a question with sentences \((s_1, \dots , s_m)\) into sentence vectors \((h_1, \dots , h_m)\) via a hierarchical encoder based on two LSTM units for words and sentences. Then, for each sentence \(s_{i}\), the model calculates the extraction probability \(p(s_i)\), which represents the importance score of \(s_i\), by applying a binary softmax function with a linear transformation to \(h_i\). In the training phase, we use the cross entropy loss \(L_{\text {ext}}\) based on \(p(s_i)\) and the true label, similarly to classification tasks. We use SEM to define the importance score of \(s_i\) as \(f_{\text {ext}}(s_i)=p(s_i)\), which is used for the evaluation phase, together with a score obtained by AGM as described below.

AGM encodes a question into sentence vectors in the same way as in SEM. The model uses these vectors to generate an answer (word sequence) by using an ordinary sequence-to-sequence model with an attention mechanism. We do not use a hierarchical decoder, because the main purpose of this study is not to improve the performance of answer generation. In the training phase, we use the negative log likelihood loss \(L_{\text {gen}}\) based on a predicted sequence and the correct sequence. In the evaluation phase, we calculate importance scores by using attention weights \(\alpha _j(i)\), each of which represents the alignment level with respect to \(s_i\) at the j-th step in generation. Specifically, we define the importance score of \(s_i\) obtained by AGM as the average of the attention weights for \(s_i\), i.e., \(f_{\text {gen}}(s_i) = \frac{1}{k} \sum _{j=1}^{k} \alpha _j(i)\).

In our framework, we can thus simultaneously train two models in a multi-task setting (SEM and AGM are the respective main and auxiliary models) and combine their importance scores to estimate the most important sentence. We introduce two tuning parameters \(\lambda \) and \(\kappa \) for training and evaluation phases, respectively. The final loss function for the training phase is \(\lambda L_{\text {ext}}+ (1-\lambda )L_{\text {gen}}\), and the score function for the evaluation phase is \(\kappa f_{\text {ext}}(s_i) + (1-\kappa )f_{\text {gen}}(s_i)\).

3 Experiment

Datasets: We prepared two datasets, Pair and Label, which were based on a publicly available CQA dataset [25] provided by Yahoo! Chiebukuro. These two datasets formed a semi-supervised setting with unlabeled paired data, in which Pair included many unlabeled QA pairs for training AGM, while Label included a few labeled questions for SEM.

Pair consisted of 100K QA pairs, each of which included a randomly sampled question and its best answer annotated in the CQA dataset. In the sampling procedure, we removed pairs including more than 10 sentences to reduce the computational cost, as these were less than 5% of the total. For the same reason, we removed pairs including sentences consisting of more than 50 words.

Label consisted of 775 questions sampled separately but in a similar way to Pair. Every sentence in each question had a binary label representing whether the sentence was the most important, meaning that only the best sentence had a label of 1, while the others had a label of 0. We used crowdsourcing to annotate Label. In the crowdsourcing, five workers were given a question and asked to select the best sentence representing the main focus of the question. We included only questions for which at least four workers selected the same sentence.

Unsupervised Baselines: We prepared the following unsupervised methods as simple baselines.

  • Lead: Selects the initial sentence.

  • TfIdf: Selects the sentence with the highest average tf-idf on the basis of the CQA dataset.

  • SimEmb: Selects the sentence with the highest similarity on the basis of the word mover’s distance [18] to the input question.

  • LexRank: Uses a graph-based, unsupervised, extractive summarization model [8], which was trained with all the questions.

Compared Methods: We systematically compared the following methods to study how to effectively use Pair by changing the parameter settings of \(\lambda \) and \(\kappa \) in our framework.

  • Ext: Trains and uses SEM only (\(\lambda =1\), \(\kappa =1\)).

  • Gen: Trains and uses AGM only (\(\lambda =0\), \(\kappa =0\)).

  • Sep: Trains SEM (\(\lambda =1\)) and AGM (\(\lambda =0\)) separately and combines them in the evaluation phase. Then, \(\kappa \) is tuned with the development set.

  • Pre: Trains SEM (\(\lambda =1\)) after initializing the encoder’s parameters by using AGM (\(\lambda =0\)). Prediction is done with SEM (\(\kappa =1\)).

  • Multi: Trains SEM and AGM simultaneously. Mini-batches are created for each dataset and shuffled, with the loss calculated per mini-batch. Then, \(\lambda \) and \(\kappa \) are tuned with the development set.

Oversampling/Undersampling: We additionally prepared two variants of Multi to reduce the data imbalance problem of Label and Pair, because the data size of the subtask is much larger than that of the main task. Specifically, we used oversampling and undersampling to reduce the imbalance as follows.

  • MultiOver: Oversamples Label multiple times to be the same size as Pair.

  • MultiUnder: Undersamples Pair to be the same size as Label in every epoch.

Distant Supervision: Furthermore, we prepared a pseudo labeled dataset Pseudo, which included pseudo (noisy) labels for all the questions in Pair. This pseudo labeling approach is often called distant supervision, in where unlabeled data is automatically annotated with some heuristic rules. Following Ishigaki et al. [15], we adopted their heuristic rule that single-sentence questions are basically self-contained and have summary-like characteristics. Because their labels for single-sentence questions could not be directly used for our questions with multiple sentences, we first trained a classifier with their labels and used it to make Pseudo. Thus, using Pseudo, we prepared the following variants of Multi, Ext, Sep, and Pre for comparison.

  • MultiDist: Multi trained with Label, Pair, and Pseudo.

  • ExtDist/SepDist/PreDist: Variants of Ext/Sep/Pre, similar to MultiDist.

Evaluation: For evaluating the performance, we used an accuracy measure calculated by dividing the number of questions for which the target method correctly selected the most important sentence by the number of questions used. Note that well-known metrics such as ROUGE and precision/recall were not appropriate, because our task was to find only one sentence as a (snippet) headline. We divided the labeled data Label into five sets (train:develop:test = 3:1:1) and performed five-fold cross-validation to evaluate the methods.

Table 1. Accuracy on the question summarization task. Each “ ” indicates that the corresponding dataset was used.

Results: Table 1 lists the results. The three row groups from top to bottom correspond to unsupervised, semi-supervised, and distantly supervised settings. In the first group, Lead performed the best, whereas the other methods (TfIdf, SimEmb, and LexRank) did not work well. This indicates the difficulty of our task and confirms that we need supervision to develop practical models.

In the second group, MultiUnder performed the best, although Multi (without sampling) performed worse than Ext did. This suggests that reducing the data imbalance is a key factor for our setting. MultiOver also worked well but did not reach the performance of MultiUnder. The reason seems to be that sampling the same data many times yields overfitting. Among other methods, Sep performed well because of an ensemble effect of Ext and Gen, whereas Gen by itself performed the worst because it did not use any labels. Pre unexpectedly performed worse than Ext did, although Shimizu et al. [27] reported that sentiment classifiers were more enhanced by pretraining with tweet-reply pairs than by language model pretraining. This implies that the performance depends on the task settings, so our framework can be useful for other tasks.

In the third group, MultiDist (without sampling) performed the best. The differences from the other methods in this group were statistically significant according to the sign test (\(p<0.05\)). Although distant supervision itself has positive effects as shown by the improvement for ExtDist, it has an extra bonus in that pseudo labels can simply solve the data imbalance. These results suggest that we have room to study the combinations of multi-task training and distant supervision for other NLP tasks. We also prepared a larger labeled dataset than Label. The experiments on this dataset showed similar tendencies. We will study how the data size of labeled data affects the performances in future work.

4 Related Work

Several studies have considered semi-supervised settings for summarization tasks [1, 20, 30], but in contrast to our main focus, none of them considered multi-task settings, especially using paired data. In the multi-task field, there have been several studies on summarization tasks. Guo et al. [11] improved an abstractive summarization model by using multi-task training with entailment and question generation tasks. Their work used human-annotated data from SQuAD dataset for these auxiliary tasks, whose sizes were much smaller than that of the main task, so their setting was completely different from ours. Angelidis et al. [2] addressed summarization of opinions from Amazon reviews by using multi-task training with aspect extraction and sentiment prediction tasks. Their work is related to ours in that they targeted user-generated content, but their auxiliary tasks were basic subtasks of opinion summarization with explicit aspect or sentiment labels. This implies that their task’s usefulness was clearer than that of our task, in which we only assume a paired structure without any explicit labels. The study most related to ours is the work by Isonuma et al. [17], who proposed an extractive summarization method for news articles through multi-task training with a document classification task. Their strategy was similar to ours in that they used categories originally attached to news articles without costly annotation, but in many cases, we cannot access such categories or useful meta-information for documents, like CQA sites.

Several studies have used QA or similar structures for summarization tasks. Chen et al. [6] used a QA system to predict summarization quality in the evaluation phase, in contrast to our study, which uses QA paired data in the training phase. Arumae and Liu [3] used QA data to calculate a reward function for reinforcement learning in the training phase. They used Cloze-style (fill in the blank) questions, however, and we cannot directly apply their method to our task. Gao et al. [9] used an article-comments structure to personalize summaries in a multi-modal setting with multiple inputs, i.e., article and comments, rather than multi-task settings with multiple outputs, as in our study. Note that we did not consider such a multi-modal setting, as we assumed that answers would not always available for posted questions.

Many studies have used CQA data, but most have addressed different tasks, i.e., dealing with answering questions [4, 5, 23, 28], retrieving similar questions [19, 23, 26], and generating questions [12]. Tamura et al. [29] focused on extracting a core sentence and identifying the question type as a classification task for answering multiple-sentence questions. Higurashi et al. [13] proposed a learning-to-rank approach for extracting an important substring from a question. Although their models are useful for retrieving important information, they considered methods that are trained with only labeled data. Finally, Ishigaki et al. [16] addressed neural abstractive and extractive approaches to summarize lengthy questions by using much paired data consisting of questions and headlines. Therefore, their method is not applicable to our task, in which we assume questions without headlines.

5 Conclusion

We have addressed an extractive question summarization task with QA pairs as a case study of a semi-supervised setting with unlabeled paired data. Our results suggest that multi-task training is effective especially with undersampling and distant supervision. For future work, we will apply our framework to other tasks with similar structures, such as news articles with comments.