Romain Paulus, Caiming Xiong and Richard Socher

The last few decades have witnessed a fundamental change in the challenge of taking in new information. The bottleneck is no longer access to information; now it’s our ability to keep up. We all have to read more and more to keep up-to-date with our jobs, the news, and social media. We’ve looked at how AI can improve people’s work by helping with this information deluge and one potential answer is to have algorithms automatically summarize longer texts.

Training a model that can generate long, coherent, and meaningful summaries remains an open research problem. In fact, generating any kind of longer text is hard for even the most advanced deep learning algorithms. In order to make summarization successful, we introduce two separate improvements: a more contextual word generation model and a new way of training summarization models via reinforcement learning (RL).

The combination of the two training methods enables the system to create relevant and highly readable multi-sentence summaries of long text, such as news articles, significantly improving on previous results. Our algorithm can be trained on a variety of different types of texts and summary lengths. In this blog post, we present the main contributions of our model and an overview of the natural language challenges specific to text summarization.

Automatic summarization models can work in one of two ways: by extraction or by abstraction. Extractive models perform "copy-and-paste" operations: they select relevant phrases of the input document and concatenate them to form a summary. They are quite robust since they use existing natural-language phrases that are taken straight from the input, but they lack in flexibility since they cannot use novel words or connectors. They also cannot paraphrase like people sometimes do. In contrast, abstractive models generate a summary based on the actual “abstracted” content: they can use words that were not in the original input. This gives them a lot more potential to produce fluent and coherent summaries but it is also a much harder problem as you now require the model to generate coherent phrases and connectors.

Even though abstractive models are more powerful in theory, it is common for them to make mistakes in practice. Typical mistakes include incoherent, irrelevant or repeated phrases in generated summaries, especially when trying to create long text outputs. They historically lacked a sense of general coherence, flow and readability. In this work, we tackle these issues and design a more robust and coherent abstractive summarization model.

In order to understand our new abstractive model, let’s first define the basic building blocks and then introduce our new training scheme.

Recurrent neural networks (RNNs) are deep learning models that can process sequences (e.g. text) of variable length and compute useful representations (or hidden state) for each phrase.  These networks process each element of the sequence (in this case, each word) one by one; for each new input in the sequence, the network outputs a new hidden state as a function of that input and the previous hidden state. In this sense, the hidden state calculated at each word is a function of all the words read up to that point.
The input (reading) and output (generating) RNNs can be combined in a joint model where the the final hidden state of the input RNN is used as the initial hidden state of the output RNN. Combined in this way, the joint model is able to read any text and generate a different text from it. This framework is called an encoder-decoder RNN (or Seq2Seq) and is the basis of our summarization model. In addition, we replace the traditional encoder RNN by a bidirectional encoder, which uses two different RNNs to read the input sequence: one that reads the text from left-to-right (as illustrated in Figure 4) and another that reads from right-to-left. This helps our model to have a better representation of the input context.

To make our model outputs more coherent, we allow the decoder to look back at parts of the input document when generating a new word with a technique called temporal attention. Instead of relying entirely on its own hidden state, the decoder can incorporate contextual information about different parts of the input with an attention function. This attention is then modulated to ensure that the model uses different parts of the input when generating the output text, hence increasing information coverage of the summary.

In addition, to make sure that our model doesn't repeat itself, we also allow it to look back at the previous hidden states from the decoder. In a similar fashion, we define an intra-decoder attention function that can look back at previous hidden states of the decoder RNNs. Finally, the decoder combines the context vector from the temporal attention with the one from the intra-decoder attention to generate the next word in the output summary. Figure 5 illustrates the combination of these two attention functions at a given decoding step.

To train this model on real-world data like news articles, a common way is to use the teacher forcing algorithm: a model generates a summary while using a reference summary, and the model is assigned a word-by-word error (or “local supervision”, as shown in Figure 6) each time it generates a new word.

This method can be used to train any sequence generation model based on recurrent neural networks, with very decent results. However, for our particular task, summaries don't have to match a reference sequence word by word in order to be correct. As you can imagine, two humans may generate very different summaries of the same news article, sometimes using different styles, words or sentence orders, while still being considered good summaries.

The problem with teacher forcing here is that as soon as the first few words are generated, the training is misguided: it sticks strictly to the one officially correct summary and cannot adjust to a potentially correct but different beginning.

Taking this into consideration, we can do better than the word-by-word approach of teacher forcing. A different kind of training called reinforcement learning (RL) can be applied here.

At first, the RL algorithm lets the model generate its own summary, then it uses an external scorer to compare the generated summary against the ground truth. This scorer then indicates to the model how "good" the generated summary was. If the score is high, then the model can update itself to make such summaries more likely to appear in the future. Otherwise, if the score is low, the model will get penalized and change its generation procedure to prevent similar summaries. This reinforced model is very good at increasing the summarization score that evaluates the entire sequence rather than a word-by-word prediction.

What exactly is this scorer, and how does it tell if summaries are "good"? Since asking a human to manually evaluate millions of summaries is long and impractical at scale, we rely on an automated evaluation metric called ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE works by comparing matching sub-phrases in the generated summaries against sub-phrases in the ground truth reference summaries, even if they are not perfectly aligned. Different variants of ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) all work in the same fashion but use different sub-sequence lengths.

While ROUGE scores have a good correlation with human judgment in general, the summaries with the highest ROUGE aren't necessarily the most readable or natural ones. This became an issue when we trained our model to maximize the ROUGE score with reinforcement learning alone. We observed that our models with the highest ROUGE scores also generated barely-readable summaries.

To bring the best of both worlds, our model is trained with teacher forcing and reinforcement learning at the same time, being able to make use of both word-level and whole-summary-level supervision to make it more coherent and readable. In particular, we find that ROUGE-optimized RL helps improve recall (i.e all important information that needs to be summarized is indeed summarized) and word level learning supervision ensures good language flow, making the summary more coherent and readable.

Until recently, the highest ROUGE-1 score for abstractive summarization on the CNN/Daily Mail dataset was 35.46. The combination of our intra-decoder attention RNN model with joint supervised and RL training improves this score to 39.87, and 41.16 with RL only. Figure 9 shows other summarization scores for existing models and ours. Even though our pure RL model has higher ROUGE scores, our supervised+RL model has a higher readability, hence is more relevant for this summarization task. Note that See et al. use a slightly different data format, hence their results are not directly comparable with ours and the others but still give a good reference point.
What does such a large improvement mean in terms of real summaries? Here we show a couple of multi-sentence summaries based on documents from the development split of the dataset. Our model and its simpler baselines generated these, after training on the CNN/Daily Mail dataset. As you can see, the summaries have significantly improved but there’s still more work needed to make them perfect.

Our model significantly improves the state-of-the-art in multi-sentence summary generation, outperforming existing abstractive models and extractive baselines. We believe that our contributions - the intra-decoder attention module and the combined training objective - could improve other sequence generation tasks, especially for long outputs.

Our work also touches on the limit of automatic evaluation metrics such as ROUGE, and shows that better metrics are required to evaluate - and optimize - summarization models. An ideal metric will correlate well with human judgment both in terms of summary coherence and readability. When such a metric is used with our reinforced summarization model summaries may improve even further.