Question answering remains one of the most difficult challenges we face in Natural Language Processing. The idea of creating an agent capable of open-domain question answering - answering arbitrary questions with respect to arbitrary documents - has long captured our imagination. An agent that responds in natural language rather than by lists of query results (as in search) takes on almost anthropomorphic qualities, and spurs the imagination to think about the future of artificial intelligence.
The path to open-domain question answering has been long and challenging. One crucial problem that has dogged researchers on this path has been the lack of large-scale datasets. Traditional question answering datasets, such as MCTest, have been high in quality. However, the cost of annotating such datasets with human experts have been prohibitively expensive, thus keeping them small. Recently, researchers have devised techniques to create arbitrarily large cloze-form question answering datasets. These cloze-form datasets, such as the CNN/DailyMail corpus are created by replacing an entity with a placeholder, thereby creating a problem similar to fill-in-the-blank. Namely, the task is to infer the missing entity by choosing amongst all the entities that appear in the document. The cloze-form question answering task is not as natural as open-domain question answering, but the ease with which cloze-form datasets can be created has led to dramatic progress in the development of expressive models such as deep neural networks for question answering.
Some of the earliest question answering systems date back to BASEBALL and STUDENT in the 1960's. These systems tended to be limited in domain, but they are nevertheless telling of our fascination with autonomous agents that can understand and communicate in natural language to answer questions.
In recent years, the exponential increase in data and in computational power has enabled the development of ever more powerful machine learning systems. In particular, the resurgence of neural networks has led to the wide-spread adoption of deep learning models in domains ranging from machine translation to object recognition to speech recognition. Today, we announce the Dynamic Coattention Network (DCN), an end-to-end deep learning system for question answering. The DCN combines an coattentive encoder with a dynamic decoder. The combination of these two techniques allows the DCN to significantly outperform other systems on the Stanford Question Answering Dataset.
In most neural network approaches for Natural Language Processing, the system builds a static representation of the input document upon which to perform inference.
Although this approach has produced remarkable systems for tasks such as machine translation, we feel that it is insufficient for question answering. The reason behind this intuition is that it is incredibly difficult to build a static representation over a document to answer arbitrary questions. It is much easier to build a representation over the document to answer a single question that is known in advance.
To make this idea more concrete, let's consider an example. Suppose I gave you the following document. You can only read this document once (don't cheat!)
In the meantime, on August 1, 1774, an experiment conducted by the British clergyman Joseph Priestley focused sunlight on mercuric oxide (HgO) inside a glass tube, which liberated a gas he named "dephlogisticated air". He noted that candles burned brighter in the gas and that a mouse was more active and lived longer while breathing it. After breathing the gas himself, he wrote: "The feeling of it to my lungs was not sensibly different from that of common air, but I fancied that my breast felt peculiarly light and easy for some time afterwards." Priestley published his findings in 1775 in a paper titled "An Account of Further Discoveries in Air" which was included in the second volume of his book titled Experiments and Observations on Different Kinds of Air. Because he published his findings first, Priestley is usually given priority in the discovery.
Do you remember who published the paper? What was his occupation? How about the chemical used in his experiments on oxygen? How about when he published his findings? What was the name of the paper he published? Hopefully, you would agree that it is hard to answer these questions based on a single reading.
Now, let's try something else. I am going to give you a document and I would like you to answer the question "what is needed to make combustion happen".
Highly concentrated sources of oxygen promote rapid combustion. Fire and explosion hazards exist when concentrated oxidants and fuels are brought into close proximity; an ignition event, such as heat or a spark, is needed to trigger combustion. Oxygen is the oxidant, not the fuel, but nevertheless the source of most of the chemical energy released in combustion. Combustion hazards also apply to compounds of oxygen with a high oxidative potential, such as peroxides, chlorates, nitrates, perchlorates, and dichromates because they can donate oxygen to a fire.
Now, read the document one more time to answer the question "what role does oxygen play in combustion?".
The first approach, in which you were forced to cram as much information about the document as possible, not knowing what the questions will be, is analogous to the traditional approach of building a static representation. The second approach, in which you were able to read the document again for each question, is analogous to building a conditional representation of the document, based on the question. Hopefully, you'll agree with me that the latter is much easier than the former, since you can selectively read the document and discard information irrelevant to the question. This is exactly the idea behind our Coattention Encoder, the first of two parts of the DCN.
For each document and question pair, the Coattention Encoder builds a conditional representation of the document given the question, as well as a conditional representation of the question given the document. The encoder then builds a final representation of the document, taking into account the two previous conditional representations. A subsequent decoder module then produces an answer from this final representation.