Hostname: page-component-745bb68f8f-l4dxg Total loading time: 0 Render date: 2025-02-09T15:05:23.198Z Has data issue: false hasContentIssue false

A Survey on Machine Reading Comprehension Systems

Published online by Cambridge University Press:  19 January 2022

Razieh Baradaran
Affiliation:
Computer and Information Technology Department, University of Qom, Qom, Iran
Razieh Ghiasi
Affiliation:
Computer and Information Technology Department, University of Qom, Qom, Iran
Hossein Amirkhani*
Affiliation:
Computer and Information Technology Department, University of Qom, Qom, Iran
*
*Corresponding author. E-mail: amirkhani@qom.ac.ir
Rights & Permissions [Opens in a new window]

Abstract

Machine Reading Comprehension (MRC) is a challenging task and hot topic in Natural Language Processing. The goal of this field is to develop systems for answering the questions regarding a given context. In this paper, we present a comprehensive survey on diverse aspects of MRC systems, including their approaches, structures, input/outputs, and research novelties. We illustrate the recent trends in this field based on a review of 241 papers published during 2016–2020. Our investigation demonstrated that the focus of research has changed in recent years from answer extraction to answer generation, from single- to multi-document reading comprehension, and from learning from scratch to using pre-trained word vectors. Moreover, we discuss the popular datasets and the evaluation metrics in this field. The paper ends with an investigation of the most-cited papers and their contributions.

Type
Survey Paper
Copyright
© The Author(s), 2022. Published by Cambridge University Press

1. Introduction

Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP) aimed to evaluate the extent to which machines achieve the goal of natural language understanding. In order to assess the comprehension of a machine of a piece of natural language text, a set of questions about the text is given to the machine, and the responses of the machine are evaluated against the gold standard. Nowadays, MRC is known as the research area of reading comprehension for machines based on question answering (QA). Each instance in MRC datasets contains a context $C$ , a related question $Q$ , and an answer $A$ . Figure 1 shows some examples of the SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) and CNN/Daily Mail (Hermann et al. Reference Hermann, Kocisky, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015) datasets. The goal of MRC systems is to learn the predictive function f that extracts/generates the appropriate answer $A$ based on the context $C$ and the related question $Q$ :

$$f:\left( {C,Q} \right) \to A$$

Furthermore, MRC systems have important applications in distinct areas, such as conversational agents (Hewlett, Jones and Lacoste Reference Hewlett, Jones and Lacoste2017; Reddy, Chen and Manning Reference Reddy, Chen and Manning2019) and customer service support (Cui et al. Reference Cui, Huang, Wei, Tan, Duan and Zhou2017a).

Although MRC is referred to as QA in some studies, these two concepts are different in the following ways:

  • The main objective of QA systems is to answer the input questions, while the main goal of an MRC system, as the name suggests, is to demonstrate the machine’s ability in understanding natural languages through answering questions about specific context that it reads.

  • The only input to QA systems is the question, while the inputs to MRC systems entail the question and the corresponding context, which should be used to answer the question. As a result, sometimes MRC is referred to as QA from the text (Deng and Liu Reference Deng and Liu2018).

  • The main information source used to answer questions in MRC systems is natural language texts, while in QA systems, the structured and semi-structured data sources, such as knowledge-based ones, are commonly applied, in addition to the non-structured data like texts.

1.1. History

The history of reading comprehension for machines dates back to the 1970s when researchers identified it as a convenient way to test computer comprehension ability. One of the most prominent early studies was the QUALM system (Lehnert Reference Lehnert1977). This system was limited to handwritten scripts and could not be easily generalized to larger domains. Due to the complexity of this task, research in this area was reduced in 1980s and 1990s. In the late 1990s, Hirschman et al. (Reference Hirschman, Light, Breck and Burger1999) revived the field of MRC by creating a new dataset, including 120 stories and questions from 3rd- to 6th-grade material, followed by a workshop on comprehension test as a tool for assessing machine comprehension at ANLP/NAACL 2000.

Figure 1. Samples from SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) and CNN/Daily Mail (Chen, Bolton and Manning Reference Chen, Bolton and Manning2016) datasets. The original article of the CNN/Daily Mail example can be found at https://edition.cnn.com/2015/03/10/entertainment/ feat-star-wars-gay-character

Another revolution in this field occurred between 2013 and 2015 by introducing labeled training datasets mapping (context, question) pairs to the answer. This transformed the MRC problem into a supervised learning task (Chen Reference Chen2018). Two prominent datasets in this period were the MCTest dataset (Richardson, Burges and Renshaw Reference Richardson, Burges and Renshaw2013) with 500 stories and 2000 questions and the ProcessBank dataset (Berant et al. Reference Berant, Srikumar, Chen, Vander Linden, Harding, Huang, Clark and Manning2014) with 585 questions over 200 paragraphs related to biological processes. In 2015, the introduction of large datasets such as CNN/Daily Mail (Hermann et al. Reference Hermann, Kocisky, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015) and SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) opened a new window in the MRC field by allowing the development of deep models.

In recent years, with the success of machine learning techniques, especially the neural networks, and the usage of recurrent neural networks to process sequential data such as texts, MRC has become an active area in the field of NLP. The goal of this paper is to categorize these studies, provide related statistics, and show the trends in this field. Some recent surveys focused on QA systems (Bouziane et al. Reference Bouziane, Bouchiha, Doumi and Malki2015; Kodra and Meçe Reference Kodra and Meçe2017). Some papers presented a partial survey on some MRC systems but did not provide a comprehensive classification of different aspects and different statistics in this field (Arivuchelvan and Lakahmi Reference Arivuchelvan and Lakahmi2017; Zhang et al. Reference Zhang, Yang, Li and Wang2019a). Ingale and Singh only provided a review on MRC datasets (Ingale and Singh Reference Ingale and Singh2019). Liu et al. (Reference Liu, Zhang, Zhang, Wang and Zhang2019c) provide a review on different aspects of neural MRC models including definitions, differences, popular datasets, architectures, and new trends based on 85 papers. In our study, we present a comprehensive review on 241 papers, analyzing and categorizing MRC studies from different aspects including problem-solving approaches, input/output, model structures, research novelties, and datasets. We also provide statistics on the amount of research attention to these aspects in different years, which are not provided in previous reviews.

1.2. Outline

In order to select papers, the queries “reading comprehension,” “machine reading,” “machine reading comprehension,” and “machine comprehension” were submitted to the Google Scholar serviceFootnote a . Also, the ACL Anthology websiteFootnote b , which includes top related NLP conferences such as ACL, EMNLP, NAACL, and CoNLL, was searched with the same queries to extract the remaining related papers. We excluded the retrieved papers that were published only on arXiv as well as the QA papers with no novelty in the MRC phase. We also excluded the papers with the conversational or dialogue MRC as subjects because these papers focus on multi-turn QA in conversational contexts with different challenges (Gupta, Rawat and Yu Reference Gupta, Rawat and Yu2020). We limited our study to the papers published in recent years, that is, from 2016 to September 2020. Table 1 shows the number of reviewed papers over different years.

Table 1. Number of reviewed papers over different years.

The contributions of this paper are as follows:

  • Investigating recently published MRC papers from different perspectives including problem-solving approaches, system input/outputs, contributions of these studies, and evaluation metrics.

  • Providing statistics for each category over different years and highlighting the trends in this field.

  • Reviewing available datasets and classifying them based on important factors.

  • Discussing specific aspects of the most-cited papers such as their contribution, their citation number, and year and venue of their publication.

The rest of this paper is organized as follows. Section 2 focuses on the main problem-solving approaches for the MRC task. The review of the papers based on the basic phases of an MRC system is presented in Section 3. Section 4 provides an analysis of the type of input/outputs of MRC systems. The recent datasets and evaluation measures are reviewed in Sections 5 and 6, respectively. In Section 7, the MRC studies are categorized based on their contributions and novelties. The most-cited papers are investigated in Section 8. Section 9 provides future trends and opportunities. Finally, the paper is concluded in Section 10.

2. Problem-solving approaches

The approaches used for developing MRC systems can be grouped into three categories: rule-based methods, classical machine learning-based methods, and deep learning-based methods.

The traditional rule-based methods use the rules handcrafted by linguistic experts. These methods suffer from the problem of the incompleteness of the rules. Also, this approach is domain-specific where for any new domain a new set of rules should be handcrafted. As an example, Riloff and Thelen (Reference Riloff and Thelen2000) present a rule-based MRC system called Quarc, which reads a short story and answers the input question by extracting the most relevant sentences. Quarc uses a separate set of rules for each question type (WHO, WHAT, WHEN, WHERE, and WHY). In this system, several NLP tools are used for parsing, part of speech tagging, morphological analysis, entity recognition, and semantic class tagging. As another example, Akour et al. (Reference Akour, Abufardeh, Magel and Al-Radaideh2011) introduce the QArabPro system, which is a system for answering reading comprehension questions in the Arabic language. It is also developed using a set of rules for each type of question and uses multiple NLP components, including question classification, query reformulation, stemming, and root extraction.

The second approach is based on the classical machine learning. These methods rely on a set of human-defined features and train a model for mapping input features to the output. Note that in classical machine learning-based methods, even though the handcrafted rules are not necessary, feature engineering is a critical necessity.

For example, Ng, Teo and Kwan (Reference Ng, Teo and Kwan2000) have developed a machine learning-based MRC system and introduced some of features to be extracted from a context sentence like “the number of matching words/verb types between the question and the sentence,” “the number of matching words/verb types between the question and the previous/next sentence,” “co-reference information,” and binary features like “sentence-contain-person,” “sentence-contain-time,” “sentence-contain-location,” “sentence-is-title,” and so on

The third approach uses deep learning methods to learn features from raw input data automatically. These methods require a large amount of training data to create high accuracy models. Because of the growth of available data and computational power in recent years, deep learning methods have gained state-of-the-art results in many tasks. In the MRC task, most of the recent research falls into this category. Two main deep learning architectures used by MRC researchers are the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN).

RNNs are often used for modeling sequential data by iterating through the sequence elements and maintaining a state containing information relative to what have seen so far. Two common types of RNNs are Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) and Gated Recurrent Unit (GRU) (Cho et al. Reference Cho, Van MerriËnboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014) in uni-directional and bi-directional versions (Chen et al. Reference Chen, Bolton and Manning2016; Kobayashi et al. Reference Kobayashi, Tian, Okazaki and Inui2016; Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Clark and Gardner Reference Clark and Gardner2018; Ghaeini et al. Reference Ghaeini, Fern, Shahbazi and Tadepalli2018; Hoang, Wiseman and Rush Reference Hoang, Wiseman and Rush2018; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Liu et al. Reference Liu, Zhao, Si, Zhang, Li and Yu2018a). In MRC systems, like other NLP tasks, these architectures have been commonly used in different parts of the pipeline, such as for representing questions and contexts (Chen et al. Reference Chen, Bolton and Manning2016; Kobayashi et al. Reference Kobayashi, Tian, Okazaki and Inui2016; Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Clark and Gardner Reference Clark and Gardner2018; Ghaeini et al. Reference Ghaeini, Fern, Shahbazi and Tadepalli2018; Hoang et al. Reference Hoang, Wiseman and Rush2018; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Liu et al. Reference Liu, Zhao, Si, Zhang, Li and Yu2018a) or in higher levels of the MRC system such as the modeling layer (Choi et al. Reference Choi, Hewlett, Uszkoreit, Polosukhin, Lacoste and Berant2017b; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Li, Li and Lv Reference Li, Li and Lv2018). In recent years, the attention-based Transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) has emerged as a powerful alternative to the RNN architecture. For more detailed information, refer to Section 3.

CNN is a type of deep learning model that is universally used in computer vision applications. It utilizes layers with convolution filters that are applied to local spots of their inputs (LeCun et al. Reference LeCun, Bottou, Bengio and Haffner1998). CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in various NLP tasks (Kim Reference Kim2014). In MRC systems, CNN is used in the embedding phase (especially, character embedding) (Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Indurthi et al. Reference Indurthi, Yu, Back and CuayÁhuitl2018) as well as in higher-level phases (introduced in Section 3) for modeling interactions between the question and passage like in the QANet (Yu et al. Reference Yu, Dohan, Luong, Zhao, Chen, Norouzi and Le2018). QANet uses CNN and self-attention blocks instead of the RNN, which results in faster answer span detection on the SQuAD dataset (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016).

3. MRC phases

Most of the recent deep learning-based MRC systems have the following phases: embedding phase, reasoning phase, and prediction phase. Many of the reviewed papers focus on developing new structures for these phases, especially the reasoning phase.

3.1. Embedding phase

In this phase, input characters, words, or sentences are represented by real-valued dense vectors in a meaningful space. The goal of this phase is to provide question and context embedding. Different levels of embedding are used in MRC systems. Character-level and word-level embeddings can capture the properties of words, subwords, and characters, and higher-level representations can represent syntactic and semantic information of input text. Table 2 shows the statistics of various embedding methods used in the reviewed papers. Since there is not any paper that uses the character embedding as the only embedding method, there is no character embedding column in this table. For a complete list of papers categorized based on their embedding methods, refer to Table A1.

Table 2. Statistics of different embedding methods used by reviewed papers.

3.1.1. Character embedding

Some papers use character embedding as part of their embedding phase. This type of embedding is useful to overcome unknown and rare words problems (Dhingra et al. Reference Dhingra, Zhou, Fitzpatrick, Muehl and Cohen2016; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017). To generate the input representation, deep neural network models are commonly used. Inspired by Kim’s work (Kim Reference Kim2014), some papers have used CNN models to embed the input characters (Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Zhang et al. Reference Zhang, Zhu, Chen, Ling, Dai, Wei and Jiang2017; Gong and Bowman Reference Gong and Bowman2018; Kundu and Ng Reference Kundu and Ng2018a). Some other papers have used character level information captured from the final state of an RNN model like LSTM (or Bi-LSTM) and GRU (or Bi-GRU) (Wang, Yuan and Trischler Reference Wang, Yuan and Trischler2017a; Yang et al. Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a; Du and Cardie Reference Du and Cardie2018; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Wang et al. Reference Wang, Yu, Guo, Wang, Klinger, Zhang, Chang, Tesauro, Zhou and Jiang2018b). As another approach which uses both CNN and LSTM to embed input characters, LSTM-char CNN (Kim et al. Reference Kim, Jernite, Sontag and Rush2016) is also used in MRC literature (Prakash, Tripathy and Banu Reference Prakash, Tripathy and Banu2018). We classify these papers in two categories, CNN and RNN, and so the sum of percentages is greater than 100% in Figure 2. This figure shows the percentage of different character embedding methods over different years. Other methods include skip-gram, n-grams, and more recent methods like ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018). It is clear that the use of CNN has been consistently higher than the use of RNN for character embedding in different years.

Figure 2. The percentage of different character embedding methods over different years

3.1.2. Word embedding

Word embedding is to represent the words or subwords in a numeric vector space, which is performed by two main approaches: 1) non-contextual embedding and 2) contextual embedding.

Non-contextual word embedding

Non-contextual embeddings present a single general representation for each word, regardless of its context. There are three main non-contextual embeddings: 1) one hot encoding, 2) learning word vectors jointly with the main task, and 3) using pre-trained word vectors (fixed or fine-tuned).

One hot encoding is the most basic way to turn a token into a vector. These are binary, sparse, and very high dimensional vectors with the size of the vocabulary (the number of unique words in the corpus). To represent a word w, all the vector elements are set to zero, except the one which identifies w. This approach has been less popular than other approaches in recent papers (Cui et al. Reference Cui, Liu, Chen, Wang and Hu2016; Liu and Perez Reference Liu and Perez2017).

Another popular way to represent words is learned word vectors, which delivers dense real-valued representations. In the presence of a large amount of training data, it is advised to learn the word vectors from scratch jointly with the main task (Chollet Reference Chollet2017).

Some studies have shown that initializing word vectors with pre-trained values results in better accuracies than random initialization (Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Ren, Cheng and Su Reference Ren, Cheng and Su2020; Wang et al. Reference Wang, Zhang, Zhou and Li2020a; Zhou, Luo and Wu Reference Zhou, Luo and Wu2020b). This approach is especially useful in low-data scenarios (Chollet Reference Chollet2017; Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017). GloVe embedding (Pennington, Socher and Manning Reference Pennington, Socher and Manning2014) is the most common pre-trained word representation among non-contextual representations, used in the reviewed papers (Chen et al. Reference Chen, Bolton and Manning2016; Yin, Ebert and Schütze Reference Yin, Ebert and Schütze2016; Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Liu et al. Reference Liu, Hu, Wei, Yang and Nyberg2017; Wang et al. Reference Wang, Yuan and Trischler2017a; Xiong, Zhong and Socher Reference Xiong, Zhong and Socher2017; Gong and Bowman Reference Gong and Bowman2018; Wang et al. Reference Wang, Yu, Guo, Wang, Klinger, Zhang, Chang, Tesauro, Zhou and Jiang2018b). Word2Vec (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) is another word embedding used in this task (Kobayashi et al. Reference Kobayashi, Tian, Okazaki and Inui2016; Chaturvedi, Pandit and Garain Reference Chaturvedi, Pandit and Garain2018; Šuster and Daelemans Reference Šuster and Daelemans2018). These pre-trained word vectors are fine-tuned (Chen et al. Reference Chen, Bolton and Manning2016; Kobayashi et al. Reference Kobayashi, Tian, Okazaki and Inui2016; Zhang et al. Reference Zhang, Zhu, Chen, Ling, Dai, Wei and Jiang2017; Clark and Gardner Reference Clark and Gardner2018; Liu et al. Reference Liu, Shen, Duh and Gao2018c; Šuster and Daelemans Reference Šuster and Daelemans2018) or left as fixed vectors (Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Shen et al. Reference Shen, Huang, Gao and Chen2017; Weissenborn, Wiese and Seiffe Reference Weissenborn, Wiese and Seiffe2017; Gong and Bowman Reference Gong and Bowman2018). Fine-tuning some keywords such as “what,” “how,” “which,” and “many” could be crucial for QA systems, while most of the pre-trained word vectors can be kept fixed (Chen et al. Reference Chen, Fisch, Weston and Bordes2017). Table 3 shows the statistics of these approaches in different years. Finally, it is worth noting that some papers use hand-designed word features such as named entity (NE) tag and part of speech (POS) tag along with embedding of words (Huang et al. Reference Huang, Zhu, Shen and Chen2018).

Table 3. Statistics of different word representation methods in the reviewed papers.

Contextual word embedding

Despite the relative success of non-contextual embeddings, they are static, so all meanings of a word are represented with a fixed vector (Ethayarajh Reference Ethayarajh2019). Different from static word embeddings, contextual embeddings move beyond word-level semantics and represent each word considering its context (surrounding words). To obtain the context-based representation of the words, two approaches can be adopted: 1) learning the word vectors jointly with the main task and 2) using pre-trained contextual word vectors (fixed or fine-tuned).

For learning the contextual word vectors, a sequence modeling method, usually an RNN, is used. For example, Chen et al. (Reference Chen, Fisch, Weston and Bordes2017) used a multi-layer Bi-LSTM model for this purpose. In Yang, Kang and Seo (Reference Yang, Kang and Seo2020) study, forward and backward GRU hidden states are combined to generate contextual representations of query and context words. Bajgar, Kadlec and Kleindienst (Reference Bajgar, Kadlec and Kleindienst2017) used different approaches for the query and context words, where the combination of forward and backward GRU hidden states is exploited for representing the context words, while the final hidden state of GRU is used for the query. On the other hand, pre-trained contextualized embeddings such as ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018), BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018), and GPT (Radford et al. Reference Radford, Narasimhan, Salimans and Sutskever2018) are deep neural language models that are trained on large unlabeled corpuses. The ELMo method (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) obtains the contextualized embeddings by a 2-layer Bi-LSTM, while BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) and GPT (Radford et al. Reference Radford, Narasimhan, Salimans and Sutskever2018) are bi-directional and uni-directional Transformer-based (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) language models, respectively. These embeddings are used either besides other embeddings (Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Hu et al. Reference Hu, Peng, Wei, Huang, Li, Yang and Zhou2018b; Seo et al. Reference Seo, Kwiatkowski, Parikh, Farhadi and Hajishirzi2018; Lee and Kim Reference Lee and Kim2020; Ren et al. Reference Ren, Cheng and Su2020) or alone (Bauer, Wang and Bansal Reference Bauer, Wang and Bansal2018; Zheng et al. Reference Zheng, Wen, Liang, Duan, Che, Jiang, Zhou and Liu2020).

Due to the success of the contextual word embeddings in many NLP tasks, there is a clear trend toward using these embeddings in recent years (Bauer et al. Reference Bauer, Wang and Bansal2018; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Hu et al. Reference Hu, Peng, Wei, Huang, Li, Yang and Zhou2018b; Wang and Bansal Reference Wang and Bansal2018). This is obvious in Table 3, where the use of fixed pre-trained and fine-tuned contextual embeddings has increased from 0% in 2016 to 23% and 36% in 2020, respectively. Note that some papers use multiple methods, so the sum of percentages in the tables may be greater than 100%.

3.1.3. Hybrid word-character embedding

The combination of word embedding and character embedding is used in some reviewed papers (Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Yang et al. Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a; Zhang et al. Reference Zhang, Zhu, Chen, Ling, Dai, Wei and Jiang2017; Gong and Bowman Reference Gong and Bowman2018). Hybrid embedding tries to use the strengths of both word and character embeddings. A simple approach is to concatenate the word and character embeddings. As an example, Lee and Kim (Reference Lee and Kim2020) used GloVe as the word embedding and the output of the CNN model as the character embedding.

This approach suffers from a potential problem. Word embedding has better performance for frequent words (subwords), while it can have negative effects for representing rare words (subwords). The reverse is true for character embedding (Yang et al. Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a). To solve this problem, some researchers introduced a gating mechanism which regulates the flow of information. Yang et al. (Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a) used a fine-grained gating mechanism for dynamic concatenation of word and characters embedding. This mechanism uses a gate vector, which is a linear multiplication of word features (POS and NE), to control the flow of information of word and character embeddings. Seo et al. (Reference Seo, Kembhavi, Farhadi and Hajishirzi2017) used highway networks (Srivastava, Greff and Schmidhuber Reference Srivastava, Greff and Schmidhuber2015) for embedding concatenation. These networks use the gating mechanism learned by the LSTM network. According to Table 2, the use of hybrid embedding in reviewed papers has increased from 0% in 2016 to 54% in 2018. However, with the success of language model-based contextual embeddings, the direct combination of character and word embeddings has decreased thereafter.

3.1.4. Sentence embedding

Sentence embedding is a high-level representation in which the entire sentence is encoded in a single vector. It is often used along with other embeddings (Yin et al. Reference Yin, Ebert and Schütze2016). However, sentence embedding is not so popular in MRC systems, because the answer is often a sentence part, not the whole sentence.

3.2. Reasoning phase

The goal of this phase is to match the input query (question) with the input document (context). In other words, this phase determines the related parts of the context for answering the question by calculating the relevance between question and context parts. Recently, Phrase Indexed Question Answering (PIQA) model (Seo et al. Reference Seo, Kwiatkowski, Parikh, Farhadi and Hajishirzi2018) enforces complete independence between document encoder and question encoder and does not include any cross attention between question and document. In this model, each document is processed beforehand, and its phrase index vectors are generated. Then, at inference time, the answer is obtained by retrieving the nearest indexed phrase vector to the query vector.

The attention mechanism (Bahdanau, Cho and Bengio Reference Bahdanau, Cho and Bengio2015), originally introduced for machine translation, is used for this phase. In recent years, with the advent of attention-based Transformer architecture (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) as an alternative to common sequential structures like RNN, new Transformer-based language models, such as BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) and XLNet (Yang et al. Reference Yang, Dai, Yang, Carbonell, Salakhutdinov and Le2019b), have been introduced. They are used as the basis for new state-of-the-art results in MRC task (Sharma and Roychowdhury Reference Sharma and Roychowdhury2019; Su et al. Reference Su, Xu, Winata, Xu, Kim, Liu and Fung2019; Yang et al. Reference Yang, Wang, Liu, Liu, Lyu, Wu, She and Li2019a; Tu et al. Reference Tu, Huang, Wang, Huang, He and Zhou2020; Zhang et al. Reference Zhang, Luo, Lu, Liu, Bai, Bai and Xu2020a; Zhang et al. Reference Zhang, Zhao, Wu, Zhang, Zhou and Zhou2020b) by adding or modifying final layers and fine-tuning them on the target task.

The attention mechanism used in MRC systems can be explored in three perspectives: direction, dimension, and number of steps. For the statistics, refer to Table 4.

Table 4. Statistics of different attention mechanisms used in the reasoning phase of MRC systems.

3.2.1. Direction

Some research only uses the context-to-query (C2Q) attention vector (Cui et al. Reference Cui, Liu, Chen, Wang and Hu2016; Wang and Jiang Reference Wang and Jiang2017; Weissenborn et al. Reference Weissenborn, Wiese and Seiffe2017; Huang et al. Reference Huang, Zhu, Shen and Chen2018) called one-directional attention mechanism. It signifies which query words are relevant to each context word (Cui et al. Reference Cui, Chen, Wei, Wang, Liu and Hu2017b; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017).

In bi-directional attention mechanism, query-to-context (Q2C) attention weights are also calculated (Cui et al. Reference Cui, Chen, Wei, Wang, Liu and Hu2017b; Liu et al. Reference Liu, Hu, Wei, Yang and Nyberg2017; Min, Seo and Hajishirzi Reference Min, Seo and Hajishirzi2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Xiong et al. Reference Xiong, Zhong and Socher2017; Clark and Gardner Reference Clark and Gardner2018) along with C2Q. It signifies which context words have the closest similarity to one of the query words and are hence critical for answering the question (Cui et al. Reference Cui, Chen, Wei, Wang, Liu and Hu2017b; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017). In Transformer-based MRC models like BERT-based models, the question and context are processed as one sequence, so the attention mechanism can be considered as bi-directional attention. As shown in Table 4, the ratio of bi-directional attention usage has increased in recent years.

3.2.2. Dimension

There are two attention dimensions in the reviewed papers: one-dimensional and two-dimensional attentions. In one-dimensional attention, the whole question is represented by one embedding vector, which is usually the last hidden state of the contextual embedding (Chen et al. Reference Chen, Bolton and Manning2016; Kadlec et al. Reference Kadlec, Schmid, Bajgar and Kleindienst2016b; Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Shen et al. Reference Shen, Huang, Gao and Chen2017; Weissenborn et al. Reference Weissenborn, Wiese and Seiffe2017). It does not pay more attention to important question words. On the contrary, in two-dimensional attention, every word in the query has its own embedding vector (Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Cui et al. Reference Cui, Chen, Wei, Wang, Liu and Hu2017b; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Yang et al. Reference Yang, Hu, Salakhutdinov and Cohen2017b; Clark and Gardner Reference Clark and Gardner2018).

According to Table 4, 86% of all reviewed papers use two-dimensional attention. Also, the use of two-dimensional attention has increased over recent years.

3.2.3. Number of steps

According to the number of reasoning steps, three types of MRC systems can be seen: single-step reasoning, multi-step reasoning with a fixed number of steps, and dynamic multi-step reasoning.

In the single-step reasoning, question and passage matching is done in a single step. However, the obtained representation can be processed through multiple layers to extract or generate the answer (Chen et al. Reference Chen, Bolton and Manning2016; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Clark and Gardner Reference Clark and Gardner2018). In multi-step reasoning, question and passage matching is done in multiple steps such that the question-aware context representation is updated by integrating the intermediate information in each step. The number of steps can be static (Yang et al. Reference Yang, Hu, Salakhutdinov and Cohen2017b; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a) or dynamic (Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Shen et al. Reference Shen, Huang, Gao and Chen2017; Song et al. Reference Song, Tang, Qian, Zhu and Wu2018). Dynamic multi-step reasoning uses a termination module to decide whether the inferred information is sufficient for answering or more reasoning steps are still needed. Therefore, the number of reasoning steps in this model depends on the complexity of the passage and question. It is obvious that in multi-step reasoning, the model complexity increases by the number of reasoning steps. In the Transformer-based MRC models, the number of steps is fixed and depends on the number of layers.

According to Table 4, about 61% of reviewed papers use single-step reasoning, but the popularity of multi-step reasoning has increased over recent years. For a detailed list of the used reasoning methods in different papers, refer to Table A2.

3.3. Prediction phase

The final output of an MRC system is specified in the prediction phase. The output can be extracted from context or generated according to context. In generation mode, a decoder module generates answer words one by one (Hewlett et al. Reference Hewlett, Jones and Lacoste2017). In some cases, multiple choices are presented to the system, and it must select the best answer according to the question and passage(s) (Greco et al. Reference Greco, Suglia, Basile, Rossiello and Semeraro2016). These multi-choice systems can be seen in both extractive and generative models based on whether the answer choices occur in the passage or not.

The extraction mode is implemented in different forms. If the answer is a span of context, the start and end indices of the span are predicted in many studies by estimating the probability distribution of indices over the entire context (Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Wang and Jiang Reference Wang and Jiang2017; Xiong et al. Reference Xiong, Zhong and Socher2017; Yang et al. Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a; Yang et al. Reference Yang, Hu, Salakhutdinov and Cohen2017b; Clark and Gardner Reference Clark and Gardner2018).

In some studies, the candidate chunks (answers) are extracted first, which are ranked by a trained model. These chunks can be sentences (Duan et al. Reference Duan, Tang, Chen and Zhou2017; Min et al. Reference Min, Seo and Hajishirzi2017) or entities (Sachan and Xing Reference Sachan and Xing2018). In the Ren et al. (Reference Ren, Cheng and Su2020) study, after extracting the candidate chunks from various contexts, a linear transformation is used along with a sigmoid function to compute the score of the answer candidates.

Table 5 shows the statistics of these categories in the reviewed papers. It is clear that most papers (65%) extract the answer span in the passage(s). It seems that developing rich span-based datasets like SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) is the reason for this popularity. Also, the generation-based papers have increased from 10% in 2016 to 55% in 2020 (sum of the answer generation and candidate ranking columns in the generation mode). For more details, refer to Table A3.

Table 5. Statistics of different prediction phase categories in the reviewed papers.

4. Input/Output-based analysis

4.1. MRC systems input

The inputs to an MRC system are question and passage texts. The passage is often referred to as context. Moreover, in some systems, the candidate answer list is part of the input.

4.1.1. Question

Input questions can be grouped into three categories: factoid questions, non-factoid questions, and yes/no questions.

Factoid questions are questions that can be answered with simple facts expressed in short text answers like a personal name, temporal expression, or location (Jurafsky and Martin Reference Jurafsky and Martin2019). For example, the answer to the question “Who founded Virgin Airlines?” is a personal name; or questions “What is the average age of the onset of autism?” and “Where is Apple Computer based?” have number and location as an answer, respectively. In other words, the answers to factoid questions are one or more entities or a short expression. Because of its simplicity compared to other types, most research in MRC literature has focused on this type of question (Chen et al. Reference Chen, Fisch, Weston and Bordes2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Clark and Gardner Reference Clark and Gardner2018; Huang et al. Reference Huang, Zhu, Shen and Chen2018).

Non-factoid questions, on the other hand, are open-ended questions that usually require long and complex passage-level answers, such as descriptions, opinions, and explanations (Hashemi et al. Reference Hashemi, Aliannejadi, Zamani and Croft2020). For example, “Why does Queen Elizabeth sign her name Elizabeth?” “What is the difference between MRC and QA?” and “What do you think about MRC?” are instances of non-factoid questions. In our reviewed papers, 32% of works focus on non-factoid questions. Because of their difficulty, the systems dealing with non-factoid questions have often lower accuracies (Wang et al. Reference Wang, Guo, Liu, He and Zhao2016; Tan et al. Reference Tan, Wei, Yang, Du, Lv and Zhou2018a; Wang et al. Reference Wang, Yu, Chang and Jiang2018a).

Yes/No questions, as indicated by their name, have yes or no as answers. According to our investigation, the papers which deal with this type of question consider other types of questions as well (Li et al. Reference Li, Li and Lv2018; Liu et al. Reference Liu, Wei, Sun, Chen, Du and Lin2018b; Zhang et al. Reference Zhang, Wu, He, Liu and Su2018).

Refer to Table 6 for the statistics of input/output types in MRC systems. It is clear from the table that the popularity of non-factoid and yes/no questions is increased. Note that since some papers focus on multiple question types, the sum of percentages is greater than 100% in this table.

4.1.2. Context

The input context can be a single passage or multiple passages. It is obvious that as the context gets longer, finding the answer becomes harder and more time-consuming. Until now, most of the papers have focused on a single passage (Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Wang et al. Reference Wang, Yang, Wei, Chang and Zhou2017b; Yang et al. Reference Yang, Dhingra, Yuan, Hu, Cohen and Salakhutdinov2017a; Yang et al. Reference Yang, Hu, Salakhutdinov and Cohen2017b; Zhang et al. Reference Zhang, Zhu, Chen, Ling, Dai, Wei and Jiang2017). But multiple passages MRC systems are becoming more popular (Xie and Xing Reference Xie and Xing2017; Huang et al. Reference Huang, Zhu, Shen and Chen2018; Liu et al. Reference Liu, Wei, Sun, Chen, Du and Lin2018b; Wang et al. Reference Wang, Yu, Guo, Wang, Klinger, Zhang, Chang, Tesauro, Zhou and Jiang2018b). According to Table 6, only 4% of the reviewed papers focused on multiple passages in 2016, but this ratio has increased in recent years reaching 52% and 32% in 2019 and 2020, respectively.

4.2. MRC systems output

The output of MRC systems can be classified into two categories: abstractive (generative) output and extractive (selective) output.

In the abstractive mode, the answer is not necessarily an exact span in the context and is generated according to the question and context. This output type is especially suitable for non-factoid questions (Greco et al. Reference Greco, Suglia, Basile, Rossiello and Semeraro2016; Choi et al. Reference Choi, Hewlett, Uszkoreit, Polosukhin, Lacoste and Berant2017b; Tan et al. Reference Tan, Wei, Yang, Du, Lv and Zhou2018a).

In the extractive mode, the answer is a specific span of the context (Liu et al. Reference Liu, Hu, Wei, Yang and Nyberg2017; Min et al. Reference Min, Seo and Hajishirzi2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Yang et al. Reference Yang, Hu, Salakhutdinov and Cohen2017b; Liu et al. Reference Liu, Shen, Duh and Gao2018c). This output type is appropriate for factoid questions; however, it is possible that the answer to a factoid question may be generative or the answer to a non-factoid question may be extractive. For example, the answer to a non-factoid question may be a whole sentence which is extracted from the context.

There has generally been more focus on extractive MRC systems, but according to Table 6, the popularity of abstractive MRC systems has increased over recent years. From another point of view, MRC outputs can be categorized as multiple-choice style, cloze style, and detail style:

  • In the multiple-choice style mode, the answer is one of the multiple candidate answers that must be selected from a predefined set $A$ containing $k$ candidate answers:

    $$A = \left\{ {{a_1}{\rm{,}} \ldots {\rm{,}}{a_k}} \right\},$$
    where ${a_j}$ can be a word, a phrase, or a sentence with length ${l_j}$ :
    $${a_j} = \left( {{a_{j{\rm{,}}1}}{\rm{,}}{a_{j{\rm{,}}2}}{\rm{,}} \ldots {\rm{,}}{a_{j{\rm{,}}{l_j}}}} \right)$$
  • In the cloze style mode, the question includes a blank that must be filled as an answer according to the context. In the available cloze style datasets, the answer is an entity from the context.

  • In the detail style mode, there is no candidate or blank, so the answer must be extracted or generated according to the context. In extractive mode, the answer must be a definite range in the context, so it can be shown as $\left( {{a_{start}},\;{a_{end}}} \right),{\rm{\;}}$ where ${{\rm{a}}_{{\rm{start}}}}$ and ${a_{end}}$ are, respectively, the start and end indices of the answer in the context. In generative mode, on the other hand, the answer can be generated from any custom vocabulary $V$ , and it is not limited to the context words.

As shown in Table 6, most of the reviewed papers (72%) have focused on the detail style mode. Also, about 100%, 70%, and 82% of the reviewed papers have focused on factoid questions, multi-passage context, and extractive answers, respectively, due to their lower complexity and existence of rich datasets. For a more detailed categorization of papers based on their input/outputs, refer to Table A4.

Table 6. Statistics of input/output types in MRC systems.

5. MRC datasets

Rich datasets are the first prerequisite for having accurate machine learning models. Especially, deep neural network models require high volumes of training data to achieve good results. For this reason, in recent years, many researchers have focused on collecting big datasets. For example, Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016), which is a popular MRC dataset used in many studies, includes over 100,000 training samples.

MRC datasets can be categorized according to their volume, domain, question type, answer type, context type, data collection method, and language.

In terms of domain, MRC datasets can be classified into two categories: open domain and closed domain. Open domain datasets contain diverse subjects, while closed domain datasets focus on specific areas such as the medical domain. For example, the SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) dataset, which contains Wikipedia articles, is an open domain dataset, and emrQA (Pampari et al. Reference Pampari, Raghavan, Liang and Peng2018), BIOMRC (Pappas et al. Reference Pappas, Stavropoulos, Androutsopoulos and McDonald2020), LUPARCQ (Horbach et al. Reference Horbach, Aldabe, Bexte, de Lacalle and Maritxalar2020) are a closed domain dataset with biology as its subject.

There are two data collection approaches in MRC datasets, automatic approach, and crowdsourcing approach. The former generates questions/answers without direct human interventions. For instance, datasets that contain cloze style questions, such as Children’s Book Test dataset (Hill et al. Reference Hill, Bordes, Chopra and Weston2016), are generated by removing important entities from text. Also, in some datasets, questions are automatically extracted from the search engine’s user logs (Nguyen et al. Reference Nguyen, Rosenberg, Song, Gao, Tiwary, Majumder and Deng2016) or real reading comprehension tests (Lai et al. Reference Lai, Xie, Liu, Yang and Hovy2017).

On the other hand, in the crowdsourcing approach, humans generate questions, answers, or select related paragraphs. Of course, a dataset can be generated by a combination of these two approaches. For instance, in MS MARCO (Nguyen et al. Reference Nguyen, Rosenberg, Song, Gao, Tiwary, Majumder and Deng2016), questions have been generated automatically, while these questions have been answered and evaluated by crowdsourcing.

Table 7 shows a detailed list of the datasets proposed from 2016-2020 in chronological order. In this table, the datasets with a link address are publicly available. Finally, Figure 3 shows the progress made on two representative datasets, SQuAD1.1 and RACE, as representatives of the advancements in the field. Note that only the articles available in our reviewed papers are reported in this figure. According to this figure, the state-of-the-art model’s performance on SQuAD1.1 dataset increased from about 0.7 in 2017 to super human level of 0.95 in 2019. For the RACE dataset, despite the progress made in the accuracy from around 0.4 in 2017 to around 0.8 in 2020, it is still under the human-level performance which shows that this dataset is more challenging.

Table 7. MRC datasets proposed from 2016 to 2020. (A: Answer, P: passage, Q: Question)

6. MRC evaluation measures

Based on the system output type, different evaluation metrics are introduced. We classify these measures into two categories: extractive metrics and generative metrics.

Figure 3. The progress made on two datasets: SQuAD1.1 (top) and RACE (down). The data points are taken from https://rajpurkar.github.io/SQuAD-explorer and http://www.qizhexie.com/data/RACE_leaderboard.html, respectively. Only the articles available in our reviewed papers are reported.

6.1. Extractive metrics

These metrics are used for the extractive outputs. Table 8 shows the statistics of these measures in the reviewed papers.

  • F1 score: The harmonic mean of precision and recall is a common extractive metric for evaluating MRC systems. It takes into account the system output and the ground truth answer as bag-of-tokens (words). Precision is calculated as the number of correctly predicted tokens divided by the number of all predicted tokens. The recall is also the number of correctly predicted tokens divided by the number of ground truth tokens. The final F1 score is then obtained by averaging over all question-answer pairs.

  • Exact Match (EM) . This is the percentage of answers that exactly match with the correct answers. If there are multiple answers to a question in a dataset, a match with at least one of the answers is considered as an exact match. Some QA systems such as multiple-choice QA systems (Zhang et al. Reference Zhang, Wu, Zhou, Duan, Zhao and Wang2020c) or sentence selection QA systems (Min et al. Reference Min, Seo and Hajishirzi2017) call this measure as accuracy (ACC) instead of EM.

  • Mean Average Precision (MAP). This measure is used when the system returns several answers along with their ratings. The MAP for a set of question-answer pairs is the mean of Average Precision scores (AveP) for each one. AveP is an evaluation measure used in information retrieval systems which evaluates a ranked list of documents in response to a given query. AveP for a single query is calculated by taking the mean of the precision scores obtained after each relevant document is retrieved, with relevant documents that are not retrieved receiving a precision score of zero (Turpin and Scholer Reference Turpin and Scholer2006). In MRC literature, the ranked list of answers for a given question is evaluated.

  • Mean Reciprocal Rank (MRR). This is a common evaluation metric for factoid QA systems introduced in TREC QA track 1999. According to the definition presented in the “Evaluation of Factoid Answers” Section of the “Speech and Language Processing” book (Jurafsky and Martin Reference Jurafsky and Martin2019), MRR evaluates a ranked list of answers based on the inverse of the rank of the correct answer. For example, if the rank of the correct answer in the output list of a system is 4, the reciprocal rank score for that question would be 1/4. This measure is then averaged for all questions in the test set.

  • Precision@K. This measure is also borrowed from information retrieval literature. It is the number of correct answers in the first k returned answers without considering the position of these correct ones (Manning, Raghavan and Schütze Reference Manning, Raghavan and Schütze2008).

  • Hit@K or Top-K. Hit@K, which is equivalent to the Top-K accuracy, counts the number of samples where their first k returned answers include the correct answer.

Table 8. Statistics of evaluation measures used in reviewed papers.

6.2. Generative metrics

The metrics used for evaluating the performance of generative MRC systems are the same metrics used for machine translation and summarization evaluation. Table 8 shows the statistics of these measures in the reviewed papers.

  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE). This measure compares a system-generated answer with the human-generated one (Lin Reference Lin2004). It is defined as the recall of the system based on the n-grams, that is, the number of correctly generated n-grams divided by the total number of n-grams in the human-generated answer.

  • BiLingual Evaluation Understudy (BLEU). This metric was first introduced for evaluating the output of machine translation task. It is defined as the precision of the system based on the n-grams, that is, the number of correctly generated n-grams divided by the total number of n-grams in the system-generated answer (Papineni et al. Reference Papineni, Roukos, Ward and Zhu2002).

  • Metric for Evaluation of Translation with Explicit Ordering (METEOR). This measure is designed to fix some weaknesses of the popular BLEU measure. METEOR is based on an alignment between the system output and reference output. It introduces a penalty for adjacent tokens that cannot be mapped between the reference and system output. For this, unigrams are grouped into longer chunks if they are adjacent in both reference and system output. The more adjacent unigrams, the fewer chunks and the fewer penalty will be (Banerjee and Lavie Reference Banerjee and Lavie2005).

  • Consensus-based Image Description Evaluation (CIDEr). This measure is initially introduced for evaluating the image description generation task (Vedantam, Lawrence Zitnick and Parikh Reference Vedantam, Lawrence Zitnick and Parikh2015). It is based on the n-gram matching of the system output and reference output in the stem or root forms. According to this measure, the n-grams that are not in the reference output should not be in the system output. Also, the common n-grams in the dataset are less informative and have lower weights.

Figure 4 shows the ratio of the used extractive/generative measures in the reviewed papers. According to this figure, the generative measures have been more popular in recent years than in 2016 and 2017. The obvious reason for this is the trend toward developing abstractive MRC systems. For more details, refer to Table A5.

7. Research contribution

The contribution of MRC research can be grouped into four categories: developing new model structures, creating new datasets, combining with other tasks and improvement, and introducing new evaluation measures. Table 9 shows the statistics of these categories. Note that some studies have more than one contribution type, so the sum of some rows is greater than 100%. For example, Ma, Jurczyk and Choi (Reference Ma, Jurczyk and Choi2018) introduced a new dataset from the “Friends” sitcom transcripts and developed a new model architecture as well. For more details, refer to Table A6.

Table 9. Statistics of different research contributions to MRC task in the reviewed papers.

Figure 4. Ratio of reviewed papers (%) for extractive/generative evaluation metrics

7.1. Developing new model structures

Many MRC papers have focused on developing new model structures to address the weaknesses of previous models. Most of them developed new internal structures (Cui et al. Reference Cui, Liu, Chen, Wang and Hu2016; Kadlec et al. Reference Kadlec, Schmid, Bajgar and Kleindienst2016b; Kobayashi et al. Reference Kobayashi, Tian, Okazaki and Inui2016; Cui et al. Reference Cui, Chen, Wei, Wang, Liu and Hu2017b; Dhingra et al. Reference Dhingra, Liu, Yang, Cohen and Salakhutdinov2017; Seo et al. Reference Seo, Kembhavi, Farhadi and Hajishirzi2017; Shen et al. Reference Shen, Huang, Gao and Chen2017; Xiong et al. Reference Xiong, Zhong and Socher2017; Hu et al. Reference Hu, Peng, Huang, Qiu, Wei and Zhou2018a; Huang et al. Reference Huang, Zhu, Shen and Chen2018). Some others changed the system inputs. For example, in Chen et al. (Reference Chen, Fisch, Weston and Bordes2017) study, in addition to word vectors, linguistic features such as NE and POS vectors have also been used as the input to the model. Also, some papers introduced a new way of entering the input into the system. For example, Hewlett et al. (Reference Hewlett, Jones and Lacoste2017) proposed breaking the context into overlapping windows and entering each window as an input to the system.

7.2. Creating new datasets

One of the main reasons for advancing the MRC research in recent years is the introduction of rich datasets. Many studies have focused on creating new datasets with new features in recent years (Nguyen et al. Reference Nguyen, Rosenberg, Song, Gao, Tiwary, Majumder and Deng2016; Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016; Joshi et al. Reference Joshi, Choi, Weld and Zettlemoyer2017; Lai et al. Reference Lai, Xie, Liu, Yang and Hovy2017; Trischler et al. Reference Trischler, Wang, Yuan, Harris, Sordoni, Bachman and Suleman2017; He et al. Reference He, Liu, Liu, Lyu, Zhao, Xiao, Liu, Wang, Wu and She2018; Šuster and Daelemans Reference Šuster and Daelemans2018). The main trend is to develop multi-document datasets, abstractive style outputs, and more complex questions that require more advanced reasoning. Also, some papers focus on customizing the available datasets instead of creating new ones. For example, Horbach et al. (Reference Horbach, Aldabe, Bexte, de Lacalle and Maritxalar2020) proposed to turn existing datasets, such as SQuAD and NewsQA, into interactive datasets. Some other papers added annotations to the existing datasets to provide interpretable clues for investigating the models’ behavior as well as to prevent models from exploiting biases and annotation artifacts. For example, SQuAD 2.0 (Rajpurkar et al. Reference Rajpurkar, Jia and Liang2018) was created by adding unanswerable questions to SQuAD, and SQuAD2-CR (Lee et al. Reference Lee, Hwang and Cho2020) was developed by adding causal and relational annotations to unanswerable questions in SQuAD 2.0. Similarly, ${R^4}C$ (Inoue et al. Reference Inoue, Stenetorp and Inui2020) was created by adding derivations to questions in hotpotQA.

7.3. Combining with other tasks

Simultaneous learning of multiple tasks (multi-task learning) (Collobert and Weston Reference Collobert and Weston2008) and exploiting the learned knowledge from one task in another task (transfer learning) (Ruder Reference Ruder2019) have been promising directions for obtaining better results, especially in the data-poor setting. As an example, Wang et al. (Reference Wang, Yuan and Trischler2017a) trained their MRC task with a question generation task and achieved better results. Besides these approaches, some papers exploit other task solutions as sub-modules in their MRC system. As an example, Yin et al. (Reference Yin, Ebert and Schütze2016) used a question classifier and a natural language inference (NLI) system as two sub-modules in their MRC system.

7.4. Introducing new evaluation measures

Reliable assessment of an MRC system is still a challenging topic. While some systems go beyond human performance in specific datasets such as SQuAD by the current measures (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016), further investigation shows that these systems fail to achieve a thorough and true understanding of human language (Jia and Liang Reference Jia and Liang2017; Wang and Bansal Reference Wang and Bansal2018). In these papers, the passage is successfully edited to mislead the model. These papers can be seen as a measure to evaluate the true comprehension of systems. Also, some papers have evaluated the required comprehension and reasoning capabilities for solving the MRC problem in available datasets (Chen et al. Reference Chen, Bolton and Manning2016; Sugawara et al. Reference Sugawara, Inui, Sekine and Aizawa2018).

8. Hot MRC papers

Table 10 shows the top papers in different years from 2016-2020 based on the number of citations in the Google Scholar service until September 2020. For all years but 2020, ten papers have been selected; while for the year 2020, just five papers have been chosen because the number of citations in this year is not enough in the time of writing the paper. According to this table, hot papers often introduce a new successful model structure or a new dataset.

Table 10. Hot papers based on the number of citations in the Google Scholar service until September 2020.

9. Future trends and opportunities

The MRC task has witnessed a great progress in recent years. Especially, like other NLP tasks, fine-tuning the pre-trained language models like BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) and XLNet (Yang et al. Reference Yang, Dai, Yang, Carbonell, Salakhutdinov and Le2019b) on the target task has achieved an impressive success in MRC task, such that many state-of-the-art systems use these language models. However, they suffer from some shortcomings, which make them far from the real reading comprehension. In the following, we list some of these challenges and new trends in the MRC field:

10. Conclusion

The MRC, as a hot research topic in NLP, focuses on reading the document(s) and answering the corresponding questions. The main goal of an MRC system is to gain a comprehensive understanding of text documents to be able to justify and respond to related questions. In this paper, we presented an overview of the variable aspects of recent MRC studies, including approaches, internal architecture, input/output type, research contributions, and evaluation measures. We reviewed 241 papers published during 2016-2020 to investigate recent studies and find new trends.

Based on the question type, MRC papers are categorized into factoid, non-factoid, and yes/no questions. In addition, the input context is categorized into single or multiple passages. According to the results of statistics, a trend toward non-factoid questions and multiple passages was obvious in recent years. The output types are classified into extractive and abstractive outputs. From another point of view, the output types are categorized as multiple choice, cloze, and detail styles. The statistics showed that although the extractive outputs have been more popular, the abstractive outputs are becoming more popular in recent years.

Furthermore, we reviewed the developed datasets along with their features, such as data volume, domain, question type, answer type, context type, collection method, and data language. The number of developed datasets has increased in recent years, and they are in general more challenging than previous datasets. Regarding research contributions, some papers develop new model structures, some introduce new datasets, some combine MRC tasks with other tasks, and others introduce new evaluation measures. The majority of papers developed novel model structures or introduced new datasets. Moreover, we presented the most-cited papers, which indicate the most popular datasets and models in the MRC literature. Finally, we mentioned the possible future trends and important challenges of available models, including the issues related to out-of-domain distributions, multi-document MRC, numerical reasoning, no-answer questions, non-factoid questions, and low-resource languages.

Appendix

Table A1. Reviewed papers categorized based on their embedding phase

Table A2. Reviewed papers categorized based on their reasoning phase

Table A3. Reviewed papers categorized based on their prediction phase

Table A4. Reviewed papers categorized based on their input/output

Table A5. Reviewed papers categorized based on their evaluation metric

Table A6. Reviewed papers categorized based on their novelties

References

Aghaebrahimian, A. (2018). Linguistically-Based Deep Unstructured Question Answering. Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 433-43.CrossRefGoogle Scholar
Akour, M., Abufardeh, S., Magel, K. and Al-Radaideh, Q. (2011). QArabPro: A Rule Based Question Answering System for Reading Comprehension Tests in Arabic. American Journal of Applied Sciences 8, 652661.CrossRefGoogle Scholar
Amirkhani, H., Jafari, M.A., Amirak, A., Pourjafari, Z., Jahromi, S.F. and Kouhkan, Z. (2020). FarsTail: A Persian Natural Language Inference Dataset. arXiv preprint arXiv:2009.08820.Google Scholar
Andor, D., He, L., Lee, K. and Pitler, E. (2019). Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5947-52. Hong Kong, China.CrossRefGoogle Scholar
Angelidis, S., Frermann, L., Marcheggiani, D., Blanco, R. and MÀrquez, L. (2019). Book QA: Stories of Challenges and Opportunities. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 78-85. Hong Kong, China.CrossRefGoogle Scholar
Anuranjana, K., Rao, V. and Mamidi, R. (2019). HindiRC: A Dataset for Reading Comprehension in Hindi. 0th International Conference on Computational Linguistics and Intelligent Text, La Rochelle, France.Google Scholar
Arivuchelvan, K.M. and Lakahmi, K. (2017). Reading Comprehension System-A Review. Indian Journal of Scientific Research (IJSR) 14, 8390.Google Scholar
Asai, A. and Hajishirzi, H. (2020). Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5642–50.CrossRefGoogle Scholar
Back, S., Chinthakindi, S.C., Kedia, A., Lee, H. and Choo, J. (2020). NeurQuRI: Neural Question Requirement Inspector for Answerability Prediction in Machine Reading Comprehension. International Conference on Learning Representations.Google Scholar
Back, S., Yu, S., Indurthi, S.R., Kim, J. and Choo, J. (2018). MemoReader: Large-Scale Reading Comprehension through Neural Memory Controller. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2131-40.CrossRefGoogle Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Bajgar, O., Kadlec, R. and Kleindienst, J. (2017). Embracing data abundance: BookTest Dataset for Reading Comprehension. International Conference on Learning Representations (ICLR) Workshop.Google Scholar
Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65-72.Google Scholar
Bao, H., Dong, L., Wei, F., Wang, W., Yang, N., Cui, L., Piao, S. and Zhou, M. (2019). Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 14-18.CrossRefGoogle Scholar
Bauer, L., Wang, Y. and Bansal, M. (2018). Commonsense for Generative Multi-Hop Question Answering Tasks. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4220–30CrossRefGoogle Scholar
Berant, J., Srikumar, V., Chen, P.-C., Vander Linden, A., Harding, B., Huang, B., Clark, P. and Manning, C.D. (2014). Modeling Biological Processes for Reading Comprehension. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1499-510.CrossRefGoogle Scholar
Berzak, Y., Malmaud, J. and Levy, R. (2020). STARC: Structured Annotations for Reading Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5726–35.CrossRefGoogle Scholar
Bi, B., Wu, C., Yan, M., Wang, W., Xia, J. and Li, C. (2019). Incorporating External Knowledge into Machine Reading for Generative Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2521-30. Hong Kong, China.CrossRefGoogle Scholar
Bouziane, A., Bouchiha, D., Doumi, N. and Malki, M. (2015). Question answering systems: survey and trends. Procedia Computer Science 73, 366375.CrossRefGoogle Scholar
Cao, Y., Fang, M., Yu, B. and Zhou, J.T. (2020). Unsupervised Domain Adaptation on Reading Comprehension. AAAI, pp. 7480-87.CrossRefGoogle Scholar
Charlet, D., Damnati, G., BÉchet, F. and Heinecke, J. (2020). Cross-lingual and cross-domain evaluation of Machine Reading Comprehension with Squad and CALOR-Quest corpora. Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5491-97.Google Scholar
Chaturvedi, A., Pandit, O. and Garain, U. (2018). CNN for Text-Based Multiple Choice Question Answering. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 272-77.CrossRefGoogle Scholar
Chen, A., Stanovsky, G., Singh, S. and Gardner, M. (2019a). Evaluating question answering evaluation. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 119-24.CrossRefGoogle Scholar
Chen, D. (2018). Neural Reading Comprehension and Beyond. Stanford University.Google Scholar
Chen, D., Bolton, J. and Manning, C.D. (2016). A Thorough Examination of the Cnn/Daily Mail Reading Comprehension Task. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 2358–67.CrossRefGoogle Scholar
Chen, D., Fisch, A., Weston, J. and Bordes, A. (2017). Reading wikipedia to answer open-domain questions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1870-79.CrossRefGoogle Scholar
Chen, X., Liang, C., Yu, A.W., Zhou, D., Song, D. and Le, Q.V. (2020). Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. International Conference on Learning Representations (ICLR2020).Google Scholar
Chen, Z., Cui, Y., Ma, W., Wang, S. and Hu, G. (2019b). Convolutional spatial attention model for reading comprehension with multiple-choice questions. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6276-83.CrossRefGoogle Scholar
Cho, K., Van MerriËnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–34. Doha, Qatar.CrossRefGoogle Scholar
Choi, E., Hewlett, D., Lacoste, A., Polosukhin, I., Uszkoreit, J. and Berant, J. (2017a). Hierarchical Question Answering for Long Documents. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 209-20. Vancouver, Canada.CrossRefGoogle Scholar
Choi, E., Hewlett, D., Uszkoreit, J., Polosukhin, I., Lacoste, A. and Berant, J. (2017b). Coarse-to-fine question answering for long documents. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 209-20.CrossRefGoogle Scholar
Chollet, F. (2017). Deep learning with python. Manning Publications Co.Google Scholar
Clark, C. and Gardner, M. (2018). Simple and effective multi-paragraph reading comprehension. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers, pp. 845–55CrossRefGoogle Scholar
Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th international conference on Machine learning, pp. 160-67. ACM.CrossRefGoogle Scholar
Cui, L., Huang, S., Wei, F., Tan, C., Duan, C. and Zhou, M. (2017a). Superagent: a customer service chatbot for e-commerce websites. Proceedings of ACL, System Demonstrations, pp. 97-102. Association for Computational Linguistics.CrossRefGoogle Scholar
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S. and Hu, G. (2019a). Cross-Lingual Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1586-95. Hong Kong, China.CrossRefGoogle Scholar
Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T. and Hu, G. (2017b). Attention-over-attention neural networks for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics pp. 593–602.CrossRefGoogle Scholar
Cui, Y., Liu, T., Che, W., Xiao, L., Chen, Z., Ma, W., Wang, S. and Hu, G. (2019b). A Span-Extraction Dataset for Chinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5883-89. Hong Kong, China.CrossRefGoogle Scholar
Cui, Y., Liu, T., Chen, Z., Wang, S. and Hu, G. (2016). Consensus Attention-Based Neural Networks for Chinese Reading Comprehension. Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1777–86.Google Scholar
Das, R., Dhuliawala, S., Zaheer, M. and McCallum, A. (2019). Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering. International Conference on Learning Representations.Google Scholar
Dasigi, P., Liu, N.F., MarasoviĆ, A., Smith, N.A. and Gardner, M. (2019). Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5925-32. Hong Kong, China.CrossRefGoogle Scholar
Dehghani, M., Azarbonyad, H., Kamps, J. and de Rijke, M. (2019). Learning to transform, combine, and reason in open-domain question answering. Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 681-89.CrossRefGoogle Scholar
Deng, L and Liu, Y. (2018). Deep Learning in Natural Language Processing. Springer Singapore.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171-86.Google Scholar
Dhingra, B., Liu, H., Yang, Z., Cohen, W.W. and Salakhutdinov, R. (2017). Gated-attention readers for text comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics pp. 1832–46.CrossRefGoogle Scholar
Dhingra, B., Pruthi, D. and Rajagopal, D. (2018). Simple and Effective Semi-Supervised Question Answering. Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pp. 582–87.CrossRefGoogle Scholar
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M. and Cohen, W.W. (2016). Tweet2vec: Character-based distributed representations for social media. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 269–74.CrossRefGoogle Scholar
Ding, M., Zhou, C., Chen, Q., Yang, H. and Tang, J. (2019). Cognitive Graph for Multi-Hop Reading Comprehension at Scale. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2694-703. Florence, Italy.CrossRefGoogle Scholar
Du, X. and Cardie, C. (2018). Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 1907–17.Google Scholar
Dua, D., Singh, S. and Gardner, M. (2020). Benefits of Intermediate Annotations in Reading Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5627-34.CrossRefGoogle Scholar
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S. and Gardner, M. (2019). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2368-78. Minneapolis, Minnesota.Google Scholar
Duan, N., Tang, D., Chen, P. and Zhou, M. (2017). Question generation for question answering. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 866-74.CrossRefGoogle Scholar
Dunietz, J., Burnham, G., Bharadwaj, A., Chu-Carroll, J., Rambow, O. and Ferrucci, D. (2020). To Test Machine Comprehension, Start by Defining Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7839–59.CrossRefGoogle Scholar
Elgohary, A., Zhao, C. and Boyd-Graber, J. (2018). A dataset and baselines for sequential open-domain question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1077-83.CrossRefGoogle Scholar
Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 55–65. Hong Kong, China.CrossRefGoogle Scholar
Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E. and Chen, D. (2019). MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 1-13. Hong Kong, China.CrossRefGoogle Scholar
Frermann, L. (2019). Extractive NarrativeQA with Heuristic Pre-Training. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 172-82.CrossRefGoogle Scholar
Gardner, M., Berant, J., Hajishirzi, H., Talmor, A. and Min, S. (2019). On Making Reading Comprehension More Comprehensive. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 105-12.CrossRefGoogle Scholar
Ghaeini, R., Fern, X.Z., Shahbazi, H. and Tadepalli, P. (2018). Dependent Gated Reading for Cloze-Style Question Answering. Proceedings of the 27th International Conference on Computational Linguistics pp. 3330–45.Google Scholar
Giuseppe, A. (2017). Question Dependent Recurrent Entity Network for Question Answering. NL4AI: 1st Workshop on Natural Language for Artificial Intelligence, pp. 69-80. CEUR.Google Scholar
Golub, D., Huang, P.-S., He, X. and Deng, L. (2017). Two-stage synthesis networks for transfer learning in machine comprehension. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing pp. 835–44.CrossRefGoogle Scholar
Gong, Y. and Bowman, S.R. (2018). Ruminating reader: Reasoning with gated multi-hop attention. Proceedings of the Workshop on Machine Reading for Question Answering, pp. 1–11 Association for Computational Linguistics.CrossRefGoogle Scholar
Greco, C., Suglia, A., Basile, P., Rossiello, G. and Semeraro, G. (2016). Iterative multi-document neural attention for multiple answer prediction. 2016 AI* IA Workshop on Deep Understanding and Reasoning: A Challenge for Next-Generation Intelligent Agents (URANIA).Google Scholar
Guo, S., Li, R., Tan, H., Li, X., Guan, Y., Zhao, H. and Zhang, Y. (2020). A Frame-based Sentence Representation for Machine Reading Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 891-96.CrossRefGoogle Scholar
Gupta, D., Lenka, P., Ekbal, A. and Bhattacharyya, P. (2018). Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural based Question Answering. Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 119-30.CrossRefGoogle Scholar
Gupta, M., Kulkarni, N., Chanda, R., Rayasam, A. and Lipton, Z.C. (2019). AmazonQA: A Review-Based Question Answering Task. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), pp. 4996-5002.CrossRefGoogle Scholar
Gupta, S. and Khade, N. (2020). BERT Based Multilingual Machine Comprehension in English and Hindi. ACM Trans. Asian Low-Resour. Lang. Inf. Process.s 19.Google Scholar
Gupta, S., Rawat, B.P.S. and Yu, H. (2020). Conversational Machine Comprehension: a Literature Review. Proceedings of the 28th International Conference on Computational Linguistics, pp. 2739-53.CrossRefGoogle Scholar
Hardalov, M., Koychev, I. and Nakov, P. (2019). Beyond English-Only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian. Proceedings of Recent Advances in Natural Language Processing, pp. 447-59. Varna, Bulgaria.CrossRefGoogle Scholar
Hashemi, H., Aliannejadi, M., Zamani, H. and Croft, W.B. (2020). ANTIQUE: A non-factoid question answering benchmark. European Conference on Information Retrieval, pp. 166-73. Springer.CrossRefGoogle Scholar
He, W., Liu, K., Liu, J., Lyu, Y., Zhao, S., Xiao, X., Liu, Y., Wang, Y., Wu, H. and She, Q. (2018). Dureader: a Chinese Machine Reading Comprehension Dataset from Real-World Applications. Proceedings of the Workshop on Machine Reading for Question Answering, pp. 37–46. Association for Computational Linguistics.CrossRefGoogle Scholar
Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M. and Blunsom, P. (2015). Teaching machines to read and comprehend. Advances in neural information processing systems, pp. 1693-701.Google Scholar
Hewlett, D., Jones, L. and Lacoste, A. (2017). Accurate supervised and semi-supervised machine reading for long documents. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2011-20.Google Scholar
Hill, F., Bordes, A., Chopra, S. and Weston, J. (2016). The goldilocks principle: Reading children’s books with explicit memory representations. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Hirschman, L., Light, M., Breck, E. and Burger, J.D. (1999). Deep Read: A Reading Comprehension System. Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 325-32.Google Scholar
Hoang, L., Wiseman, S. and Rush, A.M. (2018). Entity Tracking Improves Cloze-style Reading Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1049–55CrossRefGoogle Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9, 17351780.CrossRefGoogle ScholarPubMed
Horbach, A., Aldabe, I., Bexte, M., de Lacalle, O.L. and Maritxalar, M. (2020). Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions. Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1753-62.Google Scholar
Htut, P.M., Bowman, S.R. and Cho, K. (2018). Training a Ranking Function for Open-Domain Question Answering. Proceedings of NAACL-HLT 2018: Student Research Workshop, pp. 120–27.Google Scholar
Hu, M., Peng, Y., Huang, Z. and Li, D. (2019a). A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1596-606. Hong Kong, China.CrossRefGoogle Scholar
Hu, M., Peng, Y., Huang, Z. and Li, D. (2019b). Retrieve, Read, Rerank: Towards End-to-End Multi-Document Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy.CrossRefGoogle Scholar
Hu, M., Peng, Y., Huang, Z., Qiu, X., Wei, F. and Zhou, M. (2018a). Reinforced Mnemonic Reader for Machine Reading Comprehension. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 4099-106.Google Scholar
Hu, M., Peng, Y., Wei, F., Huang, Z., Li, D., Yang, N. and Zhou, M. (2018b). Attention-Guided Answer Distillation for Machine Reading Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4243–52.CrossRefGoogle Scholar
Hu, M., Wei, F., Peng, Y., Huang, Z., Yang, N. and Li, D. (2019c). Read+ Verify: Machine Reading Comprehension with Unanswerable Questions. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6529-37.CrossRefGoogle Scholar
Huang, H.-Y., Zhu, C., Shen, Y. and Chen, W. (2018). Fusionnet: Fusing via fully-aware attention with application to machine comprehension. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Huang, K., Tang, Y., Huang, J., He, X. and Zhou, B. (2019a). Relation Module for Non-Answerable Predictions on Reading Comprehension. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 747-56.CrossRefGoogle Scholar
Huang, L., Le Bras, R., Bhagavatula, C. and Choi, Y. (2019b). Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2391-401. Hong Kong, China.CrossRefGoogle Scholar
Indurthi, S., Yu, S., Back, S. and CuayÁhuitl, H. (2018). Cut to the Chase: A Context Zoom-in Network for Reading Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 570–75. Association for Computational Linguistics.CrossRefGoogle Scholar
Ingale, V. and Singh, P. (2019). Datasets for Machine Reading Comprehension: A Literature Review. Available at SSRN 3454037.CrossRefGoogle Scholar
Inoue, N., Stenetorp, P. and Inui, K. (2020). R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6740-50.CrossRefGoogle Scholar
Jia, R. and Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing pp. 2021–31.CrossRefGoogle Scholar
Jiang, Y., Joshi, N., Chen, Y.-C. and Bansal, M. (2019). Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2714–25. Florence, Italy.CrossRefGoogle Scholar
Jin, D., Gao, S., Kao, J.-Y., Chung, T. and Hakkani-tur, D. (2020). Mmm: Multi-stage multi-task learning for multi-choice reading comprehension. The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20).CrossRefGoogle Scholar
Jin, W., Yang, G. and Zhu, H. (2019). An Efficient Machine Reading Comprehension Method Based on Attention Mechanism. 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1297-302. IEEE.CrossRefGoogle Scholar
Jing, Y., Xiong, D. and Zhen, Y. (2019). BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross-lingual Reading Comprehension on Novels. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2452–62. Hong Kong, China.CrossRefGoogle Scholar
Joshi, M., Choi, E., Weld, D.S. and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601-11.CrossRefGoogle Scholar
Jurafsky, D. and Martin, J.H. (2019). Speech and language processing. Pearson London.Google Scholar
Kadlec, R., Bajgar, O. and Kleindienst, J. (2016a). From Particular to General: A Preliminary Case Study of Transfer Learning in Reading Comprehension. NIPS Machine Intelligence Workshop.Google Scholar
Kadlec, R., Schmid, M., Bajgar, O. and Kleindienst, J. (2016b). Text understanding with the attention sum reader network. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics pp. 908–18.CrossRefGoogle Scholar
Ke, N.R., Zolna, K., Sordoni, A., Lin, Z., Trischler, A., Bengio, Y., Pineau, J., Charlin, L. and Pal, C. (2018). Focused Hierarchical RNNs for Conditional Sequence Processing. Proceedings of the 35 th International Conference on Machine Learning. Google Scholar
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S. and Roth, D. (2018). Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252-62.Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-51. Doha, Qatar.CrossRefGoogle Scholar
Kim, Y., Jernite, Y., Sontag, D. and Rush, A.M. (2016). Character-aware neural language models. Thirtieth AAAI Conference on Artificial Intelligence.CrossRefGoogle Scholar
Kobayashi, S., Tian, R., Okazaki, N. and Inui, K. (2016). Dynamic entity representation with max-pooling improves machine reading. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 850-55.CrossRefGoogle Scholar
KoČiskÝ, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K.M., Melis, G. and Grefenstette, E. (2018). The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, 317328.CrossRefGoogle Scholar
Kodra, L. and Meçe, E.K. (2017). Question Answering Systems: A Review on Present Developments, Challenges and Trends. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE 8, 217224.Google Scholar
Kundu, S. and Ng, H.T. (2018a). A Nil-Aware Answer Extraction Framework for Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4243-52.CrossRefGoogle Scholar
Kundu, S. and Ng, H.T. (2018b). A Question-Focused Multi-Factor Attention Network for Question Answering. Association for the Advancement of Artificial Intelligence (AAAI2018).Google Scholar
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J. and Lee, K. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, 453466.Google Scholar
Lai, G., Xie, Q., Liu, H., Yang, Y. and Hovy, E. (2017). Race: Large-scale reading comprehension dataset from examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–94CrossRefGoogle Scholar
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 22782324.CrossRefGoogle Scholar
Lee, G., Hwang, S.-w. and Cho, H. (2020). SQuAD2-CR: Semi-supervised Annotation for Cause and Rationales for Unanswerability in SQuAD 2.0. Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5425-32.Google Scholar
Lee, H.-g. and Kim, H. (2020). GF-Net: Improving Machine Reading Comprehension with Feature Gates. Pattern Recognition Letters 129, 815.CrossRefGoogle Scholar
Lee, K., Park, S., Han, H., Yeo, J., Hwang, S.-w. and Lee, J. (2019a). Learning with limited data for multilingual reading comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2833-43.Google Scholar
Lee, S., Kim, D. and Park, J. (2019b). Domain-agnostic Question-Answering with Adversarial Training. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 196-202. Hong Kong, China.Google Scholar
Lehnert, W.G. (1977). The process of question answering. Yale Univ New Haven Conn Dept Of Computer Science.Google Scholar
Li, H., Zhang, X., Liu, Y., Zhang, Y., Wang, Q., Zhou, X., Liu, J., Wu, H. and Wang, H. (2019a). D-NET: A Simple Framework for Improving the Generalization of Machine Reading Comprehension. Proceedings of 2nd Machine Reading for Reading Comprehension Workshop at EMNLP.Google Scholar
Li, J., Li, B. and Lv, X. (2018). Machine Reading Comprehension Based on the Combination of BIDAF Model and Word Vectors. Proceedings of the 2nd International Conference on Computer Science and Application Engineering, pp. 89. ACM.CrossRefGoogle Scholar
Li, X., Zhang, Z., Zhu, W., Li, Z., Ni, Y., Gao, P., Yan, J. and Xie, G. (2019b). Pingan Smart Health and SJTU at COIN-Shared Task: utilizing Pre-trained Language Models and Common-sense Knowledge in Machine Reading Tasks. Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pp. 93-98.Google Scholar
Li, Y., Li, H. and Liu, J. (2019c). Towards Robust Neural Machine Reading Comprehension via Question Paraphrases. 2019 International Conference on Asian Language Processing (IALP), pp. 290-95. IEEE.Google Scholar
Liang, Y., Li, J. and Yin, J. (2019). A New Multi-Choice Reading Comprehension Dataset for Curriculum Learning. Asian Conference on Machine Learning, pp. 742-57.Google Scholar
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004): Association for Computational Linguistics.Google Scholar
Lin, K., Tafjord, O., Clark, P. and Gardner, M. (2019). Reasoning Over Paragraph Effects in Situations. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 58-62. Hong Kong, China.CrossRefGoogle Scholar
Lin, X., Liu, R. and Li, Y. (2018). An Option Gate Module for Sentence Inference on Machine Reading Comprehension. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1743-46. ACM.CrossRefGoogle Scholar
Liu, C., Zhao, Y., Si, Q., Zhang, H., Li, B. and Yu, D. (2018a). Multi-Perspective Fusion Network for Commonsense Reading Comprehension. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 262-74. Springer.CrossRefGoogle Scholar
Liu, D., Gong, Y., Fu, J., Yan, Y., Chen, J., Jiang, D., Lv, J. and Duan, N. (2020a). RikiNet: Reading Wikipedia Pages for Natural Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6762–71.CrossRefGoogle Scholar
Liu, F. and Perez, J. (2017). Gated end-to-end memory networks. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1-10.CrossRefGoogle Scholar
Liu, J., Lin, Y., Liu, Z. and Sun, M. (2019a). XQA: A cross-lingual open-domain question answering dataset. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2358-68.CrossRefGoogle Scholar
Liu, J., Wei, W., Sun, M., Chen, H., Du, Y. and Lin, D. (2018b). A Multi-answer Multi-task Framework for Real-world Machine Reading Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2109-18.CrossRefGoogle Scholar
Liu, K., Liu, X., Yang, A., Liu, J., Su, J., Li, S. and She, Q. (2020b). A Robust Adversarial Training Approach to Machine Reading Comprehension. AAAI, pp. 8392-400.CrossRefGoogle Scholar
Liu, R., Hu, J., Wei, W., Yang, Z. and Nyberg, E. (2017). Structural embedding of syntactic trees for machine comprehension. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing pp. 815–24CrossRefGoogle Scholar
Liu, S., Zhang, S., Zhang, X. and Wang, H. (2019b). R-trans: RNN Transformer Network for Chinese Machine Reading Comprehension. IEEE Access 7, 2773627745.Google Scholar
Liu, S., Zhang, X., Zhang, S., Wang, H. and Zhang, W. (2019c). Review Neural Machine Reading Comprehension: Methods and Trends. Applied Sciences 9, 3698.Google Scholar
Liu, X., Shen, Y., Duh, K. and Gao, J. (2018c). Stochastic Answer Networks for Machine Reading Comprehension. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers): 1694–704CrossRefGoogle Scholar
Liu, Y., Huang, Z., Hu, M., Du, S., Peng, Y., Li, D. and Wang, X. (2018d). MFM: A Multi-level Fused Sequence Matching Model for Candidates Filtering in Multi-paragraphs Question-Answering. Pacific Rim Conference on Multimedia, pp. 449-58. Springer.CrossRefGoogle Scholar
Longpre, S., Lu, Y., Tu, Z. and DuBois, C. (2019). An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 220-27.CrossRefGoogle Scholar
Ma, K., Jurczyk, T. and Choi, J.D. (2018). Challenging Reading Comprehension on Daily Conversation: Passage Completion on Multiparty Dialog. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2039-48.CrossRefGoogle Scholar
Manning, C.D., Raghavan, P. and Schütze, H. (2008). Chapter 8: Evaluation in information retrieval. Introduction to information retrieval: Cambridge University Press.Google Scholar
Miao, H., Liu, R. and Gao, S. (2019a). A Multiple Granularity Co-Reasoning Model for Multi-choice Reading Comprehension. 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1-7. IEEE.Google Scholar
Miao, H., Liu, R. and Gao, S. (2019b). Option Attentive Capsule Network for Multi-choice Reading Comprehension. International Conference on Neural Information Processing, pp. 306-18. Springer.CrossRefGoogle Scholar
Mihaylov, T. and Frank, A. (2018). Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 821–32.CrossRefGoogle Scholar
Mihaylov, T. and Frank, A. (2019). Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2541-52. Hong Kong, China.CrossRefGoogle Scholar
Mihaylov, T., Kozareva, Z. and Frank, A. (2017). Neural Skill Transfer from Supervised Language Tasks to Reading Comprehension. Workshop on Learning with Limited Labeled Data: Weak Supervision and Beyond at NIPS.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, pp. 3111-19.Google Scholar
Min, S., Seo, M. and Hajishirzi, H. (2017). Question answering through transfer learning from large fine-grained supervision data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers) pp. 510–17.CrossRefGoogle Scholar
Min, S., Zhong, V., Socher, R. and Xiong, C. (2018). Efficient and Robust Question Answering from Minimal Context over Documents. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 1725–35.CrossRefGoogle Scholar
Min, S., Zhong, V., Zettlemoyer, L. and Hajishirzi, H. (2019). Multi-hop Reading Comprehension through Question Decomposition and Rescoring. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6097-109. Florence, Italy.CrossRefGoogle Scholar
Mozannar, H., Maamary, E., El Hajal, K. and Hajj, H. (2019). Neural Arabic Question Answering. Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 108-18. Florence, Italy.CrossRefGoogle Scholar
Munkhdalai, T. and Yu, H. (2017). Reasoning with memory augmented neural networks for language comprehension. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Nakatsuji, M. and Okui, S. (2020). Answer Generation through Unified Memories over Multiple Passages. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20).CrossRefGoogle Scholar
Ng, H.T., Teo, L.H. and Kwan, J.L.P. (2000). A Machine Learning Approach to Answering Questions for Reading Comprehension Tests. Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, pp. 124-32.CrossRefGoogle Scholar
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R. and Deng, L. (2016). MS MARCO: A Human Generated Machine Reading Comprehension Dataset. Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (CoCo@NIPS), Barcelona, Spain.Google Scholar
Nie, Y., Wang, S. and Bansal, M. (2019). Revealing the Importance of Semantic Retrieval for Machine Reading at Scale. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2553-66. Hong Kong, China.CrossRefGoogle Scholar
Nishida, K., Nishida, K., Nagata, M., Otsuka, A., Saito, I., Asano, H. and Tomita, J. (2019a). Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2335-45. Florence, Italy.CrossRefGoogle Scholar
Nishida, K., Nishida, K., Saito, I., Asano, H. and Tomita, J. (2020). Unsupervised Domain Adaptation of Language Models for Reading Comprehension. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 5392–99.Google Scholar
Nishida, K., Saito, I., Nishida, K., Shinoda, K., Otsuka, A., Asano, H. and Tomita, J. (2019b). Multi-style Generative Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2273-84. Florence, Italy.CrossRefGoogle Scholar
Nishida, K., Saito, I., Otsuka, A., Asano, H. and Tomita, J. (2018). Retrieve-and-read: Multi-task learning of information retrieval and reading comprehension. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 647-56. ACM.CrossRefGoogle Scholar
Niu, Y., Jiao, F., Zhou, M., Yao, T., Xu, J. and Huang, M. (2020). A Self-Training Method for Machine Reading Comprehension with Soft Evidence Extraction. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3916–27.Google Scholar
Onishi, T., Wang, H., Bansal, M., Gimpel, K. and McAllester, D. (2016). Who did what: A large-scale person-centered cloze dataset. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2230–35.CrossRefGoogle Scholar
Osama, R., El-Makky, N.M. and Torki, M. (2019). Question Answering Using Hierarchical Attention on Top of BERT Features. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 191-95.CrossRefGoogle Scholar
Ostermann, S., Modi, A., Roth, M., Thater, S. and Pinkal, M. (2018). MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Google Scholar
Ostermann, S., Roth, M. and Pinkal, M. (2019). MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants. Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pp. 103-17. Minneapolis, Minnesota.Google Scholar
Pampari, A., Raghavan, P., Liang, J. and Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2357–68CrossRefGoogle Scholar
Pang, L., Lan, Y., Guo, J., Xu, J., Su, L. and Cheng, X. (2019). Has-qa: Hierarchical answer spans model for open-domain question answering. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6875-82.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-18. Association for Computational Linguistics.Google Scholar
Pappas, D., Stavropoulos, P., Androutsopoulos, I. and McDonald, R. (2020). BioMRC: A Dataset for Biomedical Machine Reading Comprehension. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pp. 140-49.CrossRefGoogle Scholar
Park, C., Lee, C. and Song, H. (2019). VS3-NET: Neural Variational Inference Model for Machine-Reading Comprehension. ETRI Journal 41, 771781.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-43.CrossRefGoogle Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 2227-37. New Orleans, Louisiana.CrossRefGoogle Scholar
Prakash, T., Tripathy, B.K. and Banu, K.S. (2018). ALICE: A Natural Language Question Answering System Using Dynamic Attention and Memory. International Conference on Soft Computing Systems, pp. 274-82. Springer.CrossRefGoogle Scholar
Pugaliya, H., Route, J., Ma, K., Geng, Y. and Nyberg, E. (2019). Bend but Don’t Break? Multi-Challenge Stress Test for QA Models. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 125-36.CrossRefGoogle Scholar
Qiu, D., Zhang, Y., Feng, X., Liao, X., Jiang, W., Lyu, Y., Liu, K. and Zhao, J. (2019). Machine Reading Comprehension Using Structural Knowledge Graph-aware Network. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5898-903.CrossRefGoogle Scholar
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). Improving language understanding by generative pre-training.Google Scholar
Rajpurkar, P., Jia, R. and Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pp. 784–89.CrossRefGoogle Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–92.Google Scholar
Ran, Q., Lin, Y., Li, P., Zhou, J. and Liu, Z. (2019). NumNet: Machine Reading Comprehension with Numerical Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2474-84. Hong Kong, China.CrossRefGoogle Scholar
Reddy, S., Chen, D. and Manning, C.D. (2019). Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, 249266.Google Scholar
Ren, M., Huang, H., Wei, R., Liu, H., Bai, Y., Wang, Y. and Gao, Y. (2019). Multiple Perspective Answer Reranking for Multi-passage Reading Comprehension. CCF International Conference on Natural Language Processing and Chinese Computing, pp. 736-47. Springer.CrossRefGoogle Scholar
Ren, Q., Cheng, X. and Su, S. (2020). Multi-Task Learning with Generative Adversarial Training for Multi-Passage Machine Reading Comprehension. AAAI, pp. 8705-12.CrossRefGoogle Scholar
Richardson, M., Burges, C.J.C and Renshaw, E. (2013). Mctest: A challenge dataset for the open-domain machine comprehension of text. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 193-203.Google Scholar
Riloff, E. and Thelen, M. (2000). A Rule-Based Question Answering System for Reading Comprehension Tests. Proceedings of the 2000 ANLP/NAACL Workshop on Reading comprehension tests as evaluation for computer-based language understanding sytems, pp. 13-19. Association for Computational Linguistics.Google Scholar
Ruder, S. (2019). Neural Transfer Learning for Natural Language Processing. NATIONAL UNIVERSITY OF IRELAND, GALWAY.Google Scholar
Sachan, M. and Xing, E. (2018). Self-Training for Jointly Learning to Ask and Answer Questions. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 629-40.CrossRefGoogle Scholar
Saha, A., Aralikatte, R., Khapra, M.M. and Sankaranarayanan, K. (2018). DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 1683–93.CrossRefGoogle Scholar
Salant, S. and Berant, J. (2018). Contextualized Word Representations for Reading Comprehension. Proceedings of NAACL-HLT 2018, pp. 554–59.Google Scholar
Sayama, H.F., Araujo, A.V. and Fernandes, E.R. (2019). FaQuAD: Reading Comprehension Dataset in the Domain of Brazilian Higher Education. 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 443-48. IEEE.Google Scholar
Schlegel, V., Valentino, M., Freitas, A., Nenadic, G. and Batista-Navarro, R. (2020). A framework for evaluation of Machine Reading Comprehension Gold Standards. Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 5359–69.Google Scholar
Seo, M., Kembhavi, A., Farhadi, A. and Hajishirzi, H. (2017). Bidirectional attention flow for machine comprehension. Proceedings of the 5th International Conference on Learning Representations (ICLR).Google Scholar
Seo, M., Kwiatkowski, T., Parikh, A.P., Farhadi, A. and Hajishirzi, H. (2018). Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing pp. 559–64.CrossRefGoogle Scholar
Shao, C.C., Liu, T., Lai, Y., Tseng, Y. and Tsai, S. (2018). DRCD: a Chinese Machine Reading Comprehension Dataset. Proceedings of the Workshop on Machine Reading for Question Answering, pp. pages 37–46. Association for Computational Linguistics.Google Scholar
Sharma, P. and Roychowdhury, S. (2019). IIT-KGP at COIN 2019: Using pre-trained Language Models for modeling Machine Comprehension. Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pp. 80-84.Google Scholar
Shen, Y., Huang, P.-S., Gao, J. and Chen, W. (2017). Reasonet: Learning to Stop Reading in Machine Comprehension. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2017), pp. 1047-55. ACM.CrossRefGoogle Scholar
Sheng, Y., Lan, M. and Wu, Y. (2018). ECNU at SemEval-2018 Task 11: Using Deep Learning Method to Address Machine Comprehension Task. Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 1048-52. Association for Computational Linguistics.CrossRefGoogle Scholar
Song, J., Tang, S., Qian, T., Zhu, W. and Wu, F. (2018). Reading Document and Answering Question via Global Attentional Inference. Pacific Rim Conference on Multimedia (PCM 2018), pp. 335-45. Springer.CrossRefGoogle Scholar
Song, L., Wang, Z., Yu, M., Zhang, Y., Florian, R. and Gildea, D. (2020). Evidence Integration for Multi-hop Reading Comprehension with Graph Neural Networks. IEEE Transactions on knowledge data engineering.Google Scholar
Soni, S. and Roberts, K. (2020). Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering. Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5532-38.Google Scholar
Srivastava, R.K., Greff, K. and Schmidhuber, J. (2015). Training very deep networks. Advances in neural information processing systems, pp. 2377-85.Google Scholar
Su, D., Xu, Y., Winata, G.I., Xu, P., Kim, H., Liu, Z. and Fung, P. (2019). Generalizing Question Answering System with Pre-trained Language Model Fine-tuning. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 203-11.CrossRefGoogle Scholar
Sugawara, S., Inui, K., Sekine, S. and Aizawa, A. (2018). What Makes Reading Comprehension Questions Easier? Proceedings of 2018 conference on empirical methods in natural language processing, pp. 4208-40219.Google Scholar
Sugawara, S., Kido, Y., Yokono, H. and Aizawa, A. (2017). Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 806-17.CrossRefGoogle Scholar
Sugawara, S., Stenetorp, P., Inui, K. and Aizawa, A. (2020). Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets. AAAI, pp. 8918-27.CrossRefGoogle Scholar
Sun, K., Yu, D., Yu, D. and Cardie, C. (2019). Improving Machine Reading Comprehension with General Reading Strategies. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2633-43.CrossRefGoogle Scholar
Sun, K., Yu, D., Yu, D. and Cardie, C. (2020). Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension. Transactions of the Association for Computational Linguistics 8, 141155.CrossRefGoogle Scholar
Šuster, S. and Daelemans, W. (2018). CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension. Proceedings of NAACL-HLT 2018 pp. 1551–63.Google Scholar
Swayamdipta, S., Parikh, A.P. and Kwiatkowski, T. (2018). Multi-mention learning for reading comprehension with neural cascades. Proceeding of the International Conference on Learning Representations (ICLR).Google Scholar
Takahashi, T., Taniguchi, M., Taniguchi, T. and Ohkuma, T. (2019). CLER: Cross-task learning with expert representation to generalize reading and understanding. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 183-90.Google Scholar
Talmor, A. and Berant, J. (2019). MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4911-21. Florence, Italy.Google Scholar
Tan, C., Wei, F., Yang, N., Du, B., Lv, W. and Zhou, M. (2018a). S-Net: From Answer Extraction to Answer Synthesis for Machine Reading Comprehension. Association for the Advancement of Artificial Intelligence (AAAI).Google Scholar
Tan, C., Wei, F., Zhou, Q., Yang, N., Lv, W. and Zhou, M. (2018b). I Know There Is No Answer: Modeling Answer Validation for Machine Reading Comprehension. CCF International Conference on Natural Language Processing and Chinese Computing, pp. 85-97. Springer.CrossRefGoogle Scholar
Tang, H., Hong, Y., Chen, X., Wu, K. and Zhang, M. (2019a). How to Answer Comparison Questions. 2019 International Conference on Asian Language Processing (IALP), pp. 331-36. IEEE.Google Scholar
Tang, M., Cai, J. and Zhuo, H.H. (2019b). Multi-matching network for multiple choice reading comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7088-95.Google Scholar
Tay, Y., Luu, A.T. and Hui, S.C. (2018). Multi-Granular Sequence Encoding via Dilated Compositional Units for Reading Comprehension. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2141-51.CrossRefGoogle Scholar
Tay, Y., Wang, S., Luu, A.T., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C. and Zhang, A. (2019). Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4922-31. Florence, Italy.CrossRefGoogle Scholar
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P. and Suleman, K. (2017). Newsqa: A machine comprehension dataset. Proceedings of the 2nd Workshop on Representation Learning for NLP.CrossRefGoogle Scholar
Tu, M., Huang, K., Wang, G., Huang, J., He, X. and Zhou, B. (2020). Select, Answer and Explain: Interpretable Multi-Hop Reading Comprehension over Multiple Documents. AAAI, pp. 9073-80.CrossRefGoogle Scholar
Tu, M., Wang, G., Huang, J., Tang, Y., He, X. and Zhou, B. (2019). Multi-hop Reading Comprehension across Multiple Documents by Reasoning over Heterogeneous Graphs. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2704-13. Florence, Italy.CrossRefGoogle Scholar
Turpin, A. and Scholer, F. (2006). User performance versus precision measures for simple search tasks. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11-18. ACM.CrossRefGoogle Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, pp. 5998-6008.Google Scholar
Vedantam, R., Lawrence Zitnick, C. and Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566-75.CrossRefGoogle Scholar
Wang, B., Guo, S., Liu, K., He, S. and Zhao, J. (2016). Employing External Rich Knowledge for Machine Comprehension. International Joint Conferences on Artificial Intelligence (IJCAI), pp. 2929-25.Google Scholar
Wang, B., Yao, T., Zhang, Q., Xu, J. and Wang, X. (2020b). ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion. AAAI, pp. 9146-53.Google Scholar
Wang, B., Yao, T., Zhang, Q., Xu, J., Liu, K., Tian, Z. and Zhao, J. (2019a). Unsupervised Story Comprehension with Hierarchical Encoder-Decoder. Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 93-100.CrossRefGoogle Scholar
Wang, B., Zhang, X., Zhou, X. and Li, J. (2020a). A Gated Dilated Convolution with Attention Model for Clinical Cloze-Style Reading Comprehension. International Journal of Environmental Research Public Health 17, 1323.CrossRefGoogle ScholarPubMed
Wang, C. and Jiang, H. (2019). Explicit Utilization of General Knowledge in Machine Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2263–72. Florence, Italy.Google Scholar
Wang, H., Gan, Z., Liu, X., Liu, J., Gao, J. and Wang, H. (2019e). Adversarial Domain Adaptation for Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2510-20. Hong Kong, China.CrossRefGoogle Scholar
Wang, H., Lu, W. and Tang, Z. (2019d). Incorporating External Knowledge to Boost Machine Comprehension Based Question Answering. European Conference on Information Retrieval, pp. 819-27. Springer.Google Scholar
Wang, H., Yu, D., Sun, K., Chen, J., Yu, D., McAllester, D. and Roth, D. (2019b). Evidence Sentence Extraction for Machine Reading Comprehension. Proceedings of the 23rd Conference on Computational Natural Language Learning, pp. 696–707. Hong Kong, China.Google Scholar
Wang, H., Yu, M., Guo, X., Das, R., Xiong, W. and Gao, T. (2019c). Do Multi-hop Readers Dream of Reasoning Chains? Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 91-97. Hong Kong, China.CrossRefGoogle Scholar
Wang, S. and Jiang, J. (2017). Machine comprehension using match-lstm and answer pointer. Proceedings of the International Conference on Learning Representations (ICLR), pp. 1-15. Toulon, France.Google Scholar
Wang, S., Yu, M., Chang, S. and Jiang, J. (2018a). A Co-Matching Model for Multi-choice Reading Comprehension. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers): 746–51.CrossRefGoogle Scholar
Wang, S., Yu, M., Guo, X., Wang, Z., Klinger, T., Zhang, W., Chang, S., Tesauro, G., Zhou, B. and Jiang, J. (2018b). R3: Reinforced ranker-reader for open-domain question answering. Association for the Advancement of Artificial Intelligence (AAAI 2018).Google Scholar
Wang, S., Yu, M., Jiang, J., Zhang, W., Guo, X., Chang, S., Wang, Z., Klinger, T., Tesauro, G. and Campbell, M. (2018c). Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Wang, T., Yuan, X. and Trischler, A. (2017a). A joint model for question answering and question generation. Learning to generate natural language workshop, ICML 2017.Google Scholar
Wang, W., Yan, M. and Wu, C. (2018d). Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1705-14.CrossRefGoogle Scholar
Wang, W., Yang, N., Wei, F., Chang, B. and Zhou, M. (2017b). Gated self-matching networks for reading comprehension and question answering. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189-98.CrossRefGoogle Scholar
Wang, Y. and Bansal, M. (2018). Robust Machine Comprehension Models via Adversarial Training. Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pp. 575–81. Association for Computational Linguistics.Google Scholar
Wang, Y., Liu, K., Liu, J., He, W., Lyu, Y., Wu, H., Li, S. and Wang, H. (2018e). Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1918-27.CrossRefGoogle Scholar
Wang, Z., Liu, J., Xiao, X., Lyu, Y. and Wu, T. (2018f). Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 1715–24.Google Scholar
Watarai, T. and Tsuchiya, M. (2020). Developing Dataset of Japanese Slot Filling Quizzes Designed for Evaluation of Machine Reading Comprehension. Proceedings of The 12th Language Resources and Evaluation Conference, pp. 6895-901.Google Scholar
Weissenborn, D., Wiese, G. and Seiffe, L. (2017). Making Neural QA as Simple as Possible but not Simpler. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL), pp. 271–80. Vancouver, Canada.CrossRefGoogle Scholar
Welbl, J., Liu, N.F. and Gardner, M. (2017). Crowdsourcing multiple choice science questions. Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 94–106. Association for Computational Linguistics.Google Scholar
Welbl, J., Minervini, P., Bartolo, M., Stenetorp, P. and Riedel, S. (2020). Undersensitivity in neural reading comprehension. International Conference on Learning Representations (ICLR2020).CrossRefGoogle Scholar
Welbl, J., Stenetorp, P. and Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, 287302.CrossRefGoogle Scholar
Wu, B., Huang, H., Wang, Z., Feng, Q., Yu, J. and Wang, B. (2019). Improving the Robustness of Deep Reading Comprehension Models by Leveraging Syntax Prior. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 53-57.CrossRefGoogle Scholar
Wu, Z. and Xu, H. (2020). Improving the Robustness of Machine Reading Comprehension Model with Hierarchical Knowledge and Auxiliary Unanswerability Prediction. Knowledge-Based Systems: 106075.Google Scholar
Xia, J., Wu, C. and Yan, M. (2019). Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning. Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2393–96. Beijing, China.CrossRefGoogle Scholar
Xie, P. and Xing, E. (2017). A constituent-centric neural architecture for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1405-14.CrossRefGoogle Scholar
Xie, Q., Lai, G., Dai, Z. and Hovy, E. (2018). Large-scale Cloze Test Dataset Created by Teachers. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2344-56.CrossRefGoogle Scholar
Xiong, C., Zhong, V. and Socher, R. (2017). Dynamic coattention networks for question answering. Proceedings of the 5th International Conference on Learning Representations (ICLR).Google Scholar
Xiong, C., Zhong, V. and Socher, R. (2018). Dcn+: Mixed objective and deep residual coattention for question answering. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Xiong, W., Yu, M., Guo, X., Wang, H., Chang, S., Campbell, M. and Wang, W.Y. (2019). Simple yet Effective Bridge Reasoning for Open-Domain Multi-Hop Question Answering. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 48-52. Hong Kong, China.CrossRefGoogle Scholar
Xu, Y., Liu, W., Chen, G., Ren, B., Zhang, S., Gao, S. and Guo, J. (2019a). Enhancing Machine Reading Comprehension With Position Information. IEEE Access 7, 141602141611.CrossRefGoogle Scholar
Xu, Y., Liu, X., Shen, Y., Liu, J. and Gao, J. (2019b). Multi-task Learning with Sample Re-weighting for Machine Reading Comprehension. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2644-55. Minneapolis, Minnesota.CrossRefGoogle Scholar
Yadav, M., Vig, L. and Shroff, G. (2017). Learning and Knowledge Transfer with Memory Networks for Machine Comprehension. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 850-59.CrossRefGoogle Scholar
Yan, M., Xia, J., Wu, C., Bi, B., Zhao, Z., Zhang, J., Si, L., Wang, R., Wang, W. and Chen, H. (2019). A deep cascade model for multi-document reading comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7354-61.CrossRefGoogle Scholar
Yan, M., Zhang, H., Jin, D. and Zhou, J.T. (2020). Multi-source Meta Transfer for Low Resource Multiple-Choice Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7331-41.CrossRefGoogle Scholar
Yang, A., Wang, Q., Liu, J., Liu, K., Lyu, Y., Wu, H., She, Q. and Li, S. (2019a). Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2346-57.CrossRefGoogle Scholar
Yang, Y., Kang, S. and Seo, J. (2020). Improved Machine Reading Comprehension Using Data Validation for Weakly Labeled Data. IEEE Access 8, 56675677.CrossRefGoogle Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. and Le, Q.V. (2019b). XLNet: Generalized Autoregressive Pretraining for Language Understanding. Advances in neural information processing systems, pp. 5753-63.Google Scholar
Yang, Z., Dhingra, B., Yuan, Y., Hu, J., Cohen, W.W. and Salakhutdinov, R. (2017a). Words or characters? fine-grained gating for reading comprehension. Proceedings of the 5th International Conference on Learning Representations, (ICLR), Toulon, France.Google Scholar
Yang, Z., Hu, J., Salakhutdinov, R. and Cohen, W.W. (2017b). Semi-supervised QA with generative domain-adaptive nets. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1040-50.CrossRefGoogle Scholar
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R. and Manning, C.D. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369-80.CrossRefGoogle Scholar
Yao, J., Feng, M., Feng, H., Wang, Z., Zhang, Y. and Xue, N. (2019). Smart: A stratified machine reading test. CCF International Conference on Natural Language Processing and Chinese Computing, pp. 67-79. Springer.CrossRefGoogle Scholar
Yin, W., Ebert, S. and Schütze, H. (2016). Attention-based convolutional neural network for machine comprehension. Proceedings of 2016 NAACL Human-Computer Question Answering Workshop, pp. 15–21.CrossRefGoogle Scholar
Yu, A.W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M. and Le, Q.V. (2018). QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
Yu, J., Zha, Z. and Yin, J. (2019). Inferential machine comprehension: Answering questions by recursively deducing the evidence chain from text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2241-51.CrossRefGoogle Scholar
Yu, W., Jiang, Z., Dong, Y. and Feng, J. (2020). ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning. International Conference on Learning Representations (ICLR2020).Google Scholar
Yuan, F., Shou, L., Bai, X., Gong, M., Liang, Y., Duan, N., Fu, Y. and Jiang, D. (2020a). Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: 925–34.CrossRefGoogle Scholar
Yuan, F., Xu, Y., Lin, Z., Wang, W. and Shi, G. (2019). Multi-perspective Denoising Reader for Multi-paragraph Reading Comprehension. International Conference on Neural Information Processing, pp. 222-34. Springer.CrossRefGoogle Scholar
Yuan, X., Fu, J., Cote, M.-A., Tay, Y., Pal, C. and Trischler, A. (2020b). Interactive machine comprehension with information seeking agents. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: 2325–38.CrossRefGoogle Scholar
Yue, X., Gutierrez, B.J. and Sun, H. (2020). Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Zhang, C., Luo, C., Lu, J., Liu, A., Bai, B., Bai, K. and Xu, Z. (2020a). Read, Attend, and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1945-48.Google Scholar
Zhang, J., Zhu, X., Chen, Q., Ling, Z., Dai, L., Wei, S. and Jiang, H. (2017). Exploring question representation and adaptation with neural networks. Computer and Communications (ICCC), 2017 3rd IEEE International Conference on, pp. 1975-84. IEEE.CrossRefGoogle Scholar
Zhang, S., Zhao, H., Wu, Y., Zhang, Z., Zhou, X. and Zhou, X. (2020b). DCMN+: Dual co-matching network for multi-choice reading comprehension. AAAI CrossRefGoogle Scholar
Zhang, X. and Wang, Z. (2020). Reception: Wide and Deep Interaction Networks for Machine Reading Comprehension (Student Abstract). AAAI, pp. 13987-88.CrossRefGoogle Scholar
Zhang, X., Wu, J., He, Z., Liu, X. and Su, Y. (2018). Medical Exam Question Answering with Large-scale Reading Comprehension. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18).CrossRefGoogle Scholar
Zhang, X., Yang, A., Li, S. and Wang, Y. (2019a). Machine Reading Comprehension: a Literature Review. arXiv preprint arXiv:1907.01686.Google Scholar
Zhang, Z., Wu, Y., Zhou, J., Duan, S., Zhao, H. and Wang, R. (2020c). SG-Net: Syntax-Guided Machine Reading Comprehension. AAAI, pp. 9636-43.Google Scholar
Zhang, Z., Zhao, H., Ling, K., Li, J., Li, Z., He, S. and Fu, G. (2019b). Effective subword segmentation for text comprehension. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 16641674.CrossRefGoogle Scholar
Zheng, B., Wen, H., Liang, Y., Duan, N., Che, W., Jiang, D., Zhou, M. and Liu, T. (2020). Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6708–18.CrossRefGoogle Scholar
Zhou, M., Huang, M. and Zhu, X. (2020a). Robust reading comprehension with linguistic constraints via posterior regularization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28.CrossRefGoogle Scholar
Zhou, X., Luo, S. and Wu, Y. (2020b). Co-Attention Hierarchical Network: Generating Coherent Long Distractors for Reading Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9725-32.CrossRefGoogle Scholar
Zhu, H., Dong, L., Wei, F., Wang, W., Qin, B. and Liu, T. (2019). Learning to Ask Unanswerable Questions for Machine Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4238-48. Florence, Italy.CrossRefGoogle Scholar
Zhuang, Y. and Wang, H. (2019). Token-level dynamic self-attention network for multi-passage reading comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2252-62.CrossRefGoogle Scholar
Figure 0

Figure 1. Samples from SQuAD (Rajpurkar et al. 2016) and CNN/Daily Mail (Chen, Bolton and Manning 2016) datasets. The original article of the CNN/Daily Mail example can be found at https://edition.cnn.com/2015/03/10/entertainment/feat-star-wars-gay-character

Figure 1

Table 1. Number of reviewed papers over different years.

Figure 2

Table 2. Statistics of different embedding methods used by reviewed papers.

Figure 3

Figure 2. The percentage of different character embedding methods over different years

Figure 4

Table 3. Statistics of different word representation methods in the reviewed papers.

Figure 5

Table 4. Statistics of different attention mechanisms used in the reasoning phase of MRC systems.

Figure 6

Table 5. Statistics of different prediction phase categories in the reviewed papers.

Figure 7

Table 6. Statistics of input/output types in MRC systems.

Figure 8

Table 7. MRC datasets proposed from 2016 to 2020. (A: Answer, P: passage, Q: Question)

Figure 9

Figure 3. The progress made on two datasets: SQuAD1.1 (top) and RACE (down). The data points are taken from https://rajpurkar.github.io/SQuAD-explorer and http://www.qizhexie.com/data/RACE_leaderboard.html, respectively. Only the articles available in our reviewed papers are reported.

Figure 10

Table 8. Statistics of evaluation measures used in reviewed papers.

Figure 11

Table 9. Statistics of different research contributions to MRC task in the reviewed papers.

Figure 12

Figure 4. Ratio of reviewed papers (%) for extractive/generative evaluation metrics

Figure 13

Table 10. Hot papers based on the number of citations in the Google Scholar service until September 2020.

Figure 14

Table A1. Reviewed papers categorized based on their embedding phase

Figure 15

Table A2. Reviewed papers categorized based on their reasoning phase

Figure 16

Table A3. Reviewed papers categorized based on their prediction phase

Figure 17

Table A4. Reviewed papers categorized based on their input/output

Figure 18

Table A5. Reviewed papers categorized based on their evaluation metric

Figure 19

Table A6. Reviewed papers categorized based on their novelties