Dimitris Dimitriadis

Language Processing R&D

Full Stack Web Developer

Dimitris Dimitriadis

Language Processing R&D

Full Stack Web Developer

Blog Post

Datasets In Question Answering

Datasets In Question Answering

Question Answering (QA) is an AI-complete problem meaning that the current state-of-the-art approaches can not solve it. One of the main problems of the domain some years before was the lack of a huge dataset. Although other Natural Language Processing tasks (e.g., named-entity recognition, language modeling) have plenty of resources, QA only recently has an appropriate amount of instances. Furthermore, due to the difficulty of the domain, researchers direct the research towards specific cases in QA. For example, the current big datasets, which are publicly available, are related to the extractive QA, a specific problem where the assumption is that the answer to a question is part of a passage related to the question. The models aim to predict the span text in the passage text that answers the question. Other datasets contain triples of questions, passages, and answers where the answers have been generated considering the passages texts and the world knowledge. This is a natural language generation problem. The models aim to produce an answer text for a question that is approximately close to the gold answer.

As we can see, there are plenty of special cases in QA, thus, it would be beneficial to find the datasets that can help us to train efficient models on a particular QA task.

The aim of this article is to present these datasets giving a description for each of them and also links to download them and experimented with them.

Natural Questions

This dataset has been constructed in 2019 by the Google Research team for extractive QA. The dataset contains thousands of queries that have been made by users in the Google search engine. Particularly, the dataset is split into the train, dev, and test set, with approximately 310,000, 8000, and 8000 instances accordingly. An instance is a question (user query), a Wikipedia page from the top 5 search results, a long answer which has the length of a paragraph, and a short answer which is one or more entities or yes/no [1]. The Wikipedia page is in the HTML format while the dataset is in the JSON Lines text format. (In Python, one can use the jsonlines library to process the dataset). The dataset is available here.

Squad V1.0

This is another dataset for extractive QA constructed by Stanford Research Team. Crowdworkers posed questions on a set of Wikipedia articles, while the answer to the questions is a span of text included in a passage contained in the Wikipedia articles. The number of questions is 100k approximately [2]. The human performance on the dataset is 86.8%. The dataset is available here.

Squad V2.0

The previous version of the dataset contained easy questions. Thus the state-of-the-art models achieved better results even from humans. In detail, the main issue of the past state-of-the-art models was that they tried to guess the answer in the context of a passage, while the answer was not stated in this context. Stanford observed that and introduced the second version of the dataset [3]. The dataset contains the questions from version 1 plus the 50k unanswerable questions. Now, the models must answer the questions only if it is possible. Based on the leaderboard, the best accuracy is 89% overcoming human judgment. The dataset is available here.


This dataset contains yes/no questions. Particularly, the creators of the dataset gathered questions from other datasets such as Natural Questions or constructed triples of questions, passages, and answers using other sources to build a dataset only with yes/no questions. They claim that to answer the questions, one must infer the answer from the passages while reasoning is mandatory. They introduced six types of reasoning. Their best model was one that trained on Natural Language Inference dataset and fine-tuned on the boolq dataset [4]. The dataset is available here.


A biomedical domain QA dataset which was constructed by biomedical experts for the BioASQ Challenge dated back to 2013. The dataset contains questions of several types (yes/no, factoid, list, summary), while the systems should answer a question giving a short answer (exact answer) and a paragraph-like answer (ideal answer) [5]. The dataset is available here. Note that if one wants to use the dataset, he must also register in the platform.


[1] Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., … & Toutanova, K. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7, 453-466.

[2] Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

[3] Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822.

[4] Clark, C., Lee, K., Chang, M. W., Kwiatkowski, T., Collins, M., & Toutanova, K. (2019). BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. arXiv preprint arXiv:1905.10044.

[5] An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition: George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos and Georgios Paliouras, in BMC bioinformatics, 2015

Write a comment