Open Domain Question Answering (ODQA)

Open In Colab

1. Introduction to the task

Open Domain Question Answering (ODQA) is a task to find an exact answer to any question in Wikipedia articles. Thus, given only a question, the system outputs the best answer it can find. The default ODQA implementation takes a batch of queries as input and returns the best answer.

English ODQA version consists of the following components:

  • TF-IDF ranker, which defines top-N most relevant paragraphs in TF-IDF index;

  • Binary Passage Retrieval (BPR) ranker, which defines top-K most relevant in binary index;

  • a database of paragraphs (by default, from Wikipedia) which finds N + K most relevant paragraph text by IDs, defined by TF-IDF and BPR ranker;

  • Reading Comprehension component, which finds answers in paragraphs and defines answer confidences.

Russian ODQA version performs retrieval only with TF-IDF index.

Binary Passage Retrieval is resource-efficient the method of building a dense passage index. The dual encoder (with BERT or other Tranformer as backbone) is trained on question answering dataset (Natural Questions in our case) to maximize dot product of question and passage with answer embeddings and minimize otherwise. The question or passage embeddings are obtained the following way: vector of BERT CLS-token is fed into a dense layer followed by a hash function which turns dense vector into binary one.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

The example below is given for basic ODQA config en_odqa_infer_wiki. Check what other ODQA configs are available and simply replace en_odqa_infer_wiki with the config name of your preference. What is a Config File?

Before using the model make sure that all required packages are installed running the command:

[ ]:
!python -m deeppavlov install en_odqa_infer_wiki

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page.

3. Models list

The table presents a list of all of the ODQA models available in the DeepPavlov Library.

Config

Description

odqa/en_odqa_infer_wiki.json

Basic config for English language. Consists of of Binary Passage Retrieval, TF-IDF retrieval and reader.

odqa/en_odqa_pop_infer_wiki.json

Extended config for English language. Consists of of Binary Passage Retrieval, TF-IDF retrieval, popularity ranker and reader.

odqa/ru_odqa_infer_wiki.json

Basic config for Russian language. Consists of TF-IDF ranker and reader.

The table presents the scores on Natural Questions and SberQuAD dataset and memory consumption.

Config

Number ofparagraphs

Dataset

F1

EM

RAM

GPU

Time for 1 query

odqa/en_odqa_infer_wiki.json

200

Natural Questions

45.2

37.0

10.4

2.4

4.9 s

odqa/ru_odqa_infer_wiki.json

100

SberQuAD

59.2

49.0

13.1

5.3

2.0 s

4. Use the model for prediction

4.1 Predict using Python

English

[ ]:
from deeppavlov import build_model

odqa_en = build_model('en_odqa_infer_wiki', download=True, install=True)

Input: List[questions]

Output: Tuple[List[answers], List[answer scores], List[answer places in paragraph]]

[ ]:
odqa_en(["What is the name of Darth Vader's son?"])
[['Luke Skywalker'], [4.196979999542236]]

Russian

[ ]:
from deeppavlov import build_model

odqa_ru = build_model('ru_odqa_infer_wiki', download=True, install=True)
[ ]:
odqa_ru(["Где живут кенгуру?"])
[['на востоке и юге Австралии'], [0.9999760985374451]]

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact en_odqa_infer_wiki -d

-d is an optional download key (alternative to download=True in Python code). The key -d is used to download the pre-trained model along with embeddings and all other files needed to run the model.

5. Customize the model

5.1 Description of config parameters

Parameters of bpr component:

  • load_path - path with checkpoint of query encoder and bpr index;

  • query_encoder_file - filename of query encoder (Transformer-based model which takes a question as input and obtains its binary embedding);

  • bpr_index - filename with BPR index (matrix of paragraph binary vectors);

  • pretrained_model - Transformer model, used in query encoder;

  • max_query_length - maximal length (in sub-tokens) of the input to the query encoder;

  • top_n - how many paragraph IDs to return per a question.

Parameters of tfidf_ranker component:

  • top_n - how many paragraph IDs to return per a question.

Parameters of logit_ranker component:

  • batch_size - the paragraphs from the database (some of which contain the answer to the question, others - do not contain) will be split into batches with the size batch_size for extraction of candidate answer in each paragraph;

  • squad_model - the model which finds spans of an answer in a paragraph;

  • sort_noans - whether to put paragraphs with no answer in the end of paragraph list, sorted by confidences;

  • top_n - the number of possible answers for a question;

  • return_answer_sentence - whether to return the sentence from the paragraph with the answer.

5.2 Building the index and training the reader model

There are two customizable components in ODQA configs:

  • TF-IDF ranker;

  • Reading comprehension model.

If you would like to build the TF-IDF index for your own text database, read here.

In addition, to train the Reader on your data, read here.