Knowledge Base Question Answering (KBQA)

Open In Colab

1. Introduction to the task

The knowledge base:

  • is a comprehensive repository of information about given domain or a number of domains;

  • reflects the ways we model knowledge about given subject or subjects, in terms of concepts, entities, properties, and relationships;

  • enables us to use this structured knowledge where appropriate, e.g. answering factoid questions.

Currently, we support Wikidata as a Knowledge Base (Knowledge Graph). In the future, we will expand support for custom knowledge bases.

The question answerer:

  • validates questions against the preconfigured list of question templates, disambiguates entities using entity linking and answers questions asked in natural language;

  • can be used with Wikidata (English, Russian) and (in the future versions) with custom knowledge graphs.

Here are some of the most popular types of questions supported by the model:

  • Complex questions with numerical values: “What position did Angela Merkel hold on November 10, 1994?”

  • Complex question where the answer is a number or a date: “When did Jean-Paul Sartre move to Le Havre?”

  • Questions with counting of answer entities: “How many sponsors are for Juventus F.C.?”

  • Questions with ordering of answer entities by ascending or descending of some parameter: “Which country has highest individual tax rate?”

  • Simple questions: “What is crew member Yuri Gagarin’s Vostok?”

The following models are used to find the answer (the links are for the English language model):

  • BERT model for prediction of query template type. Model performs classification of questions into 8 classes correponding to 8 query template types;

  • BERT entity detection model for extraction of entity substrings from the questions;

  • Substring extracted by the entity detection model is used for entity linking. Entity linking performs matching the substring with one of the Wikidata entities. Matching is based on the Levenshtein distance between the substring and an entity title. The result of the matching procedure is a set of candidate entities. There is also the search for the entity among this set with one of the top-k relations predicted by classification model;

  • BERT model for ranking candidate relation paths;

  • Query generator model is used to fill query template with candidate entities and relations to find valid combinations of entities and relations for query template. Query generation model uses Wikidata HDT file.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation

[ ]:
! pip install --q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
! python -m deeppavlov install kbqa_cq_en
! python -m deeppavlov install kbqa_cq_ru

kbqa_cq_en and kbqa_cq_rus here are the names of the model’s config_files. What is a Config File?

Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the config_file here and further. The full list of KBQA models with their config names can be found in the table.

3. Models list

The table presents a list of all of the KBQA-models available in DeepPavlov Library.

Config name

Database

Language

RAM

GPU

kbqa_cq_en

Wikidata

En

3.1 Gb

3.4 Gb

kbqa_cq_ru

Wikidata

Ru

4.3 Gb

8.0 Gb

4. Use the model for prediction

4.1 Predict using Python

After installing the model, build it from the config and predict.

[ ]:
from deeppavlov import configs, build_model

kbqa = build_model('kbqa_cq_en', download=True, install=True)

Input: List[sentences]

Output: List[answers]

[ ]:
kbqa(['Who directed Forrest Gump?'])
[['Robert Zemeckis'],
 [['Q187364']],
 [['SELECT ?answer WHERE { wd:Q134773 wdt:P57 ?answer. }']]]
[ ]:
kbqa(['What position was held by Harry S. Truman on 1/3/1935?'])
[['United States senator'],
 [['Q4416090']],
 [['SELECT ?answer WHERE { wd:Q11613 p:P39 ?ent . ?ent ps:P39 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]
[ ]:
kbqa(['What teams did Lionel Messi play for in 2004?'])
[['FC Barcelona B, Argentina national under-20 football team'],
 [['Q10467', 'Q1187790']],
 [['SELECT ?answer WHERE { wd:Q615 p:P54 ?ent . ?ent ps:P54 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]

KBQA model for complex question answering in Russian can be used from Python using the following code:

[ ]:
from deeppavlov import configs, build_model

kbqa = build_model('kbqa_cq_ru', download=True, install=True)
[ ]:
kbqa(['Когда родился Пушкин?'])
[['26 мая 1799, 06 июня 1799'],
 [['+1799-05-26^^T', '+1799-06-06^^T']],
 [['SELECT ?answer WHERE { wd:Q7200 wdt:P569 ?answer. }']]]

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI.

[ ]:
! python -m deeppavlov interact kbqa_сq_en [-d]
! python -m deeppavlov interact kbqa_cq_ru [-d]

-d is an optional download key (alternative to download=True in Python code). It is used to download the pre-trained model along with embeddings and all other files needed to run the model.

Or make predictions for samples from stdin.

[ ]:
! python -m deeppavlov predict kbqa_сq_en -f <file-name>
! python -m deeppavlov predict kbqa_cq_ru -f <file-name>

4.3 Using entity linking and Wiki parser as standalone tools for KBQA

Default configuration for KBQA was designed to use all of the supporting models together as a part of the KBQA pipeline. However, there might be a case when you want to work with some of these models in addition to KBQA.

For example, you might want to use entity linking model as an annotator in your multiskill AI Assistant. Or, you might want to use Wiki Parser component to directly run SPARQL queries against your copy of Wikidata. To support these usages, you can also deploy supporting models as standalone components.

Don’t forget to replace the url parameter values in the examples below with correct URLs.

Config entity_linking_en can be used with the following commands:

[ ]:
! python -m deeppavlov install entity_linking_en -d
[ ]:
! python -m deeppavlov riseapi entity_linking_en [-d] [-p <port>]
[ ]:
import requests

payload = {"entity_substr": [["Forrest Gump"]], "tags": [["PERSON"]], "probas": [[0.9]],
           "sentences": [["Who directed Forrest Gump?"]]}
response = requests.post(entity_linking_url, json=payload).json()
print(response)

Config wiki_parser can be used with the following command:

[ ]:
! python -m deeppavlov riseapi wiki_parser [-d] [-p <port>]

Arguments of the annotator are parser_info (what we want to extract from Wikidata) and query.

Examples of queries:

To extract triplets for entities, the query argument should be the list of entities ids. parser_info should be the list of “find_triplets” strings.

[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_triplets"], "query": ["Q159"]}).json()

To extract all relations of the entities, the query argument should be the list of entities ids, and parser_info should be the list of “find_rels” strings.

[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_rels"], "query": ["Q159"]}).json()

To find labels for entities ids, the query argument should be the list of entities ids, and parser_info should be the list of “find_label” strings.

[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_label"], "query": [["Q159", ""]]}).json()

In this example, the second element of the list (an empty string) can be replaced with a sentence.

To execute SPARQL queries, the query argument should be the list of tuples with the info about SPARQL queries, and parser_info should be the list of “query_execute” strings.

Let us consider an example of the question “What is the deepest lake in Russia?” with the corresponding SPARQL query SELECT ?ent WHERE { ?ent wdt:P31 wd:T1 . ?ent wdt:R1 ?obj . ?ent wdt:R2 wd:E1 } ORDER BY ASC(?obj) LIMIT 5

Arguments:

  • what_return: [“?obj”],

  • query_seq: [[“?ent”, “P17”, “Q159”], [“?ent”, “P31”, “Q23397”], [“?ent”, “P4511”, “?obj”]],

  • filter_info: [],

  • order_info: order_info(variable=’?obj’, sorting_order=’asc’).

[ ]:
requests.post("wiki_parser_url", json = {"parser_info": ["query_execute"], "query": [[["?obj"], [["Q159", "P36", "?obj"]], [], [], True]]}).json()

To use entity linking model in KBQA, you should add following API Requester component to the pipe in the config_file:

{
    "class_name": "api_requester",
    "id": "linker_entities",
    "url": "entity_linking_url",
    "out": ["entity_substr", "entity_ids", "entity_conf", "entity_pages", "entity_labels"],
    "param_names": ["entity_substr", "tags", "probas", "sentences"]
 }

To use Wiki parser service in KBQA, you should add following API Requester component to the pipe in the config_file:

{
    "class_name": "api_requester",
    "id": "wiki_p",
    "url": "wiki_parser_url",
    "out": ["wiki_parser_output"],
    "param_names": ["parser_info", "query"]
 }

5. Customize the model

5.1 Description of config parameters

Parameters of entity_linker component:

  • num_entities_to_return: int - the number of entity IDs, returned for each entity mention in text;

  • lemmatize: bool - whether to lemmatize entity mentions before searching candidate entity IDs in the inverted index;

  • use_decriptions: bool - whether to perform ranking of candidate entities by similarity of their descriptions to the context;

  • use_connections: bool - whether to use connections between candidate entities for different mentions for ranking;

  • use_tags: bool - whether to search only those entity IDs in the inverted index, which have the same tag as the entity mention;

  • prefixes: Dict[str, Any] - prefixes in the knowledge base for entities and relations;

  • alias_coef: float - the coefficient which is multiplied by the substring matching score of the entity if the entity mention in the text matches with the entity title.

Parameters of rel_ranking_infer component:

  • return_elements: List[str] - what elements should be returned by the component in the output tuple (answers are returned by default, optional elements are "confidences", "answer_ids", "entities_and_rels" (entities and relations from SPARQL queries), "queries" (SPARQL queries), "triplets" (triplets from SPARQL queries));

  • batch_size: int - candidate relations list will be split into N batches of the size batch_size for further ranking;

  • softmax: bool - whether to apply softmax function to the confidences list of candidate relations for a question;

  • use_api_requester: bool - true if wiki_parser is called through api_requester;

  • rank: bool - whether to perform ranking of candidate relation paths;

  • nll_rel_ranking: bool - in DeepPavlov we have two types of relation ranking models: 1) the model which takes a question and a relation and is trained to classify question-relation by two classes (relevant / irrelevant relation) 2) the model which takes a question and a list of relations (one relevant relation and others - irrelevant) and is trained to define the relevant relation in the list with NLL loss; the output format in two cases is different;

  • nll_path_ranking: bool - the same case as nll_rel_ranking for ranking of relation paths;

  • top_possible_answers: int - SPARQL query execution can result in several valid answers, so top_possible_answers is the number of these answers which we leave in the output;

  • top_n: int - number of candidate SPARQL queries (and corresponding answers) in the output for a question;

  • pos_class_num: int - if we use the model which classifies question-relation into two classes (relevant / irrelevant), we should set the number of positive class (0 or 1);

  • rel_thres: float - we leave only relations with the confidence upper threshold;

  • type_rels: List[str] - relations which connect entity and its type in the knowledge graph.

Parameters of query_generator component:

  • entities_to_leave: int - how many entity IDs to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;

  • rels_to_leave: int - how many relations to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;

  • max_comb_num: int - maximal number of combinations of entities and relations for filling in the slots of SPARQL query template;

  • map_query_str_to_kb: List[Tuple[str, str]] - a list of elements like [“wd:”, “http://we/”], where the first element is a prefix of an entity (“wd:”) or relation in the SPARQL query template, the second - the corresponding prefix in the knowledge base (“http://we/”);

  • kb_prefixes: Dict[str, str] - a dictionary {“entity”: “wd:E”, “rel”: “wdt:R”, …} - prefixes of entities, relations and types in the knowledge base;

  • gold_query_info: Dict[str, str] - names of unknown variables in SPARQL queries in the dataset (LC-QuAD2.0 or RuBQ2.0);

  • syntax_structure_known: bool - whether the syntax structure of the question is known (is True in kbqa_cq_ru.json, because this config performs syntax parsing with slovnet_syntax_parser).

5.2 Train KBQA components

Train Query Prediction Model

The dataset for training query prediction model consists of three .csv files: train.csv, valid.csv and test.csv. Each line in this file contains question and corresponding query template type, for example:

"What is the longest river in the UK?", 6

Train Entity Detection Model

The dataset is a pickle file. The dataset must be split into three parts: train, test, and validation. Each part is a list of tuples of question tokens and tags for each token. An example of training sample:

(['What', 'is', 'the', 'complete', 'list', 'of', 'records', 'released', 'by', 'Jerry', 'Lee', 'Lewis', '?'],
 ['O', 'O', 'O', 'O', 'B-T', 'I-T', 'I-T', 'O', 'O', 'B-E', 'I-E', 'I-E', 'O'])

B-T corresponds to tokens of entity types substrings beginning, I-T - to tokens of inner part of entity types substrings, B-E and I-E - for entities, O - for other tokens.

Train Path Ranking Model

The dataset (in pickle format) is a dict of three keys: “train”, “valid” and “test”. The value by each key is the list of samples, an example of a sample:

(['What is the Main St. Exile label, which Nik Powell co-founded?', ['record label', 'founded by']], '1')

The sample contains the question, relations in the question and label (1 - if the relations correspond to the question, 0 - otherwise).

Adding Templates For New SPARQL Queries

Templates can be added to sparql_queries.json file, which is a dictionary, where keys are template types and values are templates with additional information. An example of a template:

{
    "query_template": "SELECT ?obj WHERE { wd:E1 p:R1 ?s . ?s ps:R1 ?obj . ?s ?p ?x filter(contains(?x, N)) }",
    "rank_rels": ["wiki", "do_not_rank", "do_not_rank"],
    "rel_types": ["no_type", "statement", "qualifier"],
    "query_sequence": [1, 2, 3],
    "return_if_found": true,
    "template_num": "0",
    "alternative_templates": []
 }
  • query_template is the template of the SPARQL query;

  • rank_rels is a list which defines whether to rank relations, in this example p:R1 relations we extract from Wikidata for wd:E1 entities and rank with RelRanker, ps:R1 and ?p relations we do not extract or rank;

  • rel_types - direct, statement or qualifier relations;

  • query_sequence - the sequence in which the triplets will be extracted from the Wikidata hdt file;

  • return_if_found - the parameter which iterates over all possible combinations of entities, relations and types, if true - return the first valid combination found, if false - consider all combinations;

  • template_num - the type of a template;

  • alternative_templates - type numbers of alternative templates to use if the answer was not found using the current template.