{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Knowledge Base Question Answering (KBQA)\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/KBQA.ipynb)\n", "\n", "# Table of contents \n", "\n", "1. [Introduction to the task](#1.-Introduction-to-the-task)\n", "\n", "2. [Get started with the model](#2.-Get-started-with-the-model)\n", "\n", "3. [Models list](#3.-Models-list)\n", "\n", "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n", "\n", " 4.1. [Predict using Python](#4.1-Predict-using-Python)\n", " \n", " 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n", "\n", " 4.3. [Using entity linking and Wiki parser as standalone services for KBQA](#4.3-Using-entity-linking-and-Wiki-parser-as-standalone-tools-for-KBQA)\n", " \n", "5. [Customize the model](#5.-Customize-the-model)\n", " \n", " 5.1. [Description of config parameters](#5.1-Description-of-config-parameters)\n", " \n", " 5.2. [Train KBQA components](#5.2-Train-KBQA-components)\n", "\n", "# 1. Introduction to the task\n", "\n", "The knowledge base:\n", "\n", "* is a comprehensive repository of information about given domain or a number of domains;\n", "\n", "* reflects the ways we model knowledge about given subject or subjects, in terms of concepts, entities, properties, and relationships;\n", "\n", "* enables us to use this structured knowledge where appropriate, e.g. answering factoid questions.\n", "\n", "Currently, we support Wikidata as a Knowledge Base (Knowledge Graph). In the future, we will expand support for custom knowledge bases.\n", "\n", "The question answerer:\n", "\n", "* validates questions against the preconfigured list of question templates, disambiguates entities using entity linking and answers questions asked in natural language;\n", "\n", "* can be used with Wikidata (English, Russian) and (in the future versions) with custom knowledge graphs.\n", "\n", "Here are some of the most popular types of questions supported by the model:\n", "\n", "* **Complex questions with numerical values:** “What position did Angela Merkel hold on November 10, 1994?”\n", "* **Complex question where the answer is a number or a date:** “When did Jean-Paul Sartre move to Le Havre?”\n", "* **Questions with counting of answer entities:** “How many sponsors are for Juventus F.C.?”\n", "* **Questions with ordering of answer entities by ascending or descending of some parameter:** “Which country has highest individual tax rate?”\n", "* **Simple questions:** “What is crew member Yuri Gagarin’s Vostok?”\n", "\n", "The following models are used to find the answer (the links are for the English language model):\n", "\n", "* [BERT model](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/classifiers/query_pr.json) for prediction of query template type. Model performs classification of questions into 8 classes correponding to 8 query template types;\n", "* [BERT entity detection model](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/entity_extraction/entity_detection_en.json) for extraction of entity substrings from the questions;\n", "* Substring extracted by the entity detection model is used for [entity linking](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/entity_extraction/entity_linking_en.json). Entity linking performs matching the substring with one of the Wikidata entities. Matching is based on the Levenshtein distance between the substring and an entity title. The result of the matching procedure is a set of candidate entities. There is also the search for the entity among this set with one of the top-k relations predicted by classification model;\n", "* [BERT model](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/ranking/rel_ranking_bert_en.json) for ranking candidate relation paths;\n", "* Query generator model is used to fill query template with candidate entities and relations to find valid combinations of entities and relations for query template. Query generation model uses Wikidata HDT file.\n", "\n", "# 2. Get started with the model\n", "\n", "First make sure you have the DeepPavlov Library installed.\n", "[More info about the first installation](https://deeppavlov-test.readthedocs.io/en/latest/notebooks/Get%20Started%20with%20DeepPavlov.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --q deeppavlov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then make sure that all the required packages for the model are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov install kbqa_cq_en\n", "! python -m deeppavlov install kbqa_cq_ru" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`kbqa_cq_en` and `kbqa_cq_rus` here are the names of the model's *config_files*. [What is a Config File?](https://docs.deeppavlov.ai/en/master/intro/configuration.html) \n", "\n", "Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the *config_file* here and further.\n", "The full list of KBQA models with their config names can be found in the [table](#3.-Models-list).\n", "\n", "# 3. Models list\n", "\n", "The table presents a list of all of the KBQA-models available in DeepPavlov Library.\n", "\n", "| Config name | Database | Language | RAM | GPU |\n", "| :--- | --- | --- | --- | --- |\n", "| [kbqa_cq_en](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/kbqa/kbqa_cq_en.json) | Wikidata | En | 3.1 Gb | 3.4 Gb |\n", "| [kbqa_cq_ru](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/kbqa/kbqa_cq_en.json) | Wikidata | Ru | 4.3 Gb | 8.0 Gb |\n", "\n", "\n", "# 4. Use the model for prediction\n", "\n", "## 4.1 Predict using Python\n", "\n", "After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import configs, build_model\n", "\n", "kbqa = build_model('kbqa_cq_en', download=True, install=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Input**: List[sentences]\n", "\n", "**Output**: List[answers]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Robert Zemeckis'],\n", " [['Q187364']],\n", " [['SELECT ?answer WHERE { wd:Q134773 wdt:P57 ?answer. }']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kbqa(['Who directed Forrest Gump?'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['United States senator'],\n", " [['Q4416090']],\n", " [['SELECT ?answer WHERE { wd:Q11613 p:P39 ?ent . ?ent ps:P39 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kbqa(['What position was held by Harry S. Truman on 1/3/1935?'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['FC Barcelona B, Argentina national under-20 football team'],\n", " [['Q10467', 'Q1187790']],\n", " [['SELECT ?answer WHERE { wd:Q615 p:P54 ?ent . ?ent ps:P54 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kbqa(['What teams did Lionel Messi play for in 2004?'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KBQA model for complex question answering in Russian can be used from Python using the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import configs, build_model\n", "\n", "kbqa = build_model('kbqa_cq_ru', download=True, install=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['26 мая 1799, 06 июня 1799'],\n", " [['+1799-05-26^^T', '+1799-06-06^^T']],\n", " [['SELECT ?answer WHERE { wd:Q7200 wdt:P569 ?answer. }']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kbqa(['Когда родился Пушкин?'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Predict using CLI\n", "\n", "You can also get predictions in an interactive mode through CLI." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov interact kbqa_сq_en [-d]\n", "! python -m deeppavlov interact kbqa_cq_ru [-d]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`-d` is an optional download key (alternative to `download=True` in Python code). It is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n", "\n", "Or make predictions for samples from *stdin*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov predict kbqa_сq_en -f \n", "! python -m deeppavlov predict kbqa_cq_ru -f " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3 Using entity linking and Wiki parser as standalone tools for KBQA\n", "\n", "Default configuration for KBQA was designed to use all of the supporting models together as a part of the KBQA pipeline. However, there might be a case when you want to work with some of these models in addition to KBQA.\n", "\n", "For example, you might want to use entity linking model as an annotator in your [multiskill AI Assistant](https://github.com/deeppavlov/dream). Or, you might want to use Wiki Parser component to directly run SPARQL queries against your copy of Wikidata. To support these usages, you can also deploy supporting models as standalone components.\n", "\n", "Don’t forget to replace the `url` parameter values in the examples below with correct URLs.\n", "\n", "Config [entity_linking_en](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/entity_extraction/entity_linking_en.json) can be used with the following commands:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov install entity_linking_en -d" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov riseapi entity_linking_en [-d] [-p ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "payload = {\"entity_substr\": [[\"Forrest Gump\"]], \"tags\": [[\"PERSON\"]], \"probas\": [[0.9]],\n", " \"sentences\": [[\"Who directed Forrest Gump?\"]]}\n", "response = requests.post(entity_linking_url, json=payload).json()\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Config [wiki_parser](https://github.com/deeppavlov/DeepPavlov/blob/1.0.0rc1/deeppavlov/configs/kbqa/wiki_parser.json) can be used with the following command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov riseapi wiki_parser [-d] [-p ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arguments of the annotator are `parser_info` (what we want to extract from Wikidata) and `query`.\n", "\n", "**Examples of queries:**\n", "\n", "To extract triplets for entities, the `query` argument should be the list of entities ids. `parser_info` should be the list of “find_triplets” strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "requests.post(wiki_parser_url, json = {\"parser_info\": [\"find_triplets\"], \"query\": [\"Q159\"]}).json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract all relations of the entities, the `query` argument should be the list of entities ids, and `parser_info` should be the list of “find_rels” strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "requests.post(wiki_parser_url, json = {\"parser_info\": [\"find_rels\"], \"query\": [\"Q159\"]}).json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find labels for entities ids, the `query` argument should be the list of entities ids, and `parser_info` should be the list of “find_label” strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "requests.post(wiki_parser_url, json = {\"parser_info\": [\"find_label\"], \"query\": [[\"Q159\", \"\"]]}).json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, the second element of the list (an empty string) can be replaced with a sentence.\n", "\n", "To execute SPARQL queries, the `query` argument should be the list of tuples with the info about SPARQL queries, and `parser_info` should be the list of “query_execute” strings.\n", "\n", "Let us consider an example of the question “What is the deepest lake in Russia?” with the corresponding SPARQL query `SELECT ?ent WHERE { ?ent wdt:P31 wd:T1 . ?ent wdt:R1 ?obj . ?ent wdt:R2 wd:E1 } ORDER BY ASC(?obj) LIMIT 5`\n", "\n", "Arguments:\n", "\n", "* *what_return*: ```[“?obj”]```,\n", "* *query_seq*: ```[[“?ent”, “P17”, “Q159”], [“?ent”, “P31”, “Q23397”], [“?ent”, “P4511”, “?obj”]]```,\n", "* *filter_info*: ```[]```,\n", "* *order_info*: ```order_info(variable=’?obj’, sorting_order=’asc’)```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "requests.post(\"wiki_parser_url\", json = {\"parser_info\": [\"query_execute\"], \"query\": [[[\"?obj\"], [[\"Q159\", \"P36\", \"?obj\"]], [], [], True]]}).json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use entity linking model in KBQA, you should add following API Requester component to the `pipe` in the *config_file*:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "{\n", " \"class_name\": \"api_requester\",\n", " \"id\": \"linker_entities\",\n", " \"url\": \"entity_linking_url\",\n", " \"out\": [\"entity_substr\", \"entity_ids\", \"entity_conf\", \"entity_pages\", \"entity_labels\"],\n", " \"param_names\": [\"entity_substr\", \"tags\", \"probas\", \"sentences\"]\n", " }\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use Wiki parser service in KBQA, you should add following API Requester component to the `pipe` in the *config_file*:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "{\n", " \"class_name\": \"api_requester\",\n", " \"id\": \"wiki_p\",\n", " \"url\": \"wiki_parser_url\",\n", " \"out\": [\"wiki_parser_output\"],\n", " \"param_names\": [\"parser_info\", \"query\"]\n", " }\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Customize the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 Description of config parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameters of ``entity_linker`` component:\n", "\n", "- ``num_entities_to_return: int`` - the number of entity IDs, returned for each entity mention in text;\n", "- ``lemmatize: bool`` - whether to lemmatize entity mentions before searching candidate entity IDs in the inverted index;\n", "- ``use_decriptions: bool`` - whether to perform ranking of candidate entities by similarity of their descriptions to the context;\n", "- ``use_connections: bool`` - whether to use connections between candidate entities for different mentions for ranking;\n", "- ``use_tags: bool`` - whether to search only those entity IDs in the inverted index, which have the same tag as the entity mention;\n", "- ``prefixes: Dict[str, Any]`` - prefixes in the knowledge base for entities and relations;\n", "- ``alias_coef: float`` - the coefficient which is multiplied by the substring matching score of the entity if the entity mention in the text matches with the entity title." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameters of ``rel_ranking_infer`` component:\n", "\n", "- ``return_elements: List[str]`` - what elements should be returned by the component in the output tuple (answers are returned by default, optional elements are `\"confidences\"`, `\"answer_ids\"`, `\"entities_and_rels\"` (entities and relations from SPARQL queries), `\"queries\"` (SPARQL queries), `\"triplets\"` (triplets from SPARQL queries));\n", "- ``batch_size: int`` - candidate relations list will be split into N batches of the size `batch_size` for further ranking;\n", "- ``softmax: bool`` - whether to apply softmax function to the confidences list of candidate relations for a question;\n", "- ``use_api_requester: bool`` - true if wiki_parser [is called through api_requester](#4.3-Using-entity-linking-and-Wiki-parser-as-standalone-tools-for-KBQA);\n", "- ``rank: bool`` - whether to perform ranking of candidate relation paths;\n", "- ``nll_rel_ranking: bool`` - in DeepPavlov we have two types of relation ranking models: 1) the model which takes a question and a relation and is trained to classify question-relation by two classes (relevant / irrelevant relation) 2) the model which takes a question and a list of relations (one relevant relation and others - irrelevant) and is trained to define the relevant relation in the list with NLL loss; the output format in two cases is different;\n", "- ``nll_path_ranking: bool`` - the same case as `nll_rel_ranking` for ranking of relation paths;\n", "- ``top_possible_answers: int`` - SPARQL query execution can result in several valid answers, so `top_possible_answers` is the number of these answers which we leave in the output;\n", "- ``top_n: int`` - number of candidate SPARQL queries (and corresponding answers) in the output for a question;\n", "- ``pos_class_num: int`` - if we use the model which classifies question-relation into two classes (relevant / irrelevant), we should set the number of positive class (0 or 1);\n", "- ``rel_thres: float`` - we leave only relations with the confidence upper threshold;\n", "- ``type_rels: List[str]`` - relations which connect entity and its type in the knowledge graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameters of ``query_generator`` component:\n", "\n", "- ``entities_to_leave: int`` - how many entity IDs to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;\n", "- ``rels_to_leave: int`` - how many relations to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;\n", "- ``max_comb_num: int`` - maximal number of combinations of entities and relations for filling in the slots of SPARQL query template;\n", "- ``map_query_str_to_kb: List[Tuple[str, str]]`` - a list of elements like [\"wd:\", \"http://we/\"], where the first element is a prefix of an entity (\"wd:\") or relation in the SPARQL query template, the second - the corresponding prefix in the knowledge base (\"http://we/\");\n", "- ``kb_prefixes: Dict[str, str]`` - a dictionary {\"entity\": \"wd:E\", \"rel\": \"wdt:R\", ...} - prefixes of entities, relations and types in the knowledge base;\n", "- ``gold_query_info: Dict[str, str]`` - names of unknown variables in SPARQL queries in the dataset (LC-QuAD2.0 or RuBQ2.0);\n", "- ``syntax_structure_known: bool`` - whether the syntax structure of the question is known (is True in kbqa_cq_ru.json, because this config performs syntax parsing with slovnet_syntax_parser)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Train KBQA components" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Query Prediction Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset for training query prediction model consists of three *.csv* files: *train.csv*, *valid.csv* and *test.csv*. Each line in this file contains question and corresponding query template type, for example:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "\"What is the longest river in the UK?\", 6\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Entity Detection Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is a pickle file. The dataset must be split into three parts: train, test, and validation. Each part is a list of tuples of question tokens and tags for each token. An example of training sample:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "(['What', 'is', 'the', 'complete', 'list', 'of', 'records', 'released', 'by', 'Jerry', 'Lee', 'Lewis', '?'],\n", " ['O', 'O', 'O', 'O', 'B-T', 'I-T', 'I-T', 'O', 'O', 'B-E', 'I-E', 'I-E', 'O'])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`B-T` corresponds to tokens of entity types substrings beginning, `I-T` - to tokens of inner part of entity types substrings, `B-E` and `I-E` - for entities, `O` - for other tokens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train Path Ranking Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset (in pickle format) is a dict of three keys: \"train\", \"valid\" and \"test\". The value by each key is the list of samples, an example of a sample:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "(['What is the Main St. Exile label, which Nik Powell co-founded?', ['record label', 'founded by']], '1')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample contains the question, relations in the question and label (1 - if the relations correspond to the question, 0 - otherwise)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding Templates For New SPARQL Queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Templates can be added to sparql_queries.json file, which is a dictionary, where keys are template types and values are templates with additional information. An example of a template:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "{\n", " \"query_template\": \"SELECT ?obj WHERE { wd:E1 p:R1 ?s . ?s ps:R1 ?obj . ?s ?p ?x filter(contains(?x, N)) }\",\n", " \"rank_rels\": [\"wiki\", \"do_not_rank\", \"do_not_rank\"],\n", " \"rel_types\": [\"no_type\", \"statement\", \"qualifier\"],\n", " \"query_sequence\": [1, 2, 3],\n", " \"return_if_found\": true,\n", " \"template_num\": \"0\",\n", " \"alternative_templates\": []\n", " }\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `query_template` is the template of the SPARQL query;\n", "* `rank_rels` is a list which defines whether to rank relations, in this example **p:R1** relations we extract from Wikidata for **wd:E1** entities and rank with RelRanker, **ps:R1** and **?p** relations we do not extract or rank;\n", "* `rel_types` - direct, statement or qualifier relations;\n", "* `query_sequence` - the sequence in which the triplets will be extracted from the Wikidata hdt file;\n", "* `return_if_found` - the parameter which iterates over all possible combinations of entities, relations and types, if true - return the first valid combination found, if false - consider all combinations;\n", "* `template_num` - the type of a template;\n", "* `alternative_templates` - type numbers of alternative templates to use if the answer was not found using the current template." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 1 }