{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Entity Extraction\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/entity_extraction.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1 [Predict using Python](#4.1-Predict-using-Python)\n",
    "    \n",
    "    4.2 [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "    \n",
    "    5.1 [Description of config parameters](#5.1-Description-of-config-parameters)\n",
    "    \n",
    "    5.2 [Training entity detection model](#5.2-Training-entity-detection-model)\n",
    "    \n",
    "    5.3 [Using custom knowledge base](#5.3-Using-custom-knowledge-base)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "**Entity Detection** is the task of identifying entity mentions in text with corresponding entity types. Entity Detection models in DeepPavlov split the input text into fragments of the lengths less than 512 tokens and find entities with BERT-based models.\n",
    "\n",
    "**Entity Linking** is the task of finding knowledge base entity ids for entity mentions in text. Entity Linking in DeepPavlov supports Wikidata and Wikipedia. Entity Linking component performs the following steps:\n",
    "\n",
    "* extraction of candidate entities from SQLite database;\n",
    "* candidate entities sorting by entity tags (if entity tags are provided);\n",
    "* ranking of candidate entities by connections in Wikidata knowledge graph of candidate entities for different mentions;\n",
    "* candidate entities ranking by context and descriptions using Transformer model [bert-small](https://huggingface.co/prajjwal1/bert-small) in English config and [distilrubert-tiny](https://huggingface.co/DeepPavlov/distilrubert-tiny-cased-conversational-v1).\n",
    "\n",
    "**Entity Extraction** configs perform subsequent Entity Detection and Entity Linking of extracted entity mentions.\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then make sure that all the required packages for the model are installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install entity_extraction_en"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`entity_extraction_en` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
    "\n",
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "The full list of models for entity detection, linking and extraction with their config names can be found in the [table](#3.-Models-list).\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.\n",
    "\n",
    "| Config name | Language | RAM | GPU |\n",
    "| :--- | --- | --- | --- |\n",
    "| entity_detection_en | En | 2.5 Gb | 3.7 Gb |\n",
    "| entity_detection_ru | Ru | 2.5 Gb | 5.3 Gb |\n",
    "| entity_linking_en | En | 2.4 Gb | 1.2 Gb |\n",
    "| entity_linking_ru | Ru | 2.2 Gb | 1.1 Gb |\n",
    "| entity_extraction_en | En | 2.5 Gb | 3.7 Gb |\n",
    "| entity_extraction_ru | Ru | 2.5 Gb | 5.3 Gb |\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict.\n",
    "\n",
    "### Entity Detection\n",
    "\n",
    "**For English:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from deeppavlov import build_model\n",
    "\n",
    "ed_en = build_model('entity_detection_en', download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The output elements:**\n",
    "\n",
    "* entity substrings\n",
    "* entity offsets (indices of start and end symbols of entities in text)\n",
    "* entity positions (indices of entity tokens in text)\n",
    "* entity tags\n",
    "* sentences offsets\n",
    "* list of sentences in text\n",
    "* confidences of detected entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ed_en(['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**For Russian:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ed_ru = build_model('entity_detection_ru', download=True, install=True)\n",
    "ed_ru(['Москва — столица России, центр Центрального федерального округа и центр Московской области.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entity Linking\n",
    "\n",
    "**For English:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "el_en = build_model('entity_linking_en', download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The input elements:**\n",
    "\n",
    "* entity substrings\n",
    "* entity tags (optional argument)\n",
    "* confidences of entity substrings (optional argument)\n",
    "* sentences (context) of the entities (optional argument)\n",
    "* entity offsets (optional argument)\n",
    "* sentences offsets (optional argument)\n",
    "\n",
    "**The output elements:**\n",
    "\n",
    "* entity ids\n",
    "* entity confidences (for each entity - the list with three confidences: substring matching confidence, popularity ranking confidence and context ranking confidence)\n",
    "* entity pages in Wikipedia\n",
    "* entity labels in Wikidata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "el_en([['forrest gump', 'robert zemeckis', 'eric roth']],\n",
    "      [['WORK_OF_ART', 'PERSON', 'PERSON']],\n",
    "      [[1.0, 1.0, 1.0]],\n",
    "      [['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.']],\n",
    "      [[(0, 12), (48, 63), (79, 88)]],\n",
    "      [[(0, 89)]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**For Russian:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "el_ru = build_model('entity_linking_ru', download=True, install=True)\n",
    "\n",
    "el_ru([['москва', 'россии', 'центрального федерального округа', 'московской области']],\n",
    "      [['CITY', 'COUNTRY', 'LOC', 'LOC']],\n",
    "      [[1.0, 1.0, 1.0, 1.0]],\n",
    "      [['Москва — столица России, центр Центрального федерального округа и центр Московской области.']],\n",
    "      [[(0, 6), (17, 23), (31, 63), (72, 90)]],\n",
    "      [[(0, 91)]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entity Extraction\n",
    "\n",
    "**For English:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex_en = build_model('entity_extraction_en', download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The output elements:**\n",
    "\n",
    "* entity substrings\n",
    "* entity tags\n",
    "* entity offsets\n",
    "* entity ids in the knowledge base\n",
    "* entity linking confidences\n",
    "* entity pages\n",
    "* entity labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex_en(['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**For Russian:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex_ru = build_model('entity_extraction_ru', download=True, install=True)\n",
    "\n",
    "ex_ru(['Москва — столица России, центр Центрального федерального округа и центр Московской области.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact entity_extraction_en -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Customize the model\n",
    "\n",
    "## 5.1 Description of config parameters\n",
    "\n",
    "Parameters of ``ner_chunker`` component:\n",
    "\n",
    "- ``batch_size: int`` - each text from the input text batch is split into chunks with the length lower than the threshold (because Transformer-based models for entity detection work with limited lengths of the input sequences), than all chunks are concatenated into one list and the list is split into batches of the size ``batch_size``;\n",
    "- ``max_seq_len: int`` - maximum length of chunk (in wordpiece tokens);\n",
    "- ``vocab_file: str`` - vocab file of Transformer tokenizer, which is used to tokenize the text for further splitting into chunks.\n",
    "\n",
    "Parameters of ``entity_detection_parser`` component:\n",
    "    \n",
    "- ``thres_proba: float`` - the NER models return tag confidences for each token; if the probability of \"O\" tag (which is used for tokens not related to entities) for the token is lower than the ``thres_proba``, the tag with the maximum probability from entity tags list is chosen;\n",
    "- ``o_tag: str`` - tag for non-entity tokens (by default is \"O\" tag);\n",
    "- ``tags_file: str`` - the filename with the list of tags used in the NER model.\n",
    "\n",
    "Parameters of ``ner_chunk_model`` component:\n",
    "\n",
    "- ``ner: deeppavlov.core.common.chainer:Chainer`` - the config for entity recognition, which defines entity tags (or \"O\" tag) and tag probabilities for each token in the input text;\n",
    "- ``ner_parser: deeppavlov.models.entity_extraction.entity_detection_parser:EntityDetectionParser`` - the component which processes the tags and tag probabilities returned by the entity recognition model and defines entity substrings;\n",
    "- ``ner2: deeppavlov.core.common.chainer:Chainer`` - (optional) an additional entity recognition config, which can improve the quality of entity recognition in the case of joint usage with ``ner`` config;\n",
    "- ``ner_parser2: deeppavlov.models.entity_extraction.entity_detection_parser:EntityDetectionParser`` - (optional) an additional config for processing entity recognition output.\n",
    "\n",
    "Parameters of ``entity_linker`` component:\n",
    "\n",
    "- ``load_path: str`` - the path to the folder with the inverted index;\n",
    "- ``entity_ranker`` - the component for ranking of candidate entities by descriptions;\n",
    "- ``entities_database_filename: str`` - file with the inverted index (the mapping between entity titles and entity IDs);\n",
    "- ``words_dict_filename: str`` - file with mapping of entity titles to the tags of entity detection model;\n",
    "- ``ngrams_matrix_filename: str`` - matrix of char ngrams of words from entity titles from the knowledge base;\n",
    "- ``num_entities_for_bert_ranking: int`` - number of candidate entities which are re-ranked by context and description using Transformer-based model;\n",
    "- ``num_entities_for_conn_ranking: int`` - number of candidate entities which are re-ranked by connections in the knowledge graph between entities for different mentions in the text;\n",
    "- ``num_entities_to_return: int`` - the number of entity IDs, returned for each entity mention in text; \n",
    "- ``max_paragraph_len: int`` - maximum length of context used for ranking of entities by description;\n",
    "- ``lang: str`` - language of the entity linking model (Russian or English);\n",
    "- ``use_descriptions: bool`` - whether to perform ranking of candidate entities by similarity of their descriptions to the context;\n",
    "- ``alias_coef: float`` - the coefficient which is multiplied by the substring matching score of the entity if the entity mention in the text matches with the entity title;\n",
    "- ``use_tags: bool`` - whether to search only those entity IDs in the inverted index, which have the same tag as the entity mention;\n",
    "- ``lemmatize: bool`` - whether to lemmatize entity mentions before searching candidate entity IDs in the inverted index;\n",
    "- ``full_paragraph: bool`` - whether to use full context for ranking of entities by descriptions or cut the paragraph to one sentence with entity mention;\n",
    "- ``use_connections: bool`` - whether to use connections between candidate entities for different mentions for ranking;\n",
    "- ``kb_filename: str`` - file with the knowledge base in .hdt format;\n",
    "- ``prefixes: Dict[str, Any]`` - prefixes in the knowledge base for entities and relations.\n",
    "\n",
    "## 5.2 Training entity detection model\n",
    "\n",
    "The configs `entity_detection_en` and `entity extraction_en` use `ner_ontonotes_bert` model for detection of entity mentions, the configs `entity_detection_ru` and `entity extraction_ru` use `ner_rus_bert_probas` model. [How to train a NER model](http://docs.deeppavlov.ai/en/master/features/models/NER.html#6.-Customize-the-model).\n",
    "\n",
    "## 5.3 Using custom knowledge base\n",
    "\n",
    "The database filename is defined with the **entities_database_filename** in entity linking configs. The file is in SQLite format with FTS5 extensions for full-text search of entities by entity mention. The database file should contain the **inverted_index** table with the following columns:\n",
    "\n",
    "* ``title`` - entity title (name or alias) in the knowledge base;\n",
    "* ``entity_id`` - entity ID in the knowledge base;\n",
    "* ``num_rels`` - number of relations of the entity with other entities in the knowledge graph;\n",
    "* ``ent_tag`` - entity tag of the entity detection model (for example, CITY, PERSON, WORK_OF_ART, etc.);\n",
    "* ``page`` - page title of the entity (for Wikidata entities - the Wikipedia page);\n",
    "* ``label`` - entity label in the knowledge base;\n",
    "* ``descr`` - entity description in the knowledge base.\n",
    "\n",
    "Tags of entities in the knowledge base should correspond with the tags of the custom NER model or default `ner_ontonotes_bert` or `ner_rus_bert_probas` models. The list of `ner_ontonotes_bert` tags is listed in tags.dict file in ~/.deeppavlov/models/ner_ontonotes_bert_torch_crf directory, the list of `ner_rus_bert_probas tags` - in tags.dict file in ~/.deeppavlov/models/wiki_ner_rus_bert directory."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}