{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Named Entity Recognition (NER)\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/NER.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
    "    \n",
    "    4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "    \n",
    "5. [Evaluate](#5.-Evaluate)\n",
    "    \n",
    "    5.1. [Evaluate from Python](#5.1-Evaluate-from-Python)\n",
    "    \n",
    "    5.2. [Evaluate from CLI](#5.2-Evaluate-from-CLI)\n",
    "\n",
    "6. [Customize the model](#6.-Customize-the-model)\n",
    "    \n",
    "    6.1. [Train your model from Python](#6.1-Train-your-model-from-Python)\n",
    "    \n",
    "    6.2. [Train your model from CLI](#6.2-Train-your-model-from-CLI)\n",
    "\n",
    "7. [NER-tags list](#7.-NER-tags-list)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "**Named Entity Recognition (NER)** is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc). \n",
    "\n",
    "**BIO encoding schema** is usually used in NER task. It uses 3 tags: B for the beginning of the entity, I for the inside of the entity, and O for non-entity tokens. The second part of the tag stands for the entity type.\n",
    "\n",
    "Here is an example of a tagged sequence:\n",
    "\n",
    "| Elon | Musk | founded | Tesla| in | 2003 | . |\n",
    "| --- | --- | --- | --- | --- | --- | --- |\n",
    "| B-PER | I-PER | O | B-ORG | O | B-DATE | O |\n",
    "\n",
    "Here we can see three extracted named entities: *Elon Musk* (which is a person's name), *Tesla* (which is a name of an organization) and *2003* (which is a date). To see more examples try out our [Demo](https://demo.deeppavlov.ai/#/en/ner).\n",
    "\n",
    "The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov's models can be found in the [table](#7.-NER-tags-list).\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then make sure that all the required packages for the model are installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install ner_ontonotes_bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ner_ontonotes_bert` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html) \n",
    "\n",
    "Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the *config_file* here and further.\n",
    "The full list of NER models with their config names can be found in the [table](#3.-Models-list).\n",
    "\n",
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "The table presents a list of all of the NER-models available in the DeepPavlov Library.\n",
    "\n",
    "| Config name | Dataset | Language | Model Size | F1 score (ner_f1) | F1 score (ner_f1_token) |\n",
    "| :--- | --- | --- | --- | --- | ---: |\n",
    "| ner_case_agnostic_mdistilbert| [CoNLL-2003](https://paperswithcode.com/dataset/conll-2003)   | En | 1.6 GB | 89.9 | 91.6 |\n",
    "| ner_conll2003_bert | [CoNLL-2003](https://paperswithcode.com/dataset/conll-2003) | En | 1.3 GB | **91.9** | **93.4** |\n",
    "| ner_ontonotes_bert | [OntoNotes](https://paperswithcode.com/dataset/ontonotes-5-0) | En | 1.3 GB | 89.2 | 92.7 |\n",
    "| ner_collection3_bert | [Collection3](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 2.1 GB | **98.5** | **98.9** |\n",
    "| ner_rus_bert | [Collection3](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 2.1 GB | 97.6 | 98.5 |\n",
    "| ner_rus_convers_distilrubert_2L | [Collection-rus](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 1.3 GB | 92.9 | 96.6 |\n",
    "| ner_rus_convers_distilrubert_6L | [Collection-rus](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 1.6 GB | 96.7 | 98.5 |\n",
    "| ner_rus_bert_probas | [Wiki-NER-rus](https://aclanthology.org/I17-1042/) | Ru | 2.1 GB | 72.6 | 79.5 |\n",
    "| ner_ontonotes_bert_mult | [OntoNotes](https://paperswithcode.com/dataset/ontonotes-5-0) | Multi | 2.1 GB | 88.9 | 92.0 |\n",
    "\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model\n",
    "\n",
    "ner_model = build_model('ner_ontonotes_bert', download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `download` argument defines whether it is necessary to download the files defined in the `download` section of the config: usually it provides the links to the train and test data, to the pretrained models, or to the embeddings.\n",
    "\n",
    "Setting the `install` argument to `True` is equivalent to executing the command line `install` command. If set to `True`, it will first install all the required packages.\n",
    "\n",
    "**Input**: List[sentences]\n",
    "\n",
    "**Output**: List[tokenized sentences, corresponding NER-tags]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[['Bob', 'Ross', 'lived', 'in', 'Florida'],\n",
       "  ['Elon', 'Musk', 'founded', 'Tesla']],\n",
       " [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],\n",
       "  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact ner_ontonotes_bert -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n",
    "\n",
    "Or make predictions for samples from *stdin*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov predict ner_ontonotes_bert -f <file-name>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Evaluate\n",
    "\n",
    "There are two metrics that are used to evaluate a NER model in DeepPavlov:\n",
    "\n",
    "`ner_f1` is measured on the entity-level (actual text spans should match exactly)\n",
    "\n",
    "`ner_token_f1` is measured on a token level (correct tokens from not fully extracted entities will still be counted as TPs (true positives))\n",
    "\n",
    "## 5.1 Evaluate from Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import evaluate_model\n",
    "\n",
    "model = evaluate_model('ner_ontonotes_bert', download=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2 Evaluate from CLI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov evaluate ner_ontonotes_bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Customize the model\n",
    "\n",
    "## 6.1 Train your model from Python\n",
    "\n",
    "### Provide your data path\n",
    "\n",
    "To train the model on your data, you need to change the path to the training data in the *config_file*.\n",
    " \n",
    "Parse the *config_file* and change the path to your data from Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "~/.deeppavlov/downloads/ontonotes/\n"
     ]
    }
   ],
   "source": [
    "from deeppavlov import train_model\n",
    "from deeppavlov.core.commands.utils import parse_config\n",
    "\n",
    "model_config = parse_config('ner_ontonotes_bert')\n",
    "\n",
    "# dataset that the model was trained on\n",
    "print(model_config['dataset_reader']['data_path'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Provide a *data_path* to your own dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download and unzip a new example dataset\n",
    "!wget http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz\n",
    "!tar -xzvf \"conll2003_v2.tar.gz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# provide a path to the train file\n",
    "model_config['dataset_reader']['data_path'] = 'contents/train.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Train dataset format\n",
    "\n",
    "To train the model, you need to have a txt-file with a dataset in the following format:\n",
    "\n",
    "```\n",
    "EU B-ORG\n",
    "rejects O\n",
    "the O\n",
    "call O\n",
    "of O\n",
    "Germany B-LOC\n",
    "to O\n",
    "boycott O\n",
    "lamb O\n",
    "from O\n",
    "Great B-LOC\n",
    "Britain I-LOC\n",
    ". O\n",
    "\n",
    "China B-LOC\n",
    "says O\n",
    "time O\n",
    "right O\n",
    "for O\n",
    "Taiwan B-LOC\n",
    "talks O\n",
    ". O\n",
    "```\n",
    "\n",
    "The source text is **tokenized** and **tagged**. For each token, there is a tag with **BIO** markup. Tags are separated from tokens with **whitespaces**. Sentences are separated with **empty lines**.\n",
    "\n",
    "\n",
    "### Train the model using new config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ner_model = train_model(model_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use your model for prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[['Bob', 'Ross', 'lived', 'in', 'Florida'],\n",
       "  ['Elon', 'Musk', 'founded', 'Tesla']],\n",
       " [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],\n",
       "  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.2 Train your model from CLI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov train ner_ontonotes_bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. NER-tags list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table presents a list of all of the NER entity tags used in DeepPavlov's NER-models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|              |                                                        |\n",
    "| ------------ | ------------------------------------------------------ |\n",
    "| **PERSON**       | People including fictional                             |\n",
    "| **NORP**         | Nationalities or religious or political groups         |\n",
    "| **FACILITY**     | Buildings, airports, highways, bridges, etc.           |\n",
    "| **ORGANIZATION** | Companies, agencies, institutions, etc.                |\n",
    "| **GPE**          | Countries, cities, states                              |\n",
    "| **LOCATION**     | Non-GPE locations, mountain ranges, bodies of water    |\n",
    "| **PRODUCT**      | Vehicles, weapons, foods, etc. (Not services)          |\n",
    "| **EVENT**        | Named hurricanes, battles, wars, sports events, etc.   |\n",
    "| **WORK OF ART**  | Titles of books, songs, etc.                           |\n",
    "| **LAW**          | Named documents made into laws                         |\n",
    "| **LANGUAGE**     | Any named language                                     |\n",
    "| **DATE**         | Absolute or relative dates or periods                  |\n",
    "| **TIME**         | Times smaller than a day                               |\n",
    "| **PERCENT**      | Percentage (including “%”)                             |\n",
    "| **MONEY**        | Monetary values, including unit                        |\n",
    "| **QUANTITY**     | Measurements such as weight or distance                |\n",
    "| **ORDINAL**      | “first”, “second”, etc.                                |\n",
    "| **CARDINAL**     | Numerals that do not fall under another type           |"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}