{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Named Entity Recognition (NER)\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/NER.ipynb)\n", "\n", "# Table of contents \n", "\n", "1. [Introduction to the task](#1.-Introduction-to-the-task)\n", "\n", "2. [Get started with the model](#2.-Get-started-with-the-model)\n", "\n", "3. [Models list](#3.-Models-list)\n", "\n", "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n", "\n", " 4.1. [Predict using Python](#4.1-Predict-using-Python)\n", " \n", " 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n", " \n", "5. [Evaluate](#5.-Evaluate)\n", " \n", " 5.1. [Evaluate from Python](#5.1-Evaluate-from-Python)\n", " \n", " 5.2. [Evaluate from CLI](#5.2-Evaluate-from-CLI)\n", "\n", "6. [Customize the model](#6.-Customize-the-model)\n", " \n", " 6.1. [Train your model from Python](#6.1-Train-your-model-from-Python)\n", " \n", " 6.2. [Train your model from CLI](#6.2-Train-your-model-from-CLI)\n", "\n", "7. [NER-tags list](#7.-NER-tags-list)\n", "\n", "# 1. Introduction to the task\n", "\n", "**Named Entity Recognition (NER)** is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc). \n", "\n", "**BIO encoding schema** is usually used in NER task. It uses 3 tags: B for the beginning of the entity, I for the inside of the entity, and O for non-entity tokens. The second part of the tag stands for the entity type.\n", "\n", "Here is an example of a tagged sequence:\n", "\n", "| Elon | Musk | founded | Tesla| in | 2003 | . |\n", "| --- | --- | --- | --- | --- | --- | --- |\n", "| B-PER | I-PER | O | B-ORG | O | B-DATE | O |\n", "\n", "Here we can see three extracted named entities: *Elon Musk* (which is a person's name), *Tesla* (which is a name of an organization) and *2003* (which is a date). To see more examples try out our [Demo](https://demo.deeppavlov.ai/#/en/ner).\n", "\n", "The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov's models can be found in the [table](#7.-NER-tags-list).\n", "\n", "# 2. Get started with the model\n", "\n", "First make sure you have the DeepPavlov Library installed.\n", "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q deeppavlov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then make sure that all the required packages for the model are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov install ner_ontonotes_bert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ner_ontonotes_bert` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html) \n", "\n", "Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the *config_file* here and further.\n", "The full list of NER models with their config names can be found in the [table](#3.-Models-list).\n", "\n", "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n", "\n", "# 3. Models list\n", "\n", "The table presents a list of all of the NER-models available in the DeepPavlov Library.\n", "\n", "| Config name | Dataset | Language | Model Size | F1 score (ner_f1) | F1 score (ner_f1_token) |\n", "| :--- | --- | --- | --- | --- | ---: |\n", "| ner_case_agnostic_mdistilbert| [CoNLL-2003](https://paperswithcode.com/dataset/conll-2003) | En | 1.6 GB | 89.9 | 91.6 |\n", "| ner_conll2003_bert | [CoNLL-2003](https://paperswithcode.com/dataset/conll-2003) | En | 1.3 GB | **91.9** | **93.4** |\n", "| ner_ontonotes_bert | [OntoNotes](https://paperswithcode.com/dataset/ontonotes-5-0) | En | 1.3 GB | 89.2 | 92.7 |\n", "| ner_collection3_bert | [Collection3](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 2.1 GB | **98.5** | **98.9** |\n", "| ner_rus_bert | [Collection3](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 2.1 GB | 97.6 | 98.5 |\n", "| ner_rus_convers_distilrubert_2L | [Collection-rus](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 1.3 GB | 92.9 | 96.6 |\n", "| ner_rus_convers_distilrubert_6L | [Collection-rus](https://www.researchgate.net/publication/313808701_Combining_Knowledge_and_CRF-Based_Approach_to_Named_Entity_Recognition_in_Russian) | Ru | 1.6 GB | 96.7 | 98.5 |\n", "| ner_rus_bert_probas | [Wiki-NER-rus](https://aclanthology.org/I17-1042/) | Ru | 2.1 GB | 72.6 | 79.5 |\n", "| ner_ontonotes_bert_mult | [OntoNotes](https://paperswithcode.com/dataset/ontonotes-5-0) | Multi | 2.1 GB | 88.9 | 92.0 |\n", "\n", "\n", "# 4. Use the model for prediction\n", "\n", "## 4.1 Predict using Python\n", "\n", "After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model\n", "\n", "ner_model = build_model('ner_ontonotes_bert', download=True, install=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `download` argument defines whether it is necessary to download the files defined in the `download` section of the config: usually it provides the links to the train and test data, to the pretrained models, or to the embeddings.\n", "\n", "Setting the `install` argument to `True` is equivalent to executing the command line `install` command. If set to `True`, it will first install all the required packages.\n", "\n", "**Input**: List[sentences]\n", "\n", "**Output**: List[tokenized sentences, corresponding NER-tags]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[['Bob', 'Ross', 'lived', 'in', 'Florida'],\n", " ['Elon', 'Musk', 'founded', 'Tesla']],\n", " [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],\n", " ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Predict using CLI\n", "\n", "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov interact ner_ontonotes_bert -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n", "\n", "Or make predictions for samples from *stdin*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov predict ner_ontonotes_bert -f " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Evaluate\n", "\n", "There are two metrics that are used to evaluate a NER model in DeepPavlov:\n", "\n", "`ner_f1` is measured on the entity-level (actual text spans should match exactly)\n", "\n", "`ner_token_f1` is measured on a token level (correct tokens from not fully extracted entities will still be counted as TPs (true positives))\n", "\n", "## 5.1 Evaluate from Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import evaluate_model\n", "\n", "model = evaluate_model('ner_ontonotes_bert', download=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Evaluate from CLI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov evaluate ner_ontonotes_bert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Customize the model\n", "\n", "## 6.1 Train your model from Python\n", "\n", "### Provide your data path\n", "\n", "To train the model on your data, you need to change the path to the training data in the *config_file*.\n", " \n", "Parse the *config_file* and change the path to your data from Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "~/.deeppavlov/downloads/ontonotes/\n" ] } ], "source": [ "from deeppavlov import train_model\n", "from deeppavlov.core.commands.utils import parse_config\n", "\n", "model_config = parse_config('ner_ontonotes_bert')\n", "\n", "# dataset that the model was trained on\n", "print(model_config['dataset_reader']['data_path'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Provide a *data_path* to your own dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# download and unzip a new example dataset\n", "!wget http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz\n", "!tar -xzvf \"conll2003_v2.tar.gz\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# provide a path to the train file\n", "model_config['dataset_reader']['data_path'] = 'contents/train.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Train dataset format\n", "\n", "To train the model, you need to have a txt-file with a dataset in the following format:\n", "\n", "```\n", "EU B-ORG\n", "rejects O\n", "the O\n", "call O\n", "of O\n", "Germany B-LOC\n", "to O\n", "boycott O\n", "lamb O\n", "from O\n", "Great B-LOC\n", "Britain I-LOC\n", ". O\n", "\n", "China B-LOC\n", "says O\n", "time O\n", "right O\n", "for O\n", "Taiwan B-LOC\n", "talks O\n", ". O\n", "```\n", "\n", "The source text is **tokenized** and **tagged**. For each token, there is a tag with **BIO** markup. Tags are separated from tokens with **whitespaces**. Sentences are separated with **empty lines**.\n", "\n", "\n", "### Train the model using new config" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ner_model = train_model(model_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your model for prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[['Bob', 'Ross', 'lived', 'in', 'Florida'],\n", " ['Elon', 'Musk', 'founded', 'Tesla']],\n", " [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],\n", " ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 Train your model from CLI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov train ner_ontonotes_bert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. NER-tags list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table presents a list of all of the NER entity tags used in DeepPavlov's NER-models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| | |\n", "| ------------ | ------------------------------------------------------ |\n", "| **PERSON** | People including fictional |\n", "| **NORP** | Nationalities or religious or political groups |\n", "| **FACILITY** | Buildings, airports, highways, bridges, etc. |\n", "| **ORGANIZATION** | Companies, agencies, institutions, etc. |\n", "| **GPE** | Countries, cities, states |\n", "| **LOCATION** | Non-GPE locations, mountain ranges, bodies of water |\n", "| **PRODUCT** | Vehicles, weapons, foods, etc. (Not services) |\n", "| **EVENT** | Named hurricanes, battles, wars, sports events, etc. |\n", "| **WORK OF ART** | Titles of books, songs, etc. |\n", "| **LAW** | Named documents made into laws |\n", "| **LANGUAGE** | Any named language |\n", "| **DATE** | Absolute or relative dates or periods |\n", "| **TIME** | Times smaller than a day |\n", "| **PERCENT** | Percentage (including “%”) |\n", "| **MONEY** | Monetary values, including unit |\n", "| **QUANTITY** | Measurements such as weight or distance |\n", "| **ORDINAL** | “first”, “second”, etc. |\n", "| **CARDINAL** | Numerals that do not fall under another type |" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }