{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Spelling correction\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/spelling_correction.ipynb)\n", "\n", "# Table of contents \n", "\n", "1. [Introduction to the task](#1.-Introduction-to-the-task)\n", "\n", "2. [Get started with the model](#2.-Get-started-with-the-model)\n", "\n", "3. [Models list](#3.-Models-list)\n", "\n", "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n", "\n", " 4.1. [Predict using Python](#4.1-Predict-using-Python)\n", "\n", " 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n", "\n", "5. [Customize the model](#5.-Customize-the-model)\n", "\n", " 5.1. [Training configuration](#5.1-Training-configuration)\n", "\n", " 5.2. [Language model](#5.2-Language-model)\n", "\n", "6. [Comparison](#6.-Comparison)\n", "\n", "# 1. Introduction to the task\n", "\n", "Spelling correction is detection of words in the text with spelling errors and replacement them with correct ones.\n", "\n", "For example, the sentence\n", "\n", "```\n", "The platypus lives in eastern Astralia, inkluding Tasmania.\n", "```\n", "\n", "with spelling mistakes ('Astralia', 'inkluding') will be corrected as\n", "\n", "```\n", "The platypus lives in eastern Australia, including Tasmania.\n", "```\n", "\n", "# 2. Get started with the model\n", "\n", "First make sure you have the DeepPavlov Library installed.\n", "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q deeppavlov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then make sure that all the required packages for the model are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov install brillmoore_wikitypos_en" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`brillmoore_wikitypos_en` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n", "\n", "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n", "The full list of models for spelling correction with their config names can be found in the [table](#3.-Models-list).\n", "\n", "# 3. Models list\n", "\n", "The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.\n", "\n", "| Config name | Language | RAM |\n", "| :--- | --- | --- |\n", "| brillmoore_wikitypos_en | En | 6.7 Gb |\n", "| levenshtein_corrector_ru | Ru | 8.7 Gb |\n", "\n", "We provide two types of pipelines for spelling correction:\n", "\n", "* [levenshtein_corrector](#4.1.1-Levenshtein-corrector) uses simple Damerau-Levenshtein distance to find correction candidates\n", "\n", "* [brillmoore](#4.1.2-Brillmoore) uses statistics based error model for it.\n", "\n", "In both cases correction candidates are chosen based on context with the help of a [kenlm language model](https://docs.deeppavlov.ai/en/master/features/models/spelling_correction.html#language-model).\n", "\n", "You can find [the comparison](#6.-Comparison) of these and other approaches near the end of this readme.\n", "\n", "# 4. Use the model for prediction\n", "\n", "## 4.1 Predict using Python\n", "\n", "### 4.1.1 Levenshtein corrector\n", "\n", "[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent) finds all the candidates in a static dictionary on a set Damerau-Levenshtein distance. It can separate one token into two but it will not work the other way around.\n", "\n", "**Component config parameters**:\n", "\n", "- ``in`` — list with one element: name of this component's input in\n", " chainer's shared memory\n", "- ``out`` — list with one element: name for this component's output in\n", " chainer's shared memory\n", "- ``class_name`` always equals to ``\"spelling_levenshtein\"`` or ``deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent``.\n", "- ``words`` — list of all correct words (should be a reference)\n", "- ``max_distance`` — maximum allowed Damerau-Levenshtein distance\n", " between source words and candidates\n", "- ``error_probability`` — assigned probability for every edit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model, configs\n", "\n", "model = build_model('levenshtein_corrector_ru', download=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['утконос живет в восточной австралии на обширном ареале от холодных плато тасмании и австралийских альп до дождевых лесов прибрежного квинсленда.']\n" ] } ], "source": [ "model(['Утканос живет в Васточной Австралии на обширном ареале от холодных плато Тасмании и Австралийских Альп до дождевых лесов прибрежного Квинсленда.'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1.2 Brillmoore\n", "\n", "[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.brillmoore.ErrorModel) is based on [An Improved Error Model for Noisy Channel Spelling Correction](http://www.aclweb.org/anthology/P00-1037) by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.\n", "\n", "**Component config parameters:**\n", "\n", "- ``in`` — list with one element: name of this component's input in\n", " chainer's shared memory\n", "- ``out`` — list with one element: name for this component's output in\n", " chainer's shared memory\n", "- ``class_name`` always equals to ``\"spelling_error_model\"`` or ``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``.\n", "- ``save_path`` — path where the model will be saved at after a\n", " training session\n", "- ``load_path`` — path to the pretrained model\n", "- ``window`` — window size for the error model from ``0`` to ``4``,\n", " defaults to ``1``\n", "- ``candidates_count`` — maximum allowed count of candidates for every\n", " source token\n", "- ``dictionary`` — description of a static dictionary model, instance\n", " of (or inherited from)\n", " ``deeppavlov.vocabs.static_dictionary.StaticDictionary``\n", "\n", " - ``class_name`` — ``\"static_dictionary\"`` for a custom dictionary or one\n", " of two provided:\n", "\n", " - ``\"russian_words_vocab\"`` to automatically download and use a\n", " list of russian words from\n", " `https://github.com/danakt/russian-words/ `__\n", " - ``\"wikitionary_100K_vocab\"`` to automatically download a list\n", " of most common words from Project Gutenberg from\n", " `Wiktionary `__\n", "\n", " - ``dictionary_name`` — name of a directory where a dictionary will\n", " be built to and loaded from, defaults to ``\"dictionary\"`` for\n", " static\\_dictionary\n", " - ``raw_dictionary_path`` — path to a file with a line-separated\n", " list of dictionary words, required for static\\_dictionary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model, configs\n", "\n", "model = build_model('brillmoore_wikitypos_en', download=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the platypus lives in australia.']\n" ] } ], "source": [ "model(['The platypus lives in Astralia.'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Predict using CLI\n", "\n", "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov interact brillmoore_wikitypos_en -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Customize the model\n", "\n", "## 5.1 Training configuration\n", "\n", "For the training phase config file needs to also include these\n", "parameters:\n", "\n", "- ``dataset_iterator`` — it should always be set like\n", " ``\"dataset_iterator\": {\"class_name\": \"typos_iterator\"}``\n", "\n", " - ``class_name`` always equals to ``typos_iterator``\n", " - ``test_ratio`` — ratio of test data to train, from ``0.`` to\n", " ``1.``, defaults to ``0.``\n", "\n", "- ``dataset_reader``\n", "\n", " - ``class_name`` — ``typos_custom_reader`` for a custom dataset or one of\n", " two provided:\n", "\n", " - ``typos_kartaslov_reader`` to automatically download and\n", " process misspellings dataset for russian language from\n", " https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos\n", " - ``typos_wikipedia_reader`` to automatically download and\n", " process a list of common misspellings from english\n", " Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines\n", "\n", " - ``data_path`` — required for typos\\_custom\\_reader as a path to\n", " a dataset file,\n", " where each line contains a misspelling and a correct spelling\n", " of a word separated by a tab symbol\n", "\n", "Component's configuration for ``spelling_error_model`` also has to\n", "have as ``fit_on`` parameter — list of two elements:\n", "names of component's input and true output in chainer's shared\n", "memory.\n", "\n", "## 5.2 Language model\n", "\n", "Provided pipelines use [KenLM](http://kheafield.com/code/kenlm/) to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for\n", "[english](http://files.deeppavlov.ai/lang_models/en_wiki_no_punkt.arpa.binary.gz) (5.5GB) and\n", "[russian](http://files.deeppavlov.ai/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz) (3.1GB) languages.\n", "\n", "# 6. Comparison\n", "\n", "We compared our pipelines with\n", "[Yandex.Speller](http://api.yandex.ru/speller/),\n", "[JamSpell](https://github.com/bakwc/JamSpell) and\n", "[PyHunSpell](https://github.com/blatinier/pyhunspell)\n", "on the [test set](http://www.dialog-21.ru/media/3838/test_sample_testset.txt) for the [SpellRuEval\n", "competition](http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/)\n", "on Automatic Spelling Correction for Russian:\n", "\n", "| Correction method | Precision | Recall | F-measure | Speed (sentences/s) |\n", "| :---------------- | --------- | ------ | --------- | ------------------- |\n", "| Yandex.Speller | 83.09 | 59.86 | 69.59 | 5. |\n", "| DeepPavlov levenshtein_corrector_ru | 59.38 | 53.44 | 56.25 | 39.3 |\n", "| Hunspell + lm | 41.03 | 48.89 | 44.61 | 2.1 |\n", "| JamSpell | 44.57 | 35.69 | 39.64 | 136.2 |\n", "| Hunspell | 30.30 | 34.02 | 32.06 | 20.3 |" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }