{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Spelling correction\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/spelling_correction.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
    "\n",
    "    4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "\n",
    "    5.1. [Training configuration](#5.1-Training-configuration)\n",
    "\n",
    "    5.2. [Language model](#5.2-Language-model)\n",
    "\n",
    "6. [Comparison](#6.-Comparison)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "Spelling correction is detection of words in the text with spelling errors and replacement them with correct ones.\n",
    "\n",
    "For example, the sentence\n",
    "\n",
    "```\n",
    "The platypus lives in eastern Astralia, inkluding Tasmania.\n",
    "```\n",
    "\n",
    "with spelling mistakes ('Astralia', 'inkluding') will be corrected as\n",
    "\n",
    "```\n",
    "The platypus lives in eastern Australia, including Tasmania.\n",
    "```\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then make sure that all the required packages for the model are installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install brillmoore_wikitypos_en"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`brillmoore_wikitypos_en` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
    "\n",
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "The full list of models for spelling correction with their config names can be found in the [table](#3.-Models-list).\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.\n",
    "\n",
    "| Config name | Language | RAM |\n",
    "| :--- | --- | --- |\n",
    "| brillmoore_wikitypos_en | En | 6.7 Gb |\n",
    "| levenshtein_corrector_ru | Ru | 8.7 Gb |\n",
    "\n",
    "We provide two types of pipelines for spelling correction:\n",
    "\n",
    "* [levenshtein_corrector](#4.1.1-Levenshtein-corrector) uses simple Damerau-Levenshtein distance to find correction candidates\n",
    "\n",
    "* [brillmoore](#4.1.2-Brillmoore) uses statistics based error model for it.\n",
    "\n",
    "In both cases correction candidates are chosen based on context with the help of a [kenlm language model](https://docs.deeppavlov.ai/en/master/features/models/spelling_correction.html#language-model).\n",
    "\n",
    "You can find [the comparison](#6.-Comparison) of these and other approaches near the end of this readme.\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "### 4.1.1 Levenshtein corrector\n",
    "\n",
    "[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent) finds all the candidates in a static dictionary on a set Damerau-Levenshtein distance. It can separate one token into two but it will not work the other way around.\n",
    "\n",
    "**Component config parameters**:\n",
    "\n",
    "-  ``in`` — list with one element: name of this component's input in\n",
    "   chainer's shared memory\n",
    "-  ``out`` — list with one element: name for this component's output in\n",
    "   chainer's shared memory\n",
    "-  ``class_name`` always equals to ``\"spelling_levenshtein\"`` or ``deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent``.\n",
    "-  ``words`` — list of all correct words (should be a reference)\n",
    "-  ``max_distance`` — maximum allowed Damerau-Levenshtein distance\n",
    "   between source words and candidates\n",
    "-  ``error_probability`` — assigned probability for every edit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model, configs\n",
    "\n",
    "model = build_model('levenshtein_corrector_ru', download=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['утконос живет в восточной австралии на обширном ареале от холодных плато тасмании и австралийских альп до дождевых лесов прибрежного квинсленда.']\n"
     ]
    }
   ],
   "source": [
    "model(['Утканос живет в Васточной Австралии на обширном ареале от холодных плато Тасмании и Австралийских Альп до дождевых лесов прибрежного Квинсленда.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.2 Brillmoore\n",
    "\n",
    "[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.brillmoore.ErrorModel) is based on [An Improved Error Model for Noisy Channel Spelling Correction](http://www.aclweb.org/anthology/P00-1037) by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.\n",
    "\n",
    "**Component config parameters:**\n",
    "\n",
    "-  ``in`` — list with one element: name of this component's input in\n",
    "   chainer's shared memory\n",
    "-  ``out`` — list with one element: name for this component's output in\n",
    "   chainer's shared memory\n",
    "-  ``class_name`` always equals to ``\"spelling_error_model\"`` or ``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``.\n",
    "-  ``save_path`` — path where the model will be saved at after a\n",
    "   training session\n",
    "-  ``load_path`` — path to the pretrained model\n",
    "-  ``window`` — window size for the error model from ``0`` to ``4``,\n",
    "   defaults to ``1``\n",
    "-  ``candidates_count`` — maximum allowed count of candidates for every\n",
    "   source token\n",
    "-  ``dictionary`` — description of a static dictionary model, instance\n",
    "   of (or inherited from)\n",
    "   ``deeppavlov.vocabs.static_dictionary.StaticDictionary``\n",
    "\n",
    "   -  ``class_name`` — ``\"static_dictionary\"`` for a custom dictionary or one\n",
    "      of two provided:\n",
    "\n",
    "      -  ``\"russian_words_vocab\"`` to automatically download and use a\n",
    "         list of russian words from\n",
    "         `https://github.com/danakt/russian-words/ <https://github.com/danakt/russian-words/>`__\n",
    "      -  ``\"wikitionary_100K_vocab\"`` to automatically download a list\n",
    "         of most common words from Project Gutenberg from\n",
    "         `Wiktionary <https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg>`__\n",
    "\n",
    "   -  ``dictionary_name`` — name of a directory where a dictionary will\n",
    "      be built to and loaded from, defaults to ``\"dictionary\"`` for\n",
    "      static\\_dictionary\n",
    "   -  ``raw_dictionary_path`` — path to a file with a line-separated\n",
    "      list of dictionary words, required for static\\_dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model, configs\n",
    "\n",
    "model = build_model('brillmoore_wikitypos_en', download=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the platypus lives in australia.']\n"
     ]
    }
   ],
   "source": [
    "model(['The platypus lives in Astralia.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact brillmoore_wikitypos_en -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Customize the model\n",
    "\n",
    "## 5.1 Training configuration\n",
    "\n",
    "For the training phase config file needs to also include these\n",
    "parameters:\n",
    "\n",
    "-  ``dataset_iterator`` — it should always be set like\n",
    "   ``\"dataset_iterator\": {\"class_name\": \"typos_iterator\"}``\n",
    "\n",
    "   -  ``class_name`` always equals to ``typos_iterator``\n",
    "   -  ``test_ratio`` — ratio of test data to train, from ``0.`` to\n",
    "      ``1.``, defaults to ``0.``\n",
    "\n",
    "-  ``dataset_reader``\n",
    "\n",
    "   -  ``class_name`` — ``typos_custom_reader`` for a custom dataset or one of\n",
    "      two provided:\n",
    "\n",
    "      -  ``typos_kartaslov_reader`` to automatically download and\n",
    "         process misspellings dataset for russian language from\n",
    "         https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos\n",
    "      -  ``typos_wikipedia_reader`` to automatically download and\n",
    "         process a list of common misspellings from english\n",
    "         Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines\n",
    "\n",
    "   -  ``data_path`` — required for typos\\_custom\\_reader as a path to\n",
    "      a dataset file,\n",
    "      where each line contains a misspelling and a correct spelling\n",
    "      of a word separated by a tab symbol\n",
    "\n",
    "Component's configuration for ``spelling_error_model`` also has to\n",
    "have as ``fit_on`` parameter — list of two elements:\n",
    "names of component's input and true output in chainer's shared\n",
    "memory.\n",
    "\n",
    "## 5.2 Language model\n",
    "\n",
    "Provided pipelines use [KenLM](http://kheafield.com/code/kenlm/) to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for\n",
    "[english](http://files.deeppavlov.ai/lang_models/en_wiki_no_punkt.arpa.binary.gz) (5.5GB) and\n",
    "[russian](http://files.deeppavlov.ai/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz) (3.1GB) languages.\n",
    "\n",
    "# 6. Comparison\n",
    "\n",
    "We compared our pipelines with\n",
    "[Yandex.Speller](http://api.yandex.ru/speller/),\n",
    "[JamSpell](https://github.com/bakwc/JamSpell) and\n",
    "[PyHunSpell](https://github.com/blatinier/pyhunspell)\n",
    "on the [test set](http://www.dialog-21.ru/media/3838/test_sample_testset.txt) for the [SpellRuEval\n",
    "competition](http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/)\n",
    "on Automatic Spelling Correction for Russian:\n",
    "\n",
    "| Correction method | Precision | Recall | F-measure | Speed (sentences/s) |\n",
    "| :---------------- | --------- | ------ | --------- | ------------------- |\n",
    "| Yandex.Speller | 83.09 | 59.86 | 69.59 | 5. |\n",
    "| DeepPavlov levenshtein_corrector_ru | 59.38 | 53.44 | 56.25 | 39.3 |\n",
    "| Hunspell + lm | 41.03 | 48.89 | 44.61 | 2.1 |\n",
    "| JamSpell | 44.57 | 35.69 | 39.64 | 136.2 |\n",
    "| Hunspell | 30.30 | 34.02 | 32.06 | 20.3 |"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}