{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Spelling correction\n",
"\n",
"[](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/spelling_correction.ipynb)\n",
"\n",
"# Table of contents \n",
"\n",
"1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
"\n",
"2. [Get started with the model](#2.-Get-started-with-the-model)\n",
"\n",
"3. [Models list](#3.-Models-list)\n",
"\n",
"4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
"\n",
" 4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
"\n",
" 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
"\n",
"5. [Customize the model](#5.-Customize-the-model)\n",
"\n",
" 5.1. [Training configuration](#5.1-Training-configuration)\n",
"\n",
" 5.2. [Language model](#5.2-Language-model)\n",
"\n",
"6. [Comparison](#6.-Comparison)\n",
"\n",
"# 1. Introduction to the task\n",
"\n",
"Spelling correction is detection of words in the text with spelling errors and replacement them with correct ones.\n",
"\n",
"For example, the sentence\n",
"\n",
"```\n",
"The platypus lives in eastern Astralia, inkluding Tasmania.\n",
"```\n",
"\n",
"with spelling mistakes ('Astralia', 'inkluding') will be corrected as\n",
"\n",
"```\n",
"The platypus lives in eastern Australia, including Tasmania.\n",
"```\n",
"\n",
"# 2. Get started with the model\n",
"\n",
"First make sure you have the DeepPavlov Library installed.\n",
"[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q deeppavlov"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then make sure that all the required packages for the model are installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python -m deeppavlov install brillmoore_wikitypos_en"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`brillmoore_wikitypos_en` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
"\n",
"There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
"The full list of models for spelling correction with their config names can be found in the [table](#3.-Models-list).\n",
"\n",
"# 3. Models list\n",
"\n",
"The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.\n",
"\n",
"| Config name | Language | RAM |\n",
"| :--- | --- | --- |\n",
"| brillmoore_wikitypos_en | En | 6.7 Gb |\n",
"| levenshtein_corrector_ru | Ru | 8.7 Gb |\n",
"\n",
"We provide two types of pipelines for spelling correction:\n",
"\n",
"* [levenshtein_corrector](#4.1.1-Levenshtein-corrector) uses simple Damerau-Levenshtein distance to find correction candidates\n",
"\n",
"* [brillmoore](#4.1.2-Brillmoore) uses statistics based error model for it.\n",
"\n",
"In both cases correction candidates are chosen based on context with the help of a [kenlm language model](https://docs.deeppavlov.ai/en/master/features/models/spelling_correction.html#language-model).\n",
"\n",
"You can find [the comparison](#6.-Comparison) of these and other approaches near the end of this readme.\n",
"\n",
"# 4. Use the model for prediction\n",
"\n",
"## 4.1 Predict using Python\n",
"\n",
"### 4.1.1 Levenshtein corrector\n",
"\n",
"[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent) finds all the candidates in a static dictionary on a set Damerau-Levenshtein distance. It can separate one token into two but it will not work the other way around.\n",
"\n",
"**Component config parameters**:\n",
"\n",
"- ``in`` — list with one element: name of this component's input in\n",
" chainer's shared memory\n",
"- ``out`` — list with one element: name for this component's output in\n",
" chainer's shared memory\n",
"- ``class_name`` always equals to ``\"spelling_levenshtein\"`` or ``deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent``.\n",
"- ``words`` — list of all correct words (should be a reference)\n",
"- ``max_distance`` — maximum allowed Damerau-Levenshtein distance\n",
" between source words and candidates\n",
"- ``error_probability`` — assigned probability for every edit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from deeppavlov import build_model, configs\n",
"\n",
"model = build_model('levenshtein_corrector_ru', download=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['утконос живет в восточной австралии на обширном ареале от холодных плато тасмании и австралийских альп до дождевых лесов прибрежного квинсленда.']\n"
]
}
],
"source": [
"model(['Утканос живет в Васточной Австралии на обширном ареале от холодных плато Тасмании и Австралийских Альп до дождевых лесов прибрежного Квинсленда.'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1.2 Brillmoore\n",
"\n",
"[This component](https://docs.deeppavlov.ai/en/master/apiref/models/spelling_correction.html#deeppavlov.models.spelling_correction.brillmoore.ErrorModel) is based on [An Improved Error Model for Noisy Channel Spelling Correction](http://www.aclweb.org/anthology/P00-1037) by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.\n",
"\n",
"**Component config parameters:**\n",
"\n",
"- ``in`` — list with one element: name of this component's input in\n",
" chainer's shared memory\n",
"- ``out`` — list with one element: name for this component's output in\n",
" chainer's shared memory\n",
"- ``class_name`` always equals to ``\"spelling_error_model\"`` or ``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``.\n",
"- ``save_path`` — path where the model will be saved at after a\n",
" training session\n",
"- ``load_path`` — path to the pretrained model\n",
"- ``window`` — window size for the error model from ``0`` to ``4``,\n",
" defaults to ``1``\n",
"- ``candidates_count`` — maximum allowed count of candidates for every\n",
" source token\n",
"- ``dictionary`` — description of a static dictionary model, instance\n",
" of (or inherited from)\n",
" ``deeppavlov.vocabs.static_dictionary.StaticDictionary``\n",
"\n",
" - ``class_name`` — ``\"static_dictionary\"`` for a custom dictionary or one\n",
" of two provided:\n",
"\n",
" - ``\"russian_words_vocab\"`` to automatically download and use a\n",
" list of russian words from\n",
" `https://github.com/danakt/russian-words/ `__\n",
" - ``\"wikitionary_100K_vocab\"`` to automatically download a list\n",
" of most common words from Project Gutenberg from\n",
" `Wiktionary `__\n",
"\n",
" - ``dictionary_name`` — name of a directory where a dictionary will\n",
" be built to and loaded from, defaults to ``\"dictionary\"`` for\n",
" static\\_dictionary\n",
" - ``raw_dictionary_path`` — path to a file with a line-separated\n",
" list of dictionary words, required for static\\_dictionary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from deeppavlov import build_model, configs\n",
"\n",
"model = build_model('brillmoore_wikitypos_en', download=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the platypus lives in australia.']\n"
]
}
],
"source": [
"model(['The platypus lives in Astralia.'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.2 Predict using CLI\n",
"\n",
"You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! python -m deeppavlov interact brillmoore_wikitypos_en -d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Customize the model\n",
"\n",
"## 5.1 Training configuration\n",
"\n",
"For the training phase config file needs to also include these\n",
"parameters:\n",
"\n",
"- ``dataset_iterator`` — it should always be set like\n",
" ``\"dataset_iterator\": {\"class_name\": \"typos_iterator\"}``\n",
"\n",
" - ``class_name`` always equals to ``typos_iterator``\n",
" - ``test_ratio`` — ratio of test data to train, from ``0.`` to\n",
" ``1.``, defaults to ``0.``\n",
"\n",
"- ``dataset_reader``\n",
"\n",
" - ``class_name`` — ``typos_custom_reader`` for a custom dataset or one of\n",
" two provided:\n",
"\n",
" - ``typos_kartaslov_reader`` to automatically download and\n",
" process misspellings dataset for russian language from\n",
" https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos\n",
" - ``typos_wikipedia_reader`` to automatically download and\n",
" process a list of common misspellings from english\n",
" Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines\n",
"\n",
" - ``data_path`` — required for typos\\_custom\\_reader as a path to\n",
" a dataset file,\n",
" where each line contains a misspelling and a correct spelling\n",
" of a word separated by a tab symbol\n",
"\n",
"Component's configuration for ``spelling_error_model`` also has to\n",
"have as ``fit_on`` parameter — list of two elements:\n",
"names of component's input and true output in chainer's shared\n",
"memory.\n",
"\n",
"## 5.2 Language model\n",
"\n",
"Provided pipelines use [KenLM](http://kheafield.com/code/kenlm/) to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for\n",
"[english](http://files.deeppavlov.ai/lang_models/en_wiki_no_punkt.arpa.binary.gz) (5.5GB) and\n",
"[russian](http://files.deeppavlov.ai/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz) (3.1GB) languages.\n",
"\n",
"# 6. Comparison\n",
"\n",
"We compared our pipelines with\n",
"[Yandex.Speller](http://api.yandex.ru/speller/),\n",
"[JamSpell](https://github.com/bakwc/JamSpell) and\n",
"[PyHunSpell](https://github.com/blatinier/pyhunspell)\n",
"on the [test set](http://www.dialog-21.ru/media/3838/test_sample_testset.txt) for the [SpellRuEval\n",
"competition](http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/)\n",
"on Automatic Spelling Correction for Russian:\n",
"\n",
"| Correction method | Precision | Recall | F-measure | Speed (sentences/s) |\n",
"| :---------------- | --------- | ------ | --------- | ------------------- |\n",
"| Yandex.Speller | 83.09 | 59.86 | 69.59 | 5. |\n",
"| DeepPavlov levenshtein_corrector_ru | 59.38 | 53.44 | 56.25 | 39.3 |\n",
"| Hunspell + lm | 41.03 | 48.89 | 44.61 | 2.1 |\n",
"| JamSpell | 44.57 | 35.69 | 39.64 | 136.2 |\n",
"| Hunspell | 30.30 | 34.02 | 32.06 | 20.3 |"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 4
}