{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Neural Ranking\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/neural_ranking.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
    "    \n",
    "    4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "This model solves the tasks of ranking and paraphrase identification based on semantic similarity which is trained with siamese neural networks. The trained network can retrieve the response closest semantically to a given context from some database or answer whether two sentences are paraphrases or not. It is possible to build automatic semantic FAQ systems with such neural architectures.\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then make sure that all the required packages for the model are installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install ranking_ubuntu_v2_torch_bert_uncased"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ranking_ubuntu_v2_torch_bert_uncased` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
    "\n",
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "The full list of models for neural ranking with their config names can be found in the [table](#3.-Models-list).\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "| Config | Language | Dataset | Transformer model |\n",
    "| :--- | :---: | :--- | :--- |\n",
    "| ranking/ranking_ubuntu_v2_torch_bert_uncased.json | En | [Ubuntu v2](https://github.com/rkadlec/ubuntu-ranking-dataset-creator) | bert-base-uncased |\n",
    "| classifiers/paraphraser_rubert.json | Ru | [paraphraser.ru](https://paraphraser.ru) | DeepPavlov/rubert-base-cased |\n",
    "| classifiers/paraphraser_convers_distilrubert_2L.json | Ru | [paraphraser.ru](https://paraphraser.ru) | DeepPavlov/distilrubert-tiny-cased-conversational |\n",
    "| classifiers/paraphraser_convers_distilrubert_6L.json | Ru | [paraphraser.ru](https://paraphraser.ru) | DeepPavlov/distilrubert-base-cased-conversational |\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "### English"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import configs, build_model\n",
    "\n",
    "\n",
    "ranking = build_model(\"ranking_ubuntu_v2_torch_bert_uncased\", download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ranking([[\"Forrest Gump is a 1994 American epic comedy-drama film directed by Robert Zemeckis.\",\n",
    "          \"Robert Zemeckis directed Forrest Gump.\",\n",
    "          \"Robert Lee Zemeckis was born on May 14, 1952, in Chicago.\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Input:** List[List[sentence1, sentence2, ...]], where the sentences from the second to the last will be ranked by similarity with the first sentence.\n",
    "\n",
    "**Output:** List[List[scores]] - similarity scores to the first sentence of the sentences from the second to the last.\n",
    "\n",
    "### Russian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import configs, build_model\n",
    "\n",
    "\n",
    "ranking = build_model(\"paraphraser_rubert\", download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ranking([\"Форрест Гамп - комедийная драма, девятый полнометражный фильм режиссёра Роберта Земекиса.\"],\n",
    "        [\"Роберт Земекис был режиссером фильма «Форрест Гамп».\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Input:** Tuple[List[sentences1], List[sentence2]], where each element of the list of sentences1 will be compared with the corresponding element of the sentence2 list.\n",
    "\n",
    "**Output:** List[labels] - each label is 1 or 0, 1 - if the sentence from the first list is a paraphrase to the corresponding sentence from the second list, 0 - otherwise.\n",
    "\n",
    "## 4.2 Predict using CLI\n",
    "\n",
    "### English\n",
    "\n",
    "It is not intended to use the class ``deeppavlov.models.torch_bert.torch_bert_ranker.TorchBertRankerModel`` in the interact mode, so it is better to launch the config ranking/ranking_ubuntu_v2_torch_bert_uncased.json [using Python](#4.1-Predict-using-Python).\n",
    "\n",
    "### Russian\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact paraphraser_rubert -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Customize the model\n",
    "\n",
    "## English\n",
    "\n",
    "To train the ranking model on your own data, you should make a dataset in the following format:\n",
    "\n",
    "- the dataset should have **train.csv**, **valid.csv** and **test.csv** files.\n",
    "\n",
    "- **train.csv** file should contain the following columns: Context, Utterance, Label. Context and utterance are two texts and label (0 or 1) shows the relevance of the utterance to the context.\n",
    "\n",
    "- **valid.csv** and **test.csv** files should contain the following columns: Context, Ground Truth Utterance, Distractor_0, Distractor_1, ..., Distractor_N. Distractor utterances are negative samples (utterances, irrelevant to the context).\n",
    "\n",
    "Then you should put train.csv, valid.csv and test.csv files into the directory ``\"data_path\"`` in the dataset reader from the config and launch training of the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "python -m deeppavlov train ranking_ubuntu_v2_torch_bert_uncased"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Russian\n",
    "\n",
    "To train the ranking model on your own data, you should make a dataset with two files: **paraphrases.xml** (for training) and **paraphrases_gold.xml** (for testing).\n",
    "\n",
    "The xml files should have the following format:\n",
    "\n",
    "    <?xml version='1.0' encoding='UTF8'?>\n",
    "    <data>\n",
    "      <head>\n",
    "        <title>Russian Paraphrase Corpus</title>\n",
    "        <description>This file contains a collection of sentence pairs with crowdsourced annotation. Paraphrase classes: -1: non-paraphrases, 0: loose paraphrases, 1: strict paraphrases.</description>\n",
    "        <reference>http://paraphraser.ru</reference>\n",
    "        <version>1.0 beta</version>\n",
    "        <date>2015-11-28</date>\n",
    "      </head>\n",
    "      <corpus>\n",
    "        <paraphrase>\n",
    "          <value name=\"id\">1</value>\n",
    "          <value name=\"id_1\">201</value>\n",
    "          <value name=\"id_2\">8159</value>\n",
    "          <value name=\"text_1\">text 1</value>\n",
    "          <value name=\"text_2\">text 2</value>\n",
    "          <value name=\"jaccard\">0.65</value>\n",
    "          <value name=\"class\">0</value>\n",
    "        </paraphrase>\n",
    "        <paraphrase>\n",
    "          ...\n",
    "        </paraphrase>\n",
    "      </corpus>\n",
    "    </data>\n",
    "\n",
    "Place **paraphrases.xml** and **paraphrases_gold.xml** files into the directory ``\"data_path\"`` in the dataset reader from the config and launch training of the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "python -m deeppavlov train paraphraser_rubert"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}