{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tfidf Ranking\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/tfidf_ranking.ipynb)\n", "\n", "# Table of contents \n", "\n", "1. [Introduction to the task](#1.-Introduction-to-the-task)\n", "\n", "2. [Get started with the model](#2.-Get-started-with-the-model)\n", "\n", "3. [Models list](#3.-Models-list)\n", "\n", "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n", "\n", " 4.1. [Predict using Python](#4.1-Predict-using-Python)\n", " \n", " 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n", "\n", "5. [Customize the model](#5.-Customize-the-model)\n", " \n", " 5.1. [Fit on Wikipedia](#5.1-Fit-on-Wikipedia)\n", " \n", " 5.2. [Download, parse new Wikipedia dump, build database and index](#5.2-Download,-parse-new-Wikipedia-dump,-build-database-and-index)\n", "\n", "# 1. Introduction to the task\n", "\n", "This is an implementation of a passage ranker based on tf-idf vectorization.\n", "The ranker implementation is based on [DrQA](https://github.com/facebookresearch/DrQA/) project.\n", "The default ranker implementation takes a batch of queries as input and returns 100 passage titles sorted via relevance.\n", "\n", "# 2. Get started with the model\n", "\n", "First make sure you have the DeepPavlov Library installed.\n", "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q deeppavlov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then make sure that all the required packages for the model are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov install en_ranker_tfidf_wiki" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`en_ranker_tfidf_wiki` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n", "\n", "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n", "The full list of models for tfidf ranking with their config names can be found in the [table](#3.-Models-list).\n", "\n", "# 3. Models list\n", "\n", "| Config | Language | Description | RAM |\n", "| :--- | :---: | :--- | :---: |\n", "| doc_retrieval/en_ranker_tfidf_wiki.json | En | Config for TF-IDF ranking over Wikipedia | 2.9 Gb |\n", "| doc_retrieval/en_ranker_pop_wiki.json | En | Config for TF-IDF ranking, followed by
popularity ranking, over Wikipedia | 8.1 Gb |\n", "| doc_retrieval/ru_ranker_tfidf_wiki.json | Ru | TF-IDF ranking config over Wikipedia | 8.4 Gb |\n", "\n", "# 4. Use the model for prediction\n", "\n", "## 4.1 Predict using Python\n", "\n", "### English\n", "\n", "Building (if you don't have your own data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model, configs\n", "\n", "ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True, install=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inference" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[18155097, 628663, 17123727, 628662, 19097375]\n" ] } ], "source": [ "result = ranker(['Who is Ivan Pavlov?'])\n", "print(result[0][:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Russian" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model, configs\n", "\n", "ranker = build_model(configs.doc_retrieval.ru_ranker_tfidf_wiki, download=True, install=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[4902620, 1900377, 11129584, 1720563, 1720658]\n" ] } ], "source": [ "result = ranker(['Когда произошла Куликовская битва?'])\n", "print(result[0][:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text for the output titles can be further extracted with [deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab](https://docs.deeppavlov.ai/en/master/apiref/vocabs.html#deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab) class.\n", "\n", "## 4.2 Predict using CLI\n", "\n", "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python -m deeppavlov interact en_ranker_tfidf_wiki -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Customize the model\n", "\n", "## 5.1 Fit on Wikipedia\n", "\n", "Run the following to fit the ranker on **English** Wikipedia:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "python -m deppavlov train en_ranker_tfidf_wiki" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following to fit the ranker on **Russian** Wikipedia:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "python -m deeppavlov train ru_ranker_tfidf_wiki" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result of ranker training, a SQLite database and tf-idf matrix are created.\n", "\n", "## 5.2 Download, parse new Wikipedia dump, build database and index\n", "\n", "**enwiki.db** SQLite database consists of ~21 M Wikipedia articles and is built by the following steps:\n", "\n", "- Download a Wikipedia dump file. We took the latest\n", " [enwiki dump](https://dumps.wikimedia.org/enwiki/20230501/)\n", "\n", "- Unpack and extract the articles with [WikiExtractor](https://github.com/attardi/wikiextractor)\n", " (with ``--json``, ``--no-templates``, ``--filter_disambig_pages``\n", " options)\n", "\n", "- [Build a database](#5.1-Fit-on-Wikipedia).\n", "\n", "**enwiki_tfidf_matrix.npz** is a full Wikipedia tf-idf matrix of size **hash_size x number of documents** which is\n", "$2^{24}$ x 21 M. This matrix is built with [deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer](https://docs.deeppavlov.ai/en/master/apiref/models/vectorizers.html#deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer) class.\n", "\n", "**ruwiki.db** SQLite database consists of ~12 M Wikipedia articles and is built by the following steps:\n", "\n", "- Download a Wikipedia dump file. We took the latest [ruwiki dump](https://dumps.wikimedia.org/ruwiki/20230501/)\n", "\n", "- Unpack and extract the articles with [WikiExtractor](https://github.com/attardi/wikiextractor)\n", " (with ``--json``, ``--no-templates``, ``--filter_disambig_pages``\n", " options)\n", "\n", "- [Build a database](#5.1-Fit-on-Wikipedia).\n", "\n", "**ruwiki_tfidf_matrix.npz** is a full Wikipedia tf-idf matrix of size **hash_size x number of documents** which is\n", "$2^{24}$ x 12 M. This matrix is built with\n", "[deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer](https://docs.deeppavlov.ai/en/master/apiref/models/vectorizers.html#deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer) class." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }