{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Classification\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/classification.ipynb)\n", "\n", "[![Medium](https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white)](https://medium.com/deeppavlov/text-classification-using-deeppavlov-library-with-pytorch-and-transformers-f14db5528821)\n", "\n", "# Table of contents\n", "\n", "1. [Introduction to the task](#1.-Introduction-to-the-task)\n", "\n", "2. [Get started with the model](#2.-Get-started-with-the-model)\n", "\n", "3. [Models list](#3.-Models-list)\n", "\n", "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n", "\n", " 4.1. [Predict using Python](#4.1-Predict-using-Python)\n", "\n", " 4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n", "\n", "5. [Evaluation](#5.-Evaluation)\n", "\n", " 5.1. [from Python](#5.1-Evaluate-from-Python)\n", "\n", " 5.2. [from CLI](#5.2-Evaluate-from-CLI)\n", "\n", "6. [Train the model on your data](#6.-Train-the-model-on-your-data)\n", "\n", " 6.1. [from Python](#6.1-Train-your-model-from-Python)\n", "\n", " 6.2. [from CLI](#6.2-Train-your-model-from-CLI)\n", "\n", "7. [Simple few-shot classifiers](#7.-Simple-few-shot-classifiers)\n", "\n", " 7.1. [Few-shot setting](#7.1-Few-shot-setting)\n", "\n", " 7.2. [Multiple languages support](#7.2-Multiple-languages-support)\n", "\n", " 7.3. [Dataset and Scores](#7.3-Dataset-and-Scores)\n", "\n", "# 1. Introduction to the task\n", "This section describes a family of BERT-based models that solve a variety of different classification tasks.\n", "\n", "**Insults detection** is a binary classification task of identying wether a given sequence is an insult of another participant of communication.\n", "\n", "**Sentiment analysis** is a task of classifying the polarity of the the given sequence. The number of classes may vary depending on the data: positive/negative binary classification, multiclass classification with a neutral class added or with a number of different emotions.\n", "\n", "The models trained for the **paraphrase detection** task identify whether two sentences expressed with different words convey the same meaning.\n", "\n", "**Topic classification** refers to the task of classifying an utterance by the topic which belongs to the conversational domain.\n", "\n", "# 2. Get started with the model\n", "\n", "First make sure you have the DeepPavlov Library installed.\n", "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q deeppavlov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then make sure that all the required packages for the model are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov install insults_kaggle_bert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`insults_kaggle_bert` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n", "\n", "Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the *config_file* here and further.\n", "The full list of classification models with their config names can be found in the [table](#3.-Models-list).\n", "\n", "# 3. Models list\n", "\n", "The table presents a list of all of the classification models available in DeepPavlov Library.\n", "\n", "| Config name | Language | Task | Dataset | Model Size | Metric | Score |\n", "| :--- | --- | --- | --- | --- | --- | ---: |\n", "| insults_kaggle_bert | En | Insults | [Insults](https://www.kaggle.com/c/detecting-insults-in-social-commentary) | 1.1 GB | ROC-AUC | 0.8770 |\n", "| paraphraser_rubert | Ru | Paraphrase | [Paraphrase Corpus](http://paraphraser.ru/download/) | 2.0 GB | F1 | 0.8738 |\n", "| paraphraser_convers_distilrubert_2L | Ru | Paraphrase | [Paraphrase Corpus](http://paraphraser.ru/download/) | 1.2 GB | F1 | 0.7396 |\n", "| paraphraser_convers_distilrubert_6L | Ru | Paraphrase | [Paraphrase Corpus](http://paraphraser.ru/download/) | 1.6 GB | F1 | 0.8354 |\n", "| sentiment_sst_conv_bert | En | Sentiment | [SST](https://paperswithcode.com/dataset/sst) | 1.1 GB | Accuracy | 0.6626 |\n", "| sentiment_twitter | Ru | Sentiment | [Twitter Mokoron](https://github.com/mokoron/sentirueval) | 6.2 GB | F1-macro | 0.9961 |\n", "| rusentiment_bert | Ru | Sentiment | [RuSentiment](https://text-machine.cs.uml.edu/projects/rusentiment/) | 1.3 GB | F1-weighted | 0.7005 |\n", "| rusentiment_convers_bert | Ru | Sentiment | [RuSentiment](https://text-machine.cs.uml.edu/projects/rusentiment/) | 1.5 GB | F1-weighted | 0.7724 |\n", "| topics_distilbert_base_uncased | En | Topics | [DeepPavlov Topics](https://deeppavlov.ai/datasets/topics) | 6.2 GB | F1-macro | 0.9961 |\n", "\n", "# 4. Use the model for prediction\n", "\n", "## 4.1 Predict using Python\n", "\n", "After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model\n", "\n", "model = build_model('insults_kaggle_bert', download=True, install=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Input format**: List[sentences]\n", "\n", "**Output format**: List[labels]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Insult', 'Not Insult']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model(['You are kind of stupid', 'You are a wonderful person!'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Predict using CLI\n", "\n", "You can also get predictions in an interactive mode through CLI (Command Line Interface)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python deeppavlov interact insults_kaggle_bert -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n", "\n", "Or make predictions for samples from *stdin*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python deeppavlov predict insults_kaggle_bert -f " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Evaluation\n", "\n", "## 5.1 Evaluate from Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import evaluate_model\n", "\n", "model = evaluate_model('insults_kaggle_bert', download=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Evaluate from CLI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov evaluate insults_kaggle_bert -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Train the model on your data\n", "\n", "## 6.1 Train your model from Python\n", "\n", "### Provide your data path\n", "\n", "To train the model on your data, you need to change the path to the training data in the *config_file*.\n", "\n", "Parse the *config_file* and change the path to your data from Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "~/.deeppavlov/downloads/insults_data\n" ] } ], "source": [ "from deeppavlov import train_model\n", "from deeppavlov.core.commands.utils import parse_config\n", "\n", "model_config = parse_config('insults_kaggle_bert')\n", "\n", "# dataset that the model was trained on\n", "print(model_config['dataset_reader']['data_path'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Provide a *data_path* to your own dataset. You can also change any of the hyperparameters of the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# download and unzip a new example dataset\n", "!wget http://files.deeppavlov.ai/datasets/insults_data.tar.gz\n", "!tar -xzvf \"insults_data.tar.gz\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# provide a path to the directory with your train, valid and test files\n", "model_config['dataset_reader']['data_path'] = \"./contents/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Train dataset format\n", "\n", "### Train the model using new config" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = train_model(model_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your model for prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Insult', 'Not Insult']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model(['You are kind of stupid', 'You are a wonderful person!'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 Train your model from CLI\n", "\n", "To train the model on your data, create a copy of a config file and change the *data_path* variable in it. After that, train the model using your new *config_file*. You can also change any of the hyperparameters of the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m deeppavlov train model_config.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Simple few-shot classifiers\n", "\n", "Additionally, in the [faq](https://github.com/deeppavlov/DeepPavlov/tree/master/deeppavlov/configs/faq) section you can find a config for a fast and simple pre-BERT model, which consists of a fasttext vectorizer and a simple logistic regression classifier.\n", "\n", "## 7.1 Few-shot setting\n", "\n", "In the current setting the config can be used for few-shot classification - a task, in which only a few training examples are available for each class (usually from 5 to 10). Note that the config takes the full version of the dataset as the input and samples N examples for each class of the train data in the iterator.\n", "\n", "The sampling is done within the `basic_classification_iterator` component of the pipeline and the `shot` parameter defines the number of examples to be sampled. By default the `shot` parameter is set to `None` (no sampling applied).\n", "\n", "## 7.2 Multiple languages support\n", "\n", "By default `fasttext_logreg` supports classification in English, but can be modified for classification in Russian.\n", "\n", "In order to change `fasttext_logreg` language to Russian, change `LANGUAGE` variable in the `metadata.variables` section from `en` to `ru` and change the Spacy model by changing `SPACY_MODEL` variable from `en_core_web_sm` to `ru_core_news_sm`.\n", "\n", "You can do that by directly editing the config file through an editor or change it through Python (example below). N.B. `read_json` and `find_config` combination is intentionally used instead of `parse_config` to read config in the example, because `parse_config` will replace all `LANGUAGE` and `SPACY_MODEL` usages in the config with the default values from `metadata.variables`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from deeppavlov import build_model\n", "from deeppavlov.core.common.file import read_json, find_config\n", "\n", "model_config = read_json(find_config('fasttext_logreg'))\n", "model_config['metadata']['variables']['LANGUAGE'] = 'ru'\n", "model_config['metadata']['variables']['SPACY_MODEL'] = 'ru_core_news_sm'\n", "model = build_model(model_config, install=True, download=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.3 Dataset and Scores\n", "\n", "To demonstrate the performance of the model in two languages, we use the English and Russian subsets of [the MASSIVE dataset](https://github.com/alexa/massive).\n", "\n", "MASSIVE is a parallel dataset of utterrances in 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. We only employ the intent classification data. You can see the results of the given configs in 5-shot classification setting in the table below.\n", "\n", "| Config name | Language | Train accuracy | Validation accuracy | Test accuracy |\n", "| :--- | --- | --- | --- | ---: |\n", "| fasttext_logreg | en | 0.9632 | 0.5239 | 0.5155 |\n", "| fasttext_logreg | ru | 0.9231 | 0.4565 | 0.4304 |" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }