Multi-task BERT in DeepPavlov
=============================

Multi-task BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper
`Knowledge Transfer Between Tasks and Languages in the Multi-task
Encoder-agnostic Transformer-based Models <https://www.dialog-21.ru/media/5902/karpovdpluskonovalovv002.pdf>`_.

The idea is to share BERT body between several tasks. This is necessary if a model pipe has several
components using BERT and the amount of GPU memory is limited. Each task has its own 'head' part attached to the
output of the BERT encoder. If multi-task BERT has :math:`T` heads, one training iteration consists of

- composing :math:`T` lists of examples, one for each task,

- :math:`T` gradient steps, one gradient step for each task.

By default, on every training steps lists of examples for all but one tasks are empty, as if in the original MT-DNN repository. 

When one of BERT heads is being trained, other heads' parameters do not change. On each training step both BERT head
and body parameters are modified.

Currently multitask bert heads support classification, regression, NER and multiple choice tasks. 

At this page, multi-task BERT usage is explained on a toy configuration file of a model that is trained for the
single-sentence classification, sentence pair classification, regression, multiple choice and NER.
The config for this model is :config:`multitask_example <configs/multitask/multitask_example.json>`.

Other examples of using multitask models can be found in :config:`mt_glue <configs/multitask/mt_glue.json>`.

Train config
------------

When using ``multitask_transformer`` component, you can use the same inference file as the train file.

Data reading and iteration is performed by :class:`~deeppavlov.dataset_readers.multitask_reader.MultiTaskReader`
and :class:`~deeppavlov.dataset_iterators.multitask_iterator.MultiTaskIterator`. These classes are composed
of task readers and iterators and generate batches that contain data from heterogeneous datasets. Example below
demonstrates the usage of multitask dataset reader:

.. code:: json

  "dataset_reader": {
    "class_name": "multitask_reader",
    "task_defaults": {
      "class_name": "huggingface_dataset_reader",
      "path": "glue",
      "train": "train",
      "valid": "validation",
      "test": "test"
    },
    "tasks": {
      "cola": {"name": "cola"},
      "copa": {
        "path": "super_glue",
        "name": "copa"
      },
      "conll": {
        "class_name": "conll2003_reader",
        "use_task_defaults": false,
        "data_path": "{DOWNLOADS_PATH}/conll2003/",
        "dataset_name": "conll2003",
        "provide_pos": false
      }
    }
  }

Nested dataset readers are listed in the ``tasks`` section. By default, default nested readers parameters are taken from
``task_defaults`` section. Values from the ``tasks`` could complement parameters, like ``name`` parameter in the
``dataset_reader.tasks.cola``, and could overwrite default parameter values, like ``path`` parameter from
``dataset_reader.tasks.copa``. In the ``dataset_reader.tasks.conll`` ``use_task_defaults`` is ``False``. This is special
parameter, that forces ``multitask_reader`` to ignore ``task_defaults`` while creating nested reader, which means that
dataset reader for ``conll`` task will use only parameters from ``dataset_reader.tasks.conll``.

The same principle with default values applies to ``multitask_iterator``.

Batches generated by ``multitask_iterator`` are tuples of two elements: inputs of the model and labels. 
Both inputsand labels are lists of tuples. The inputs have following format:
``[(first_task_inputs[0], second_task_inputs[0],...), (first_task_inputs[1], second_task_inputs[1], ...), ...]``
where ``first_task_inputs``, ``second_task_inputs``, and so on are x values of batches from task dataset iterators.
The labels in the second element have the similar format.

If task datasets have different sizes, then for smaller datasets the lists are padded with ``None`` values. For example,
if the first task dataset inputs are ``[0, 1, 2, 3, 4, 5, 6]``, the second task dataset inputs are ``[7, 8, 9]``,
and the batch size is ``2``, then multi-task input mini-batches will be ``[(0, 7), (1, 8)]``, ``[(2, 9), (3, None)]``,
``[(4, None), (5, None)]``, ``[(6, None)]``.

In this tutorial, there are 5 datasets. Considering the batch structure, ``chainer`` inputs in
:config:`multitask_example <configs/multitask/multitask_example.json>` are:

.. code:: json

  "in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
  "in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_conll"]

Sometimes a task dataset iterator returns inputs or labels consisting of more than one element. For example, in the
model input element could consist of two strings. If there is a necessity to split such a variable, ``InputSplitter``
component can be used. Data preparation in the multitask setting can be similar to the preparation in singletask setting
except for the names of the variables.

For streamlining the code, however, ``input_splitter`` and ``tokenizer`` can be unified into the
``multitask_pipeline_preprocessor``. This preprocessor gets as a parameter ``preprocessor`` the one preprocessor class
name for all tasks, or gets the preprocessor name list as a parameter ``preprocessors``. After splitting input by
``possible_keys_to_extract``, every preprocessor (being initialized by the input beforehand) processes the input.
Note, that if ``strict`` parameter(default:False) is set to True, we always try to split data. Here is the definition of
``multitask_pipeline_preprocessor`` from the :config:`multitask_example <configs/multitask/multitask_example.json>`:

.. code:: json

  "class_name": "multitask_pipeline_preprocessor",
  "possible_keys_to_extract": [0, 1],
  "preprocessors": [
    "TorchTransformersPreprocessor",
    "TorchTransformersPreprocessor",
    "TorchTransformersPreprocessor",
    "TorchTransformersMultiplechoicePreprocessor",
    "TorchTransformersNerPreprocessor"
  ],
  "do_lower_case": true,
  "n_task": 5,
  "vocab_file": "{BACKBONE}",
  "max_seq_length": 200,
  "max_subword_length": 15,
  "token_masking_prob": 0.0,
  "return_features": true,
  "in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
  "out": [
    "bert_features_cola",
    "bert_features_rte",
    "bert_features_stsb",
    "bert_features_copa",
    "bert_features_conll"
  ]

The ``multitask_transformer`` component has common and task-specific parameters. Shared parameters are provided inside
the tasks parameter. The tasks is a dictionary that keys are task names and values are task-specific parameters (type,
options). Common parameters, are backbone_model(same parameter as in the tokenizer) and all parameters from torch_bert.
**The order of tasks MATTERS.**

Here is the definition of ``multitask_transformer`` from the :config:`multitask_example <configs/multitask/multitask_example.json>`:

.. code:: json

  "id": "multitask_transformer",
  "class_name": "multitask_transformer",
  "optimizer_parameters": {"lr": 2e-5},
  "gradient_accumulation_steps": "{GRADIENT_ACC_STEPS}",
  "learning_rate_drop_patience": 2,
  "learning_rate_drop_div": 2.0,
  "return_probas": true,
  "backbone_model": "{BACKBONE}",
  "save_path": "{MODEL_PATH}",
  "load_path": "{MODEL_PATH}",
  "tasks": {
    "cola": {
      "type": "classification",
      "options": 2
    },
    "rte": {
      "type": "classification",
      "options": 2
    },
    "stsb": {
      "type": "regression",
      "options": 1
    },
    "copa": {
      "type": "multiple_choice",
      "options": 2
    },
    "conll": {
      "type": "sequence_labeling",
      "options": "#vocab_conll.len"
    }
  },
  "in": [
    "bert_features_cola",
    "bert_features_rte",
    "bert_features_stsb",
    "bert_features_copa",
    "bert_features_conll"
  ],
  "in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_ids_conll"],
  "out": [
    "y_cola_pred_probas",
    "y_rte_pred_probas",
    "y_stsb_pred",
    "y_copa_pred_probas",
    "y_conll_pred_ids"
  ]
         
Note that ``proba2labels`` can now take several arguments.

.. code:: json

  {
    "in":["y_cola_pred_probas", "y_rte_pred_probas", "y_copa_pred_probas"],
    "out":["y_cola_pred_ids", "y_rte_pred_ids", "y_copa_pred_ids"],
    "class_name":"proba2labels",
    "max_proba":true
  }

You may need to create your own metric for early stopping. In this example, the target metric is an average of AUC ROC
for insults and sentiment tasks and F1 for NER task:

.. code:: python

    from deeppavlov.metrics.roc_auc_score import roc_auc_score

    def roc_auc__roc_auc__ner_f1(true_onehot1, pred_probas1, true_onehot2, pred_probas2, ner_true3, ner_pred3):
        roc_auc1 = roc_auc_score(true_onehot1, pred_probas1)
        roc_auc2 = roc_auc_score(true_onehot2, pred_probas2)
        ner_f1_3 = ner_f1(ner_true3, ner_pred3) / 100
        return (roc_auc1 + roc_auc2 + ner_f1_3) / 3

It he code above will be saved at ``custom_metric.py``, metric could be used in the config as
``custom_metric:roc_auc__roc_auc__ner_f1`` (``module.submodules:function_name`` reference format).

You can make an inference-only config. In this config, there is no need in dataset reader and dataset iterator.
A ``train`` field and components preparing ``in_y`` are removed. In ``multitask_transformer`` component configuration
all training parameters (learning rate, optimizer, etc.) are omitted.

Here are the results of ``deeppavlov/configs/multitask/mt_glue.json`` compared to the analogous single-task configs,
according to the test server.

+-------------------+-------------+----------------+----------+---------------+-----------------------+---------------+------------+----------+----------+----------------+
| Task              | Score       | CoLA           | SST-2    | MRPC          | STS-B                 | QQP           | MNLI(m/mm) | QNLI     | RTE      | AX             |
+-------------------+-------------+----------------+----------+---------------+-----------------------+---------------+------------+----------+----------+----------------+
| Metric            | from server | Matthew's Corr | Accuracy | F1 / Accuracy | Pearson/Spearman Corr | F1 / Accuracy | Accuracy   | Accuracy | Accuracy | Matthew's Corr |
+===================+=============+================+==========+===============+=======================+===============+============+==========+==========+================+
| Multitask config  | 77.8        | 43.6           | 93.2     | 88.6/84.2     | 84.3/84.0             | 70.1/87.9     | 83.0/82.6  | 90.6     | 75.4     | 35.4           |
+-------------------+-------------+----------------+----------+---------------+-----------------------+---------------+------------+----------+----------+----------------+
| Singletask config | 77.6        | 53.6           | 92.7     | 87.7/83.6     | 84.4/83.1             | 70.5/88.9     | 84.4/83.2  | 90.3     | 63.4     | 36.3           |
+-------------------+-------------+----------------+----------+---------------+-----------------------+---------------+------------+----------+----------+----------------+