Pre-trained embeddings¶

BERT¶

We are publishing several pre-trained BERT models:

RuBERT for Russian language
Slavic BERT for Bulgarian, Czech, Polish, and Russian
Conversational BERT for informal English
Conversational BERT for informal Russian
Sentence Multilingual BERT for encoding sentences in 101 languages
Sentence RuBERT for encoding sentences in Russian

Description of these models is available in the BERT section of the docs.

License¶

The pre-trained models are distributed under the License Apache 2.0.

Downloads¶

The TensorFlow models can be run with the original BERT repo code while the PyTorch models can be run with the HuggingFace’s Transformers library. The download links are:

Description	Model parameters	Download links
RuBERT	vocab size = 120K, parameters = 180M, size = 632MB	[tensorflow], [pytorch]
Slavic BERT	vocab size = 120K, parameters = 180M, size = 632MB	[tensorflow], [pytorch]
Conversational BERT	vocab size = 30K, parameters = 110M, size = 385MB	[tensorflow], [pytorch]
Conversational RuBERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]
Sentence Multilingual BERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]
Sentence RuBERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]

ELMo¶

The ELMo can used via Python code as following:

import tensorflow as tf
import tensorflow_hub as hub
elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz", trainable=True)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"]
sess.run(embeddings)

TensorFlow Hub module also supports tokenized sentences in the following format.

tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]]
tokens_length = [3, 1]
embeddings = elmo(inputs={"tokens": tokens_input,"sequence_len": tokens_length},signature="tokens",as_dict=True)["elmo"]
sess.run(embeddings)

Downloads¶

The models can be downloaded and run by tensorflow hub module from:

Description	Dataset parameters	Perplexity	Tensorflow hub module
ELMo on Russian Wikipedia	lines = 1M, tokens = 386M, size = 5GB	43.692	module_spec
ELMo on Russian WMT News	lines = 63M, tokens = 946M, size = 12GB	49.876	module_spec
ELMo on Russian Twitter	lines = 104M, tokens = 810M, size = 8.5GB	94.145	module_spec

fastText¶

We are publishing pre-trained word vectors for Russian language. Several models were trained on joint Russian Wikipedia and Lenta.ru corpora. We also introduce one model for Russian conversational language that was trained on Russian Twitter corpus.

All vectors are 300-dimensional. We used fastText skip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).

You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.

License¶

The pre-trained word vectors are distributed under the License Apache 2.0.

Downloads¶

The pre-trained fastText skipgram models can be downloaded from:

Domain	Preprocessing	Vectors
Wiki+Lenta	tokenize (nltk word_tokenize), lemmatize (pymorphy2)	bin, vec
	tokenize (nltk word_tokenize), lowercasing	bin, vec
	tokenize (nltk wordpunсt_tokenize)	bin, vec
	tokenize (nltk word_tokenize)	bin, vec
	tokenize (nltk word_tokenize), remove stopwords	bin, vec
Twitter	tokenize (nltk word_tokenize)	bin, vec

Word vectors training parameters¶

These word vectors were trained with following parameters ([…] is for default value):

fastText (skipgram)

lr [0.1]
lrUpdateRate [100]
dim 300
ws [5]
epoch [5]
neg [5]
loss [softmax]
pretrainedVectors []
saveOutput [0]