deeppavlov.models.embedders¶

class deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]¶

Class implements fastText embedding model

Parameters

load_path – path where to load pre-trained embedding model from
pad_zero – whether to pad samples or not

model¶: fastText model instance

tok2emb¶: dictionary with already embedded tokens

dim¶: dimension of embeddings

pad_zero¶: whether to pad sequence of tokens with zeros or not

load_path¶: path with pre-trained fastText binary model

__call__(batch: List[List[str]], mean: Optional[bool] = None) → List[Union[list, numpy.ndarray]]¶

Embed sentences from batch

Parameters

batch – list of tokenized text samples
mean – whether to return mean embedding of tokens per sample

Returns

embedded batch

__iter__() → Iterator[str][source]¶

Iterate over all words from fastText model vocabulary

Returns: iterator

class deeppavlov.models.embedders.tfidf_weighted_embedder.TfidfWeightedEmbedder(embedder: deeppavlov.core.models.component.Component, tokenizer: Optional[deeppavlov.core.models.component.Component] = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: Optional[str] = None, vectorizer: Optional[deeppavlov.core.models.component.Component] = None, counter_vocab_path: Optional[str] = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]¶

The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in vectorizer or calculated as TFIDF from counter vocabulary given in counter_vocab_path.: Also one can give tags_vocab_path to the vocabulary with weights of tags. In this case, batch with tags should be given as a second input in __call__ method.

Parameters

embedder – embedder instance
tokenizer – tokenizer instance, should be able to detokenize sentence
pad_zero – whether to pad samples or not
mean – whether to return mean token embedding
tags_vocab_path – optional path to vocabulary with tags weights
vectorizer – vectorizer instance should be trained with analyzer="word"
counter_vocab_path – path to counter vocabulary
idf_base_count – minimal idf value (less time occured are not counted)
log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary
min_idf_weight – minimal idf weight

embedder¶: embedder instance

tokenizer¶: tokenizer instance, should be able to detokenize sentence

dim¶: dimension of embeddings

pad_zero¶: whether to pad samples or not

mean¶: whether to return mean token embedding

tags_vocab¶: vocabulary with weigths for tags

vectorizer¶: vectorizer instance

counter_vocab_path¶: path to counter vocabulary

counter_vocab¶: counter vocabulary

idf_base_count¶: minimal idf value (less time occured are not counted)

log_base¶: logarithm base for TFIDF-coefficient calculation froom counter vocabulary

min_idf_weight¶: minimal idf weight

Examples

>>> from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder
>>> from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder
>>> fasttext_embedder = FasttextEmbedder('/data/embeddings/wiki.ru.bin')
>>> fastTextTfidf = TfidfWeightedEmbedder(embedder=fasttext_embedder,
        counter_vocab_path='/data/vocabs/counts_wiki_lenta.txt')
>>> fastTextTfidf([['большой', 'и', 'розовый', 'бегемот']])
[array([ 1.99135890e-01, -7.14746421e-02,  8.01428872e-02, -5.32840924e-02,
         5.05212297e-02,  2.76053832e-01, -2.53270134e-01, -9.34443950e-02,
         ...
         1.18385439e-02,  1.05643446e-01, -1.21904516e-03,  7.70555378e-02])]

__call__(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: Optional[bool] = None, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]¶

Infer on the given data

Parameters

batch – tokenized text samples
tags_batch – optional batch of corresponding tags
mean – whether to return mean token embedding (does not depend on self.mean)
*args – additional arguments
**kwargs – additional arguments

Returns:

class deeppavlov.models.embedders.transformers_embedder.TransformersBertEmbedder(load_path: Union[str, pathlib.Path], bert_config_path: Optional[Union[str, pathlib.Path]] = None, truncate: bool = False, **kwargs)[source]¶

Transformers-based BERT model for embeddings tokens, subtokens and sentences

Parameters

load_path – path to a pretrained BERT pytorch checkpoint
bert_config_file – path to a BERT configuration file
truncate – whether to remove zero-paddings from returned data

__call__(subtoken_ids_batch: Collection[Collection[int]], startofwords_batch: Collection[Collection[int]], attention_batch: Collection[Collection[int]]) → Tuple[Collection[Collection[Collection[float]]], Collection[Collection[Collection[float]]], Collection[Collection[float]], Collection[Collection[float]], Collection[Collection[float]]][source]¶

Predict embeddings values for a given batch

Parameters

subtoken_ids_batch – padded indexes for every subtoken
startofwords_batch – a mask matrix with 1 for every first subtoken init in a token and 0 for every other subtoken
attention_batch – a mask matrix with 1 for every significant subtoken and 0 for paddings