deeppavlov.models.embedders¶
-
class
deeppavlov.models.embedders.fasttext_embedder.
FasttextEmbedder
(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]¶ Class implements fastText embedding model
- Parameters
load_path – path where to load pre-trained embedding model from
pad_zero – whether to pad samples or not
-
model
¶ fastText model instance
-
tok2emb
¶ dictionary with already embedded tokens
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad sequence of tokens with zeros or not
-
load_path
¶ path with pre-trained fastText binary model
-
class
deeppavlov.models.embedders.tfidf_weighted_embedder.
TfidfWeightedEmbedder
(embedder: deeppavlov.core.models.component.Component, tokenizer: Optional[deeppavlov.core.models.component.Component] = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: Optional[str] = None, vectorizer: Optional[deeppavlov.core.models.component.Component] = None, counter_vocab_path: Optional[str] = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]¶ - The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
vectorizer
or calculated as TFIDF from counter vocabulary given incounter_vocab_path
. Also one can give
tags_vocab_path
to the vocabulary with weights of tags. In this case, batch with tags should be given as a second input in__call__
method.
- Parameters
embedder – embedder instance
tokenizer – tokenizer instance, should be able to detokenize sentence
pad_zero – whether to pad samples or not
mean – whether to return mean token embedding
tags_vocab_path – optional path to vocabulary with tags weights
vectorizer – vectorizer instance should be trained with
analyzer="word"
counter_vocab_path – path to counter vocabulary
idf_base_count – minimal idf value (less time occured are not counted)
log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary
min_idf_weight – minimal idf weight
-
embedder
¶ embedder instance
-
tokenizer
¶ tokenizer instance, should be able to detokenize sentence
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad samples or not
-
mean
¶ whether to return mean token embedding
vocabulary with weigths for tags
-
vectorizer
¶ vectorizer instance
-
counter_vocab_path
¶ path to counter vocabulary
-
counter_vocab
¶ counter vocabulary
-
idf_base_count
¶ minimal idf value (less time occured are not counted)
-
log_base
¶ logarithm base for TFIDF-coefficient calculation froom counter vocabulary
-
min_idf_weight
¶ minimal idf weight
Examples
>>> from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder >>> from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder >>> fasttext_embedder = FasttextEmbedder('/data/embeddings/wiki.ru.bin') >>> fastTextTfidf = TfidfWeightedEmbedder(embedder=fasttext_embedder, counter_vocab_path='/data/vocabs/counts_wiki_lenta.txt') >>> fastTextTfidf([['большой', 'и', 'розовый', 'бегемот']]) [array([ 1.99135890e-01, -7.14746421e-02, 8.01428872e-02, -5.32840924e-02, 5.05212297e-02, 2.76053832e-01, -2.53270134e-01, -9.34443950e-02, ... 1.18385439e-02, 1.05643446e-01, -1.21904516e-03, 7.70555378e-02])]
-
__call__
(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: Optional[bool] = None, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]¶ Infer on the given data
- Parameters
batch – tokenized text samples
tags_batch – optional batch of corresponding tags
mean – whether to return mean token embedding (does not depend on self.mean)
*args – additional arguments
**kwargs – additional arguments
Returns:
- The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
-
class
deeppavlov.models.embedders.transformers_embedder.
TransformersBertEmbedder
(load_path: Union[str, pathlib.Path], bert_config_path: Optional[Union[str, pathlib.Path]] = None, truncate: bool = False, **kwargs)[source]¶ Transformers-based BERT model for embeddings tokens, subtokens and sentences
- Parameters
load_path – path to a pretrained BERT pytorch checkpoint
bert_config_file – path to a BERT configuration file
truncate – whether to remove zero-paddings from returned data
-
__call__
(subtoken_ids_batch: Collection[Collection[int]], startofwords_batch: Collection[Collection[int]], attention_batch: Collection[Collection[int]]) → Tuple[Collection[Collection[Collection[float]]], Collection[Collection[Collection[float]]], Collection[Collection[float]], Collection[Collection[float]], Collection[Collection[float]]][source]¶ Predict embeddings values for a given batch
- Parameters
subtoken_ids_batch – padded indexes for every subtoken
startofwords_batch – a mask matrix with
1
for every first subtoken init in a token and0
for every other subtokenattention_batch – a mask matrix with
1
for every significant subtoken and0
for paddings