deeppavlov.models.tokenizers¶
-
class
deeppavlov.models.tokenizers.lazy_tokenizer.
LazyTokenizer
(**kwargs)[source]¶ Tokenizes if there is something to tokenize.
-
class
deeppavlov.models.tokenizers.nltk_moses_tokenizer.
NLTKMosesTokenizer
(escape: bool = False, *args, **kwargs)[source]¶ Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer
-
escape
¶ whether escape characters for use in html markup
-
tokenizer
¶ tokenizer instance from nltk.tokenize.moses
-
detokenizer
¶ detokenizer instance from nltk.tokenize.moses
Parameters: escape – whether escape characters for use in html markup -
-
class
deeppavlov.models.tokenizers.nltk_tokenizer.
NLTKTokenizer
(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]¶ Class for splitting texts on tokens using NLTK
Parameters: - tokenizer – tokenization mode for nltk.tokenize
- download – whether to download nltk data
-
tokenizer
¶ tokenizer instance from nltk.tokenizers
-
class
deeppavlov.models.tokenizers.ru_sent_tokenizer.
RuSentTokenizer
(shortenings: Set[str] = {'внутр', 'прим', 'проц', 'corp', 'зав', 'жен', 'сек', 'гос', 'коп', 'муж', 'мн', 'inc', 'адм', 'эт', 'авт', 'ед', 'co', 'дол', 'обр', 'руб', 'яз', 'барр', 'долл', 'сан', 'лат', 'р', 'рус', 'о', 'зам', 'куб', 'мин', 'накл', 'ред', 'повел', 'русск', 'дифф', 'шутл', 'тыс', 'га', 'искл', 'букв', 'корп', 'обл'}, joining_shortenings: Set[str] = {'яп', 'д', 'итал', 'устар', 'им', 'dr', 'vs', 'пер', 'слав', 'араб', 'mrs', 'тел', 'евр', 'сокр', 'св', 'стр', 'греч', 'пл', 'кит', 'англ', 'ms', 'mr', 'г', 'ул', 'см', 'корп', 'рис'}, paired_shortenings: Set[Tuple[str, str]] = {('и', 'о'), ('т', 'п'), ('н', 'э'), ('т', 'е'), ('у', 'е')}, **kwargs)[source]¶ Rule-base sentence tokenizer for Russian language. https://github.com/deepmipt/ru_sentence_tokenizer
Parameters: - shortenings – list of known shortenings. Use default value if working on news or fiction texts
- joining_shortenings – list of shortenings after that sentence split is not possible (i.e. “ул”). Use default value if working on news or fiction texts
- paired_shortenings – list of known paired shotenings (i.e. “т. е.”). Use default value if working on news or fiction texts
-
class
deeppavlov.models.tokenizers.split_tokenizer.
SplitTokenizer
(**kwargs)[source]¶ Generates utterance’s tokens by mere python’s
str.split()
.Doesn’t have any parameters.
-
class
deeppavlov.models.tokenizers.spacy_tokenizer.
StreamSpacyTokenizer
(disable: Union[typing.List[str], NoneType] = None, stopwords: Union[typing.List[str], NoneType] = None, batch_size: Union[int, NoneType] = None, ngram_range: Union[typing.List[int], NoneType] = None, lemmas: bool = False, n_threads: Union[int, NoneType] = None, lowercase: Union[bool, NoneType] = None, alphas_only: Union[bool, NoneType] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]¶ Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto
List[str]
, performs detokenizing procedure.Parameters: - disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing
- stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
- batch_size – a batch size for inner spacy multi-threading
- ngram_range – size of ngrams to create; only unigrams are returned by default
- lemmas – whether to perform lemmatizing or not
- n_threads – a number of threads for inner spacy multi-threading
- lowercase – whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methods - alphas_only – whether to filter out non-alpha tokens; is performed by default by
_filter()
method - spacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation
-
stopwords
¶ a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
-
model
¶ a loaded spacy model
-
batch_size
¶ a batch size for inner spacy multi-threading
-
ngram_range
¶ size of ngrams to create; only unigrams are returned by default
-
lemmas
¶ whether to perform lemmatizing or not
-
n_threads
¶ a number of threads for inner spacy multi-threading
-
lowercase
¶ whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methods
-
alphas_only
¶ whether to filter out non-alpha tokens; is performed by default by
_filter()
method
-
__call__
(batch: Union[typing.List[str], typing.List[typing.List[str]]]) → Union[typing.List[typing.List[str]], typing.List[str]][source]¶ Tokenize or detokenize strings, depends on the type structure of passed arguments.
Parameters: batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing Returns: a batch of lists of tokens/lemmas; or a batch of detokenized strings Raises: TypeError
– If the first element ofbatch
is neither List, nor str.
-
class
deeppavlov.models.tokenizers.ru_tokenizer.
RussianTokenizer
(stopwords: Union[typing.List[str], NoneType] = None, ngram_range: List[int] = None, lemmas: bool = False, lowercase: Union[bool, NoneType] = None, alphas_only: Union[bool, NoneType] = None, **kwargs)[source]¶ Tokenize or lemmatize a list of documents for Russian language. Default models are
ToktokTokenizer
tokenizer andpymorphy2
lemmatizer. Return a list of tokens or lemmas for a whole document. If is called ontoList[str]
, performs detokenizing procedure.Parameters: - stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
- ngram_range – size of ngrams to create; only unigrams are returned by default
- lemmas – whether to perform lemmatizing or not
- lowercase – whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methods - alphas_only – whether to filter out non-alpha tokens; is performed by default by
_filter()
method
-
stopwords
¶ a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
-
tokenizer
¶ an instance of
ToktokTokenizer
tokenizer class
-
lemmatizer
¶ an instance of
pymorphy2.MorphAnalyzer
lemmatizer class
-
ngram_range
¶ size of ngrams to create; only unigrams are returned by default
-
lemmas
¶ whether to perform lemmatizing or not
-
lowercase
¶ whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methods
-
alphas_only
¶ whether to filter out non-alpha tokens; is performed by default by
_filter()
method tok2morph: token-to-lemma cache
-
__call__
(batch: Union[typing.List[str], typing.List[typing.List[str]]]) → Union[typing.List[typing.List[str]], typing.List[str]][source]¶ Tokenize or detokenize strings, depends on the type structure of passed arguments.
Parameters: batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing Returns: a batch of lists of tokens/lemmas; or a batch of detokenized strings Raises: TypeError
– If the first element ofbatch
is neitherList
, norstr
.