deeppavlov.models.preprocessors¶
- class deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor(remove_punctuation: bool = True, *args, **kwargs)[source]¶
Class implements preprocessing of english texts with low level of literacy such as comments
- class deeppavlov.models.preprocessors.mask.Mask(*args, **kwargs)[source]¶
Takes a batch of tokens and returns the masks of corresponding length
- class deeppavlov.models.preprocessors.one_hotter.OneHotter(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]¶
One-hot featurizer with zero-padding. If
single_vector
, return the only vector per sample which can have several elements equal to1
.- Parameters
depth – the depth for one-hotting
pad_zeros – whether to pad elements of batch with zeros
single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)
- class deeppavlov.models.preprocessors.sanitizer.Sanitizer(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]¶
Remove all combining characters like diacritical marks from tokens
- Parameters
diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
nums – whether to replace all digits with 1 or not
- deeppavlov.models.preprocessors.str_lower.str_lower(batch: Union[str, list, tuple])[source]¶
Recursively search for strings in a list and convert them to lowercase
- Parameters
batch – a string or a list containing strings at some level of nesting
- Returns
the same structure where all strings are converted to lowercase
- class deeppavlov.models.preprocessors.str_token_reverser.StrTokenReverser(tokenized: bool = False, *args, **kwargs)[source]¶
Component for converting strings to strings with reversed token positions
- Parameters
tokenized – The parameter is only needed to reverse tokenized strings.
- __call__(batch: Union[str, list, tuple]) Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]] [source]¶
Recursively search for strings in a list and convert them to strings with reversed token positions
- Parameters
batch – a string or a list containing strings
- Returns
the same structure where all strings tokens are reversed
- class deeppavlov.models.preprocessors.str_utf8_encoder.StrUTF8Encoder(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]¶
Component for encoding all strings to utf8 codes
- Parameters
max_word_length – Max length of words of input and output batches.
pad_special_char_use – Whether to use special char for padding or not.
word_boundary_special_char_use – Whether to add word boundaries by special chars or not.
sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.
reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.
bos – Name of a special token of the begin of a sentence.
eos – Name of a special token of the end of a sentence.
- class deeppavlov.models.preprocessors.odqa_preprocessors.DocumentChunker(sentencize_fn: Callable = nltk.sent_tokenize, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, number_of_paragraphs: int = - 1, *args, **kwargs)[source]¶
Make chunks from a document or a list of documents. Don’t tear up sentences if needed.
- Parameters
sentencize_fn – a function for sentence segmentation
keep_sentences – whether to tear up sentences between chunks or not
tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
flatten_result – whether to flatten the resulting list of lists of chunks
paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored
- keep_sentences¶
whether to tear up sentences between chunks or not
- tokens_limit¶
a number of tokens in a single chunk
- flatten_result¶
whether to flatten the resulting list of lists of chunks
- paragraphs¶
whether to split document by paragrahs; if set to True, tokens_limit is ignored
- __call__(batch_docs: List[Union[str, List[str]]], batch_docs_ids: Optional[List[Union[str, List[str]]]] = None) Union[Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]], List[str], List[List[str]]] [source]¶
Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents :param batch_docs_ids: a batch of documents ids / a batch of lists of documents ids :type batch_docs_ids: optional
- Returns
chunks of docs, flattened or not and chunks of docs ids, flattened or not if batch_docs_ids were passed