dataset_iterators¶

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters:

data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of "train", "valid", "test") to merge
merged_field – name of field (out of "train", "valid", "test") to which save merged fields
field_to_split – name of field (out of "train", "valid", "test") to split
split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
split_proportions – list of corresponding proportions for splitting
seed – random seed
shuffle – whether to shuffle examples in batches
*args – arguments
**kwargs – arguments

data¶: dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Iterates over dialog data, generates batches where one sample is one dialog.

A subclass of DataLearningIterator.

train¶: list of training dialogs (tuples (context, response))

valid¶: list of validation dialogs (tuples (context, response))

test¶: list of dialogs used for testing (tuples (context, response))

class deeppavlov.dataset_iterators.dialog_iterator.DialogDBResultDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Iterates over dialog data, outputs list of all 'db_result' fields (if present).

The class helps to build a list of all 'db_result' values present in a dataset.

Inherits key methods and attributes from DataLearningIterator.

train¶: list of tuples (db_result dictionary, '') from “train” data

valid¶: list of tuples (db_result dictionary, '') from “valid” data

test¶: list of tuples (db_result dictionary, '') from “test” data

class deeppavlov.dataset_iterators.dstc2_intents_iterator.Dstc2IntentsDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary

Parameters:

data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of "train", "valid", "test") to merge
merged_field – name of field (out of "train", "valid", "test") to which save merged fields
field_to_split – name of field (out of "train", "valid", "test") to split
split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
split_proportions – list of corresponding proportions for splitting
seed – random seed
shuffle – whether to shuffle examples in batches
*args – arguments
**kwargs – arguments

data¶: dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dstc2_ner_iterator.Dstc2NerDatasetIterator[source]¶

Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.

Parameters:	data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features. dataset_path – path to dataset seed – value for random seed shuffle – whether to shuffle the data

class deeppavlov.dataset_iterators.kvret_dialog_iterator.KvretDialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Inputs data from DSTC2DatasetReader, constructs dialog history for each turn, generates batches (one sample is a turn).

Inherits key methods and attributes from DataLearningIterator.

train¶: list of “train” (context, response) tuples

valid¶: list of “valid” (context, response) tuples

test¶: list of “test” (context, response) tuples

deeppavlov.dataset_iterators.morphotagger_iterator.preprocess_data(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]¶

Processes all words in data using process_word().

Parameters:	data – a list of pairs (words, tags), each pair corresponds to a single sentence to_lower – whether to lowercase append_case – whether to add case mark
Returns:	a list of preprocessed sentences

class deeppavlov.dataset_iterators.morphotagger_iterator.MorphoTaggerDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, validation_split: float = 0.2)[source]¶

Iterates over data for Morphological Tagging. A subclass of DataLearningIterator.

Parameters:	seed – random seed for data shuffling shuffle – whether to shuffle data during batching validation_split – the fraction of validation data (is used only if there is no valid subset in data)

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(data_dir: str = '', data_url: str = 'http://files.deeppavlov.ai/datasets/wikipedia/enwiki.db', batch_size: int = None, shuffle: bool = None, seed: int = None, **kwargs)[source]¶

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters:	data_dir – a directory where to save downloaded DB to data_url – an URL where to download a DB from batch_size – a number of samples in a single batch shuffle – whether to shuffle data during batching seed – random seed for data shuffling

connect¶: a DB connection

db_name¶: a DB name

doc_ids¶: DB document ids

doc2index¶: a dictionary of document indices and their titles

batch_size¶: a number of samples in a single batch

shuffle¶: whether to shuffle data during batching

random¶: an instance of Random class.

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]¶

Split all data into train and test

Parameters:	test_ratio – ratio of test data to train, from 0. to 1.

class deeppavlov.dataset_iterators.ranking_iterator.RankingIterator(data: Dict[str, List], sample_candidates_pool: bool = False, sample_candidates_pool_valid: bool = True, sample_candidates_pool_test: bool = True, num_negative_samples: int = 10, num_ranking_samples_valid: int = 10, num_ranking_samples_test: int = 10, seed: int = None, shuffle: bool = False, len_vocab: int = 0, pos_pool_sample: bool = False, pos_pool_rank: bool = True, random_batches: bool = False, batches_per_epoch: int = None, triplet_mode: bool = True, hard_triplets_sampling: bool = False, num_positive_samples: int = 5)[source]¶

The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

Note

Each sample in data['train'] is arranged as follows: {'context': 21507, 'response': 7009, 'pos_pool': [7009, 7010], 'neg_pool': None}. The context has a ‘context’ key in the data sample. It is represented by a single integer. The correct response has the ‘response’ key in the sample, its value is also always a single integer. The list of possible correct responses (there may be several) can be obtained with the ‘pos_pool’ key. The value of the ‘response’ should be equal to the one item from the list obtained using the ‘pos_pool’ key. The list of possible negative responses (there can be a lot of them, 100–10000) is represented by the key ‘neg_pool’. Its value is None, when global sampling is used, or the list of fixed length, when sampling from predefined negative responses is used. It is important that values in ‘pos_pool’ and ‘negative_pool’ do not overlap. Single items in ‘context’, ‘response’, ‘pos_pool’, ‘neg_pool’ are represented by single integers that give lists of integers using some dictionary integer–list of integers. These lists of integers are converted to lists of tokens with some dictionary integer–token. Samples in data['valid'] and data['test'] representation are almost the same as the train sample shown above.

Parameters:

data – A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.
sample_candidates_pool – Whether to sample candidates from a predefined pool of candidates for each sample in training mode. If False, negative sampling from the whole data will be performed.
sample_candidates_pool_valid – Whether to validate a model on a predefined pool of candidates for each sample. If False, sampling from the whole data will be performed for validation.
sample_candidates_pool_test – Whether to test a model on a predefined pool of candidates for each sample. If False, sampling from the whole data will be performed for test.
num_negative_samples – A size of a predefined pool of candidates or a size of data subsample from the whole data in training mode.
num_ranking_samples_valid – A size of a predefined pool of candidates or a size of data subsample from the whole data in validation mode.
num_ranking_samples_test – A size of a predefined pool of candidates or a size of data subsample from the whole data in test mode.
seed – Random seed.
shuffle – Whether to shuffle data.
len_vocab – A length of a vocabulary to perform sampling in training, validation and test mode.
pos_pool_sample – Whether to sample response from pos_pool each time when the batch is generated. If False, the response from response will be used.
pos_pool_rank – Whether to count samples from the whole pos_pool as correct answers in test / validation mode.
random_batches – Whether to choose batches randomly or iterate over data sequentally in training mode.
batches_per_epoch – A number of batches to choose per each epoch in training mode. Only required if random_batches is set to True.
triplet_mode – Whether to use a model with triplet loss. If False, a model with crossentropy loss will be used.
hard_triplets_sampling – Whether to use hard triplets method of sampling in training mode.
num_positive_samples – A number of contexts to choose from pos_pool for each context. Only required if hard_triplets_sampling is set to True.

gen_batches(batch_size: int, data_type: str = 'train', shuffle: bool = True) → Tuple[List[List[Tuple[int, int]]], List[int]][source]¶

Generate batches of inputs and expected outputs to train neural networks.

Parameters:

batch_size – number of samples in batch
data_type – can be either ‘train’, ‘test’, or ‘valid’
shuffle – whether to shuffle dataset before batching

Returns:

A tuple of a batch of inputs and a batch of expected outputs.

Inputs and expected outputs have different structure and meaning depending on class attributes values and data_type.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train SquadModel.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train¶: train examples

valid¶: validation examples

test¶: test examples