dataset_iterators¶
Concrete DatasetIterator classes.
-
class
deeppavlov.dataset_iterators.basic_classification_iterator.
BasicClassificationDatasetIterator
(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary
Parameters: - data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
- fields_to_merge – list of fields (out of
"train", "valid", "test"
) to merge - merged_field – name of field (out of
"train", "valid", "test"
) to which save merged fields - field_to_split – name of field (out of
"train", "valid", "test"
) to split - split_fields – list of fields (out of
"train", "valid", "test"
) to which save splitted field - split_proportions – list of corresponding proportions for splitting
- seed – random seed
- shuffle – whether to shuffle examples in batches
- *args – arguments
- **kwargs – arguments
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dialog_iterator.
DialogDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, generates batches where one sample is one dialog.
A subclass of
DataLearningIterator
.-
train
¶ list of training dialogs (tuples
(context, response)
)
-
valid
¶ list of validation dialogs (tuples
(context, response)
)
-
test
¶ list of dialogs used for testing (tuples
(context, response)
)
-
-
class
deeppavlov.dataset_iterators.dialog_iterator.
DialogDBResultDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, outputs list of all
'db_result'
fields (if present).The class helps to build a list of all
'db_result'
values present in a dataset.Inherits key methods and attributes from
DataLearningIterator
.-
train
¶ list of tuples
(db_result dictionary, '')
from “train” data
-
valid
¶ list of tuples
(db_result dictionary, '')
from “valid” data
-
test
¶ list of tuples
(db_result dictionary, '')
from “test” data
-
-
class
deeppavlov.dataset_iterators.dstc2_intents_iterator.
Dstc2IntentsDatasetIterator
(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary
Parameters: - data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
- fields_to_merge – list of fields (out of
"train", "valid", "test"
) to merge - merged_field – name of field (out of
"train", "valid", "test"
) to which save merged fields - field_to_split – name of field (out of
"train", "valid", "test"
) to split - split_fields – list of fields (out of
"train", "valid", "test"
) to which save splitted field - split_proportions – list of corresponding proportions for splitting
- seed – random seed
- shuffle – whether to shuffle examples in batches
- *args – arguments
- **kwargs – arguments
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dstc2_ner_iterator.
Dstc2NerDatasetIterator
[source]¶ Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.
Parameters: - data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.
- dataset_path – path to dataset
- seed – value for random seed
- shuffle – whether to shuffle the data
-
class
deeppavlov.dataset_iterators.kvret_dialog_iterator.
KvretDialogDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Inputs data from
DSTC2DatasetReader
, constructs dialog history for each turn, generates batches (one sample is a turn).Inherits key methods and attributes from
DataLearningIterator
.-
train
¶ list of “train”
(context, response)
tuples
-
valid
¶ list of “valid”
(context, response)
tuples
-
test
¶ list of “test”
(context, response)
tuples
-
-
deeppavlov.dataset_iterators.morphotagger_iterator.
preprocess_data
(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]¶ Processes all words in data using
process_word()
.Parameters: - data – a list of pairs (words, tags), each pair corresponds to a single sentence
- to_lower – whether to lowercase
- append_case – whether to add case mark
Returns: a list of preprocessed sentences
-
class
deeppavlov.dataset_iterators.morphotagger_iterator.
MorphoTaggerDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, validation_split: float = 0.2)[source]¶ Iterates over data for Morphological Tagging. A subclass of
DataLearningIterator
.Parameters: - seed – random seed for data shuffling
- shuffle – whether to shuffle data during batching
- validation_split – the fraction of validation data (is used only if there is no valid subset in data)
-
class
deeppavlov.dataset_iterators.sqlite_iterator.
SQLiteDataIterator
(data_dir: str = '', data_url: str = 'http://files.deeppavlov.ai/datasets/wikipedia/enwiki.db', batch_size: int = None, shuffle: bool = None, seed: int = None, **kwargs)[source]¶ Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.
Parameters: - data_dir – a directory where to save downloaded DB to
- data_url – an URL where to download a DB from
- batch_size – a number of samples in a single batch
- shuffle – whether to shuffle data during batching
- seed – random seed for data shuffling
-
connect
¶ a DB connection
-
db_name
¶ a DB name
-
doc_ids
¶ DB document ids
-
doc2index
¶ a dictionary of document indices and their titles
-
batch_size
¶ a number of samples in a single batch
-
shuffle
¶ whether to shuffle data during batching
-
random
¶ an instance of
Random
class.
-
class
deeppavlov.dataset_iterators.typos_iterator.
TyposDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Implementation of
DataLearningIterator
used for trainingErrorModel
-
class
deeppavlov.dataset_iterators.ranking_iterator.
RankingIterator
(data: Dict[str, List], sample_candidates_pool: bool = False, sample_candidates_pool_valid: bool = True, sample_candidates_pool_test: bool = True, num_negative_samples: int = 10, num_ranking_samples_valid: int = 10, num_ranking_samples_test: int = 10, seed: int = None, shuffle: bool = False, len_vocab: int = 0, pos_pool_sample: bool = False, pos_pool_rank: bool = True, random_batches: bool = False, batches_per_epoch: int = None, triplet_mode: bool = True, hard_triplets_sampling: bool = False, num_positive_samples: int = 5)[source]¶ The class contains methods for iterating over a dataset for ranking in training, validation and test mode.
Note
Each sample in
data['train']
is arranged as follows:{'context': 21507, 'response': 7009, 'pos_pool': [7009, 7010], 'neg_pool': None}
. The context has a ‘context’ key in the data sample. It is represented by a single integer. The correct response has the ‘response’ key in the sample, its value is also always a single integer. The list of possible correct responses (there may be several) can be obtained with the ‘pos_pool’ key. The value of the ‘response’ should be equal to the one item from the list obtained using the ‘pos_pool’ key. The list of possible negative responses (there can be a lot of them, 100–10000) is represented by the key ‘neg_pool’. Its value is None, when global sampling is used, or the list of fixed length, when sampling from predefined negative responses is used. It is important that values in ‘pos_pool’ and ‘negative_pool’ do not overlap. Single items in ‘context’, ‘response’, ‘pos_pool’, ‘neg_pool’ are represented by single integers that give lists of integers using some dictionary integer–list of integers. These lists of integers are converted to lists of tokens with some dictionary integer–token. Samples indata['valid']
anddata['test']
representation are almost the same as the train sample shown above.Parameters: - data – A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys. - sample_candidates_pool – Whether to sample candidates from a predefined pool of candidates
for each sample in training mode. If
False
, negative sampling from the whole data will be performed. - sample_candidates_pool_valid – Whether to validate a model on a predefined pool of candidates for each sample.
If
False
, sampling from the whole data will be performed for validation. - sample_candidates_pool_test – Whether to test a model on a predefined pool of candidates for each sample.
If
False
, sampling from the whole data will be performed for test. - num_negative_samples – A size of a predefined pool of candidates or a size of data subsample from the whole data in training mode.
- num_ranking_samples_valid – A size of a predefined pool of candidates or a size of data subsample from the whole data in validation mode.
- num_ranking_samples_test – A size of a predefined pool of candidates or a size of data subsample from the whole data in test mode.
- seed – Random seed.
- shuffle – Whether to shuffle data.
- len_vocab – A length of a vocabulary to perform sampling in training, validation and test mode.
- pos_pool_sample – Whether to sample response from pos_pool each time when the batch is generated.
If
False
, the response from response will be used. - pos_pool_rank – Whether to count samples from the whole pos_pool as correct answers in test / validation mode.
- random_batches – Whether to choose batches randomly or iterate over data sequentally in training mode.
- batches_per_epoch – A number of batches to choose per each epoch in training mode.
Only required if
random_batches
is set toTrue
. - triplet_mode – Whether to use a model with triplet loss.
If
False
, a model with crossentropy loss will be used. - hard_triplets_sampling – Whether to use hard triplets method of sampling in training mode.
- num_positive_samples – A number of contexts to choose from pos_pool for each context.
Only required if
hard_triplets_sampling
is set toTrue
.
-
gen_batches
(batch_size: int, data_type: str = 'train', shuffle: bool = True) → Tuple[List[List[Tuple[int, int]]], List[int]][source]¶ Generate batches of inputs and expected outputs to train neural networks.
Parameters: - batch_size – number of samples in batch
- data_type – can be either ‘train’, ‘test’, or ‘valid’
- shuffle – whether to shuffle dataset before batching
Returns: A tuple of a batch of inputs and a batch of expected outputs.
Inputs and expected outputs have different structure and meaning depending on class attributes values and
data_type
.
- data – A dictionary containing training, validation and test parts of the dataset obtainable via
-
class
deeppavlov.dataset_iterators.squad_iterator.
SquadIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train
SquadModel
.It extracts
context
,question
,answer_text
andanswer_start
position from dataset. Example from a dataset is a tuple of(context, question)
and(answer_text, answer_start)
-
train
¶ train examples
-
valid
¶ validation examples
-
test
¶ test examples
-