dataset_iterators¶

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, split_seed: Optional[int] = None, stratify: Optional[bool] = None, *args, **kwargs)[source]¶

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters

data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of "train", "valid", "test") to merge
merged_field – name of field (out of "train", "valid", "test") to which save merged fields
field_to_split – name of field (out of "train", "valid", "test") to split
split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
split_proportions – list of corresponding proportions for splitting
seed – random seed for iterating
shuffle – whether to shuffle examples in batches
split_seed – random seed for splitting dataset, if split_seed is None, division is based on seed.
stratify – whether to use stratified split
*args – arguments
**kwargs – arguments

data¶: dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶: The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]¶

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters

load_path – a path to local DB file
batch_size – a number of samples in a single batch
shuffle – whether to shuffle data during batching
seed – random seed for data shuffling

connect¶: a DB connection

db_name¶: a DB name

doc_ids¶: DB document ids

doc2index¶: a dictionary of document indices and their titles

batch_size¶: a number of samples in a single batch

shuffle¶: whether to shuffle data during batching

random¶: an instance of Random class.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train torch_transformers_squad:TorchTransformersSquad.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train¶: train examples

valid¶: validation examples

test¶: test examples

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]¶

Split all data into train and test

Parameters: test_ratio – ratio of test data to train, from 0. to 1.