deeppavlov.models.vectorizers¶
- class deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer(tokenizer: Component, hash_size=16777216, doc_index: Optional[dict] = None, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]¶
Create a tfidf matrix from collection of documents of size [n_documents X n_features(hash_size)].
- Parameters
tokenizer – a tokenizer class
hash_size – a hash size, power of two
doc_index – a dictionary of document ids and their titles
save_path – a path to .npz file where tfidf matrix is saved
load_path – a path to .npz file where tfidf matrix is loaded from
- hash_size¶
a hash size
- tokenizer¶
instance of a tokenizer class
- term_freqs¶
a dictionary with tfidf terms and their frequences
- doc_index¶
provided by a user ids or generated automatically ids
- rows¶
tfidf matrix rows corresponding to terms
- cols¶
tfidf matrix cols corresponding to docs
- data¶
tfidf matrix data corresponding to tfidf values
- __call__(questions: List[str]) csr_matrix [source]¶
Transform input list of documents to tfidf vectors.
- Parameters
questions – a list of input strings
- Returns
transformed documents as a csr_matrix with shape [n_documents X
hash_size
]
- fit(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) None [source]¶
Fit the vectorizer.
- Parameters
docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents
doc_nums – a list of document integer ids as they appear in a database
- Returns
None
- get_count_matrix(row: List[int], col: List[int], data: List[int], size: int) csr_matrix [source]¶
Get count matrix.
- Parameters
row – tfidf matrix rows corresponding to terms
col – tfidf matrix cols corresponding to docs
data – tfidf matrix data corresponding to tfidf values
size –
doc_index
size
- Returns
a count csr_matrix
- get_counts(docs: List[str], doc_ids: List[Any]) Generator[Tuple[KeysView, ValuesView, List[int]], Any, None] [source]¶
Get term counts for a list of documents.
- Parameters
docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents
- Yields
a tuple of term hashes, count values and column ids
- Returns
None
- static get_tfidf_matrix(count_matrix: csr_matrix) Tuple[csr_matrix, array] [source]¶
Convert a count matrix into a tfidf matrix.
- Parameters
count_matrix – a count matrix
- Returns
a tuple of tfidf matrix and term frequences
- load() Tuple[csr_matrix, Dict] [source]¶
Load a tfidf matrix as csr_matrix.
- Returns
a tuple of tfidf matrix and csr data.
:raises FileNotFoundError if
load_path
doesn’t exist.:
- partial_fit(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) None [source]¶
Partially fit on one batch.
- Parameters
docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents
doc_nums – a list of document integer ids as they appear in a database
- Returns
None