corpora.mmcorpus
– Corpus in Matrix Market format¶Corpus in the Matrix Market format.
Corpus in the Matrix Market format.
Initialize the matrix reader.
The input refers to a file on local filesystem, which is expected to be in the sparse (coordinate) Matrix Market format. Documents are assumed to be rows of the matrix (and document features are columns).
input is either a string (file path) or a file-like object that supports seek() (e.g. gzip.GzipFile, bz2.BZ2File).
Return document at file offset offset (in bytes)
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
Save the object to file (also see load).
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
Save a corpus in the Matrix Market format to disk.
This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
Skip file headers that appear before the first document.