sample.textmatrix {lsa} | R Documentation |
Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.
sample.textmatrix(textmatrix, samplesize, index.return=FALSE)
textmatrix |
A document-term matrix. |
samplesize |
Desired number of files |
index.return |
if set to true, the positions of the subset in the original column vectors will be returned as well. |
Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.
filelist |
a list of filenames of the documents in the corpus.). |
ix |
If index.return is set to true, a list is returned; |
Fridolin Wild f.wild@open.ac.uk
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/"))
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/"))
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/"))
# create matrices
myMatrix = textmatrix(td, minWordLength=1)
sample(myMatrix, 3)
# clean up
unlink(td, recursive=TRUE)