topicsDtm {topics} | R Documentation |
Document Term Matrix
Description
The function for creating a document term matrix
Usage
topicsDtm(
data,
ngram_window = c(1, 3),
stopwords = stopwords::stopwords("en", source = "snowball"),
removalword = "",
occ_rate = 0,
removal_mode = "none",
removal_rate_most = 0,
removal_rate_least = 0,
split = 1,
seed = 42L,
save_dir,
load_dir = NULL,
threads = 1
)
Arguments
data |
(list) the list containing the text data with each entry belonging to a unique id |
ngram_window |
(list) the minimum and maximum n-gram length, e.g. c(1,3) |
stopwords |
(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball") |
removalword |
(string) the word to remove |
occ_rate |
(integer) the rate of occurence of a word to be removed |
removal_mode |
(string) the mode of removal -> "none", "frequency", "term" or "percentage", frequency removes all words under a certain frequency or over a certain frequency as indicated by removal_rate_least and removal_rate_most, term removes an absolute amount of terms that are most frequent and least frequent, percentage the amount of terms indicated by removal_rate_least and removal_rate_most relative to the amount of terms in the matrix |
removal_rate_most |
(integer) the rate of most frequent words to be removed, functionality depends on removal_mode |
removal_rate_least |
(integer) the rate of least frequent words to be removed, functionality depends on removal_mode |
split |
(float) the proportion of the data to be used for training |
seed |
(integer) the random seed for reproducibility |
save_dir |
(string) the directory to save the results, if NULL, no results are saved. |
load_dir |
(string) the directory to load from. |
threads |
(integer) the number of threads to use |
Value
the document term matrix
Examples
# Create a Dtm and remove the terms that occur less than 4 times and more than 500 times.
save_dir_temp <- tempfile()
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "frequency",
removal_rate_least = 4,
removal_rate_most = 500,
save_dir = save_dir_temp)
# Create Dtm and remove the 5 least and 5 most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "term",
removal_rate_least = 1,
removal_rate_most = 1,
save_dir = save_dir_temp)
# Create Dtm and remove the 5% least frequent and 1% most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "percentage",
removal_rate_least = 1,
removal_rate_most = 1,
save_dir = save_dir_temp)
# Load precomputed Dtm from directory
dtm <- topicsDtm(load_dir = save_dir_temp,
seed = 42,
save_dir = save_dir_temp)