bow_pp_create_basic_text_rep {aifeducation} | R Documentation |
This function prepares raw texts for use with TextEmbeddingModel.
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)
data |
|
vocab_draft |
Object created with bow_pp_create_vocab_draft. |
remove_punct |
|
remove_symbols |
|
remove_numbers |
|
remove_url |
|
remove_separators |
|
split_hyphens |
|
split_tags |
|
language_stopwords |
|
use_lemmata |
|
to_lower |
|
min_termfreq |
|
min_docfreq |
|
max_docfreq |
|
window |
|
weights |
|
trace |
|
Returns a list
of class basic_text_rep
with the following components.
dfm:
Document-Feature-Matrix. Rows correspond to the documents. Columns represent
the number of tokens in the document.
fcm:
Feature-Co-Occurance-Matrix.
information:
list
containing information about the used vocabulary. These are:
n_sentence:
Number of sentences
n_document_segments:
Number of document segments/raw texts
n_token_init:
Number of initial tokens
n_token_final:
Number of final tokens
n_lemmata:
Number of lemmas
configuration:
list
containing information if the vocabulary was
created with lower cases and if the vocabulary uses original tokens or lemmas.
language_model:
list
containing information about the applied
language model. These are:
model:
the udpipe language model
label:
the label of the udpipe language model
upos:
the applied universal part-of-speech tags
language:
the language
vocab:
a data.frame
with the original vocabulary
Other Preparation:
bow_pp_create_vocab_draft()