ngrams_segmentation {NUSS} | R Documentation |
ngrams_segmentation
segments input sequence into possible segmented
text based on n-grams segmentation approach.
ngrams_segmentation(
sequences,
ngrams_dictionary,
retrieve = "most-scored",
simplify = TRUE,
omit_zero = TRUE,
score_formula = "points / words.number ^ 2"
)
sequences |
character vector, sequence to be segmented (e.g., hashtag) or without it. |
ngrams_dictionary |
data.frame, containing ids, n-grams to search, words to use for segmentation, and their points. See details. |
retrieve |
character vector of length 1, with formula to calculate score. |
simplify |
logical, if adjacent numbers should be merged into one, and underscores removed. See simplification section. |
omit_zero |
logical, if words with 0 points should be omitted from word count. See simplification section. |
score_formula |
character vector of length 1, with formula to calculate score. |
The output always will be data.frame. If retrieve='all'
is used, then the return will include all possible segmentation
of the given sequence.
If retrieve='first-shortest'
is used, the first of the shortest
segmentations (with respect to the order of word's appearance
in the dictionary, 1 row).
If retrieve='most-pointed'
is used, segmentation with most total
points is returned (1 row).
If retrieve='most-scored'
is used, segmentation with the highest
score calculated as
score = points / words.number ^ 2
(or as specified by the user).
The output is not in the input order. If needed, use
lapply
Dictionary has to be data.frame with four named columns: 1) to_search,
2) to_replace, 3) id, 4) points.
'to_search' should be column of type character, containing n-grams to
look for. Word case might be used.
'to_replace' should be column of type character, containing n-grams that
should be used for creating segmentation vector, if 'to_search' matches
text.
'id' should be column of type numeric, containing id of unigram.
'points' should be column of type numeric, containing number of points
for the word - the higher, the better. Unigrams with 0 points might be
removed from the word count with omit_zero argument. ngrams_dictionary
might be created with ngrams_dictionary.
Two arguments are possible for simplification:
simplify - removes spaces between numbers and removes underscores,
omit_zero - removes ids of 0-pointed unigrams,
and omits them in the word count.
By default segmented sequence will be simplified,
and numbers and underscores will be removed from word count
for score computing, since they are neutral as they are necessary.
texts <- c("this is science",
"science is #fascinatingthing",
"this is a scientific approach",
"science is everywhere",
"the beauty of science")
ndict <- ngrams_dictionary(texts)
ngrams_segmentation("thisisscience", ndict)
ngrams_segmentation("this_is_science", ndict)
ngrams_segmentation("ThisIsScience", ndict)
ngrams_segmentation("thisisscience",
ndict,
simplify=FALSE,
omit_zero=FALSE)