ngrams_segmentation {NUSS}R Documentation

Segmenting sequences with n-grams.

Description

ngrams_segmentation segments input sequence into possible segmented text based on n-grams segmentation approach.

Usage

ngrams_segmentation(
  sequences,
  ngrams_dictionary,
  retrieve = "most-scored",
  simplify = TRUE,
  omit_zero = TRUE,
  score_formula = "points / words.number ^ 2"
)

Arguments

sequences

character vector, sequence to be segmented (e.g., hashtag) or without it.

ngrams_dictionary

data.frame, containing ids, n-grams to search, words to use for segmentation, and their points. See details.

retrieve

character vector of length 1, with formula to calculate score.

simplify

logical, if adjacent numbers should be merged into one, and underscores removed. See simplification section.

omit_zero

logical, if words with 0 points should be omitted from word count. See simplification section.

score_formula

character vector of length 1, with formula to calculate score.

Value

The output always will be data.frame. If retrieve='all' is used, then the return will include all possible segmentation of the given sequence.
If retrieve='first-shortest' is used, the first of the shortest segmentations (with respect to the order of word's appearance in the dictionary, 1 row).
If retrieve='most-pointed' is used, segmentation with most total points is returned (1 row).
If retrieve='most-scored' is used, segmentation with the highest score calculated as
score = points / words.number ^ 2 (or as specified by the user).
The output is not in the input order. If needed, use lapply

ngrams_dictionary

Dictionary has to be data.frame with four named columns: 1) to_search, 2) to_replace, 3) id, 4) points.
'to_search' should be column of type character, containing n-grams to look for. Word case might be used.
'to_replace' should be column of type character, containing n-grams that should be used for creating segmentation vector, if 'to_search' matches text.
'id' should be column of type numeric, containing id of unigram.
'points' should be column of type numeric, containing number of points for the word - the higher, the better. Unigrams with 0 points might be removed from the word count with omit_zero argument. ngrams_dictionary might be created with ngrams_dictionary.

Simplification

Two arguments are possible for simplification:

Examples

texts <- c("this is science",
           "science is #fascinatingthing",
           "this is a scientific approach",
           "science is everywhere",
           "the beauty of science")
ndict <- ngrams_dictionary(texts)
ngrams_segmentation("thisisscience", ndict)
ngrams_segmentation("this_is_science", ndict)
ngrams_segmentation("ThisIsScience", ndict)
ngrams_segmentation("thisisscience",
                    ndict,
                    simplify=FALSE,
                    omit_zero=FALSE)


[Package NUSS version 0.1.0 Index]