Spectrum {kerntools} | R Documentation |
Spectrum kernel
Description
'Spectrum()' computes the basic Spectrum kernel between strings. This kernel computes the similarity of two strings by counting how many matching substrings of length l are present in each one.
Usage
Spectrum(
x,
alphabet,
l = 1,
group.ids = NULL,
weights = NULL,
feat_space = FALSE,
cos.norm = FALSE
)
Arguments
x |
Vector of strings (length N). |
alphabet |
Alphabet of reference. |
l |
Length of the substrings. |
group.ids |
(optional) A vector with ids. It allows to compute the kernel over groups of strings within x, instead of the individual strings. |
weights |
(optional) A numeric vector as long as x. It allows to weight differently each one of the strings. |
feat_space |
If FALSE, only the kernel matrix is returned. Otherwise, the feature space (i.e. a table with the number of times that a substring of length l appears in each string) is also returned (Defaults: FALSE). |
cos.norm |
Should the resulting kernel matrix be cosine normalized? (Defaults: FALSE). |
Details
In large datasets this function may be slow. In that case, you may use the 'stringdot()' function of the 'kernlab' package, or the 'spectrumKernel()' function of the 'kebabs' package.
Value
Kernel matrix (dimension: NxN), or a list with the kernel matrix and the feature space.
References
Leslie, C., Eskin, E., and Noble, W.S. The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput. 2002:564-75. PMID: 11928508. Link
Examples
## Examples of alphabets. _ stands for a blank space, a gap, or the
## start or the end of sequence)
NT <- c("A","C","G","T","_") # DNA nucleotides
AA <- c("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T",
"V","W","Y","_") ##canonical aminoacids
letters_ <- c(letters,"_")
## Example of data
strings <- c("hello_world","hello_word","hola_mon","kaixo_mundua",
"saluton_mondo","ola_mundo", "bonjour_le_monde")
names(strings) <- c("english1","english_typo","catalan","basque",
"esperanto","galician","french")
## Computing the kernel:
Spectrum(strings,alphabet=letters_,l=2)