RPatternJoin-package {RPatternJoin} | R Documentation |
String Similarity Joins for Hamming and Levenshtein Distances
Description
This project is a tool for words edit similarity joins under small (< 3
) edit distance constraints.
It works for Levenshtein distance and Hamming (with allowed insertions/deletions to the end) distance.
Details
The package offers several similarity join algorithms, all of which can be accessed through the similarityJoin
function.
The software was originally developed for edit similarity joins of short amino-acid/nucleotide sequences from Adaptive Immune Repertoires,
where the number of words is relatively large (10^5-10^6
) and the average length of words is relatively small (10-100
).
The algorithms will work with any alphabet and any list of words, however, larger lists or word sizes can lead to memory issues.
Author(s)
Daniil Matveev <dmatveev@sfsu.edu>
See Also
similarityJoin
,
edit_dist1_example
Examples
library(RPatternJoin)
## Small example
similarityJoin(c("ABC", "AX", "QQQ"), 2, "Hamming", output_format = "adj_pairs")
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 2 1
# [4,] 2 2
# [5,] 3 3
## Larger example
# The `edit_dist1_example` function generate a random list
# of `num_strings` strings with the average string length=`avg_len`.
strings <- edit_dist1_example(avg_len = 25, num_strings = 5000)
# Firstly let's do it with `stringdist` package.
library(stringdist)
unname(system.time({
which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE)
})["elapsed"])
# Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM:
# [1] 63.773
# Now let's do it with similarityJoin function.
unname(system.time({
similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs")
})["elapsed"])
# Runtime on the same machine:
# [1] 0.105