similarityJoin {RPatternJoin} | R Documentation |
Build Adjacency Matrix
Description
Build Adjacency Matrix
Usage
similarityJoin(
strings,
cutoff,
metric,
method = "partition_pattern",
drop_deg_one = FALSE,
special_chars = TRUE,
output_format = "adj_matrix"
)
Arguments
strings |
Input vector of strings.
To avoid hidden errors, the function will give a warning if strings contain characters not in the English alphabet.
To disable this warning, change |
cutoff |
Cutoff: |
metric |
Edit distance type: |
method |
Method: |
drop_deg_one |
Drop isolated strings: |
special_chars |
Enable check for special characters in strings: |
output_format |
Output format: |
Value
If output_format = adj_pairs
- 2-column matrix where
each row is a pair of indices of strings with an edit distance \leq
cutoff
.
If output_format = adj_matrix
- the same output is presented as a sparse adjacency matrix with corresponding strings and
their indices in the original vector are stored in dimnames of the adjacency matrix.
I.e. (adj_matrix[i, j]=1
) \Leftrightarrow
distance between dimnames(adj_matrix)[[1]][i]
and dimnames(adj_matrix)[[1]][i]
is \leq
cutoff
.
If drop_deg_one
is FALSE
, then dimnames(adj_matrix)[[1]] = strings
and dimnames(adj_matrix)[[2]]=1:length(strings)
.
Otherwise, dimnames(adj_matrix)[[1]] = strings
without isolated strings and dimnames(adj_matrix)[[2]]
=original indices
of strings in dimnames(adj_matrix)[[1]]
(original = index in input strings
vector).
See Also
Examples
library(RPatternJoin)
library(Matrix)
## Example 1
# Consider the following example with small similar words:
strings <- c("cat", "ecast", "bat", "cats", "chat")
# Let's find all pairs s.t. strings can be modified
# to each other with at most 2 substitutions.
# For this we choose our metric to be Hamming distance and cutoff to be 2.
metric <- "Hamming"
cutoff <- 2
# By default we use 'partition_pattern' method
# since it is the most memory efficient.
method <- "partition_pattern"
# Let's output the result as an adjacency matrix.
output_format <- "adj_matrix"
drop_deg_one <- TRUE
similarityJoin(
strings, cutoff, metric,
method = method, drop_deg_one = drop_deg_one)
# 3 x 3 sparse Matrix of class "dgCMatrix"
# cat bat cats
# 1 1 1 1
# 3 1 1 1
# 4 1 1 1
## Example 2
# On the same strings, let's calculate pairs of strings with edit distance \eqn{\leq} 1.
cutoff <- 1
metric <- "Levenshtein"
# Let's output the result as an adjacency matrix, but drop strings without any connections.
drop_deg_one <- FALSE
similarityJoin(
strings, cutoff, metric,
method = method, drop_deg_one = drop_deg_one)
# cat ecast bat cats chat
# 1 1 . 1 1 1
# 2 . 1 . . .
# 3 1 . 1 . .
# 4 1 . . 1 .
# 5 1 . . . 1
## Example 3
# Now let's simulate a larger example.
# The `edit_dist1_example` function generate a random list
# of `num_strings` strings with the average string length=`avg_len`.
strings <- edit_dist1_example(avg_len = 25, num_strings = 5000)
# Firstly let's do it with `stringdist` package.
library(stringdist)
system.time({
which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE)
})["elapsed"]
# Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM:
# elapsed
# 63.773
# Now let's do it with similarityJoin function.
system.time({
similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs")
})["elapsed"]
# Runtime on the same machine:
# elapsed
# 0.105