similarityJoin {RPatternJoin}R Documentation

Build Adjacency Matrix

Description

Build Adjacency Matrix

Usage

similarityJoin(
  strings,
  cutoff,
  metric,
  method = "partition_pattern",
  drop_deg_one = FALSE,
  special_chars = TRUE,
  output_format = "adj_matrix"
)

Arguments

strings

Input vector of strings. To avoid hidden errors, the function will give a warning if strings contain characters not in the English alphabet. To disable this warning, change special_chars to FALSE.

cutoff

Cutoff: 0,1,2. The function will search all pairs of strings with edit distance less than or equal to the cutoff.

metric

Edit distance type: Hamming, Levenshtein.

method

Method: partition_pattern, semi_pattern, pattern. This parameter determines what algorithm will be used for similarity join. Methods will differ in time and space complexity, but produce the same output. By default, we recommend using partition_pattern, since it is the most memory efficient.

drop_deg_one

Drop isolated strings: TRUE, FALSE. Works only for output_format=adj_matrix. The default is FALSE.

special_chars

Enable check for special characters in strings: TRUE, FALSE. The default is TRUE.

output_format

Output format: adj_matrix, adj_pairs. The default is adj_matrix.

Value

If output_format = adj_pairs - 2-column matrix where each row is a pair of indices of strings with an edit distance \leq cutoff.
If output_format = adj_matrix - the same output is presented as a sparse adjacency matrix with corresponding strings and their indices in the original vector are stored in dimnames of the adjacency matrix.
I.e. (adj_matrix[i, j]=1) \Leftrightarrow distance between dimnames(adj_matrix)[[1]][i] and dimnames(adj_matrix)[[1]][i] is \leq cutoff.
If drop_deg_one is FALSE, then dimnames(adj_matrix)[[1]] = strings and dimnames(adj_matrix)[[2]]=1:length(strings).
Otherwise, dimnames(adj_matrix)[[1]] = strings without isolated strings and dimnames(adj_matrix)[[2]]=original indices of strings in dimnames(adj_matrix)[[1]] (original = index in input strings vector).

See Also

edit_dist1_example

Examples

library(RPatternJoin)
library(Matrix)

## Example 1
# Consider the following example with small similar words:
strings <- c("cat", "ecast", "bat", "cats", "chat")
# Let's find all pairs s.t. strings can be modified
# to each other with at most 2 substitutions.
# For this we choose our metric to be Hamming distance and cutoff to be 2.
metric <- "Hamming"
cutoff <- 2
# By default we use 'partition_pattern' method
# since it is the most memory efficient.
method <- "partition_pattern"
# Let's output the result as an adjacency matrix.
output_format <- "adj_matrix"
drop_deg_one <- TRUE

similarityJoin(
  strings, cutoff, metric,
  method = method, drop_deg_one = drop_deg_one)
# 3 x 3 sparse Matrix of class "dgCMatrix"
#   cat bat cats
# 1   1   1    1
# 3   1   1    1
# 4   1   1    1


## Example 2
# On the same strings, let's calculate pairs of strings with edit distance \eqn{\leq} 1.
cutoff <- 1
metric <- "Levenshtein"
# Let's output the result as an adjacency matrix, but drop strings without any connections.
drop_deg_one <- FALSE

similarityJoin(
  strings, cutoff, metric,
  method = method, drop_deg_one = drop_deg_one)
#   cat ecast bat cats chat
# 1   1     .   1    1    1
# 2   .     1   .    .    .
# 3   1     .   1    .    .
# 4   1     .   .    1    .
# 5   1     .   .    .    1


## Example 3
# Now let's simulate a larger example.

# The `edit_dist1_example` function generate a random list
# of `num_strings` strings with the average string length=`avg_len`.
strings <- edit_dist1_example(avg_len = 25, num_strings = 5000)

# Firstly let's do it with `stringdist` package.

library(stringdist)
system.time({
  which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE)
})["elapsed"]
# Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM:
# elapsed
# 63.773


# Now let's do it with similarityJoin function.
system.time({
  similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs")
})["elapsed"]
# Runtime on the same machine:
# elapsed
# 0.105

[Package RPatternJoin version 1.0.0 Index]