brute_force_knn {rnndescent} | R Documentation |
Find exact nearest neighbors by brute force
Description
Returns the exact nearest neighbors of a dataset. A brute force search is
carried out: all possible pairs of points are compared, and the nearest
neighbors are returned.
Usage
brute_force_knn(
data,
k,
metric = "euclidean",
use_alt_metric = TRUE,
n_threads = 0,
verbose = FALSE,
obs = "R"
)
Arguments
data |
Matrix of n items to generate neighbors for, with observations
in the rows and features in the columns. Optionally, input can be passed
with observations in the columns, by setting obs = "C" , which should be
more efficient. Possible formats are base::data.frame() , base::matrix()
or Matrix::sparseMatrix() . Sparse matrices should be in dgCMatrix
format. Dataframes will be converted to numerical matrix format
internally, so if your data columns are logical and intended to be used
with the specialized binary metric s, you should convert it to a logical
matrix first (otherwise you will get the slower dense numerical version).
|
k |
Number of nearest neighbors to return.
|
metric |
Type of distance calculation to use. One of:
-
"braycurtis"
-
"canberra"
-
"chebyshev"
-
"correlation" (1 minus the Pearson correlation)
-
"cosine"
-
"dice"
-
"euclidean"
-
"hamming"
-
"hellinger"
-
"jaccard"
-
"jensenshannon"
-
"kulsinski"
-
"sqeuclidean" (squared Euclidean)
-
"manhattan"
-
"rogerstanimoto"
-
"russellrao"
-
"sokalmichener"
-
"sokalsneath"
-
"spearmanr" (1 minus the Spearman rank correlation)
-
"symmetrickl" (symmetric Kullback-Leibler divergence)
-
"tsss" (Triangle Area Similarity-Sector Area Similarity or TS-SS
metric)
-
"yule"
For non-sparse data, the following variants are available with
preprocessing: this trades memory for a potential speed up during the
distance calculation. Some minor numerical differences should be expected
compared to the non-preprocessed versions:
For non-sparse binary data passed as a logical matrix, the following
metrics have specialized variants which should be substantially faster than
the non-binary variants (in other cases the logical data will be treated as
a dense numeric vector of 0s and 1s):
-
"dice"
-
"hamming"
-
"jaccard"
-
"kulsinski"
-
"matching"
-
"rogerstanimoto"
-
"russellrao"
-
"sokalmichener"
-
"sokalsneath"
-
"yule"
|
use_alt_metric |
If TRUE , use faster metrics that maintain the
ordering of distances internally (e.g. squared Euclidean distances if using
metric = "euclidean" ), then apply a correction at the end. Probably
the only reason to set this to FALSE is if you suspect that some
sort of numeric issue is occurring with your data in the alternative code
path.
|
n_threads |
Number of threads to use.
|
verbose |
If TRUE , log information to the console.
|
obs |
set to "C" to indicate that the input data orientation stores
each observation as a column. The default "R" means that observations are
stored in each row. Storing the data by row is usually more convenient, but
internally your data will be converted to column storage. Passing it
already column-oriented will save some memory and (a small amount of) CPU
usage.
|
Details
This method is accurate but scales poorly with dataset size, so use with
caution with larger datasets. Having the exact neighbors as a ground truth to
compare with approximate results is useful for benchmarking and determining
parameter settings of the approximate methods.
Value
the nearest neighbor graph as a list containing:
Examples
# Find the 4 nearest neighbors using Euclidean distance
# If you pass a data frame, non-numeric columns are removed
iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean")
# Manhattan (l1) distance
iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan")
# Multi-threading: you can choose the number of threads to use: in real
# usage, you will want to set n_threads to at least 2
iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
# Use verbose flag to see information about progress
iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
[Package
rnndescent version 0.1.6
Index]