multiplet {twinning} | R Documentation |
multiplet()
extends twin()
to partition datasets into multiple statistically similar disjoint sets, termed as multiplets, under the three different strategies described in Vakayil and Joseph (2022).
multiplet(data, k, strategy = 1, format_data = TRUE, leaf_size = 8)
data |
The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed. |
k |
The desired number of multiplets. |
strategy |
An integer either 1, 2, or 3 referring to the three strategies for generating multiplets. Strategy 2 perfroms best, but requires |
format_data |
If set to |
leaf_size |
Maximum number of elements in the leaf-nodes of the kd-tree. |
List with the multiplet id, ranging from 1 to k
, for each row in data
.
Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.
Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.
## 1. Generating 10 multiplets of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
multiplet_idx = multiplet(data, k=10)
multiplet_1 = data[which(multiplet_idx == 1), ]
multiplet_10 = data[which(multiplet_idx == 10), ]
## 2. Generating 4 multiplets of the iris dataset using strategy 2
multiplet_idx = multiplet(iris, k=4, strategy=2)
multiplet_1 = iris[which(multiplet_idx == 1), ]
multiplet_4 = iris[which(multiplet_idx == 4), ]