SMART {FACT} | R Documentation |
SMART
- Scoring Metric after PermutationSMART
estimates the importance of a feature to the clustering algorithm
by measuring changes in cluster assignments by scoring functions after
permuting selected feature. Cluster-specific SMART
indicates the importance
of specific clusters versus the remaining ones, measured by a binary scoring
metric. Global SMART
assigns importance scores across all clusters, measured
by a multi-class scoring metric. Currently, SMART
can only be used for hard
label predictors.
Let M \in \mathbb{N}_0^{k \times k}
denote the multi-cluster
confusion matrix and M_c \in \mathbb{N}_0^{2 \times 2}
the binary
confusion matrix for cluster c versus the remaining clusters. SMART
for
feature set S corresponds to:
\text{Multi-cluster scoring:} \quad \text{SMART}(X, \tilde{X}_S) = h_{\text{multi}}(M) \\
\text{Binary scoring:} \quad \text{SMART}(X, \tilde{X}_S) = \text{AVE}(h_{\text{binary}}(M_1), \dots, h_{\text{binary}}(M_k))
where \text{AVE}
averages a vector of binary scores, e.g., via micro or
macro averaging.
In order to reduce variance in the estimate from shuffling the data, one can
shuffle t times and evaluate the distribution of scores. Let \tilde{X}_S^{(t)}
denote the t-th shuffling iteration for feature set S. The SMART
point
estimate is given by:
\overline{\text{SMART}}(X, \tilde{X}_S) = \psi\left(\text{SMART}(X, \tilde{X}_S^{(1)}),
\dots, \text{SMART}(X, \tilde{X}_S^{(t)})\right)
where \psi
extracts a sample statistic such as the mean or median or quantile.
avg
(character(1)
or NULL
)
NULL
is calculating cluster-specific (binary)
metrics. "micro"
summarizes binary scores to a global
score that treats each instance in the data set with equal
importance. "macro"
summarizes binary scores to a global
score that treats each cluster with equal importance.
metric
character(1)
The binary similarity metric used.
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
data.sample
data.frame
The data, including features and cluster soft/ hard labels.
sampler
any
Sampler from the predictor
object.
features
(character or list
)
Features/ feature sets to calculate importance scores.
n.repetitions
(numeric(1)
)
How often is the shuffling of the feature repeated?
results
(data.table
)
A data.table containing the results from SMART
procedure.
new()
Create a SMART object
SMART$new( predictor, features = NULL, metric = "f1", avg = NULL, n.repetitions = 5 )
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
features
(character or list
)
For which features do you want importance scores calculated. The default
value of NULL
implies all features. Use a named list of character vectors
to define groups of features for which joint importance will be calculated.
metric
character(1)
The binary similarity metric used. Defaults to f1
,
where F1 Score is used. Other possible binary scores are
"precision"
, "recall"
, "jaccard"
, "folkes_mallows"
and "accuracy"
.
avg
(character(1)
or NULL
)
Either NULL
, "micro"
or "macro"
.
Defaults to NULL
is calculating cluster-specific (binary)
metrics. "micro"
summarizes binary scores to a global
score that treats each instance in the data set with equal
importance. "macro"
summarizes binary scores to a global
score that treats each cluster with equal importance.
For unbalanced clusters, "macro"
is more recommendable.
n.repetitions
(numeric(1)
)
How often should the shuffling of the feature be repeated?
The higher the number of repetitions the more stable and
accurate the results become.
(data.frame)
data.frame with the results of the feature importance computation.
One row per feature with the following columns:
For global scores:
importance.05 (5% quantile of importance values from the repetitions)
importance (median importance)
importance.95 (95% quantile) and the permutation.error (median error over all repetitions). For cluster specific scores each column indicates for a different cluster.
print()
Print a SMART
object
SMART$print()
character
Information about predictor
, data
, metric
, and avg
and head of the results
.
plot()
plots the similarity score results of a SMART
object.
SMART$plot(log = FALSE, single_cl = NULL)
log
logical(1)
Indicator weather results should be logged. This can be
useful to distinguish the importance if similarity scores
are all close to 1.
single_cl
character(1)
Only used for cluster-specific scores (avg = NULL
).
Should match one of the cluster names.
In this case, importance scores for a single cluster are
plotted.
The plot shows the similarity per feature.
For global scores:
When n.repetitions
in SMART$new
was larger than 1, then we get
multiple similarity estimates per feature. The similarity are aggregated and
the plot shows the median similarity per feature (as dots) and also the
90%-quantile, which helps to understand how much variance the computation has
per feature.
For cluster-specific scores:
Stacks the similarity estimates of all clusters per feature.
Can be used to achieve a global estimate as a sum of
cluster-wise similarities.
ggplot2 plot object
clone()
The objects of this class are cloneable with this method.
SMART$clone(deep = FALSE)
deep
Whether to make a deep clone.
# load data and packages
require(factoextra)
require(FuzzyDBScan)
multishapes = as.data.frame(multishapes[, 1:2])
# Set up an train FuzzyDBScan
eps = c(0, 0.2)
pts = c(3, 15)
res = FuzzyDBScan$new(multishapes, eps, pts)
res$plot("x", "y")
# create hard label predictor
predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster
predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters,
predict.function = predict_part, type = "partition")
# Run SMART globally
macro_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1", avg = "macro")
macro_f1 # print global SMART
macro_f1$plot(log = TRUE) # plot global SMART
# Run cluster specific SMART
classwise_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1")
macro_f1 # print regional SMART
macro_f1$plot(log = TRUE) # plot regional SMART