nomclust {nomclust} | R Documentation |
The nomclust()
function runs hierarchical cluster analysis (HCA) with objects characterized by nominal (categorical) variables. It completely covers the clustering process, from the proximity matrix calculation to the evaluation of the quality of clustering.
The function contains twelve similarity measures for nominal data summarized by (Boriah et al., 2008) and by (Sulc and Rezankova, 2019).
It offers three linkage methods that can be used for categorical data. The obtained clusters can be evaluated by seven evaluation criteria, see (Sulc et al., 2018). The output of the nomclust()
function may serve as an input for visualization functions in the nomclust package.
nomclust( data, measure = "lin", method = "average", clu.high = 6, eval = TRUE, prox = 100, opt = TRUE )
data |
A data.frame or a matrix with cases in rows and variables in colums. |
measure |
A character string defining the similarity measure used for computation of proximity matrix in HCA:
|
method |
A character string defining the clustering method. The following methods can be used: |
clu.high |
A numeric value expressing the maximal number of cluster for which the cluster memberships variables are produced. |
eval |
A logical operator; if TRUE, evaluation of the clustering results is performed. |
prox |
A logical operator or a numeric value. If a logical value TRUE indicates that the proximity matrix is a part of the output. A numeric value (integer) of this argument indicates the maximal number of cases in a dataset for which a proximity matrix will occur in the output. |
opt |
A logical operator; if TRUE, the time optimization method is run to substantially decrease computation time of the dissimilarity matrix calcation. Time optimalization method cannot be run if the proximity matrix is to be produced. In such a case, this parameter is automatically set to FALSE. |
The function returns a list with up to five components.
The mem
component contains cluster membership partitions for the selected numbers of clusters in the form of a list.
The eval
component contains seven evaluation criteria in as vectors in a list. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE),
Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayessian (BIC) and Akaike (AIC) information criteria for categorical data and the BK index.
To see them all in once, the form of a data.frame is more appropriate.
The opt
component is present in the output together with the eval
component. It displays the optimal number of clusters for the evaluation criteria from the eval
component, except for WCM and WCE, where the optimal number of clusters is based on the elbow method.
The prox
component contains the dissimilarity matrix in a form of the "dist" object.
The dend
component can be found in the output only together with the prox
component. It contains all the necessary information for dendrogram creation.
Zdenek Sulc.
Contact: zdenek.sulc@vse.cz
Boriah S., Chandola V. and Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Sulc Z., Cibulkova J., Prochazka J., Rezankova H. (2018). Internal Evaluation Criteria for Categorical Data in Hierarchical Clustering: Optimal Number of Clusters Determination, Metodoloski Zveski, 15(2), p. 1-20.
Sulc Z. and Rezankova H. (2019). Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. Journal of Classification. 2019, 35(1), p. 58-72. DOI: 10.1007/s00357-019-09317-5.
evalclust
, nomprox
, eval.plot
, dend.plot
.
# sample data data(data20) # creating an object with results of hierarchical clustering of hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE, opt = FALSE) # obtaining values of evaluation indices data20.eval <- hca.object$eval # getting the optimal numbers of clusters data20.opt <- hca.object$opt # extracting cluster membership variables data20.mem <- hca.object$mem # extracting cluster membership variables as a data frame data20.mem <- as.data.frame(hca.object$mem) # obtaining a proximity matrix data20.prox <- as.matrix(hca.object$prox) # setting the maximal number of objects for which a proximity matrix is provided in the output to 30 hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = 30, opt = FALSE) # generating of a larger dataset containing repeatedly occuring objects set.seed(150) sample150 <- sample(1:nrow(data20), 150, replace = TRUE) data150 <- data20[sample150, ] # running hierarchical clustering WITH the time optimization start <- Sys.time() hca.object.opt.T <- nomclust(data150, measure = "lin", opt = TRUE) end <- Sys.time() end - start # running hierarchical clustering WITHOUT the time optimization start <- Sys.time() hca.object.opt.F <- nomclust(data150, measure = "lin", opt = FALSE) end <- Sys.time() end - start