rf.clustering {PDtoolkit} | R Documentation |
Risk factor clustering
Description
rf.clustering
implements correlation based clustering of risk factors.
Clustering procedure is base on hclust from stats
package.
Usage
rf.clustering(db, metric, k = NA)
Arguments
db |
Data frame of risk factors supplied for clustering analysis.
|
metric |
Correlation metric used for distance calculation. Available options are:
-
"raw pearson" - calculated distance as.dist(1 - cor(db, method = "pearson")) ;
-
"raw spearman" - calculated distance as.dist(1 - cor(db, method = "spearman")) ;
-
"common pearson" - calculated distance as.dist((1 - cor(db, method = "pearson")) / 2) ;
-
"common spearman" - calculated distance as.dist((1 - cor(db, method = "spearman")) / 2) ;
-
"absolute pearson" - calculated distance as.dist(1 - abs(cor(db, method = "pearson"))) ;
-
"absolute spearman" - calculated distance as.dist(1 - abs(cor(db, method = "spearman"))) ;
-
"sqrt pearson" - calculated distance as.dist(sqrt(1 - cor(db, method = "pearson"))) ;
-
"sqrt spearman" - calculated distance as.dist(sqrt(1 - cor(db, method = "spearman"))) ;
-
"x2y" - calculated distance as.dist(1 - dx2y(d = db)[[2]])) .
x2y metric is proposed by Professor Rama Ramakrishnan and details can be found on this
link.
This metric is especially handy if analyst wants to perform clustering before any binning procedures and to decrease number of risk factors. Additionally,
x2y algorithm process numerical and categorical risk factors at once and it is able to identify
non-linear relationship between the pairs. Metric x2y is not symmetric with respect to inputs - x, y ,
therefore arithmetic average of values between xy and yx is used to produce the final value for each pair.
|
k |
Number of clusters. If default value (NA ) is passed, then automatic elbow method
will be used to determine the optimal number of clusters, otherwise selected number of clusters will be used.
|
Value
The function rf.clustering
returns a data frame with: risk factors, clusters assigned and
distance to centroid (ordered from smallest to largest).
The last column (distance to centroid) can be used for selection of one or more risk factors per
cluster.
Examples
suppressMessages(library(PDtoolkit))
library(rpart)
data(loans)
#clustering using common spearman metric
#first we need to categorize numeric risk factors
num.rf <- sapply(loans, is.numeric)
num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf]
loans[, num.rf] <- sapply(num.rf, function(x)
sts.bin(x = loans[, x], y = loans[, "Creditability"])[[2]])
#replace woe in order to convert to all numeric factors
loans.woe <- replace.woe(db = loans, target = "Creditability")[[1]]
cr <- rf.clustering(db = loans.woe[, -which(names(loans.woe)%in%"Creditability")],
metric = "common spearman",
k = NA)
cr
#select one risk factor per cluster with min distance to centorid
cr %>% group_by(clusters) %>%
slice(which.min(dist.to.centroid))
[Package
PDtoolkit version 1.2.0
Index]