JANE {JANE} | R Documentation |
Fit JANE
Description
Fit the latent space cluster model using an EM algorithm.
Usage
JANE(
A,
D = 2,
K = 2,
model,
initialization = "GNN",
case_control = FALSE,
DA_type = "none",
seed = NULL,
control = list()
)
Arguments
A |
A square matrix or sparse matrix of class 'dgCMatrix' representing the adjacency matrix of the unweighted network of interest. |
D |
Integer (scalar or vector) specifying the dimension of the latent space (default is 2). |
K |
Integer (scalar or vector) specifying the number of clusters to consider (default is 2). |
model |
A character string specifying the model to fit:
|
initialization |
A character string or a list to specify the initial values for the EM algorithm:
|
case_control |
A logical; if |
DA_type |
A character string to specify the type of deterministic annealing approach to use
|
seed |
(optional) An integer value to specify the seed for reproducibility. |
control |
A list of control parameters. See 'Details'. |
Details
If an unsymmetric adjacency matrix A is supplied for model %in% c('NDH', 'RS')
the user will be asked if they would like to proceed with converting A to a symmetric matrix (i.e., A <- 1.0 * ( (A + t(A)) > 0.0 )
).
control
:
The control
argument is a named list that the user can supply containing the following components:
verbose
A logical; if
TRUE
causes additional information to be printed out about the progress of the EM algorithm (default isFALSE
).max_its
An integer specifying the maximum number of iterations for the EM algorithm (default is
1e3
).min_its
An integer specifying the minimum number of iterations for the EM algorithm (default is
10
).priors
A list of prior hyperparameters (default is
NULL
). Seespecify_priors
on how to specify the hyperparameters.n_interior_knots
(only relevant for
model %in% c('RS', 'RSR')
) An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity models (default is5
).termination_rule
A character string to specify the termination rule to determine the convergence of the EM algorithm:
'prob_mat'
: uses change in the absolute difference in\hat{Z}
(i.e., theN \times K
cluster membership probability matrix) between subsequent iterations (default)'Q'
: uses change in the absolute difference in the objective function of the E-step evaluated using parameters from subsequent iterations'ARI'
: comparing the classifications between subsequent iterations using adjusted Rand index'NMI'
: comparing the classifications between subsequent iterations using normalized mutual information'CER'
: comparing the classifications between subsequent iterations using classification error rate
tolerance
A numeric specifying the tolerance used for
termination_rule %in% c('Q', 'prob_mat')
(default is1e-3
).tolerance_ARI
A numeric specifying the tolerance used for
termination_rule = 'ARI'
(default is0.999
).tolerance_NMI
A numeric specifying the tolerance used for
termination_rule = 'NMI'
(default is0.999
).tolerance_CER
A numeric specifying the tolerance used for
termination_rule = 'CER'
(default is0.01
).n_its_start_CA
An integer specifying what iteration to start computing cumulative averages (note: cumulative average of
U
, the latent position matrix, is not tracked whentermination_rule = 'Q'
) (default is20
).tolerance_diff_CA
A numeric specifying the tolerance used for the change in cumulative average of
termination_rule
metric andU
(note: cumulative average ofU
is not tracked whentermination_rule = 'Q'
) (default is1e-3
).consecutive_diff_CA
An integer specifying the tolerance for the number of consecutive instances where change in cumulative average is less than
tolerance_diff_CA
(default is5
).quantile_diff
A numeric in
[0,1]
specifying the quantile used in computing the change in the absolute difference ofZ
andU
between subsequent iterations (default is1
, i.e., max).beta_temp_schedule
A numeric vector specifying the temperature schedule for deterministic annealing (default is
1
, i.e., deterministic annealing not utilized).n_control
An integer specifying the fixed number of controls (i.e., non-links) sampled for each actor; only relevant when
case_control = TRUE
(default is100
whencase_control = TRUE
andNULL
whencase_control = FALSE
).n_start
An integer specifying the maximum number of starts for the EM algorithm (default is
5
).max_retry
An integer specifying the maximum number of re-attempts if starting values cause issues with EM algorithm (default is
5
).IC_selection
A character string to specify the information criteria used to select the optimal fit based on the combinations of
K
,D
, andn_start
considered:'BIC_logit'
: BIC computed from logistic regression component'BIC_mbc'
: BIC computed from model based clustering component'ICL_mbc'
: ICL computed from model based clustering component'Total_BIC'
: sum of'BIC_logit'
and'BIC_mbc'
'Total_ICL'
: sum of'BIC_logit'
and'ICL_mbc'
(default)
sd_random_U_GNN
(only relevant when
initialization = 'GNN'
) A positive numeric value specifying the standard deviation for the random draws from a normal distribution to initializeU
(default is1
).max_retry_GNN
(only relevant when
initialization = 'GNN'
) An integer specifying the maximum number of re-attempts for theGNN
approach before switching to random starting values (default is10
).n_its_GNN
(only relevant when
initialization = 'GNN'
) An integer specifying the maximum number of iterations for theGNN
approach (default is10
).downsampling_GNN
(only relevant when
initialization = 'GNN'
) A logical; ifTRUE
employs downsampling s.t. the number of links and non-links are balanced for theGNN
approach (default isTRUE
).
Running JANE
in parallel:
JANE
integrates the future and future.apply packages to fit the various combinations of K
, D
, and n_start
in parallel. The 'Examples' section below provides an example of how to run JANE
in parallel. See plan
and future.apply
for more details.
Choosing the number of clusters:
JANE
allows for the following model selection criteria to choose the number of clusters:
'BIC_logit'
: BIC computed from logistic regression component'BIC_mbc'
: BIC computed from model based clustering component'ICL_mbc'
: ICL (Biernacki et al. (2000)) computed from model based clustering component'Total_BIC'
: Sum of'BIC_logit'
and'BIC_mbc'
, this is the model selection criterion proposed by Handcock et al. (2007)'Total_ICL'
: (default) sum of'BIC_logit'
and'ICL_mbc'
, this criterion is similar to'Total_BIC'
, but uses ICL for the model based clustering component which tends to favor more well-separated clusters.
Warning: It is not certain whether it is appropriate to use the model selection criterion above to select D
.
Value
A list of S3 class
"JANE
" containing the following components:
input_params |
A list containing the input parameters for |
A |
The square sparse adjacency matrix of class 'dgCMatrix' used in fitting the latent space cluster model. This matrix can be different than the input A matrix as isolates are removed. |
IC_out |
A matrix containing the relevant information criteria for all combinations of |
all_convergence_ind |
A matrix containing the convergence information (i.e., 1 = converged, 0 = did not converge) and number of iterations for all combinations of |
optimal_res |
A list containing the estimated parameters of interest based on the optimal fit selected. It is recommended to use |
optimal_starting |
A list containing the starting parameters used in the EM algorithm that resulted in the optimal fit selected. It is recommended to use |
References
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725.
Handcock, M.S., Raftery, A.E., Tantrum, J.M., 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society Series A: Statistics in Society 170, 301–354.
Examples
# Simulate network
mus <- matrix(c(-1,-1,1,-1,1,1),
nrow = 3,
ncol = 2,
byrow = TRUE)
omegas <- array(c(diag(rep(7,2)),
diag(rep(7,2)),
diag(rep(7,2))),
dim = c(2,2,3))
p <- rep(1/3, 3)
beta0 <- 1.0
sim_data <- JANE::sim_A(N = 100L,
model = "NDH",
mus = mus,
omegas = omegas,
p = p,
beta0 = beta0,
remove_isolates = TRUE)
# Run JANE on simulated data
res <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
# Run JANE on simulated data - consider multiple D and K
res <- JANE::JANE(A = sim_data$A,
D = 2:5,
K = 2:10,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
# Run JANE on simulated data - parallel with 5 cores
future::plan(future::multisession, workers = 5)
res <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
future::plan(future::sequential)
# Run JANE on simulated data - case/control approach with 20 controls sampled for each actor
res <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = TRUE,
DA_type = "none",
control = list(n_control = 20))
# Reproducibility
res1 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
seed = 1234,
model = "NDH",
case_control = FALSE,
DA_type = "none")
res2 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
seed = 1234,
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Check if results match
all.equal(res1, res2)
# Another reproducibility example where the seed was not set.
# It is possible to replicate the results using the starting values due to
# the nature of EM algorithms
res3 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Extract starting values
start_vals <- res3$optimal_start
## Run JANE using extracted starting values, no need to specify D and K
## below as function will determine those values from start_vals
res4 <- JANE::JANE(A = sim_data$A,
initialization = start_vals,
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Check if optimal_res are identical
all.equal(res3$optimal_res, res4$optimal_res)