JANE {JANE}R Documentation

Fit JANE

Description

Fit the latent space cluster model using an EM algorithm.

Usage

JANE(
  A,
  D = 2,
  K = 2,
  model,
  initialization = "GNN",
  case_control = FALSE,
  DA_type = "none",
  seed = NULL,
  control = list()
)

Arguments

A

A square matrix or sparse matrix of class 'dgCMatrix' representing the adjacency matrix of the unweighted network of interest.

D

Integer (scalar or vector) specifying the dimension of the latent space (default is 2).

K

Integer (scalar or vector) specifying the number of clusters to consider (default is 2).

model

A character string specifying the model to fit:

  • 'NDH': undirected network with no degree heterogeneity

  • 'RS': undirected network with degree heterogeneity

  • 'RSR': directed network with degree heterogeneity

initialization

A character string or a list to specify the initial values for the EM algorithm:

  • 'GNN': uses a type of graphical neural network approach to generate initial values (default)

  • 'random': uses random initial values

  • A user supplied list of initial values. See specify_initial_values on how to specify initial values

case_control

A logical; if TRUE then uses a case-control approximation approach (default is FALSE).

DA_type

A character string to specify the type of deterministic annealing approach to use

  • 'none': does not employ a deterministic annealing approach (default)

  • 'cooling': employes a traditional deterministic annealing approach where temperature decreases

  • 'heating': employes a deterministic anti-annealing approach where temperature increases

  • 'hybrid': employes a combination of the 'cooling' and 'heating' approach

seed

(optional) An integer value to specify the seed for reproducibility.

control

A list of control parameters. See 'Details'.

Details

If an unsymmetric adjacency matrix A is supplied for model %in% c('NDH', 'RS') the user will be asked if they would like to proceed with converting A to a symmetric matrix (i.e., A <- 1.0 * ( (A + t(A)) > 0.0 )).

control:

The control argument is a named list that the user can supply containing the following components:

verbose

A logical; if TRUE causes additional information to be printed out about the progress of the EM algorithm (default is FALSE).

max_its

An integer specifying the maximum number of iterations for the EM algorithm (default is 1e3).

min_its

An integer specifying the minimum number of iterations for the EM algorithm (default is 10).

priors

A list of prior hyperparameters (default is NULL). See specify_priors on how to specify the hyperparameters.

n_interior_knots

(only relevant for model %in% c('RS', 'RSR')) An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity models (default is 5).

termination_rule

A character string to specify the termination rule to determine the convergence of the EM algorithm:

  • 'prob_mat': uses change in the absolute difference in \hat{Z} (i.e., the N \times K cluster membership probability matrix) between subsequent iterations (default)

  • 'Q': uses change in the absolute difference in the objective function of the E-step evaluated using parameters from subsequent iterations

  • 'ARI': comparing the classifications between subsequent iterations using adjusted Rand index

  • 'NMI': comparing the classifications between subsequent iterations using normalized mutual information

  • 'CER': comparing the classifications between subsequent iterations using classification error rate

tolerance

A numeric specifying the tolerance used for termination_rule %in% c('Q', 'prob_mat') (default is 1e-3).

tolerance_ARI

A numeric specifying the tolerance used for termination_rule = 'ARI' (default is 0.999).

tolerance_NMI

A numeric specifying the tolerance used for termination_rule = 'NMI' (default is 0.999).

tolerance_CER

A numeric specifying the tolerance used for termination_rule = 'CER' (default is 0.01).

n_its_start_CA

An integer specifying what iteration to start computing cumulative averages (note: cumulative average of U, the latent position matrix, is not tracked when termination_rule = 'Q') (default is 20).

tolerance_diff_CA

A numeric specifying the tolerance used for the change in cumulative average of termination_rule metric and U (note: cumulative average of U is not tracked when termination_rule = 'Q') (default is 1e-3).

consecutive_diff_CA

An integer specifying the tolerance for the number of consecutive instances where change in cumulative average is less than tolerance_diff_CA (default is 5).

quantile_diff

A numeric in [0,1] specifying the quantile used in computing the change in the absolute difference of Z and U between subsequent iterations (default is 1, i.e., max).

beta_temp_schedule

A numeric vector specifying the temperature schedule for deterministic annealing (default is 1, i.e., deterministic annealing not utilized).

n_control

An integer specifying the fixed number of controls (i.e., non-links) sampled for each actor; only relevant when case_control = TRUE (default is 100 when case_control = TRUE and NULL when case_control = FALSE).

n_start

An integer specifying the maximum number of starts for the EM algorithm (default is 5).

max_retry

An integer specifying the maximum number of re-attempts if starting values cause issues with EM algorithm (default is 5).

IC_selection

A character string to specify the information criteria used to select the optimal fit based on the combinations of K, D, and n_start considered:

  • 'BIC_logit': BIC computed from logistic regression component

  • 'BIC_mbc': BIC computed from model based clustering component

  • 'ICL_mbc': ICL computed from model based clustering component

  • 'Total_BIC': sum of 'BIC_logit' and 'BIC_mbc'

  • 'Total_ICL': sum of 'BIC_logit' and 'ICL_mbc' (default)

sd_random_U_GNN

(only relevant when initialization = 'GNN') A positive numeric value specifying the standard deviation for the random draws from a normal distribution to initialize U (default is 1).

max_retry_GNN

(only relevant when initialization = 'GNN') An integer specifying the maximum number of re-attempts for the GNN approach before switching to random starting values (default is 10).

n_its_GNN

(only relevant when initialization = 'GNN') An integer specifying the maximum number of iterations for the GNN approach (default is 10).

downsampling_GNN

(only relevant when initialization = 'GNN') A logical; if TRUE employs downsampling s.t. the number of links and non-links are balanced for the GNN approach (default is TRUE).

Running JANE in parallel:

JANE integrates the future and future.apply packages to fit the various combinations of K, D, and n_start in parallel. The 'Examples' section below provides an example of how to run JANE in parallel. See plan and future.apply for more details.

Choosing the number of clusters:

JANE allows for the following model selection criteria to choose the number of clusters:

Warning: It is not certain whether it is appropriate to use the model selection criterion above to select D.

Value

A list of S3 class "JANE" containing the following components:

input_params

A list containing the input parameters for IC_selection, case_control, and DA_type used in the function call.

A

The square sparse adjacency matrix of class 'dgCMatrix' used in fitting the latent space cluster model. This matrix can be different than the input A matrix as isolates are removed.

IC_out

A matrix containing the relevant information criteria for all combinations of K, D, and n_start considered. The 'selected' column indicates the optimal fit chosen.

all_convergence_ind

A matrix containing the convergence information (i.e., 1 = converged, 0 = did not converge) and number of iterations for all combinations of K, D, n_start, and beta_temperature considered.

optimal_res

A list containing the estimated parameters of interest based on the optimal fit selected. It is recommended to use summary() to extract the parameters of interest. See summary.JANE for more details.

optimal_starting

A list containing the starting parameters used in the EM algorithm that resulted in the optimal fit selected. It is recommended to use summary() to extract the parameters of interest. See summary.JANE for more details.

References

Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725.

Handcock, M.S., Raftery, A.E., Tantrum, J.M., 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society Series A: Statistics in Society 170, 301–354.

Examples


# Simulate network
mus <- matrix(c(-1,-1,1,-1,1,1), 
              nrow = 3,
              ncol = 2, 
              byrow = TRUE)
omegas <- array(c(diag(rep(7,2)),
                  diag(rep(7,2)), 
                  diag(rep(7,2))), 
                  dim = c(2,2,3))
p <- rep(1/3, 3)
beta0 <- 1.0
sim_data <- JANE::sim_A(N = 100L, 
                        model = "NDH",
                        mus = mus, 
                        omegas = omegas, 
                        p = p, 
                        beta0 = beta0, 
                        remove_isolates = TRUE)
                        
# Run JANE on simulated data
res <- JANE::JANE(A = sim_data$A,
                  D = 2L,
                  K = 3L,
                  initialization = "GNN", 
                  model = "NDH",
                  case_control = FALSE,
                  DA_type = "none")

# Run JANE on simulated data - consider multiple D and K
res <- JANE::JANE(A = sim_data$A,
                  D = 2:5,
                  K = 2:10,
                  initialization = "GNN", 
                  model = "NDH",
                  case_control = FALSE,
                  DA_type = "none")
                  
# Run JANE on simulated data - parallel with 5 cores
future::plan(future::multisession, workers = 5)
res <- JANE::JANE(A = sim_data$A,
                  D = 2L,
                  K = 3L,
                  initialization = "GNN", 
                  model = "NDH",
                  case_control = FALSE,
                  DA_type = "none")
future::plan(future::sequential)

# Run JANE on simulated data - case/control approach with 20 controls sampled for each actor
res <- JANE::JANE(A = sim_data$A,
                  D = 2L,
                  K = 3L,
                  initialization = "GNN", 
                  model = "NDH",
                  case_control = TRUE,
                  DA_type = "none",
                  control = list(n_control = 20))
                   
# Reproducibility
res1 <- JANE::JANE(A = sim_data$A,
                   D = 2L,
                   K = 3L,
                   initialization = "GNN", 
                   seed = 1234,
                   model = "NDH",
                   case_control = FALSE,
                   DA_type = "none")

res2 <- JANE::JANE(A = sim_data$A,
                   D = 2L,
                   K = 3L,
                   initialization = "GNN", 
                   seed = 1234,
                   model = "NDH",
                   case_control = FALSE,
                   DA_type = "none")  

## Check if results match
all.equal(res1, res2)    

# Another reproducibility example where the seed was not set. 
# It is possible to replicate the results using the starting values due to 
# the nature of EM algorithms
res3 <- JANE::JANE(A = sim_data$A,
                   D = 2L,
                   K = 3L,
                   initialization = "GNN", 
                   model = "NDH",
                   case_control = FALSE,
                   DA_type = "none")
## Extract starting values                    
start_vals <- res3$optimal_start  

## Run JANE using extracted starting values, no need to specify D and K 
## below as function will determine those values from start_vals
res4 <- JANE::JANE(A = sim_data$A,
                   initialization = start_vals, 
                   model = "NDH",
                   case_control = FALSE,
                   DA_type = "none")
                   
## Check if optimal_res are identical
all.equal(res3$optimal_res, res4$optimal_res)                   
                            

[Package JANE version 0.2.1 Index]