miic {miic}R Documentation

MIIC, causal network learning algorithm including latent variables

Description

MIIC (Multivariate Information-based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes.

Usage

miic(
  input_data,
  state_order = NULL,
  true_edges = NULL,
  black_box = NULL,
  n_threads = 1,
  cplx = "nml",
  orientation = TRUE,
  ort_proba_ratio = 1,
  ort_consensus_ratio = NULL,
  propagation = FALSE,
  latent = "orientation",
  n_eff = -1,
  n_shuffles = 0,
  conf_threshold = 0,
  sample_weights = NULL,
  test_mar = TRUE,
  consistent = "no",
  max_iteration = 100,
  consensus_threshold = 0.8,
  negative_info = FALSE,
  mode = "S",
  n_layers = NULL,
  delta_t = NULL,
  mov_avg = NULL,
  keep_max_data = FALSE,
  max_nodes = 50,
  verbose = FALSE
)

Arguments

input_data

[a data frame, required]

A n*d data frame (n samples, d variables) that contains the observational data.

In standard mode, each column corresponds to one variable and each row is a sample that gives the values for all the observed variables. The column names correspond to the names of the observed variables. Numeric columns with at least 5 distinct values will be treated as continuous by default whilst numeric columns with less than 5 distinct values, factors and characters will be considered as categorical.

In temporal mode, the expected data frame layout is variables as columns and time series/time steps as rows. The time step information must be supplied in the first column and, for each time series, be consecutive and in ascending order (increment of 1). Multiple trajectories can be provided, miic will consider that a new trajectory starts each time a smaller time step than the one of the previous row is encountered.

state_order

[a data frame, optional, NULL by default]

A data frame providing extra information for variables. It must have d rows where d is the number of input variables and possible columns are described below. For optional columns, if they are not provided or contain missing values, default values suitable for input_data will be used.

"var_names" (required) contains the name of each variable as specified by colnames(input_data). In temporal mode, the time steps column should not be mentioned in the variables list.

"var_type" (optional) contains a binary value that specifies if each variable is to be considered as discrete (0) or continuous (1).

"levels_increasing_order" (optional) contains a single character string with all of the unique levels of the ordinal variable in increasing order, delimited by comma ','. It will be used during the post-processing to compute the sign of an edge using Spearman's rank correlation. If a variable is continuous or is categorical but not ordinal, this column should be NA.

"is_contextual" (optional) contains a binary value that specifies if a variable is to be considered as a contextual variable (1) or not (0). Contextual variables cannot be the child node of any other variable (cannot have edge with arrowhead pointing to them).

"is_consequence" (optional) contains a binary value that specifies if a variable is to be considered as a consequence variable (1) or not (0). Edges between consequence variables are ignored, consequence variables cannot be the parent node of any other variable and cannot be used as contributors. Edges between a non consequence and consequence variables are pre-oriented toward the consequence.

Several other columns are possible in temporal mode:

"n_layers" (optional) contains an integer value that specifies the number of layers to be considered for the variable. Note that if a "n_layers" column is present in the state_order, its values will overwrite the function parameter.

"delta_t" (optional) contains an integer value that specifies the number of time steps between each layer for the variable. Note that if a "delta_t" column is present in the state_order, its values will overwrite the function parameter.

"mov_avg" (optional) contains an integer value that specifies the size of the moving average window to be applied to the variable. Note that if "mov_avg" column is present in the state_order, its values will overwrite the function parameter.

true_edges

[a data frame, optional, NULL by default]

A data frame containing the edges of the true graph for computing performance after the run.
In standard mode, the expected layout is a two columns data frame, each row representing a true edge with in each column, the variable names. Variables names must exist in the input_data data frame.
In temporal mode, the expected layout is a three columns data frame, with the first two columns being variable names and the third the lag. Variables names must exist in the input_data data frame and the lag must be valid in the time unfolded graph. e.g. a row var1, var2, 3 is valid with n_layers = 4 + delta_t = 1 or n_layers = 2 + delta_t = 3 but not for n_layers = 2 + delta_t = 2 as there is no matching edge in the time unfolded graph.
Please note that the order is important: in standard mode, "var1 var2" will be interpreted as var1 -> var2 and in temporal mode, "var1 var2 3" is interpreted as var1_lag3 -> var2_lag0. Please note also that, in temporal mode, for contextual variables that are not lagged, the expected value in the third column for the time lag is NA.

black_box

[a data frame, optional, NULL by default]

A data frame containing pairs of variables that will be considered as independent during the network reconstruction. In practice, these edges will not be included in the skeleton initialization and cannot be part of the final result.
In standard mode, the expected layout is a two columns data frame, each row representing a forbidden edge with in each column, the variable names. Variables names must exist in the input_data data frame.
In temporal mode, the expected layout is a three columns data frame, with the first two columns being variable names and the third the lag. Variables names must exist in the input_data data frame and the lag must be valid in the time unfolded graph. e.g. a row var1, var2, 3 is valid with n_layers = 4 + delta_t = 1 or n_layers = 2 + delta_t = 3 but not for n_layers = 2 + delta_t = 2 as there is no matching edge in the time unfolded graph. Please note that the order is important: var1, var2, 3 is interpreted as var1_lag3 - var2_lag0. Please note also that, for contextual variables that are not lagged, the expected value in the third column for the time lag is NA.

n_threads

[a positive integer, optional, 1 by default]

When set greater than 1, n_threads parallel threads will be used for computation. Make sure your compiler is compatible with openmp if you wish to use multithreading.

cplx

[a string, optional, "nml" by default, possible values: "nml", "bic"]

In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be shifted by a complexity term. The finite size corrections can be based on the Bayesian Information Criterion (BIC). However, the BIC complexity term tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) criterion can be used (see Affeldt 2015).

orientation

[a boolean value, optional, TRUE by default]

The miic network skeleton can be partially directed by orienting edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples and, in temporal mode, using time. If set to FALSE, the orientation step is not performed.

ort_proba_ratio

[a floating point between 0 and 1, optional, 1 by default]

The threshold when deducing the type of an edge tip (head/tail) from the probability of orientation. For a given edge tip, denote by p the probability of it being a head, the orientation is accepted if (1 - p) / p < ort_proba_ratio. 0 means reject all orientations, 1 means accept all orientations.

ort_consensus_ratio

[a floating point between 0 and 1, optional, NULL by default] Used to determine if orientations correspond to genuine causal edges and, when consistency is activated, to deduce the orientations in the consensus graph.
Oriented edges will be marked as genuine causal when: (1 - p_{head}) / p_{head} < ort_consensus_ratio and p_{tail} / (1 - p_{tail}) < ort_consensus_ratio.
When consistency is activated, ort_consensus_ratio is used as threshold when deducing the type of an consensus edge tip (head/tail) from the average probability of orientations over the cycle of graphs. For a given edge tip, denote by p the average probability of it being a head, the orientation is accepted if (1 - p) / p < ort_consensus_ratio.
If not supplied, the ort_consensus_ratio will be initialized with the ort_proba_ratio value.

propagation

[a boolean value, optional, FALSE by default]

If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are propagated to downstream un-directed edges in unshielded triples following the propagation procedure, relying on probabilities (for more details, see Verny 2017).

latent

[a string, optional, "orientation" by default, possible values: "orientation", "no", "yes"]

When set to "yes", the network reconstruction is taking into account hidden (latent) variables. When set to "orientation", latent variables are not considered during the skeleton reconstruction but allows bi-directed edges during the orientation. Dependence between two observed variables due to a latent variable is indicated with a '6' in the adjacency matrix and in the network edges.summary and by a bi-directed edge in the (partially) oriented graph.

n_eff

[a positive integer, optional, -1 by default]

In standard mode, the n samples given in the input_data data frame are expected to be independent. In case of correlated samples such as in Monte Carlo sampling approaches, the effective number of independent samples n_eff can be estimated using the decay of the autocorrelation function (see Verny 2017). This effective number n_eff of independent samples can be provided using this parameter.

n_shuffles

[a positive integer, optional, 0 by default]

The number of shufflings of the original dataset in order to evaluate the edge specific confidence ratio of all retained edges. Default is 0: no confidence cut is applied. If the number of shufflings is set to an integer > 0, the confidence threshold must also be > 0 (e.g. n_shuffles = 100 and conf_threshold = 0.01).

conf_threshold

[a positive floating point, optional, 0 by default]

The threshold used to filter the less probable edges following the skeleton step (see Verny 2017). Default is 0: no confidence cut is applied. If the confidence threshold is set > 0, the number of shufflings must also be > 0 (e.g. n_shuffles = 100 and conf_threshold = 0.01).

sample_weights

[a numeric vector, optional, NULL by default]

An vector containing the weight of each observation. If defined, it must be a vector of floats in the range [0,1] of size equal to the number of samples.

test_mar

[a boolean value, optional, TRUE by default]

If set to TRUE, distributions with missing values will be tested with Kullback-Leibler divergence: conditioning variables for the given link X - Y, Z will be considered only if the divergence between the full distribution and the non-missing distribution KL(P(X,Y) | P(X,Y)_{!NA}) is low enough (with P(X,Y)_{!NA} as the joint distribution of X and Y on samples which are not missing on Z. This is a way to ensure that data are missing at random for the considered interaction and detect bias due to values not missing at random.

consistent

[a string, optional, "no" by default, possible values: "no", "orientation", "skeleton"]

If set to "orientation": iterate over skeleton and orientation steps to ensure consistency of the separating sets and all disconnected pairs in the final network. If set to "skeleton": iterate over skeleton step to get a consistent skeleton, then orient edges including inconsistent orientations (see Li 2019 for details).

max_iteration

[a positive integer, optional, 100 by default]

When the consistent parameter is set to "skeleton" or "orientation", the maximum number of iterations allowed when trying to find a consistent graph.

consensus_threshold

[a floating point between 0.5 and 1.0, optional, 0.8 by default]

When the consistent parameter is set to "skeleton" or "orientation" and when the result graph is inconsistent or is a union of more than one inconsistent graphs, a consensus graph will be produced based on a pool of graphs. If the result graph is inconsistent, then the pool is made of max_iteration graphs from the iterations, otherwise it is made of those graphs in the union. In the consensus graph, an edge is present when the proportion of non-zero status in the pool is above the threshold. For example, if the pool contains [A, B, B, 0, 0], where "A", "B" are different status of the edge and "0" indicates the absence of the edge. Then the edge is set to connected ("1") if the proportion of non-zero status (0.6 in the example) is equal to or higher than consensus_threshold. (When set to connected, the orientation of the edge will be further determined by the average probability of orientation.)

negative_info

[a boolean value, optional, FALSE by default]

If TRUE, negative shifted mutual information is allowed during the computation when mutual information is inferior to the complexity term. For small dataset with complicated structures, e.g. discrete variables with many levels, allowing for negative shifted mutual information may help identifying weak v-structures related to those discrete variables, as the negative three-point information in those cases will come from the difference between two negative shifted mutual information terms (expected to be negative due to the small sample size). However, under this setting, a v-structure (X -> Z <- Y) in the final graph does not necessarily imply that X is dependent on Y conditioning on Z, As a consequence, the reliability of certain orientations is not guaranteed. By contrast, keeping this parameter as FALSE is more conservative and leads to more reliable orientations (see Cabeli 2021 and Ribeiro-Dantas 2024).

mode

[a string, optional, "S" by default, possible values are "S": Standard (non temporal data) or "TS": Temporal Stationary data]

When temporal mode is activated, the time information must be provided in the first column of input_data. For more details about temporal stationary mode (see Simon 2024).

n_layers

[an integer, optional, NULL by default, must be >= 2 if supplied]

Used only in temporal mode, n_layers defines the number of layers that will be considered for the variables in the time unfolded graph. The layers will be distant of delta_t time steps. If not supplied, the number of layers is estimated from the dynamic of the dataset and the maximum number of nodes max_nodes allowed in the final lagged graph.

delta_t

[an integer, optional, NULL by default, must be >= 1 if supplied]

Used only in temporal mode, delta_t defines the number of time steps between each layer. i.e. on 1000 time steps with n_layers = 3 and delta_t = 7, the time steps kept for the samples conversion will be 1, 8, 15 for the first sample, the next sample will use 2, 9, 16 and so on. If not supplied, the number of time steps between layers is estimated from the dynamic of the dataset and the number of layers.

mov_avg

[an integer, optional, NULL by default, must be >= 2 if supplied]

Used only in temporal mode. When supplied, a moving average operation is applied to all integer and numeric variables that are not contextual variables.

keep_max_data

[a boolean value, optional, FALSE by default]

Used only in temporal mode. If TRUE, rows where some NAs have been introduced during the moving averages and lagging will be kept whilst they will be dropped if FALSE.

max_nodes

[an integer, optional, 50 by default]

Used only in temporal mode and if the n_layers or delta_t parameters are not supplied. max_nodes is used as the maximum number of nodes in the final time-unfolded graph to compute n_layers and/or delta_t. The default is 50 to produce quick runs and can be increased up to 200 or 300 on recent computers to produce more precise results.

verbose

[a boolean value, optional, FALSE by default]

If TRUE, debugging output is printed.

Details

Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. Miic distinguishes genuine causal edges (with both reliable arrow heads and tails) from putative causal edges (with one reliable arrow head only) and latent causal edges (with both reliable arrow heads). (see Ribeiro-Dantas 2024)

In temporal mode, miic reorganizes the dataset using the n_layers and delta_t parameters to transform the time steps into lagged samples. As starting point, a lagged graph is created with only edges having at least one node laying on the last time step. Then, miic standard algorithm is applied to remove dispensable edges. The remaining edges are then duplicated to ensure time invariance (stationary dynamic) and oriented using the temporality and the signature of causality in observational data. The use of temporal mode is presented in Simon 2024.

The method relies on information theoretic principles which replace (conditional) independence tests as described in Affeldt 2015, Cabeli 2020, Cabeli 2021 and Ribeiro-Dantas 2024. It deals with both categorical and continuous variables by performing optimal context-dependent discretization. As such, the input data frame may contain both numerical columns which will be treated as continuous, or character / factor columns which will be treated as categorical. For further details on the optimal discretization method and the conditional independence test, see the function discretizeMutual. The user may also choose to run miic with scheme presented in Li 2019 and Ribeiro-Dantas 2024 to improve the end result's interpretability by ensuring consistent separating sets.

Value

A miic-like object that contains:

References

See Also

discretizeMutual for optimal discretization and (conditional) independence test.

Examples

library(miic)

# EXAMPLE HEMATOPOIESIS
data(hematoData)

# execute MIIC (reconstruct graph)
miic_obj <- miic(
  input_data = hematoData[1:1000,], latent = "yes",
  n_shuffles = 10, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic_obj, method="igraph")
}


# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).

writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp"))

# EXAMPLE CANCER
data(cosmicCancer)
data(cosmicCancer_stateOrder)
# execute MIIC (reconstruct graph)
miic_obj <- miic(
  input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes",
  n_shuffles = 100, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic_obj)
}

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp"))

# EXAMPLE COVID CASES (time series demo)
data(covidCases)
# execute MIIC (reconstruct graph in temporal mode)
tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14)

# to plot the default graph (compact)
if(require(igraph)) {
 plot(tmiic_obj)
}

# to plot the raw temporal network
if(require(igraph)) {
  plot(tmiic_obj, display="raw")
}

# to plot the full temporal network
if(require(igraph)) {
  plot(tmiic_obj, display="lagged")
}




[Package miic version 2.0.3 Index]