cluster_genes {speakeasyR} | R Documentation |
Use the Speakeasy 2 community detection algorithm to cluster genes based on their gene expression. A gene coexpression network is created by taking correlating the input gene expression matrix to genes that tend to be expressed together. This matrix is then clustered to find gene modules.
Note: This is intended for gene expression sampled from bulk sequencing. Samples from single cell sequencing may work but will need to be preprocessed due to the greater noise-to-signal ratio. See the speakeasyR vignette for an example of single cell preprocessing. For more information about working with single cell data see: Malte D Luecken & Fabian J Theis (2019) Current Best Practices in Single‐cell Rna‐seq Analysis: a Tutorial, Molecular Systems Biology.
cluster_genes(
gene_expression,
k = NULL,
discard_transient = 3,
independent_runs = 10,
max_threads = 0,
seed = 0,
target_clusters = 0,
target_partitions = 5,
subcluster = 1,
min_clust = 5,
verbose = FALSE
)
gene_expression |
a matrix of gene expression data with data from multiple samples (in the form genes x samples). |
k |
number of neighbors to include if converting to a k-nearest neighbor graph. Should be a non-negative integer less than the number of genes. If this value is not set the raw GCN is clustered. The kNN graph is a sparse directed graph with binary edges between a node and it's most similar k neighbors. Conversion to a kNN graph can provide good clustering results much faster than using the full graph in cases with a large number of genes. |
discard_transient |
The number of partitions to discard before tracking. |
independent_runs |
How many runs SpeakEasy2 should perform. |
max_threads |
The maximum number of threads to use. By default this is the same as the number of independent runs. If max_threads is greater than or equal to the number of processing cores, all cores may run. If max_threads is less than the number of cores, at most max_threads cores will run. |
seed |
Random seed to use for reproducible results. SpeakEasy2 uses a different random number generator than R, but if the seed is not explicitly set, R's random number generator is used create one. Because of this, setting R's RNG will also cause reproducible results. |
target_clusters |
The number of random initial labels to use. |
target_partitions |
Number of partitions to find per independent run. |
subcluster |
Depth of clustering. If greater than 1, perform recursive clustering. |
min_clust |
Smallest clusters to recursively cluster. If subcluster not set to a value greater than 1, this has no effect. |
verbose |
Whether to provide additional information about the clustering or not. |
A membership vector. If subclustering, returns a matrix with number of rows equal to the number of recursive clustering. Each row is the membership at different hierarchical scales, such that the last rows are the highest resolution.
# Set parameters
set.seed(123) # For reproducibility
ngene <- 200
nsample <- 1000
ncluster <- 5
# Create a function to simulate gene expression data
simulate_gene_expression <- function(ngene, nsample, ncluster) {
# Initialize the expression matrix
expr_matrix <- matrix(0, nrow = ngene, ncol = nsample)
# Create cluster centers for genes
cluster_centers <- matrix(rnorm(ncluster * nsample, mean = 5, sd = 2),
nrow = ncluster, ncol = nsample
)
# Assign genes to clusters
gene_clusters <- sample(1:ncluster, ngene, replace = TRUE)
for (i in 1:ngene) {
cluster <- gene_clusters[i]
expr_matrix[i, ] <- cluster_centers[cluster, ] +
rnorm(nsample, mean = 0, sd = 1)
}
return(list(expr_matrix = expr_matrix, gene_clusters = gene_clusters))
}
# Simulate the data
simulated_data <- simulate_gene_expression(ngene, nsample, ncluster)
# Extract the expression matrix and gene clusters
expr_matrix <- simulated_data$expr_matrix
gene_clusters <- simulated_data$gene_clusters
# Cluster and test quality of results
modules <- cluster_genes(expr_matrix, max_threads = 2)