TCGAretriever is an R library aimed at downloading genomic data from cBioPortal (https://www.cbioportal.org/), including The Cancer Genome Atlas (TCGA) data (free-tier data). TCGA is a program aimed at improving our understanding of Cancer Biology. Several TCGA Datasets are available online. TCGAretriever helps accessing and downloading TCGA data using R via the cBioPortal API. Features of TCGAretriever are:

This tutorial describes few use cases to help getting started with TCGAretriever.

Basic Operations

In this section, we provide an overview of the basic operations and commonly used functions.

Cancer Study IDs

Here we retrieve the list of available studies and select all tcga_pub entries. Each of the values included in the cancer_study_id column is a study identifier and can be passed as csid argument to other functions such as get_genetic_profiles() or get_case_lists().

library(TCGAretriever)
library(reshape2)
library(ggplot2)

# Obtain a list of cancer studies from cBio
all_studies <- get_cancer_studies()

# Find published TCGA datasets
keep <- grepl("tcga_pub$", all_studies[,1])
tcga_studies <- all_studies[keep, ]

# Show results
show_head(tcga_studies, 6, 2)
##       cancer_study_id                                             name
## 8       laml_tcga_pub         Acute Myeloid Leukemia (TCGA, NEJM 2013)
## 21      sarc_tcga_pub     Adult Soft Tissue Sarcomas (TCGA, Cell 2017)
## 36      blca_tcga_pub Bladder Urothelial Carcinoma (TCGA, Nature 2014)
## 55      brca_tcga_pub    Breast Invasive Carcinoma (TCGA, Nature 2012)
## 82  coadread_tcga_pub    Colorectal Adenocarcinoma (TCGA, Nature 2012)
## 110     stes_tcga_pub         Esophageal Carcinoma (TCGA, Nature 2017)

Genetic Profiles (Assays) and Case Lists

It is possible to retrieve genetic profiles and case lists associated to each study of interest. Genetic profiles indicate what kind of data are available for a certain study (e.g., RNA-seq, micro-array, DNA mutation…). Typically, a data type of interest may only be available for a fraction of patients in the cohort. Therefore, it is also important to obtain the corresponding subject with data available for a certain assay.

# Define the cancer study id: brca_tcga_pub
my_csid <- "brca_tcga_pub"

# Obtain genetic profiles
blca_pro <- get_genetic_profiles(csid = my_csid)
show_head(blca_pro, 8, 2)
##                  genetic_profile_id
## 1                brca_tcga_pub_rppa
## 2        brca_tcga_pub_rppa_Zscores
## 3              brca_tcga_pub_gistic
## 4          brca_tcga_pub_linear_CNA
## 5           brca_tcga_pub_mutations
## 6    brca_tcga_pub_methylation_hm27
## 7                brca_tcga_pub_mrna
## 8 brca_tcga_pub_mrna_median_Zscores
##                                                genetic_profile_name
## 1                                         Protein expression (RPPA)
## 2                                Protein expression z-scores (RPPA)
## 3                      Putative copy-number alterations from GISTIC
## 4                         Capped relative linear copy-number values
## 5                                                         Mutations
## 6                                                Methylation (HM27)
## 7                                      mRNA expression (microarray)
## 8 mRNA expression z-scores relative to diploid samples (microarray)

# Obtain cases 
blca_cas <- get_case_lists(csid = my_csid)
show_head(blca_cas, 8, 2)
##             case_list_id      case_list_name
## 1      brca_tcga_pub_all         All samples
## 2 brca_tcga_pub_complete    Complete samples
## 3  brca_tcga_pub_luminal       Luminal (A/B)
## 4    brca_tcga_pub_basal         PAM50 Basal
## 5  brca_tcga_pub_claudin   PAM50 Claudin low
## 6     brca_tcga_pub_her2 PAM50 Her2 enriched
## 7     brca_tcga_pub_luma     PAM50 Luminal A
## 8     brca_tcga_pub_lumb     PAM50 Luminal B

Retrieve Genomic Data

Once we identified a genetic profile of interest and a case list of interest, we can obtain clinical data and genomic data via the TCGAretriever::get_clinical_data() and TCGAretriever::get_profile_data functions respectively.

# Define a set of genes of interest
q_genes <- c("TP53", "CLDN7", "E2F1", "EZH2")
q_cases <- "brca_tcga_pub_complete"
rna_prf <- "brca_tcga_pub_mrna"
mut_prf <- "brca_tcga_pub_mutations"

# Download Clinical Data
brca_cli <- get_clinical_data(case_id = q_cases)

# Download RNA
brca_RNA <- get_profile_data(case_id = q_cases, 
                             gprofile_id = rna_prf, 
                             glist = q_genes, 
                             force_numeric = TRUE)
  • NOTE 1: the resulting data.frame includes ENTREZ_GENE_IDs and OFFICIAL_SMBOLs as first and second column.

  • NOTE 2: expression data can be retrieved as numeric by setting the force_numeric argument to TRUE.

  • NOTE 3: up to n=500 genes can be queried via the get_profile_data() function.


# Show results
show_head(brca_RNA, 4, 4)
##   GENE_ID COMMON TCGA-A2-A0T2-01 TCGA-A2-A04P-01
## 1    1366  CLDN7       2.9118750       2.2225000
## 2    1869   E2F1      -1.6997500      -0.4502500
## 3    2146   EZH2      -1.6728667      -0.8942667
## 4    7157   TP53      -0.3888333      -1.8085000
# Set SYMBOLs as rownames
# Note that you may prefer to use the tibble package for this
rownames(brca_RNA) <- brca_RNA$COMMON
brca_RNA <- brca_RNA[, -c(1,2)]

# Round numeric vals to 3 decimals
for (i in 1:ncol(brca_RNA)) {
  brca_RNA[, i] <- round(brca_RNA[, i], digits = 3)
}

# Download mutations (simple)
brca_MUT <- get_profile_data(case_id = q_cases, 
                             gprofile_id = mut_prf, 
                             glist = q_genes)

rownames(brca_MUT) <- brca_MUT$COMMON
brca_MUT <- brca_MUT[, -c(1,2)]

# Show results
show_head(brca_RNA, 4, 4)
##       TCGA-A2-A0T2-01 TCGA-A2-A04P-01 TCGA-A1-A0SK-01 TCGA-A2-A0CM-01
## CLDN7           2.912           2.223           1.986           2.921
## E2F1           -1.700          -0.450           0.238          -0.568
## EZH2           -1.673          -0.894           0.529          -0.427
## TP53           -0.389          -1.808           1.672           0.490
show_head(brca_MUT, 4, 4)
##       TCGA-A2-A0T2-01 TCGA-A2-A04P-01 TCGA-A1-A0SK-01 TCGA-A2-A0CM-01
## CLDN7             NaN             NaN             NaN             NaN
## E2F1              NaN             NaN             NaN             NaN
## EZH2              NaN             NaN             NaN             NaN
## TP53            Y220C      G108Vfs*15           M133K      E204Sfs*43

NOTE: when using the same case_list_id to retrieve different types of data (genetic profiles) results have consistent structure. In other words, data.frames include info for the same list of cases (and hence, the resulting data.frames have the same number of columns, and identical column names).

# Note that the columns (cases) are identical 
# and have the same order in both data.frames
sum(colnames(brca_MUT) != colnames(brca_RNA))
## [1] 0

Retrieve Large Data

It is possible to download large genomic data (information for a large number of genes) using the fetch_all_tcgadata() function. This function takes the same arguments as the get_profile_data() function. If the number of query genes is bigger than n=500, the job is automatically split into multiple batches. Results are automatically aggregated before being returned. If mutation data are being queried, results can be requested in two alternate formats: 1) matrix format; 2) extended (molten) format. The mutations argument is used to control the format.

—–

Examples and visualizations

In this section, we discuss a simple use case where we used TCGAretriever to study the relationship between gene expression and mutation status of some genes of interest in breast cancer tumors.

Relationship between E2F1 and EZH2 in BRCA

We use the data retrieved before to analyze the relationship between E2F1 and EZH2 expression in TCGA breast cancer samples.

# Visualize the correlation between EZH2 and E2F1
df <- data.frame(sample_id = colnames(brca_RNA), 
                 EZH2 = as.numeric(brca_RNA['EZH2', ]), 
                 E2F1 = as.numeric(brca_RNA['E2F1', ]), 
                 stringsAsFactors = FALSE)

ggplot(df, aes(x = EZH2, y = E2F1)) +
  geom_point(color = 'gray60', size = 0.75) +
  theme_bw() +
  geom_smooth(method = 'lm', color = 'red2', 
              size=0.3, fill = 'gray85') +
  ggtitle('E2F1-EZH2 correlation in BRCA') + 
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot - showing E2F1 RNA expression with respect to EZH2 RNA expression across TCGA breast cancer samples.


Relationship between CLDN7 and TP53 with respect to TP53 mutation status in BRCA

We use the data retrieved before to analyze the relationship between CLDN7 and TP53 expression with respect to the TP53 WT status in TCGA breast cancer samples.

# Coerce to data.frame with numeric features 
xpr_df <- data.frame(sample_id = colnames(brca_RNA), 
                     CLDN7 = as.numeric(brca_RNA['CLDN7', ]),
                     TP53 = as.numeric(brca_RNA['TP53', ]),
                     stringsAsFactors = FALSE)

mut_df <- data.frame(
  sample_id = colnames(brca_RNA), 
  TP53.status = as.factor(ifelse(brca_MUT["TP53",] == "NaN", "WT", "MUT")),
  stringsAsFactors = FALSE)

df <- dplyr::inner_join(xpr_df, mut_df, by='sample_id')

# Visualize the correlation between EZH2 and E2F1
ggplot(df, aes(x = TP53, y = CLDN7)) +
  geom_point(color = 'gray60', size = 0.75) +
  facet_grid(cols = vars(TP53.status)) +
  theme_bw() +
  geom_smooth(mapping = aes(color = TP53.status), 
              method = 'lm', size=0.3, fill = 'gray85') +
  ggtitle('E2F1-EZH2 correlation in BRCA') + 
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot - showing CLDN7 RNA expression with respect to TP53 RNA expression across TCGA breast cancer samples. Samples with wild type (left) and mutant (right) TP53 status are shown in the two panels.

—–

SessionInfo

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C           LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.6     reshape2_1.4.4    TCGAretriever_1.7
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9       highr_0.9        bslib_0.4.0      compiler_4.2.1  
##  [5] pillar_1.9.0     jquerylib_0.1.4  plyr_1.8.7       tools_4.2.1     
##  [9] digest_0.6.29    lattice_0.20-45  nlme_3.1-159     jsonlite_1.8.4  
## [13] evaluate_0.16    lifecycle_1.0.3  tibble_3.2.1     gtable_0.3.0    
## [17] mgcv_1.8-40      pkgconfig_2.0.3  rlang_1.1.1      Matrix_1.4-1    
## [21] cli_3.6.1        rstudioapi_0.13  curl_4.3.2       yaml_2.3.5      
## [25] xfun_0.32        fastmap_1.1.0    withr_2.5.0      httr_1.4.6      
## [29] stringr_1.5.0    dplyr_1.1.2      knitr_1.39       generics_0.1.3  
## [33] vctrs_0.6.2      sass_0.4.2       tidyselect_1.2.0 grid_4.2.1      
## [37] glue_1.6.2       R6_2.5.1         fansi_1.0.3      rmarkdown_2.14  
## [41] farver_2.1.1     magrittr_2.0.3   splines_4.2.1    scales_1.2.0    
## [45] htmltools_0.5.5  colorspace_2.0-3 labeling_0.4.2   utf8_1.2.2      
## [49] stringi_1.7.8    munsell_0.5.0    cachem_1.0.6

Success! TCGAretriever vignette, version 2023-Jul-09, by D Fantini.