Abstract

Source Code

Setup

suppressPackageStartupMessages({
  library(GenomicSuperSignature)
  library(dplyr)
  library(ggplot2)
})
## Warning: package 'GenomicSuperSignature' was built under R version 4.1.3
## If GenomicSuperSignaturePaper is built locally with RAVmodel in inst/extdata
data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
RAVmodel <- readRDS(file.path(data.dir, "RAVmodel_C2.rds"))
RAVmodel_PLIERpriors <- readRDS(file.path(data.dir, "RAVmodel_PLIERpriors.rds"))
RAVmodel
## class: PCAGenomicSignatures 
## dim: 13934 4764 
## metadata(8): cluster size ... version geneSets
## assays(1): RAVindex
## rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
## rowData names(0):
## colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
## colData names(4): RAV studies silhouetteWidth gsea
## trainingData(2): PCAsummary MeSH
## trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526

geneSets(RAVmodel)
## [1] "MSigDB C2 version 7.1"
version(RAVmodel)
## [1] "1.1.1"

RAVmodel characterization

RAVmodel is composed of RAVindex, model’s metadata, and annotation modules linked through RAVs. RAVindex has 4,764 RAVs and 1,378 out of them are ‘single-element’ clusters. By definition a single-element cluster is not a ‘repetitive’ signal, leaving only 3,386 actual RAVs. This means we compressed the information from 44,890 samples into 3,386 RAVs, which is less than 1/10 of the initial number of samples. Also, 417 out of 536 training datasets have 40,746 genes and the other 119 training datasets have 41,255 genes, while the RAVindex uses only 13,934 common genes among the top 90% varying genes of all samples. Thus, our method achieves an efficient data compression, maintaining significant information in ~3% of the initial volume of the training data.

Cluster sizes

Cluster sizes, which is the number of PCs in RAVs, are strongly skewed to the right. When we exclude single-element clusters, about 65% of RAVs (2,212 out of 3,386) have two PCs and the mean cluster size is 2.759 PCs per RAV with the largest cluster containing 24 PCs. Interestingly, the distribution of PCs in RAVs changes with the cluster sizes: the majority of PCs in one- and two- element clusters are lower PCs, but once clusters have more than two elements, PC distribution starts to skew to the right. This suggests that RAVs from small clusters tend to represent weak and less common signals. Though, we still keep the ‘single-element’ RAVs for two reasons: 1) If any new data is validated by those ‘single-element’ RAVs, they become ‘repetitive’ signals and thus, could lead to new hypotheses and 2) by keeping all RAVs, we include all potential PCs in the RAVmodel and support different use cases. Since metadata associated with all RAVs are readily accessible, end users can filter downstream results based on cluster sizes or other properties.

## Bar plot of the cluster sizes. It ranges from 1 to 24, strongly right-skewed.
x <- metadata(RAVmodel)$size
table(x)
## x
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   17 
## 1378 2212  628  249  133   61   38   17   13    6    6    6    6    1    1    1 
##   19   20   21   22   24 
##    2    2    2    1    1

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    2.25    2.00   24.00

PC compositions in clusters

We check the composition of PCs in different clusters and display the results as bar plots for each cluster size. (Supplementary Figure 10)

We tested whether the top PCs tend to be grouped together forming large clusters. We considered PC numbers as a numeric data points and plotted the mean and standard deviation of PC numbers per cluster.

summary <- as.data.frame(matrix(nrow = ncol(RAVmodel), ncol = 4))
colnames(summary) <- c("RAV", "size", "mean", "sd")
summary$RAV <- paste0("RAV", 1:ncol(RAVmodel))
summary$size <- metadata(RAVmodel)$size

for (i in seq_len(ncol(RAVmodel))) {
    cl <- which(metadata(RAVmodel)$cluster == i)
    cl_pcs <- metadata(RAVmodel)$cluster[cl]
    PCs <- lapply(names(cl_pcs), function(x) {
        strsplit(x, "\\.PC") %>% unlist %>% .[2] %>% as.character
    }) %>% unlist %>% as.numeric
    summary[i,"mean"] <- mean(PCs)
    summary[i,"sd"] <- sd(PCs)
}

head(summary)
##    RAV size     mean        sd
## 1 RAV1    6 1.333333 0.8164966
## 2 RAV2   21 2.142857 1.0141851
## 3 RAV3    4 4.500000 1.9148542
## 4 RAV4    7 4.142857 1.5735916
## 5 RAV5    3 5.333333 2.5166115
## 6 RAV6    3 5.333333 0.5773503

Larger clusters tend to have higher proportions of top PCs.

If we take the mean and standard deviation of the above plot:

Overall, we observed the tendency of larger clusters formed with higher proportions of top PCs.

## Summary table
summaryOfsummary <- dplyr::group_by(summary_sub, size) %>% 
    summarize(mean_mean = mean(mean), mean_sd = mean(sd))
summaryOfsummary
## # A tibble: 20 × 3
##     size mean_mean mean_sd
##    <int>     <dbl>   <dbl>
##  1     2     12.5    3.06 
##  2     3      9.74   3.28 
##  3     4      8.16   3.33 
##  4     5      7.05   3.08 
##  5     6      5.12   2.38 
##  6     7      5.42   2.48 
##  7     8      4.5    2.62 
##  8     9      3.33   1.77 
##  9    10      3.98   1.97 
## 10    11      3.02   1.25 
## 11    12      2.51   1.16 
## 12    13      2.22   1.13 
## 13    14      1.5    0.760
## 14    15      2.33   1.18 
## 15    17      1.29   0.985
## 16    19      4.18   2.26 
## 17    20      1.75   0.716
## 18    21      1.76   0.967
## 19    22      1.09   0.426
## 20    24      6      3.96

GSEA

Clsuter sizes and GSEA

We assess the number of enriched gene sets for each RAV. About 40% of RAVs in RAVmodel_C2 and 50% of RAVs in RAVmodel_PLIERpriors do not have any enriched pathway and the majority of them are one- or two- element clusters (Supplementary Figure 11), suggesting that the smaller clusters are less likely to represent biological features. Because there are RAVs annotated with only one input annotation, MSigDB C2 or PLIERpriors, we include all the RAVs to make our model cover diverse annotation databases. We further evaluate the scope of biological features represented by RAVmodel through two model validation measures, pathway coverage and pathway separation, adopted from the previous study. Pathway coverage is defined as the proportion of pathways annotating RAVs out of all the gene set terms provided. Pathway coverage of RAVmodel_C2 is 0.32. The recount2_MultiPLIER has the pathway coverage of 0.42 while the RAVmodel_PLIERpriors which uses the same gene set as recount2_MultiPLIER has 0.64 pathway coverage. Pathway separation is defined as the ability of the signature model to keep non-overlapping signatures that can differentiate biologically similar pathways. Three biological subjects were tested on RAVmodel_PLIERpriors - type I versus type II interferon, neutrophil versus monocyte, and G1 versus G2 cell cycle phases. RAVmodel can successfully separate them either with top one or top five enriched pathways. Detailed analysis is available HERE

GSEA without LINC genes

We investigated the effect of long intergenic non-coding (LINC) RNA in our model on GSEA annotation. Details of this analysis is available HERE, which also includes pathway coverage analysis.

Redundancy

Redundancy within the cluster is defined as the cluster containing more than one PCs from the same study. The majority of RAVs (78%, 2,628 out of 3,386 non-single-element RAVs) consist of PCs from unique studies. 622 non-single-element RAVs are composed of only one study and 80% of them have no or only one MSigDB C2 pathway enriched.

First, we summarize RAVs with multiple PCs from one study and save it in dup object.

dup <- as.data.frame(matrix(ncol = 4, nrow = ncol(RAVmodel)))
colnames(dup) <- c("RAV", "Dupicated", "Num_duplicated", "Studies_duplicated")
for (i in seq_len(ncol(RAVmodel))) {
    ## PCs in each cluster
    cl_pcs <- which(metadata(RAVmodel)$cluster == i)
    
    ## Studies in the given RAV
    Projs <- lapply(names(cl_pcs), function(x) {
        strsplit(x, "\\.") %>% unlist %>% .[1] %>% as.character
    }) %>% unlist
    
    dup[i, 1] <- paste0("RAV", i)
    dup[i, 2] <- any(duplicated(Projs))
    dup[i, 3] <- sum(duplicated(Projs))
    dup[i, 4] <- length(unique(Projs[duplicated(Projs)]))
}

table(dup$Dupicated) # 758 RAVs have PCs from the same study --> remove them?
## 
## FALSE  TRUE 
##  4006   758
table(dup$Num_duplicated) # how many PCs from the same study
## 
##    0    1    2    3    4 
## 4006  675   70    8    5
table(dup$Studies_duplicated) # how many studies are multiplicated in one RAV
## 
##    0    1    2 
## 4006  752    6

We subset dup to the RAVs with multiple PCs from the same study and saved it in multiPCs. If the number of duplicates PCs and the number of duplicated studies are equal, then there are two PCs from multiple studies. We checked RAVs that have more than two PCs from the same study.

## Check RAVs with more than one PC from a study
multiPCs_ind <- which(dup$Num_duplicated != 0 & dup$Num_duplicated != 1)
multiPCs <- dup[multiPCs_ind,]

## The number of RAVs with more than 2 PCs from a study
multiPCs_ind <- which(dup[,"Num_duplicated"] > dup[,"Studies_duplicated"]) # 77 RAVs

## `dup_RAVs` contains studies contributing an RAV with more than 2 PCs   
dup_RAVs_studies <- vector(mode = "list", length = length(multiPCs_ind))
dup_RAVs_PCs <- vector(mode = "list", length = length(multiPCs_ind))
dup_RAVs_gsea <- vector(mode = "list", length = length(multiPCs_ind))
dup_RAVs_clsize <- vector(mode = "list", length = length(multiPCs_ind))
names(dup_RAVs_studies) <- names(dup_RAVs_PCs) <- names(dup_RAVs_gsea) <- names(dup_RAVs_clsize) <- paste0("RAV", multiPCs_ind)
for (i in seq_along(multiPCs_ind)) {
    res <- getRAVInfo(RAVmodel, multiPCs_ind[i])
    dup_RAVs_studies[[i]] <- res$members$studyName
    dup_RAVs_PCs[[i]] <- res$members$PC
    dup_RAVs_gsea[[i]] <- res$enrichedPathways
    dup_RAVs_clsize[[i]] <- res$clusterSize
}

Characteristics of those 77 dup_RAVs are:

length(unlist(dup_RAVs_studies)) # 249 PCs 
length(unique(unlist(dup_RAVs_studies))) # only 45 studies

table(unlist(dup_RAVs_gsea)) # 42 RAVs with no enriched pathways
table(unlist(dup_RAVs_clsize)) # 64 are 3-element clusters

We summarized the duplication detail.

dup_RAVs_studies_tbl <- as.data.frame(matrix(nrow = length(dup_RAVs_studies),
                                             ncol = 3))
colnames(dup_RAVs_studies_tbl) <- c("RAV", "Study", "# duplicated")

for (i in seq_along(dup_RAVs_studies)) {
    res <- table(dup_RAVs_studies[[i]])
    dup_RAVs_studies_tbl[i,1] <- names(dup_RAVs_studies)[i] # RAV
    dup_RAVs_studies_tbl[i,2] <- names(res) # Study
    dup_RAVs_studies_tbl[i,3] <- res # duplicated number
}

dup_RAVs_studies_tbl$gsea <- unlist(dup_RAVs_gsea)
head(dup_RAVs_studies_tbl)

We checked any RAV with both PC1 and PC2 from the same study.

all_dup_RAVs <- which(dup$Dupicated == TRUE)

topPCs_RAVs <- vector(mode = "list", length = length(all_dup_RAVs))
names(topPCs_RAVs) <- paste0("RAV", all_dup_RAVs)

for (i in seq_along(all_dup_RAVs)) {
    x <- getRAVInfo(RAVmodel, all_dup_RAVs[i])$members
    dup_study <- x$studyName[duplicated(x$studyName)]
    res <- all(filter(x, studyName %in% dup_study)$PC %in% c(1,2,3))
    topPCs_RAVs[[i]] <- res
}

There are 37 RAVs which have both PC1 and PC2 from the same study.

sum(unlist(topPCs_RAVs)) # 37
## [1] 37
which(topPCs_RAVs == TRUE)
##   RAV23   RAV40   RAV59   RAV95  RAV206  RAV257  RAV272  RAV311  RAV450  RAV515 
##       1       2       6       9      17      19      23      26      43      47 
##  RAV518  RAV625  RAV654  RAV683  RAV702  RAV887  RAV937  RAV987 RAV1018 RAV1266 
##      48      55      57      58      60      88     103     113     116     145 
## RAV1303 RAV1345 RAV1406 RAV1455 RAV1528 RAV1726 RAV2058 RAV2112 RAV2686 RAV2902 
##     151     155     165     174     179     198     240     243     312     359 
## RAV2910 RAV3485 RAV3644 RAV3936 RAV4203 RAV4494 RAV4621 
##     361     451     477     529     594     679     710

We checked non-single-element clusters containing PCs from only one study.

ns_ind <- which(metadata(RAVmodel)$size != 1)
s_study <- which(sapply(studies(RAVmodel), length) == 1)

single_study_ind <- intersect(ns_ind, s_study)
length(single_study_ind)
## [1] 622

We checked the number of enriched pathways for 622 RAVs consist of only one study. 80% of them for RAVmodel_C2 and 82% of RAVmodel_PLIERpriors among 622 RAVs have no or one enriched pathway.

## C2
x <- gsea(RAVmodel)[filterList[[4]]]
x_count <- sapply(x, nrow)
table(x_count)
## x_count
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  19  20  22  29 
## 314 185  56  24  10   3   7   3   1   1   1   2   2   1   1   1   2   1   1   1 
##  30  32  33  53  72 
##   1   1   1   1   1

## PLIERpriors
x <- gsea(RAVmodel_PLIERpriors)[filterList[[4]]]
x_count <- sapply(x, nrow)
table(x_count)
## x_count
##   0   1   2   3   4   5   6   7   8   9  10  12 
## 401 109  36  31  25   8   4   2   2   1   2   1

Effect on the existing use cases

Overlap with the RAVs mentioned in the manuscript (ind_all).

dup_ind <- which(dup$Dupicated)
commonRAVs <-intersect(ind_all, dup_ind)

for (rav in commonRAVs) {
    res <- getRAVInfo(RAVmodel, rav)
    print(paste(paste0("RAV", rav), "contains", 
                res$clusterSize, "PCs from", 
                length(unique(res$members$studyName)), "unique studies"))
}
## [1] "RAV683 contains 11 PCs from 10 unique studies"
## [1] "RAV868 contains 7 PCs from 5 unique studies"
## [1] "RAV684 contains 19 PCs from 18 unique studies"
## [1] "RAV23 contains 13 PCs from 12 unique studies"

To guide the interpretation, GenomicSuperSignature gives a message when the output included any of the following RAVs:

  1. single-element RAVs,
  2. RAVs with no or too many enriched pathways, where ‘too-many’ is defined as 5% of input gene sets (276 and 31 for MSigDB C2 and PLIERpriors, respectively),
  3. non-single-element RAVs constructed from a single study. These criteria together include 2,557 RAVs.

This message is for guidance only and not for definitive interpretation.

## Message example
library(bcellViper)
data(bcellViper)
val_all <- validate(dset, RAVmodel)
heatmapTable(val_all, RAVmodel)
## RAV1076 can be filtered based on GSEA_C2
## RAV2538 can be filtered based on GSEA_C2
## RAV1139 can be filtered based on GSEA_C2

Manuscript Figures

Supplementary Figure 9

RAVs are constructed from different numbers of PCs, ranging from 1 to 24. Here, we plotted the number RAVs (y-axis) against the cluster sizes (x-axis) to show the distribution of RAV sizes.

Supplementary Figure 11

We summarized the gene set annotation status of RAVs based on the RAV sizes. We tested two RAVmodels A) RAVmodel annotated with MSigDB C2 and B) RAVmodel annotated with three gene sets provided through the PLIER package. RAVs without enriched pathways are labeled with teal and RAVs with one or more enriched pathways are in red.

## Warning: package 'EBImage' was built under R version 4.1.3

Session Info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] EBImage_4.36.0              bcellViper_1.30.0          
##  [3] ggplot2_3.3.5               dplyr_1.0.8                
##  [5] GenomicSuperSignature_1.3.6 SummarizedExperiment_1.24.0
##  [7] Biobase_2.54.0              GenomicRanges_1.46.1       
##  [9] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [11] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [13] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [15] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-3       ggsignif_0.6.3         rjson_0.2.21          
##   [4] ellipsis_0.3.2         rprojroot_2.0.2        circlize_0.4.14       
##   [7] XVector_0.34.0         fftwtools_0.9-11       GlobalOptions_0.1.2   
##  [10] fs_1.5.2               clue_0.3-60            ggpubr_0.4.0          
##  [13] farver_2.1.0           bit64_4.0.5            fansi_1.0.2           
##  [16] codetools_0.2-18       doParallel_1.0.17      cachem_1.0.6          
##  [19] knitr_1.37             jsonlite_1.8.0         broom_0.7.12          
##  [22] cluster_2.1.2          dbplyr_2.1.1           png_0.1-7             
##  [25] BiocManager_1.30.16    readr_2.1.2            compiler_4.1.2        
##  [28] httr_1.4.2             backports_1.4.1        assertthat_0.2.1      
##  [31] Matrix_1.4-0           fastmap_1.1.0          cli_3.2.0             
##  [34] htmltools_0.5.2        tools_4.1.2            gtable_0.3.0          
##  [37] glue_1.6.2             GenomeInfoDbData_1.2.7 rappdirs_0.3.3        
##  [40] Rcpp_1.0.8.3           carData_3.0-5          jquerylib_0.1.4       
##  [43] pkgdown_2.0.2          vctrs_0.3.8            iterators_1.0.14      
##  [46] xfun_0.30              stringr_1.4.0          lifecycle_1.0.1       
##  [49] rstatix_0.7.0          zlibbioc_1.40.0        scales_1.1.1          
##  [52] ragg_1.2.2             hms_1.1.1              parallel_4.1.2        
##  [55] RColorBrewer_1.1-2     ComplexHeatmap_2.10.0  yaml_2.3.5            
##  [58] curl_4.3.2             memoise_2.0.1          sass_0.4.0            
##  [61] stringi_1.7.6          RSQLite_2.2.10         highr_0.9             
##  [64] desc_1.4.1             foreach_1.5.2          tiff_0.1-11           
##  [67] filelock_1.0.2         shape_1.4.6            rlang_1.0.2           
##  [70] pkgconfig_2.0.3        systemfonts_1.0.4      bitops_1.0-7          
##  [73] evaluate_0.15          lattice_0.20-45        purrr_0.3.4           
##  [76] htmlwidgets_1.5.4      labeling_0.4.2         bit_4.0.4             
##  [79] tidyselect_1.1.2       magrittr_2.0.2         bookdown_0.25         
##  [82] R6_2.5.1               magick_2.7.3           generics_0.1.2        
##  [85] DelayedArray_0.20.0    DBI_1.1.2              pillar_1.7.0          
##  [88] withr_2.5.0            abind_1.4-5            RCurl_1.98-1.6        
##  [91] tibble_3.1.6           crayon_1.5.0           car_3.0-12            
##  [94] utf8_1.2.2             BiocFileCache_2.2.1    tzdb_0.2.0            
##  [97] rmarkdown_2.13         jpeg_0.1-9             GetoptLong_1.0.5      
## [100] locfit_1.5-9.5         grid_4.1.2             blob_1.2.2            
## [103] digest_0.6.29          tidyr_1.2.0            textshaping_0.3.6     
## [106] munsell_0.5.0          bslib_0.3.1