Training datasets for RAVmodel building

Abstract

Source

We used human RNA sequencing datasets from RNA-seq Sample Compendia in refine.bio, which hosts uniformly processed gene expression data from EBI’s ArrayExpress, NCBI’s GEO and SRA. Data were downloaded on April 10th, 2020, and we selected studies based on the following criteria: 1) Exclude studies with more than 1,000 samples because they are more likely to be single-cell RNA sequencing datasets. 2) Exclude studies assigned with a Medical Subject Headings (MeSH) term, “Single-Cell Analysis”. 3) Exclude studies with less than or equal to 50 successfully downloaded and imported samples. After filtering, the complete compendium includes 536 studies (defined as a single SRA study) comprising 44,890 samples.

The detail of training data is summarized in studyMeta.tsv table in GenomicSuperSignature package. This vignette explains how this summary table is built.

extdata <- system.file("extdata", package = "GenomicSuperSignature")
studyMeta <- read.table(file.path(extdata, "studyMeta.tsv.gz"))
head(studyMeta)
##   studyName downloaded metadata imported RAVmodel_677 RAVmodel_1399
## 2 DRP000425          4        4    FALSE        FALSE         FALSE
## 3 DRP000464          8        8    FALSE        FALSE         FALSE
## 4 DRP000499         21       21     TRUE        FALSE          TRUE
## 5 DRP000527          2        4    FALSE        FALSE         FALSE
## 6 DRP000665          7        7    FALSE        FALSE         FALSE
## 7 DRP000929          4        4    FALSE        FALSE         FALSE
##   RAVmodel_536
## 2        FALSE
## 3        FALSE
## 4        FALSE
## 5        FALSE
## 6        FALSE
## 7        FALSE
##                                                                                                                                                                     title
## 2                                                                                                           Homo sapiens T24 human bladder cancer cell line Transcriptome
## 3 pre-miRNA profiles obtained through application of locked nucleic acids reveals complex 5’/3’ arm variation including concomitant cleavage and polyuridylation patterns
## 4                                                                                                                      Homo sapiens strain:Human ICESeq Genome sequencing
## 5                                                                                                      RNA sequencing of wild-type or mutant U2AF35 transduced HeLa Cells
## 6                                                                                                             Global transcriptional response against glucose deprivation
## 7                                                                  Human inactive X chromosome is compacted through a polycomb-independent SMCHD1-HBiX1 pathway [RNA-seq]

Availability: metadata

aggregated_metadata.json is downloaded along with the training dataset. We collect the number of samples per study based on this metadata.

data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
meta_json <- jsonlite::fromJSON(file.path(data.dir, 
                                          "aggregated_metadata.json"))
experiments <- meta_json[[2]] 

studies <- list()
for (i in seq_along(experiments)) {
  studies <- append(studies, list(experiments[[i]]$sample_accession_codes))
  names(studies)[i] <- names(experiments)[i]
}

sampleNum <- sapply(studies, function(x) length(x))

Availability: download/import

I summarized the number of samples based on aggregated_metadata.json under metadata column and the number of actually downloaded _quant.sf sample files under downloaded column in the studyMeta table.

# download.dir <- "path/to/download/directory"
dirNames <- list.files(download.dir)

ind_rm <- which(dirNames %in% 
                  c("aggregated_metadata.json", "README.md", "LICENSE.TXT"))
if (length(ind_rm) != 0) {dirNames <- dirNames[-ind_rm]}
length(dirNames) # 6457 studies were downloaded
## [1] 6457

studyMeta <- as.data.frame(matrix(NA, nrow = length(dirNames), ncol = 3))
colnames(studyMeta) <- c("studyName", "downloaded", "metadata") 
studyMeta$studyName <- dirNames

for (i in seq_along(dirNames)) {
  dir_path <- file.path(download.dir, dirNames[i])
  
  ## actually downloaded samples
  ind1 <- which(studyMeta$studyName == dirNames[i]) 
  studyMeta$downloaded[ind1] <- grep("_quant.sf", 
                                     list.files(dir_path)) %>% length
   
  ## available samples based on metadata
  ind2 <- which(names(sampleNum) == dirNames[i])  
  studyMeta$metadata[ind2] <- sampleNum[ind2]
}

Imported samples

1399 studies with more than 20 samples were successfully imported by tximport::tximport

studyNames <- studyMeta$studyName
importedFiles <- c()

# download.dir2 <- "path/to/imported/directory"
for (i in seq_along(studyNames)) {
  out.path <- file.path(download.dir2, studyNames[i])
  fname <- file.path(out.path, paste0(studyNames[i], ".rds"))
  if (file.exists(fname)) {
    importedFiles <- c(importedFiles, studyNames[i])
  }
}

imported_ind <- which(studyMeta$studyName %in% importedFiles)
studyMeta$imported <- FALSE
studyMeta$imported[imported_ind] <- TRUE

Study title is also added.

for (i in seq_along(experiments)) {
  title <- experiments[[i]]$title
  studyMeta$title[i] <- title
}

Example

Example study, SRP152576:
- 13,127 samples are available based on aggregated_metadata.json.
- No sample is available based on refine.bio webpage.
- 100 samples are actually downloaded.
- 3,927 samples are available based on metadata_SRP152576.csv.

ind <- which.max(studyMeta$metadata)  
studyMeta[ind,]
##      studyName downloaded metadata imported
## 5530 SRP152576        100    13127     TRUE
##                                                                                title
## 5530 In vivo molecular signatures of severe dengue infection revealed by viscRNA-Seq

file_dir <- file.path(download.dir, studyMeta$studyName[ind])
fname <- paste0("metadata_", studyMeta$studyName[ind], ".tsv")
sampleMeta <- read.table(file.path(file_dir, fname), header = TRUE, sep = "\t")
dim(sampleMeta)
## [1] 3927   19

head(sampleMeta, 3)
##   refinebio_accession_code experiment_accession refinebio_age
## 1               SRR7481117            SRP152576            14
## 2               SRR7480712            SRP152576            14
## 3               SRR7483223            SRP152576            31
##   refinebio_cell_line refinebio_compound refinebio_disease
## 1                  NA                 NA                NA
## 2                  NA                 NA                NA
## 3                  NA                 NA                NA
##   refinebio_disease_stage refinebio_genetic_information refinebio_organism
## 1                      NA                            NA       HOMO_SAPIENS
## 2                      NA                            NA       HOMO_SAPIENS
## 3                      NA                            NA       HOMO_SAPIENS
##         refinebio_platform refinebio_race refinebio_sex
## 1 NextSeq 500 (NextSeq500)             NA          male
## 2 NextSeq 500 (NextSeq500)             NA          male
## 3 NextSeq 500 (NextSeq500)             NA        female
##   refinebio_source_archive_url refinebio_source_database
## 1                           NA                       SRA
## 2                           NA                       SRA
## 3                           NA                       SRA
##   refinebio_specimen_part refinebio_subject refinebio_time refinebio_title
## 1                   blood     blood (pbmcs)             NA  1001701103_E10
## 2                   blood     blood (pbmcs)             NA  1001701101_C19
## 3                   blood     blood (pbmcs)             NA  1001701401_P23
##   refinebio_treatment
## 1                  NA
## 2                  NA
## 3                  NA

Summary plot

Availability

Metadata bar (light blue) shows the number of studies with the given ranges of study sizes based on the metadata. Downloaded bar (pink) represents the number of studies with the given ranges of study sizes that were successfully downloaded and imported through tximport. Based on metadata, there were studies with more than 100 samples, but at the time of snapshot, only up to 100 samples were available. Thus, the plot displays only up to 100 samples/study cases. Due to the unavailability of certain samples, more studies belong to 0-5 samples/study bracket than metadata suggests.

## Cases where all the available samples (based on metadata) were downloaded
sum(studyMeta$downloaded == studyMeta$metadata)
## [1] 3135

## The number of studies with more than 100 samples: 
## Actually downloaded vs. expected by metadata
sum(studyMeta$downloaded > 100)
## [1] 0
sum(studyMeta$metadata > 100)
## [1] 426

Types of data

data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
fname <- file.path(data.dir, "source_name_annotated.tsv")
source_name_annotated <- read.csv(fname, sep = "\t")

Cancer

Blood

Cell Line

Studies used for model buildling

RAVmodel_677

studies_with_error.csv file contains the sample names that were failed to be downloaded/imported when I tried to use studies > 50 samples based on the metadata. You can find more documentation on this under Methods/model_building.

RAVmodel_1399

All the studies that have more than 20 successfully imported samples. So, it’s practically same as imported column.

RAVmodel_536

These 536 studies meet the criteria, > 50 (downloaded) and <1000 (metadata) samples per study. Also, any study with “Single-Cell Analysis” MeSH term was removed.

MultiPLIER vs. RAVmodel

## Table with study and run accessions
saveRDS(res_all, "accessions.rds")

## Training data used for GenomicSuperSignature
res_all <- readRDS("accessions.rds")

## Training data used for multiPLIER
recount2 <- read.table("https://raw.githubusercontent.com/greenelab/multi-plier/master/data/sample_info/recount2_srr_srs_srp.tsv", header = TRUE)

head(res_all, 3)
##   run_accession source study_accession sample_accession experiment_accession
## 1     DRR006374    SRA       DRP000987        DRS005720            DRX005592
## 2     DRR006375    SRA       DRP000987        DRS005721            DRX005593
## 3     DRR006376    SRA       DRP000987        DRS005722            DRX005594
head(recount2, 3)
##         run     study    sample
## 1 SRR013549 SRP000599 SRS002125
## 2 SRR013550 SRP000599 SRS002126
## 3 SRR013551 SRP000599 SRS002127

multiPLIER used 37027 unique runs from 30301 samples in 1466 studies.
GenomicSuperSignature used 44890 unique runs from 34616 samples in 535 studies.
There are 6839 overlapping runs from 5260 samples in 87 studies used for both models.

Session Info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.8      BiocStyle_2.22.0
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9           bslib_0.3.1         compiler_4.1.2     
##  [4] pillar_1.7.0        BiocManager_1.30.16 jquerylib_0.1.4    
##  [7] tools_4.1.2         digest_0.6.29       jsonlite_1.8.0     
## [10] evaluate_0.15       memoise_2.0.1       lifecycle_1.0.1    
## [13] tibble_3.1.6        pkgconfig_2.0.3     rlang_1.0.2        
## [16] DBI_1.1.2           cli_3.2.0           yaml_2.3.5         
## [19] pkgdown_2.0.2       xfun_0.30           fastmap_1.1.0      
## [22] stringr_1.4.0       knitr_1.37          desc_1.4.1         
## [25] generics_0.1.2      fs_1.5.2            sass_0.4.1         
## [28] vctrs_0.3.8         systemfonts_1.0.4   tidyselect_1.1.2   
## [31] rprojroot_2.0.2     glue_1.6.2          R6_2.5.1           
## [34] textshaping_0.3.6   fansi_1.0.2         rmarkdown_2.13     
## [37] bookdown_0.25       tidyr_1.2.0         purrr_0.3.4        
## [40] magrittr_2.0.2      htmltools_0.5.2     ellipsis_0.3.2     
## [43] assertthat_0.2.1    ragg_1.2.2          utf8_1.2.2         
## [46] stringi_1.7.6       cachem_1.0.6        crayon_1.5.0

Sehyun Oh

2022-03-24