vignettes/training_datasets/available_samples.Rmd
available_samples.Rmd
Abstract
Source CodeWe used human RNA sequencing datasets from RNA-seq Sample Compendia in refine.bio, which hosts uniformly processed gene expression data from EBI’s ArrayExpress, NCBI’s GEO and SRA. Data were downloaded on April 10th, 2020, and we selected studies based on the following criteria: 1) Exclude studies with more than 1,000 samples because they are more likely to be single-cell RNA sequencing datasets. 2) Exclude studies assigned with a Medical Subject Headings (MeSH) term, “Single-Cell Analysis”. 3) Exclude studies with less than or equal to 50 successfully downloaded and imported samples. After filtering, the complete compendium includes 536 studies (defined as a single SRA study) comprising 44,890 samples.
The detail of training data is summarized in studyMeta.tsv
table in GenomicSuperSignature
package. This vignette explains how this summary table is built.
extdata <- system.file("extdata", package = "GenomicSuperSignature")
studyMeta <- read.table(file.path(extdata, "studyMeta.tsv.gz"))
head(studyMeta)
## studyName downloaded metadata imported RAVmodel_677 RAVmodel_1399
## 2 DRP000425 4 4 FALSE FALSE FALSE
## 3 DRP000464 8 8 FALSE FALSE FALSE
## 4 DRP000499 21 21 TRUE FALSE TRUE
## 5 DRP000527 2 4 FALSE FALSE FALSE
## 6 DRP000665 7 7 FALSE FALSE FALSE
## 7 DRP000929 4 4 FALSE FALSE FALSE
## RAVmodel_536
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## title
## 2 Homo sapiens T24 human bladder cancer cell line Transcriptome
## 3 pre-miRNA profiles obtained through application of locked nucleic acids reveals complex 5’/3’ arm variation including concomitant cleavage and polyuridylation patterns
## 4 Homo sapiens strain:Human ICESeq Genome sequencing
## 5 RNA sequencing of wild-type or mutant U2AF35 transduced HeLa Cells
## 6 Global transcriptional response against glucose deprivation
## 7 Human inactive X chromosome is compacted through a polycomb-independent SMCHD1-HBiX1 pathway [RNA-seq]
aggregated_metadata.json
is downloaded along with the training dataset. We collect the number of samples per study based on this metadata.
data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
meta_json <- jsonlite::fromJSON(file.path(data.dir,
"aggregated_metadata.json"))
experiments <- meta_json[[2]]
studies <- list()
for (i in seq_along(experiments)) {
studies <- append(studies, list(experiments[[i]]$sample_accession_codes))
names(studies)[i] <- names(experiments)[i]
}
sampleNum <- sapply(studies, function(x) length(x))
I summarized the number of samples based on aggregated_metadata.json
under metadata
column and the number of actually downloaded _quant.sf
sample files under downloaded
column in the studyMeta
table.
# download.dir <- "path/to/download/directory"
dirNames <- list.files(download.dir)
ind_rm <- which(dirNames %in%
c("aggregated_metadata.json", "README.md", "LICENSE.TXT"))
if (length(ind_rm) != 0) {dirNames <- dirNames[-ind_rm]}
length(dirNames) # 6457 studies were downloaded
## [1] 6457
studyMeta <- as.data.frame(matrix(NA, nrow = length(dirNames), ncol = 3))
colnames(studyMeta) <- c("studyName", "downloaded", "metadata")
studyMeta$studyName <- dirNames
for (i in seq_along(dirNames)) {
dir_path <- file.path(download.dir, dirNames[i])
## actually downloaded samples
ind1 <- which(studyMeta$studyName == dirNames[i])
studyMeta$downloaded[ind1] <- grep("_quant.sf",
list.files(dir_path)) %>% length
## available samples based on metadata
ind2 <- which(names(sampleNum) == dirNames[i])
studyMeta$metadata[ind2] <- sampleNum[ind2]
}
1399 studies with more than 20 samples were successfully imported by tximport::tximport
studyNames <- studyMeta$studyName
importedFiles <- c()
# download.dir2 <- "path/to/imported/directory"
for (i in seq_along(studyNames)) {
out.path <- file.path(download.dir2, studyNames[i])
fname <- file.path(out.path, paste0(studyNames[i], ".rds"))
if (file.exists(fname)) {
importedFiles <- c(importedFiles, studyNames[i])
}
}
imported_ind <- which(studyMeta$studyName %in% importedFiles)
studyMeta$imported <- FALSE
studyMeta$imported[imported_ind] <- TRUE
Study title is also added.
for (i in seq_along(experiments)) {
title <- experiments[[i]]$title
studyMeta$title[i] <- title
}
Example study, SRP152576
:
- 13,127 samples are available based on aggregated_metadata.json
.
- No sample is available based on refine.bio webpage.
- 100 samples are actually downloaded.
- 3,927 samples are available based on metadata_SRP152576.csv
.
ind <- which.max(studyMeta$metadata)
studyMeta[ind,]
## studyName downloaded metadata imported
## 5530 SRP152576 100 13127 TRUE
## title
## 5530 In vivo molecular signatures of severe dengue infection revealed by viscRNA-Seq
file_dir <- file.path(download.dir, studyMeta$studyName[ind])
fname <- paste0("metadata_", studyMeta$studyName[ind], ".tsv")
sampleMeta <- read.table(file.path(file_dir, fname), header = TRUE, sep = "\t")
dim(sampleMeta)
## [1] 3927 19
head(sampleMeta, 3)
## refinebio_accession_code experiment_accession refinebio_age
## 1 SRR7481117 SRP152576 14
## 2 SRR7480712 SRP152576 14
## 3 SRR7483223 SRP152576 31
## refinebio_cell_line refinebio_compound refinebio_disease
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## refinebio_disease_stage refinebio_genetic_information refinebio_organism
## 1 NA NA HOMO_SAPIENS
## 2 NA NA HOMO_SAPIENS
## 3 NA NA HOMO_SAPIENS
## refinebio_platform refinebio_race refinebio_sex
## 1 NextSeq 500 (NextSeq500) NA male
## 2 NextSeq 500 (NextSeq500) NA male
## 3 NextSeq 500 (NextSeq500) NA female
## refinebio_source_archive_url refinebio_source_database
## 1 NA SRA
## 2 NA SRA
## 3 NA SRA
## refinebio_specimen_part refinebio_subject refinebio_time refinebio_title
## 1 blood blood (pbmcs) NA 1001701103_E10
## 2 blood blood (pbmcs) NA 1001701101_C19
## 3 blood blood (pbmcs) NA 1001701401_P23
## refinebio_treatment
## 1 NA
## 2 NA
## 3 NA
Metadata bar (light blue) shows the number of studies with the given ranges of study sizes based on the metadata. Downloaded bar (pink) represents the number of studies with the given ranges of study sizes that were successfully downloaded and imported through tximport
. Based on metadata, there were studies with more than 100 samples, but at the time of snapshot, only up to 100 samples were available. Thus, the plot displays only up to 100 samples/study cases. Due to the unavailability of certain samples, more studies belong to 0-5 samples/study bracket than metadata suggests.
## Cases where all the available samples (based on metadata) were downloaded
sum(studyMeta$downloaded == studyMeta$metadata)
## [1] 3135
## The number of studies with more than 100 samples:
## Actually downloaded vs. expected by metadata
sum(studyMeta$downloaded > 100)
## [1] 0
sum(studyMeta$metadata > 100)
## [1] 426
data.dir <- system.file("extdata", package = "GenomicSuperSignaturePaper")
fname <- file.path(data.dir, "source_name_annotated.tsv")
source_name_annotated <- read.csv(fname, sep = "\t")
studies_with_error.csv
file contains the sample names that were failed to be downloaded/imported when I tried to use studies > 50 samples based on the metadata. You can find more documentation on this under Methods/model_building
.
## Table with study and run accessions
saveRDS(res_all, "accessions.rds")
## Training data used for GenomicSuperSignature
res_all <- readRDS("accessions.rds")
## Training data used for multiPLIER
recount2 <- read.table("https://raw.githubusercontent.com/greenelab/multi-plier/master/data/sample_info/recount2_srr_srs_srp.tsv", header = TRUE)
head(res_all, 3)
## run_accession source study_accession sample_accession experiment_accession
## 1 DRR006374 SRA DRP000987 DRS005720 DRX005592
## 2 DRR006375 SRA DRP000987 DRS005721 DRX005593
## 3 DRR006376 SRA DRP000987 DRS005722 DRX005594
head(recount2, 3)
## run study sample
## 1 SRR013549 SRP000599 SRS002125
## 2 SRR013550 SRP000599 SRS002126
## 3 SRR013551 SRP000599 SRS002127
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.8 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 bslib_0.3.1 compiler_4.1.2
## [4] pillar_1.7.0 BiocManager_1.30.16 jquerylib_0.1.4
## [7] tools_4.1.2 digest_0.6.29 jsonlite_1.8.0
## [10] evaluate_0.15 memoise_2.0.1 lifecycle_1.0.1
## [13] tibble_3.1.6 pkgconfig_2.0.3 rlang_1.0.2
## [16] DBI_1.1.2 cli_3.2.0 yaml_2.3.5
## [19] pkgdown_2.0.2 xfun_0.30 fastmap_1.1.0
## [22] stringr_1.4.0 knitr_1.37 desc_1.4.1
## [25] generics_0.1.2 fs_1.5.2 sass_0.4.1
## [28] vctrs_0.3.8 systemfonts_1.0.4 tidyselect_1.1.2
## [31] rprojroot_2.0.2 glue_1.6.2 R6_2.5.1
## [34] textshaping_0.3.6 fansi_1.0.2 rmarkdown_2.13
## [37] bookdown_0.25 tidyr_1.2.0 purrr_0.3.4
## [40] magrittr_2.0.2 htmltools_0.5.2 ellipsis_0.3.2
## [43] assertthat_0.2.1 ragg_1.2.2 utf8_1.2.2
## [46] stringi_1.7.6 cachem_1.0.6 crayon_1.5.0