salmon.Rmd
The AnVIL project is an analysis, visualization, and informatics cloud-based space for data access, sharing and computing across large genomic-related data sets.
For R users with the limited computing resources, we introduce the AnVILWorkflow package. This package allows users to run workflows implemented in Terra without installing software, writing any workflow, or managing cloud resources. Terra is a cloud-based genomics platform and its computing resources rely on Google Cloud Platform (GCP).
Use of this package requires AnVIL and Google cloud computing billing accounts. Consult AnVIL training guides for details on establishing these accounts.
If you use AnVILWorkflow within Terra’s RStudio, you don’t need extra authentication and gcloud SDK. If you use this package locally, it requires gcloud SDK and the billing account used in Terra. You can [install][] the gcloud sdk.
Check whether your system has the installation with
AnVIL::gcloud_exists()
. It should return TRUE
to use AnVILWorkflow package.
gcloud_exists()
#> [1] TRUE
If it returns FALSE
, install the gcloud SDK following
this script:
devtools::install_github("rstudio/cloudml")
cloudml::gcloud_install()
## shell
$ gcloud auth login
You need Terra account setup. Once you have your own Terra account, you need two pieces of information to use AnVILWorkflow package:
You can setup your working environment using
setCloudEnv()
function like below. Provide the
input values with YOUR account information!
accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"
setCloudEnv(accountEmail = accountEmail,
billingProjectName = billingProjectName)
The remainder of this vignette assumes that an Terra account has been established and successfully linked to a Google cloud computing billing account.
Here is the table of major functions for three workflow steps - prepare, run, and check result.
Steps | Functions | Description |
---|---|---|
Prepare | cloneWorkspace |
Copy the template workspace |
updateInput |
Take user’s inputs | |
Run | runWorkflow |
Launch the workflow in Terra |
stopWorkflow |
Abort the submission | |
monitorWorkflow |
Monitor the status of your workflow run | |
Result | getOutput |
List or download your workflow outputs |
You can find all the available workspaces you have access to using
AnVIL::avworkspaces()
function. Workspaces maintained by
Bioconductor team are separately checked using
availableAnalysis()
function. The values under
analysis
column can be used for the analysis argument,
simplifying the cloning process. For this vignette, we use
"salmon"
.
availableAnalysis()
#> analysis
#> 1 bioBakery
#> 2 salmon
#> description
#> 1 Microbiome analysis using bioBakery
#> 2 Trascript quantification from RNAseq using Salmon | Differential gene expression analysis using DESeq2
analysis <- "salmon"
We will refer the existing workspaces, that you have access to and
want to use for your analysis, as ‘template’ workspaces. The first step
of using this package is cloning the template workspace using
cloneWorkspace
function. Note that you need to provide a
unique name for the cloned workspace through
workspaceName
argument. Once you successfully clone the
workspace, the function will return the name of the cloned workspace.
For example, the successfully execution of the below script will return
{YOUR_BILLING_ACCOUNT}/salmon_test
.
#> Error : 'avworkspace_clone' failed:
#> Conflict (HTTP 409).
#> Workspace waldronlab-terra-rstudio/salmon_test already exists
#> Error : 'avworkspace_clone' failed:
#> Conflict (HTTP 409).
#> Workspace waldronlab-terra-rstudio/salmon_test2 already exists
If you want to clone any other workspace that you have access to but
is not curated by this pacakge, you can directly enter the name of the
target workspace as a templateName
. For example, to clone
the Tumor_Only_CNV
workspace:
cloneWorkspace(workspaceName = "cnv_test",
templateName = "Tumor_Only_CNV")
#> Error in mget(arg_names, environment): object 'billingProjectName' not found
#> Error in mget(arg_names, environment): object 'billingProjectName' not found
You can review the current inputs using currentInput
function. Below shows all the required and optional inputs for the
workflow.
current_input <- currentInput(workspaceName)
current_input
#> # A tibble: 4 × 4
#> inputType name optional attribute
#> <chr> <chr> <lgl> <chr>
#> 1 Array[File] salmon.fastqs_1 FALSE "this.participants.fastq…
#> 2 Array[File] salmon.fastqs_2 FALSE "this.participants.fastq…
#> 3 File salmon.transcriptome_fasta FALSE "\"gs://fc-78b3af8a-a1d5…
#> 4 String salmon.transcriptome_index_name FALSE "\"new_index_name\""
You can modify/update inputs of your workflow using
updateInput
function. To minimize the formatting issues, we
recommend to make any change in the current input table returned from
the currentInput
function. Under the default (), the
updated input table will be returned without actually updating
Terra/AnVIL. Set , to make a change in Terra/AnVIL.
new_input <- current_input
new_input[4,4] <- "athal_index"
new_input
#> # A tibble: 4 × 4
#> inputType name optional attribute
#> <chr> <chr> <lgl> <chr>
#> 1 Array[File] salmon.fastqs_1 FALSE "this.participants.fastq…
#> 2 Array[File] salmon.fastqs_2 FALSE "this.participants.fastq…
#> 3 File salmon.transcriptome_fasta FALSE "\"gs://fc-78b3af8a-a1d5…
#> 4 String salmon.transcriptome_index_name FALSE "athal_index"
updateInput(workspaceName, inputs = new_input)
#> List of 9
#> $ deleted : logi FALSE
#> $ inputs :List of 4
#> ..$ salmon.fastqs_1 : 'scalar' chr "this.participants.fastq_1"
#> ..$ salmon.fastqs_2 : 'scalar' chr "this.participants.fastq_2"
#> ..$ salmon.transcriptome_fasta : 'scalar' chr "\"gs://fc-78b3af8a-a1d5-4c43-9a84-4b9036230184/athal.fa.gz\""
#> ..$ salmon.transcriptome_index_name: 'scalar' chr "\"athal_index\""
#> $ methodConfigVersion: int 9
#> $ methodRepoMethod :List of 4
#> ..$ methodUri : chr "dockstore://github.com%2FKayla-Morrell%2FAnVILBulkRNASeq/master"
#> ..$ sourceRepo : chr "dockstore"
#> ..$ methodPath : chr "github.com/Kayla-Morrell/AnVILBulkRNASeq"
#> ..$ methodVersion: chr "master"
#> $ name : chr "AnVILBulkRNASeq"
#> $ namespace : chr "bioconductor-rpci-anvil"
#> $ outputs :List of 2
#> ..$ salmon.salmon_index.transcriptome_index: 'scalar' chr ""
#> ..$ salmon.salmon_quant.quant_output : 'scalar' chr ""
#> $ prerequisites : Named list()
#> $ rootEntityType : chr "participant_set"
#> - attr(*, "class")= chr [1:2] "avworkflow_configuration" "list"
You can launch the workflow using runWorkflow()
function. You need to specify the inputName
of your
workflow. If you don’t provide it, this function will return the list of
input names you can use for your workflow.
runWorkflow(workspaceName)
#> You should provide the inputName from the followings:
#> [1] "AnVILBulkRNASeq_set"
#> Error in runWorkflow(workspaceName):
Example error outputs:
# You should provide the inputName from the followings:
# [1] "AnVILBulkRNASeq_set"
runWorkflow(workspaceName, inputName = "AnVILBulkRNASeq_set")
#> [1] "Workflow is succesfully launched."
The last three columns (status
, succeeded
,
and failed
) show the submission and the result status.
submissions <- monitorWorkflow(workspaceName = "salmon_test")
submissions
#> # A tibble: 24 × 9
#> submissio…¹ submi…² submissionDate status succe…³ failed submi…⁴ names…⁵
#> <chr> <chr> <dttm> <chr> <int> <int> <chr> <chr>
#> 1 d2a06522-e… shbrie… 2022-11-01 10:10:16 Submi… 0 0 gs://f… waldro…
#> 2 8762e2cf-8… shbrie… 2022-10-27 22:56:42 Abort… 0 0 gs://f… waldro…
#> 3 6c0eed0c-9… shbrie… 2022-10-27 22:55:10 Abort… 0 0 gs://f… waldro…
#> 4 7598308b-9… shbrie… 2022-10-27 22:54:31 Abort… 0 0 gs://f… waldro…
#> 5 e2b6921e-2… shbrie… 2022-10-26 16:02:34 Abort… 0 0 gs://f… waldro…
#> 6 018bebd5-7… shbrie… 2022-10-26 16:01:23 Abort… 0 0 gs://f… waldro…
#> 7 a9704eed-d… shbrie… 2022-10-26 16:00:45 Abort… 0 0 gs://f… waldro…
#> 8 7ac90e50-8… shbrie… 2022-10-26 15:22:08 Abort… 0 0 gs://f… waldro…
#> 9 6ed7afa9-b… shbrie… 2022-10-26 15:15:20 Abort… 0 0 gs://f… waldro…
#> 10 25975032-f… shbrie… 2022-10-25 14:16:26 Abort… 0 0 gs://f… waldro…
#> # … with 14 more rows, 1 more variable: name <chr>, and abbreviated variable
#> # names ¹submissionId, ²submitter, ³succeeded, ⁴submissionRoot, ⁵namespace
You can abort the most recently submitted job using the
stopWorkflow
function. You can abort any workflow that is
not the most recently submitted by providing a specific
submissionId
.
stopWorkflow(workspaceName)
#> Status of the submitted job (submissionId: d2a06522-e870-476e-a165-9cc09e168989)
#> [1] "Workflow is succesfully aborted."
You can check all the output files from the most recently succeeded
submission using getOutput
function. If you specify the
submissionId
argument, you can get the output files of that
specific submission.
## Output from the aborted submission
output_from_aborted <- getOutput(workspaceName = "salmont_test",
submissionId = "83c8be07-c024-40e8-94c9-cd6756f738ac")
## Output from the successfully-done submission
out <- getOutput(workspaceName = "salmon_test",
submissionId = "4ff369e0-eda0-43c7-8107-1b11f4cc2056")
dim(out)
#> [1] 17 2
head(out)
#> filename name
#> 1 test_index.tar.gz ,"salmon.salmon_index.transcriptome_index"
#> 2 DRR016125_1.tar.gz ,"salmon.salmon_quant.quant_output"
#> 3 DRR016126_1.tar.gz ,"salmon.salmon_quant.quant_output"
#> 4 DRR016127_1.tar.gz ,"salmon.salmon_quant.quant_output"
#> 5 DRR016128_1.tar.gz ,"salmon.salmon_quant.quant_output"
#> 6 DRR016129_1.tar.gz ,"salmon.salmon_quant.quant_output"
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] AnVILWorkflow_0.99.19 AnVIL_1.9.13.2 dplyr_1.0.10
#> [4] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 xfun_0.34 bslib_0.4.0
#> [4] purrr_0.3.5 vctrs_0.5.0 generics_0.1.3
#> [7] miniUI_0.1.1.1 htmltools_0.5.3 yaml_2.3.6
#> [10] utf8_1.2.2 rlang_1.0.6 pkgdown_2.0.6
#> [13] jquerylib_0.1.4 pillar_1.8.1 later_1.3.0
#> [16] withr_2.5.0 rapiclient_0.1.3 glue_1.6.2
#> [19] DBI_1.1.3 lambda.r_1.2.4 lifecycle_1.0.3
#> [22] stringr_1.4.1 futile.logger_1.4.3 ragg_1.2.4
#> [25] htmlwidgets_1.5.4 memoise_2.0.1 evaluate_0.17
#> [28] knitr_1.40 fastmap_1.1.0 httpuv_1.6.6
#> [31] curl_4.3.3 parallel_4.1.2 fansi_1.0.3
#> [34] Rcpp_1.0.9 xtable_1.8-4 promises_1.2.0.1
#> [37] DT_0.26 formatR_1.12 BiocManager_1.30.19
#> [40] cachem_1.0.6 desc_1.4.2 jsonlite_1.8.3
#> [43] mime_0.12 systemfonts_1.0.4 fs_1.5.2
#> [46] textshaping_0.3.6 digest_0.6.30 stringi_1.7.8
#> [49] bookdown_0.29 shiny_1.7.3 rprojroot_2.0.3
#> [52] cli_3.4.1 tools_4.1.2 magrittr_2.0.3
#> [55] sass_0.4.2 tibble_3.1.8 futile.options_1.0.1
#> [58] tidyr_1.2.1 pkgconfig_2.0.3 ellipsis_0.3.2
#> [61] assertthat_0.2.1 rmarkdown_2.17 httr_1.4.4
#> [64] rstudioapi_0.14 R6_2.5.1 compiler_4.1.2