--- title: "13-Apul-sRNAseq-ShortStack" author: "Sam White" date: "2023-11-03" output: bookdown::pdf_document2: latex_engine: xelatex bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib link-citations: true --- ```{r setup, include=FALSE} library(knitr) library(kableExtra) library(dplyr) library(reticulate) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` Use [ShortStack](https://github.com/MikeAxtell/ShortStack) [@axtell2013a; @shahid2014; @johnson2016a]to perform alignment of sRNAseq data and annotation of sRNA-producing genes. The *A.millepora* genome will be used as the reference genome for *A.pulchra*, as *A.pulchra* does not currently have a sequenced genome and *A.millepora* had highest alignment rates for standard RNAseq data compared to other published genomes tested. ------------------------------------------------------------------------ Inputs: - Requires trimmed sRNAseq files generated by [08-Apul-sRNAseq-trimming.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/08-Apul-sRNAseq-trimming.Rmd) - Filenames formatted: `*flexbar_trim.25bp*.gz` - *A.millepora* genome FastA. See [12-Apul-sRNAseq-MirMachine.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/12-Apul-sRNAseq-MirMachine.Rmd) for download info if needed. Outputs: - See [ShortStack outputs documentation](https://github.com/MikeAxtell/ShortStack#outputs) for full list and detailed descriptions. Software requirements: - Utilizes a [ShortStack](https://github.com/MikeAxtell/ShortStack#installation) Conda/Mamba environment, per the installation instructions. Replace with name of your ShortStack environment and the path to the corresponding conda installation (find this *after* you've activated the environment). E.g. ``` bash # Activate environment conda activate ShortStack4_env # Find conda path which conda ``` ------------------------------------------------------------------------ # Set R variables ```{r R-variables, eval=TRUE} shortstack_conda_env_name <- c("ShortStack4_env") shortstack_cond_path <- c("/home/sam/programs/mambaforge/condabin/conda") ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Trimmed FastQ naming pattern" echo "export trimmed_fastqs_pattern='*flexbar_trim.25bp*.fastq.gz'" echo "# Data directories" echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive' echo 'export deep_dive_data_dir="${deep_dive_dir}/data"' echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/13-Apul-sRNAseq-ShortStack' echo 'export trimmed_fastqs_dir="${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-reads"' echo "" echo "# Input/Output files" echo 'export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1' echo 'export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"' echo 'export shortstack_genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fa"' echo 'export mirbase_mature_fasta=mature.fa' echo 'export mirbase_mature_fasta_version=mirbase-mature-v22.1.fa' echo 'export genome_fasta="${genome_fasta_dir}/${shortstack_genome_fasta_name}"' echo "" echo "# External data URLs" echo 'export mirbase_fasta_url="https://mirbase.org/download_version_files/22.1/"' echo "" echo "# Set number of CPUs to use" echo 'export threads=46' echo "" echo "# Initialize arrays" echo 'export trimmed_fastqs_array=()' } > .bashvars cat .bashvars ``` # Load [ShortStack](https://github.com/MikeAxtell/ShortStack) conda environment If this is successful, the first line of output should show that the Python being used is the one in your [ShortStack]( conda environment path. E.g. `python: /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python` ```{r load-shortstack-conda-env, eval=TRUE} use_condaenv(condaenv = shortstack_conda_env_name, conda = shortstack_cond_path) # Check successful env loading py_config() ``` # Download [miRBase](https://mirbase.org/) mature miRNA FastA ```{r download-mirbase-mature-fasta, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars wget \ --directory-prefix ${deep_dive_data_dir} \ --recursive \ --no-check-certificate \ --continue \ --no-host-directories \ --no-directories \ --no-parent \ --quiet \ --execute robots=off \ ${mirbase_fasta_url}/${mirbase_mature_fasta} # Rename to indicate miRBase FastA version mv ${deep_dive_data_dir}/${mirbase_mature_fasta} ${deep_dive_data_dir}/${mirbase_mature_fasta_version} ls -lh "${deep_dive_data_dir}" ``` # Run ShortStack ## Modify genome filename for ShortStack compatability ```{r rename-genome-filename, engine='bash', cache=TRUE, eval=TRUE} # Load bash variables into memory source .bashvars # Copy genome FastA to ShortStack-compatible filename (ending with .fa) cp ${genome_fasta_dir}/${genome_fasta_name} ${genome_fasta_dir}/${shortstack_genome_fasta_name} # Confirm ls -lh ${genome_fasta_dir}/${shortstack_genome_fasta_name} ``` ## Excecute ShortStack command Uses the `--dn_mirna` option to identify miRNAs in the genome, without relying on the `--known_miRNAs`. This part of the code redirects the output of `time` to the end of `shortstack.log` file. - `; } \ 2>> ${output_dir_top}/shortstack.log` ```{r shortstack, engine='bash', cache=TRUE} # Load bash variables into memory source .bashvars # Create array of trimmed FastQs trimmed_fastqs_array=(${trimmed_fastqs_dir}/${trimmed_fastqs_pattern}) # Pass array contents to new variable as space-delimited list trimmed_fastqs_list=$(echo "${trimmed_fastqs_array[*]}") ###### Run ShortStack ###### { time \ ShortStack \ --genomefile "${genome_fasta}" \ --readfile ${trimmed_fastqs_list} \ --known_miRNAs ${deep_dive_data_dir}/${mirbase_mature_fasta_version} \ --dn_mirna \ --threads ${threads} \ --outdir ${output_dir_top}/ShortStack_out \ &> ${output_dir_top}/shortstack.log ; } \ 2>> ${output_dir_top}/shortstack.log ``` ## Check runtime ```{r engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars tail -n 3 ${output_dir_top}/shortstack.log \ | grep "real" \ | awk '{print "ShortStack runtime:" "\t" $2}' ``` # Results ## ShortStack synopsis ```{r shortstack-synopsis, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars tail -n 20 ${output_dir_top}/shortstack.log ``` ShortStack didn't identify *any* miRNAs. ## Inspect `Results.txt` ```{r results-txt-file, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars head ${output_dir_top}/ShortStack_out/Results.txt echo "" echo "----------------------------------------------------------" echo "" echo "Nummber of potential loci:" awk '(NR>1)' ${output_dir_top}/ShortStack_out/Results.txt | wc -l ``` Column 20 of the `Results.txt` file identifies if a cluster is a miRNA or not (`Y` or `N`). ```{r results-txt-miRNAs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars echo "Number of loci characterized as miRNA:" awk '$20=="Y" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \ | wc -l echo "" echo "----------------------------------------------------------" echo "" echo "Number of loci _not_ characterized as miRNA:" awk '$20=="N" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \ | wc -l ``` Column 21 of the `Results.txt` file identifies if a cluster aligned to a known miRNA (miRBase) or not (`Y` or `NA`). Since there are no miRNAs, the following code will *not* print any output. The `echo` command after the `awk` command is simply there to prove that the chunk executed. ```{r results-txt-miRBase-miRNAs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars echo "Number of loci matching miRBase miRNAs:" awk '$21!="NA" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \ | wc -l echo "" echo "----------------------------------------------------------" echo "" echo "Number of loci _not_ matching miRBase miRNAs:" awk '$21=="NA" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \ | wc -l ``` Although there are loci with matches to miRBase miRNAs, ShortStack did *not* annotated these clusters as miRNAs likely [because they do not *also* match secondary structure criteria](https://github.com/MikeAxtell/ShortStack#mirna-annotation). ### Directory tree of all ShortStack outputs Many of these are large (by GitHub standards) BAM files, so will not be added to the repo. Additionally, it's unlikely we'll utilize most of the other files (bigwig) generated by ShortStack. ```{r shortstack-directory-tree, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars tree -h ${output_dir_top}/ ``` ------------------------------------------------------------------------ # Citations