---
title: "13-Apul-sRNAseq-ShortStack"
author: "Sam White"
date: "2023-11-03"
output: 
  bookdown::pdf_document2:
    latex_engine: xelatex
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
link-citations: true
---

```{r setup, include=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(reticulate)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

Use [ShortStack](https://github.com/MikeAxtell/ShortStack) [@axtell2013a; @shahid2014; @johnson2016a]to perform alignment of sRNAseq data and annotation of sRNA-producing genes.

The *A.millepora* genome will be used as the reference genome for *A.pulchra*, as *A.pulchra* does not currently have a sequenced genome and *A.millepora* had highest alignment rates for standard RNAseq data compared to other published genomes tested.

------------------------------------------------------------------------

Inputs:

-   Requires trimmed sRNAseq files generated by [08-Apul-sRNAseq-trimming.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/08-Apul-sRNAseq-trimming.Rmd)

    -   Filenames formatted: `*flexbar_trim.25bp*.gz`

-   *A.millepora* genome FastA. See [12-Apul-sRNAseq-MirMachine.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/12-Apul-sRNAseq-MirMachine.Rmd) for download info if needed.

Outputs:

-   See [ShortStack outputs documentation](https://github.com/MikeAxtell/ShortStack#outputs) for full list and detailed descriptions.

Software requirements:

-   Utilizes a [ShortStack](https://github.com/MikeAxtell/ShortStack#installation) Conda/Mamba environment, per the installation instructions.

Replace with name of your ShortStack environment and the path to the corresponding conda installation (find this *after* you've activated the environment).

E.g.

``` bash
# Activate environment
conda activate ShortStack4_env

# Find conda path
which conda
```

------------------------------------------------------------------------

# Set R variables

```{r R-variables, eval=TRUE}
shortstack_conda_env_name <- c("ShortStack4_env")
shortstack_cond_path <- c("/home/sam/programs/mambaforge/condabin/conda")
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Trimmed FastQ naming pattern"
echo "export trimmed_fastqs_pattern='*flexbar_trim.25bp*.fastq.gz'"

echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export deep_dive_data_dir="${deep_dive_dir}/data"'
echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/13-Apul-sRNAseq-ShortStack'
echo 'export trimmed_fastqs_dir="${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-reads"'
echo ""

echo "# Input/Output files"
echo 'export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1'
echo 'export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"'
echo 'export shortstack_genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fa"'
echo 'export mirbase_mature_fasta=mature.fa'
echo 'export mirbase_mature_fasta_version=mirbase-mature-v22.1.fa'
echo 'export genome_fasta="${genome_fasta_dir}/${shortstack_genome_fasta_name}"'
echo ""

echo "# External data URLs"
echo 'export mirbase_fasta_url="https://mirbase.org/download_version_files/22.1/"'
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=46'
echo ""

echo "# Initialize arrays"
echo 'export trimmed_fastqs_array=()'


} > .bashvars

cat .bashvars
```

# Load [ShortStack](https://github.com/MikeAxtell/ShortStack) conda environment

If this is successful, the first line of output should show that the Python being used is the one in your [ShortStack](<https://github.com/MikeAxtell/ShortStack> conda environment path.

E.g.

`python:         /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python`

```{r load-shortstack-conda-env, eval=TRUE}
use_condaenv(condaenv = shortstack_conda_env_name, conda = shortstack_cond_path)

# Check successful env loading
py_config()
```

# Download [miRBase](https://mirbase.org/) mature miRNA FastA

```{r download-mirbase-mature-fasta, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

wget \
--directory-prefix ${deep_dive_data_dir} \
--recursive \
--no-check-certificate \
--continue \
--no-host-directories \
--no-directories \
--no-parent \
--quiet \
--execute robots=off \
 ${mirbase_fasta_url}/${mirbase_mature_fasta}

# Rename to indicate miRBase FastA version
mv ${deep_dive_data_dir}/${mirbase_mature_fasta} ${deep_dive_data_dir}/${mirbase_mature_fasta_version}

ls -lh "${deep_dive_data_dir}"
```

# Run ShortStack

## Modify genome filename for ShortStack compatability

```{r rename-genome-filename, engine='bash', cache=TRUE, eval=TRUE}
# Load bash variables into memory
source .bashvars

# Copy genome FastA to ShortStack-compatible filename (ending with .fa)
cp ${genome_fasta_dir}/${genome_fasta_name} ${genome_fasta_dir}/${shortstack_genome_fasta_name}

# Confirm
ls -lh ${genome_fasta_dir}/${shortstack_genome_fasta_name}
```

## Excecute ShortStack command

Uses the `--dn_mirna` option to identify miRNAs in the genome, without relying on the `--known_miRNAs`.

This part of the code redirects the output of `time` to the end of `shortstack.log` file.

-   `; } \ 2>> ${output_dir_top}/shortstack.log`


```{r shortstack, engine='bash', cache=TRUE}
# Load bash variables into memory
source .bashvars

# Create array of trimmed FastQs
trimmed_fastqs_array=(${trimmed_fastqs_dir}/${trimmed_fastqs_pattern})


# Pass array contents to new variable as space-delimited list
trimmed_fastqs_list=$(echo "${trimmed_fastqs_array[*]}")


###### Run ShortStack ######
{ time \
ShortStack \
--genomefile "${genome_fasta}" \
--readfile ${trimmed_fastqs_list} \
--known_miRNAs ${deep_dive_data_dir}/${mirbase_mature_fasta_version} \
--dn_mirna \
--threads ${threads} \
--outdir ${output_dir_top}/ShortStack_out \
&> ${output_dir_top}/shortstack.log ; } \
2>> ${output_dir_top}/shortstack.log

```

## Check runtime

```{r engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

tail -n 3 ${output_dir_top}/shortstack.log \
| grep "real" \
| awk '{print "ShortStack runtime:" "\t" $2}'

```


# Results

## ShortStack synopsis

```{r shortstack-synopsis, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

tail -n 20 ${output_dir_top}/shortstack.log
```

ShortStack didn't identify *any* miRNAs.

## Inspect `Results.txt`

```{r results-txt-file, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

head ${output_dir_top}/ShortStack_out/Results.txt

echo ""
echo "----------------------------------------------------------"
echo ""

echo "Nummber of potential loci:"
awk '(NR>1)' ${output_dir_top}/ShortStack_out/Results.txt | wc -l
```

Column 20 of the `Results.txt` file identifies if a cluster is a miRNA or not (`Y` or `N`).

```{r results-txt-miRNAs, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

echo "Number of loci characterized as miRNA:"
awk '$20=="Y" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \
| wc -l
echo ""

echo "----------------------------------------------------------"

echo ""
echo "Number of loci _not_ characterized as miRNA:"
awk '$20=="N" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \
| wc -l

```

Column 21 of the `Results.txt` file identifies if a cluster aligned to a known miRNA (miRBase) or not (`Y` or `NA`).

Since there are no miRNAs, the following code will *not* print any output.

The `echo` command after the `awk` command is simply there to prove that the chunk executed.

```{r results-txt-miRBase-miRNAs, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

echo "Number of loci matching miRBase miRNAs:"
awk '$21!="NA" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \
| wc -l
echo ""

echo "----------------------------------------------------------"

echo ""
echo "Number of loci _not_ matching miRBase miRNAs:"
awk '$21=="NA" {print $0}' ${output_dir_top}/ShortStack_out/Results.txt \
| wc -l

```

Although there are loci with matches to miRBase miRNAs, ShortStack did *not* annotated these clusters as miRNAs likely [because they do not *also* match secondary structure criteria](https://github.com/MikeAxtell/ShortStack#mirna-annotation).

### Directory tree of all ShortStack outputs

Many of these are large (by GitHub standards) BAM files, so will not be added to the repo.

Additionally, it's unlikely we'll utilize most of the other files (bigwig) generated by ShortStack.

```{r shortstack-directory-tree, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

tree -h ${output_dir_top}/

```

------------------------------------------------------------------------

# Citations