---
title: "11-Apul-sRNAseq-miRdeep2"
author: "Sam White"
date: "2023-11-08"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
link-citations: true
---

Use [miRDeep2](https://github.com/rajewsky-lab/mirdeep2) [@friedländer2011] to identify potential miRNAs using _A.pulchra_ sRNAseq reads. The *A.millepora* genome will be used as the reference genome for *A.pulchra*, as *A.pulchra* does not currently have a sequenced genome and *A.millepora* had highest alignment rates for standard RNAseq data compared to other published genomes tested. The mature miRBase miRNA database will also be used.

------------------------------------------------------------------------

Inputs:

-   Requires collapsed reads (i.e. concatenated, unique reads) in FastA format. See [10-Apul-sRNAseq-BLASTn.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/10-Apul-sRNAseq-BLASTn.Rmd) for code.

-   *A.millepora* genome FastA. See [12-Apul-sRNAseq-MirMachine.Rmd](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/code/12-Apul-sRNAseq-MirMachine.Rmd) for download info if needed.

- [miRBase](https://mirbase.org/download/) mature miRNAs FastA.

Outputs:

-   Primary outputs are a result table in BED, CSV (tab-delimited), and HTML formats.

  - Due to the nature of mirDeep2's naming, trying to use variable names is challenging. As such, the chunks processing those files will require manual intervention to identify and provide the output filename(s); they are _not_ handled at in the `.bashvars` file at the top of this script.
  
- Other output files are too large for GitHub (and some (all?) are not needed for this analysis). Please find a full backup here:

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/deep-dive/D-Apul/output/11-Apul-sRNAseq-miRdeep2/


------------------------------------------------------------------------


```{r setup, include=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(reticulate)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = "",        # Prevents appending '##' to beginning of lines in code output
  width = 1000         # adds scroll bar
)
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Trimmed FastQ naming pattern"
echo "export trimmed_fastqs_pattern='*flexbar_trim.25bp*.fastq.gz'"
echo ""

echo "# miRTrace FastA naming pattern"
echo "export mirtrace_fasta_pattern='*flexbar_trim.25bp*.fasta'"

echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export deep_dive_data_dir="${deep_dive_dir}/data"'
echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/11-Apul-sRNAseq-miRdeep2'
echo 'export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1'
echo 'export trimmed_fastqs_dir="${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-reads"'
echo 'export collapsed_reads_dir="${deep_dive_dir}/D-Apul/output/10-Apul-sRNAseq-BLASTn"'
echo ""

echo "# Input/Output files"
echo 'export collapsed_reads_fasta="collapsed-reads-all.fasta"'
echo 'export collapsed_reads_mirdeep2="collapsed-reads-all-mirdeep2.fasta"'
echo 'export concatenated_trimmed_reads_fastq="concatenated-trimmed-reads-all.fastq.gz"'
echo 'export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"'
echo 'export genome_fasta_no_spaces="GCF_013753865.1_Amil_v2.1_genomic-no_spaces.fna"'
echo 'export mirdeep2_mapping_file="Apul-mirdeep2-mapping.arf"'
echo 'export mirbase_mature_fasta_name="mirbase-mature-v22.1.fa"'
echo 'export mirbase_mature_fasta_no_spaces="mirbase-mature-v22.1-no_spaces.fa"'
echo 'export concatenated_mirtrace_reads_fasta="concatenated-mirtrace-reads.fasta"'
echo ""


echo "# Paths to programs"
echo 'export mirdeep2_mapper="mapper.pl"'
echo 'export mirdeep2="miRDeep2.pl"'
echo 'export bowtie_build="/home/shared/bowtie-1.3.1-linux-x86_64/bowtie-build"'
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=46'
echo ""

echo "# Initialize arrays"
echo 'export trimmed_fastqs_array=()'


} > .bashvars

cat .bashvars
```

# Prepare reads for miRDeep2

Per miRDeep2 documentation:

> The readID must end with \_xNumber and is not allowed to contain
> whitespaces. has to have the format name_uniqueNumber_xnumber

```{r reformat-read-IDs, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars


sed '/^>/ s/-/_x/g' "${collapsed_reads_dir}/${collapsed_reads_fasta}" \
| sed '/^>/ s/>/>seq_/' \
> "${output_dir_top}/${collapsed_reads_mirdeep2}"

grep "^>" ${collapsed_reads_dir}/${collapsed_reads_fasta} \
| head

echo ""
echo "--------------------------------------------------"
echo ""

grep "^>" ${output_dir_top}/${collapsed_reads_mirdeep2} \
| head
```

# Reformat genome FastA description lines

miRDeep2 can't process genome FastAs with spaces in the description
lines.

So, I'm replacing spaces with underscores.

And, for aesthetics, I'm also removing commas.

```{r remove-spaces-genome-FastA, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars


sed '/^>/ s/ /_/g' "${genome_fasta_dir}/${genome_fasta_name}" \
| sed '/^>/ s/,//g' \
> "${genome_fasta_dir}/${genome_fasta_no_spaces}"

grep "^>" ${genome_fasta_dir}/${genome_fasta_name} \
| head

echo ""
echo "--------------------------------------------------"
echo ""

grep "^>" ${genome_fasta_dir}/${genome_fasta_no_spaces} \
| head
```

# Reformat miRBase FastA description lines

```{r remove-spaces-mirbase-FastA, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars


sed '/^>/ s/ /_/g' "${deep_dive_data_dir}/${mirbase_mature_fasta_name}" \
| sed '/^>/ s/,//g' \
> "${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces}"

grep "^>" ${deep_dive_data_dir}/${mirbase_mature_fasta_name} \
| head

echo ""
echo "--------------------------------------------------"
echo ""

grep "^>" ${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces} \
| head
```

# Bowtie v1 genome index

miRDeep2 requires a Bowtie v1 genome index - cannot use Bowtie2 index

```{r bowtie1-index, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Check for existence of genome index first
if [ ! -f "${genome_fasta_no_spaces%.*}.ebwt" ]; then
  ${bowtie_build} \
  ${genome_fasta_dir}/${genome_fasta_no_spaces} \
  ${genome_fasta_dir}/${genome_fasta_no_spaces%.*} \
  --threads ${threads} \
  --quiet
fi
```

# Map reads to genome

Requires genome to be previously indexed with Bowtie.

Additionally, requires user to enter path to their `mirdeep2` directory as well
as their `perl5` installation.

```{r map-read-genome, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Append miRDeep2 to system PATH and set PERL5LIB
export PATH=$PATH:/home/shared/mirdeep2/bin
export PERL5LIB=$PERL5LIB:/home/shared/mirdeep2/lib/perl5

# Run miRDeep2 mapping
time \
${mirdeep2_mapper} \
${output_dir_top}/${collapsed_reads_mirdeep2} \
-c \
-p ${genome_fasta_dir}/${genome_fasta_no_spaces%.*} \
-t ${output_dir_top}/${mirdeep2_mapping_file} \
-o ${threads}
```

# Run miRDeep2

Recommendation is to use the closest related species in the miRDeep2 options, even
if the species isn't very closely related. The documentation indicates that miRDeep2
is always more accurate when at least a species is provided.

The options provided to the command are as follows:

- `none`: Known miRNAs of the species being analyzed.
- `-t S.pupuratus`: Related species.
- `none`: Known miRNA precursors in this species.
- `-P`: Specifies miRBase version > 18.
- `-v`: Remove temporary files after completion.
- `-g -1`: Number of precursors to anlayze. A setting of `-1` will analyze all. Default is 50,000
I set this to `-1` after multiple attempts to run using the default kept failing.

NOTE: This will take an _extremely_ long time to run (days). Could possible by shortened by
excluding `randfold` analysis.


```{r mirdeep2, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Append miRDeep2 to system PATH and set PERL5LIB
export PATH=$PATH:/home/shared/mirdeep2/bin
export PERL5LIB=$PERL5LIB:/home/shared/mirdeep2/lib/perl5

time \
${mirdeep2} \
${output_dir_top}/${collapsed_reads_mirdeep2} \
${genome_fasta_dir}/${genome_fasta_no_spaces} \
${output_dir_top}/${mirdeep2_mapping_file} \
none \
${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces} \
none \
-t S.purpuratus \
-P \
-v \
-g -1 \
2>${output_dir_top}/miRDeep2-S.purpuratus-report.log
```

## Check runtime

```{r runtime-mirdeep2, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

tail -n 6 ${output_dir_top}/miRDeep2-S.purpuratus-report.log
```

# Move output files to output directory

MiRDeep2 outputs all files to the current working directly with no way
to redirect so want to move to intended output directory.

## Move output files

Output files will be in the format of `result_*` and `error_*`

```{r move-mirdeep2-output-files, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

for file in result_* error_*
do
  mv "${file}" "${output_dir_top}/"
done
```

## Identify directories

```{r identify-mirdeep2-output-dirs, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

ls -l | grep "^d"
```

## Rsync directories

```{r rsync-mirdeep2-output-files, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

rsync -aP . --include='dir***' --exclude='*' --quiet "${output_dir_top}/"

rsync -aP . --include='map***' --exclude='*' --quiet "${output_dir_top}/"

rsync -aP . --exclude='mirgene*' --include='mir***' --exclude='*' --quiet "${output_dir_top}/"

rsync -aP . --include='pdfs***' --exclude='*' --quiet "${output_dir_top}/"

echo ""
echo "Check new location:"
echo ""

ls -l "${output_dir_top}/" | grep "^d"


```

## Confirm deletion patterns work *before* deletion!

FYI - `eval=FALSE` is set because the following command will only work
once...

```{r test-deletion-patterns, eval=FALSE, engine='bash'}
ls --directory dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*
```

## Remove directores from code directory

```{r remove-mirdeep2-output-dirs, eval=FALSE, engine='bash'}
# Load bash variables into memory
source .bashvars

rm -rf dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*

```

## Check to make sure they're gone

```{r check-deletion-success, eval=TRUE, engine='bash', error=TRUE}
ls --directory dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*
```

# Results

## Peep output file format

The formatting of this CSV is terrible. It has a 3-column table on top
of a 17-column table. This makes parsing a bit of a pain in its raw
format.

```{r check-results-file, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars


head -n 30 "${output_dir_top}/result_08_11_2023_t_14_47_36.csv"
```

## Create more easily parasable results file

This will match the line beginning with `provisional id` and print to
the end of the file (represented by the `$p`. `$` = end, `p` = print)

```{r parsable-results-file, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars


sed --quiet '/provisional id/,$p' "${output_dir_top}/result_08_11_2023_t_14_47_36.csv" \
> "${output_dir_top}/parsable-result_08_11_2023_t_14_47_36.csv"

head "${output_dir_top}/parsable-result_08_11_2023_t_14_47_36.csv"
```

## Read in output CSV

This chunk provides a more convise overview of the data and it's
columns.

```{r read-in-result-csv, eval=TRUE}

mirdeep_result.df <- read.csv("../output/11-Apul-sRNAseq-miRdeep2/parsable-result_08_11_2023_t_14_47_36.csv",
                              header = TRUE,
                              sep = "\t")

str(mirdeep_result.df)

```

## miRNAs count data

This provides some rudimentary numbers for the miRDeep2 output.

Further analysis is possibly desired to evaluate score thresholds, miRNA
families, etc.

```{r results-counts, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Total predicted miRNAS
total_miRNAs=$(awk 'NR > 1' ${output_dir_top}/parsable-result_08_11_2023_t_14_47_36.csv \
| wc -l
)

echo "Total of predicted miRNAs: ${total_miRNAs}"
echo ""

# Matches to known mature miRNAs
mature_miRNAs=$(awk -F'\t' '$11 != "-" && $11 != "" {print $11}' ${output_dir_top}/parsable-result_08_11_2023_t_14_47_36.csv \
| wc -l
)

echo "Number of seed matches to known miRNAS: ${mature_miRNAs}"
echo ""

# Novel miRNAs
novel_miRNAs=$(awk -F "\t" '$11 == "-" || $11 == "" {print $11}' ${output_dir_top}/parsable-result_08_11_2023_t_14_47_36.csv \
| awk 'NR > 1' \
| wc -l
)

echo "Number of novel miRNAs: ${novel_miRNAs}"

```

------------------------------------------------------------------------

# Citations