---
title: "12-Apul-sRNAseq-MirMachine"
author: "Sam White"
date: "2023-11-02"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---

```{r setup, include=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(reticulate)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

Use [MirMachine](https://github.com/sinanugur/MirMachine) [@umu2022] to identify potential
miRNA homologs in the _A.millepora_ genome.

The _A.millepora_ genome will be used as the reference genome for _A.pulchra_, as _A.pulchra_ does not currently
have a sequenced genome and _A.millepora_ had highest alignment rates for standard RNAseq data compared to other published genomes tested.

---

Inputs:

- Genome FastA.

Outputs:

- Confidence-filtered GFF of predicted miRNAs.
- Unfiltered FastA of predicted miRNAs.
- Unfiltered GFF of predicted miRNAs.

Software requirements:

- Requires a [MirMachine](https://github.com/sinanugur/MirMachine) Conda/Mamba environment, per the installation instructions.

---

# Set R variables
Replace with name of your MirMachine environment and the path to the corresponding
conda installation (find this _after_ you've activated the environment).

E.g. 

```{r conda-evn-example, eval=FALSE}
# Activate environment
conda activate ShortStack4_env

# Find conda path
which conda
```
```{r R-variables, eval=TRUE}
mirmachine_conda_env_name <- c("mirmachine_env")
mirmachine_cond_path <- c("/home/sam/programs/mambaforge/condabin/conda")
```


# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/12-Apul-sRNAseq-MirMachine'
echo ""

echo 'export mirmarchine_output_prefix="Amil-MirMachine"'

echo "# Input/Output files"
echo 'export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1'
echo 'export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"'
echo 'export genome_fasta="${genome_fasta_dir}/${genome_fasta_name}"'

echo ""

echo "# External data URLs"
echo 'export genome_fasta_url="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/753/865/GCF_013753865.1_Amil_v2.1/"'
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
} > .bashvars

cat .bashvars
```

# Load [MirMachine](https://github.com/sinanugur/MirMachine) conda environment

If this is successful, the first line of output should show that the Python being used
is the one in your [MirMachine](https://github.com/sinanugur/MirMachine) conda environment path.

E.g.

`python:         /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python`
```{r load-mirmachine-conda-env, eval=TRUE}
use_condaenv(condaenv = mirmachine_conda_env_name, conda = mirmachine_cond_path)

# Check successful env loading
py_config()
```

# Download _A.millepora_ genome from NCBI (`GCF_013753865.1_Amil_v2.1`)
```{r download-ncbi-genome-fasta, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

wget \
--directory-prefix ${genome_fasta_dir} \
--recursive \
--no-check-certificate \
--continue \
--no-host-directories \
--no-directories \
--no-parent \
--quiet \
--execute robots=off \
--accept "${genome_fasta_name}.gz,md5checksums.txt" ${genome_fasta_url}

ls -lh "${genome_fasta_dir}"
```


## Verify genome FastA MD5 checksum
```{r verify-genome-fasta-checksum, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${genome_fasta_dir}"

# Checksums file contains other files, so this just looks for the sRNAseq files.
grep "${genome_fasta_name}" md5checksums.txt | md5sum --check
```

## Decompress FastA file
```{r decompress-FastA, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${genome_fasta_dir}"

gunzip --keep "${genome_fasta_name}.gz"

ls -lth
```


# Run [MirMachine](https://github.com/sinanugur/MirMachine)

The `--species` option specifies output naming structure
```{r mirmachine, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

time \
MirMachine.py \
--node Metazoa \
--species ${mirmarchine_output_prefix} \
--genome ${genome_fasta} \
--cpu 40 \
--add-all-nodes \
&> mirmachine.log
```

# Results

## View [MirMachine](https://github.com/sinanugur/MirMachine) output structure
```{r mirmachine-output-tree, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

tree -h ${output_dir_top}/results/predictions
```

## Check FastA
```{r check-mirmachine-fasta, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

echo "Total number of predicted miRNAs (high & low confidence):"
grep --count "^>" "${output_dir_top}/results/predictions/fasta/${mirmarchine_output_prefix}.PRE.fasta"
echo ""
echo "----------------------------------------------------------------------------------"
echo ""

head "${output_dir_top}/results/predictions/fasta/${mirmarchine_output_prefix}.PRE.fasta"
```
## Check filtered GFF
Per [MirMachine](https://github.com/sinanugur/MirMachine) recommendations:

> `filtered_gff/` High confidence miRNA family predictions after bitscore filtering. (This file is what you need in most cases)

So, we'll stick with that.
```{r check-filtered-gff, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Avoid 10th line - lengthy list of all miRNA families examined
head -n 9 "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff"

echo ""
# Pull out lines that do _not_ begin with a '#'.
grep "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" \
| head

```

## Counts of predicted high-confidence miRNAs
```{r count-predicted-hc-mirnas, eval=TRUE, engine='bash'}
# Load bash variables into memory
source .bashvars

# Skip lines which begin with '#' (i.e. the header components)
echo "Predicted loci:"
grep -c "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff"

echo ""
echo "Unique miRNA families:"
grep "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" \
| awk -F"[\t=;]" '{print $10}' \
| sort -u \
| wc -l
```

---

# Citations