---
title: "lncRNA Comparison"
author: Zach Bengtsson, Steven Roberts
date: "`r format(Sys.time(), '%d %B, %Y')`" 
output: 
  github_document:
    toc: true
    toc_depth: 3
    number_sections: true
    html_preview: true
  html_document:
    theme: readable
    highlight: zenburn
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
---

```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(kableExtra)
library(DESeq2)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(DT)
library(formattable)
library(Biostrings)
library(spaa)
library(tm)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,         # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```

# lncRNA Discovery

(using Acropora pulchra as example)

## HISAT

HISAT2 build to create an index for the Acropora millipora FASTA file...

```{r engine='bash'}
#FASTA file path input
ref_genome="/home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna"
index="../output/05.2-lncRNA/Amil.index"
/home/shared/hisat2-2.2.1/hisat2-build \
-f "$ref_genome" \
"$index"
```

HISAT2 to align paired-end RNA-Seq reads to the

```{bash}
ls /home/shared/8TB_HDD_01/apul/trimmed/
```

```{bash}
find /home/shared/8TB_HDD_01/apul/trimmed/*gz \
| xargs basename -s _R1_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/05.2-lncRNA/Amil.index \
-p 20 \
-1 /home/shared/8TB_HDD_01/apul/trimmed/{}_R1_001.fastp-trim.20230519.fastq.gz \
-2 /home/shared/8TB_HDD_01/apul/trimmed/{}_R2_001.fastp-trim.20230519.fastq.gz \
-S ../output/05.2-lncRNA/{}.sam
```

## Samtools

Samtools to convert the SAM files output from the previous HISAT2 command into sorted BAM files...

```{bash samtools}
for file in ../output/05.2-lncRNA/*.sam; do
    base=$(basename "$file" .sam)
    /home/shared/samtools-1.12/samtools view -bS "$file" | \
    /home/shared/samtools-1.12/samtools sort \
    -o ../output/05.2-lncRNA/"$base".sorted.bam
done
```


```{r engine='bash'}
num_threads=20 # Change this to the number of threads/cores you want to use

find ../output/05.2-lncRNA/ -name "*.sam" | xargs -I {} -P $num_threads bash -c '
    file={}
    base=$(basename "$file" .sam)
    /home/shared/samtools-1.12/samtools view -bS "$file" | \
    /home/shared/samtools-1.12/samtools sort \
    -o ../output/05.2-lncRNA/"$base".sorted.bam
'
```
```{bash}
process_file() {
    file=$1
    base=$(basename "$file" .sam)
    /home/shared/samtools-1.12/samtools view -bS "$file" | \
    /home/shared/samtools-1.12/samtools sort \
    -o ../output/05.2-lncRNA/"$base".sorted.bam
}
export -f process_file

find ../output/05.2-lncRNA/ -name "*.sam" | parallel -j4 process_file
```


## StringTie

StringTie to assemble transcripts from the sorted BAM files

::: panel-tabset
## Software Info

\`StringTie\` is a software tool for assembling RNA-Seq data, which helps in reconstructing a set of transcripts that can potentially explain the RNA-Seq read alignments (often provided in BAM or SAM format) for a given genomic sequence.

Here's a basic overview of what StringTie does:

1\. \*\*Transcript Assembly\*\*: After RNA-Seq reads are aligned to a reference genome using tools like \`TopHat\`, \`HISAT\`, or \`STAR\`, StringTie can use those alignments to assemble transcripts. This process is particularly useful for identifying novel isoforms or alternative splicing events.

2\. \*\*Quantification\*\*: Apart from transcript assembly, StringTie also provides an estimate of the expression levels of each transcript, usually in the form of Fragments Per Kilobase of transcript per Million mapped reads (FPKM) or Transcripts Per Million (TPM).

3\. \*\*Merging Transcripts\*\*: StringTie can merge multiple transcript assemblies, which can be useful when you have RNA-Seq data from multiple conditions or time points and you want to create a unified set of transcripts.

4\. \*\*Comparison with Reference Annotation\*\*: You can also compare the assembled transcripts with a reference annotation (like GTF/GFF files) to determine how many of the assembled transcripts match known annotated transcripts and to find novel ones.

5\. \*\*Ballgown Integration\*\*: StringTie can produce output that's compatible with \`Ballgown\`, a tool for differential expression analysis and visualization of RNA-Seq data.

Why is it popular?

\- \*\*Efficiency\*\*: StringTie is known for its speed and efficiency, making it suitable for large datasets, like those produced in modern RNA-Seq experiments.

\- \*\*Sensitivity\*\*: The algorithm behind StringTie was designed to be highly sensitive, especially when detecting alternative splicing events.

\- \*\*Flexibility\*\*: It offers various functionalities beyond just transcript assembly, like quantification and integration with differential expression tools.

When considering transcript assembly and quantification tools, it's essential to keep in mind the specific needs and characteristics of your dataset. While StringTie is powerful and versatile, there are other tools in the bioinformatics field, each with its strengths and weaknesses, that might be more appropriate for specific use cases or datasets.

## Example Code

Assuming you've already aligned your RNA-Seq reads to a reference genome and have a resulting BAM file (e.g., `aligned_reads.bam`), you can use the following steps:

1.  **Transcript Assembly with StringTie**:

    ``` bash
    stringtie -p 8 -G reference_annotation.gtf -o output.gtf -l sample_name aligned_reads.bam
    ```

    -   `-p 8`: Use 8 threads/processors.
    -   `-G reference_annotation.gtf`: This is the reference annotation in GTF format. You might have one from sources like GENCODE or Ensembl.
    -   `-o output.gtf`: This specifies the name of the output GTF file.
    -   `-l sample_name`: Labels the transcripts with the provided prefix (useful for distinguishing samples).

2.  **Quantify Transcripts**: The previous command also provides expression estimates. However, you might want to merge transcripts from multiple samples or conditions and then re-quantify. For this, you can use a two-step process:

    a.  **Merge Transcripts**:

        ``` bash
        stringtie --merge -p 8 -G reference_annotation.gtf -o merged_transcripts.gtf assembly_list.txt
        ```

        -   `assembly_list.txt`: A text file listing the paths to the GTF files for each sample, one file per line.

    b.  **Re-quantify against Merged Transcriptome**:

        ``` bash
        stringtie -e -B -p 8 -G merged_transcripts.gtf -o sample_name_recquant.gtf aligned_reads.bam
        ```

3.  **Prepare Output for Ballgown**: If you plan to use Ballgown for downstream differential expression analysis, you'd need the `-B` flag during the quantification step (as shown above). This tells StringTie to produce output files in a format Ballgown can ingest.

Remember to adjust the file names (`aligned_reads.bam`, `reference_annotation.gtf`, etc.) to match your dataset. Always consult the StringTie documentation for the most up-to-date information and for an explanation of additional parameters not covered here. \## more
:::

```{bash}
ls ../output/05.2-lncRNA/*.sorted.bam
```

```{bash}
cd ../data
curl -O  https://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.gff
```


```{bash}
find ../output/05.2-lncRNA/ -name "*sorted.bam" | xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 20 \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o ../output/05.2-lncRNA/{}.gtf \
../output/05.2-lncRNA/{}.sorted.bam \
```


```{r, engine='bash'}
head -3 ../output/05.2-lncRNA/*.gtf
```


Use StringTie merge to merge all the individual GTF files into a single merged GTF file...

```{r, engine='bash' stringtie2}
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o ../output/05.2-lncRNA/apul_merged.gtf \
../output/05.2-lncRNA/*.gtf

```

```{bash}
head ../output/05.2-lncRNA/apul_merged.gtf
tail ../output/05.2-lncRNA/apul_merged.gtf
wc -l ../output/05.2-lncRNA/apul_merged.gtf
```

## GFFcompare

Use gffcompare to compare annotation file generated by StringTie to a reference annotation file and to produce a set of output files summarizing the results of the comparison, including classification of each transcript...

```{bash gffcompare}
/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
-r ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged \
/home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf \
```

```{bash}
head /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf

```

## Filter

Filter to get a subset of the transcripts from the original GTF file that are putative lncRNA candidates based on their length and lack of overlap with known reference transcripts...

```{bash filter}
awk '$3 == "transcript" && $1 !~ /^#/ {print}' /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf
```

```{bash}
head /home/shared/8TB_HDD_01/pver/filter-output/merged_lncRNA_candidates.gtf
```

## Bedtools

Get fasta of lncRNA candidate regions with Bedtools...

```{bash}
/home/shared/bedtools2/bin/bedtools \
getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf -fo /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -name -split
```

```{bash}
head /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta
```

## CPC2

Evaluate coding potential of transcripts with Coding Potential Calculator 2...

```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py -i /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -o ~/github/coral-lncRNA/output/apul_merged_cpc2_results
```

## Final Filter

Filter for those transcripts with label "noncoding"...

```{bash}
# Filter out transcripts with coding potential
awk '$8 == "noncoding" {print $1}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_cpc2_results.txt > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt
grep -Fwf /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf
```

## De-duplicate GTF

Remove duplicates of transcript IDs that appear more than once within the GTF...

```{bash}
awk '!seen[$0]++' ~/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf > ~/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf

```

This results in the final GTF list with transcript IDs containing scaffold and base range, information used to filter the reference FASTA for only lncRNAs.

## Reformat to BED

Make GTF into BED format so Bedtools can recognize scaffold and base range...

```{bash}
awk -F":|-" '{print $3 "\t" $4 "\t" $5}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed

```

## BEDTools

Use BEDTools to extract sequences of lncRNAs using the reformatted GTF list of transcript IDs...

```{bash}
/home/shared/bedtools2/bin/bedtools \
getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed -fo /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_bedtools_lncRNAs.fasta -name

```