---
title: "Long non-coding RNA discovery pipeline"
author: "Zach Bengtsson"
date: "`r format(Sys.time(), '%d %B, %Y')`"  
format:
  html:
    toc: true
    toc-depth: 2
    html-math-method: katex
    css: styles.css
    theme: sandstone
---

[GitHub Repository](https://github.com/urol-e5/deep-dive)

## Introduction

This pipeline identifies long non-coding RNAs (lncRNAs) and yields expression count matrices of lncRNAs in RNA-seq data.

## Objectives

1.  Identify lncRNAs present within the RNA-seq data
2.  Create a putative list of lncRNA transcript in GTF and FASTA formats
3.  Build count matrices for expression analysis

# Part 1: Finding lncRNAs

![Workflow of programs](https://github.com/zbengt/zbengt.github.io/blob/master/assets/img/lncRNA-disc.png?raw=true)

## Data Wrangling

Download genome FASTA for mapping and assembly of RNA-seq data

```{r, engine='bash', eval=FALSE, echo=TRUE}
wget -P ../data/ http://INSERT_LINK_TO_YOUR_GENOME/download/genome.fasta.gz
gunzip ../data/genome.fasta.gz
```

Download genome GTF or GFF for use later in transcript assembly

```{r, engine='bash', eval=FALSE, echo=TRUE}
wget -P ../data/ http://INSERT_LINK_TO_YOUR_GENOME_GTF/download/genome.gtf
gunzip ../data/genome.gtf
```

Download paired-end, post-QC RNA-seq FASTQ files...

```{r, engine='bash', eval=FALSE, echo=TRUE}
wget -r \
--no-check-certificate \
--quiet \
--no-directories --no-parent \
-P ../data/fastq \
-A *fastq.gz \
https://INSERT_LINK_TO_YOUR_TRIMMED_FASTQ_FILES
```

What an unzipped FASTQ should look like...

```{r, engine='bash', eval=TRUE, echo=FALSE}
head /home/shared/8TB_HDD_01/pver/C17_R1_001.fastq
```

## HISAT2

HISAT2 build to create an index for the genome FASTA file...

```{r, engine='bash', eval=FALSE, echo=TRUE}
/home/shared/hisat2-2.2.1/hisat2-build \ ##replace this with your path to hisat2-build tool
-f ../data/genome.fasta \
../output/genome.index
```

HISAT2 to align paired-end RNA-Seq reads to the genome.fasta index...

```{r, engine='bash', eval=FALSE, echo=TRUE}
find ../data/fastq/*fastq.gz \
| xargs basename -s -INSERT_FILE_BASENAME_HERE.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \ ##replace this with your path to hisat2-build tool
-x ../output/genome.index \
-p 48 \
-1 ../data/fastq/{}-R1_001-INSERT_FILE_BASENAME_HERE.fastq.gz \ ##R1 and R2 represent paired end sequences. You will need to examine your file naming system to determine how file basenames distinguish between R1 and R2 for these lines.
-2 ../data/fastq/{}-R2_001-INSERT_FILE_BASENAME_HERE.fastq.gz \
-S ../output/lncRNA-discovery/sam/{}.sam \
2> ../output/lncRNA-discovery/sam/hisat.out
```

## Samtools

Samtools to convert the SAM files output from the previous HISAT2 command into sorted BAM files...

```{r, engine='bash', eval=FALSE, echo=TRUE}
mkdir -p ../output/lncRNA-discovery/bam  # Ensure the output directory exists

# Change samtools path to fit your machine
for samfile in ../output/lncRNA-discovery/sam/*.sam; do
  bamfile="../output/lncRNA-discovery/bam/$(basename "${samfile%.sam}.bam")"
  sorted_bamfile="../output/lncRNA-discovery/bam/$(basename "${samfile%.sam}.sorted.bam")"

  # Convert SAM to BAM
  /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"

  # Sort BAM
  /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"

  # Index sorted BAM
  /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done

```

## StringTie

StringTie uses the sorted BAM files to assemble transcripts for each sample, outputting them as GTF (Gene Transfer Format) files. And then merges all individual GTF assemblies into a single merged GTF file. This step extracts transcript information and merges GTFs from all samples--an important step in creating a canonical list of lncRNAs across all samples included in the pipeline.

```{r, engine='bash', eval=FALSE, echo=TRUE}
find /home/shared/lncRNA-discovery/bam/*bam \
| xargs basename -s -valid_sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ # Change tool path
-p 8 \
-G ../data/genome.gtf \
-o ../output/lncRNA-discovery/stringtie-output/{}.gtf \
../output/bam/{}-valid_sorted.bam \
```

Use StringTie merge to merge all the individual GTF files into a single merged GTF file...

```{r, engine='bash', eval=FALSE, echo=TRUE}
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ # Change tool path
--merge \
-G ../data/genome.gtf \
-o ../output/stringtie-merge-output/stringtie_merged.gtf \
../output/stringtie-output/*.gtf
```

## GffCompare

Use gffcompare to compare annotation file generated by StringTie to a reference annotation file and to produce a set of output files summarizing the results of the comparison, including classification of each transcript...

```{r, engine='bash', eval=FALSE, echo=TRUE}
/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \ # CHange tool path
-r ../data/genome.gtf \
-o ../output/lncRNA-discovery/gffcompare_merged \
../output/stringtie-merge-output/stringtie_merged.gtf
```

## Filter

Filters the combined GTF output from GFFcompare to select only the lines representing "transcripts" and excluding lines starting with "\#" (these are lines in the output format from GFFcompare that don't contain transcript information). This step further filters for a class code of "u", and keep only those with lengths greater than 199 bases. The "u' class code from the GFFcompare step is for"unknown" transcripts, that is those that are not previously annotated in our reference GFF as protein coding. The size filter of +200nt is a common filtering step for isolating lncRNAs.

```{r, engine='bash', eval=FALSE, echo=TRUE}
awk '$3 == "transcript" && $1 !~ /^#/ {print}' ../output/gffcompare-merged/gffcompare_merged.annotated.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > ../output/filter-output/lncRNA_candidates.gtf
```

## Bedtools

Extracts the sequence data from the \`\$FASTA\` reference file based on the coordinates from the filtered GTF. The resulting sequences represent potential lncRNA candidates.

```{r, engine='bash', eval=FALSE, echo=TRUE}
/home/shared/bedtools2/bin/fastaFromBed \ #Change tool path

-fi ../data/genome.fasta \

-bed ../output/filter-output/lncRNA_candidates.gtf \

-fo ../output/lncRNA-discovery/candidates-fasta/lncRNA_candidates.fasta \

-name -split
```

Check FASTA contents

```{r, engine='bash', eval=TRUE, echo=FALSE}
fgrep -c ">" ../output/lncRNA-discovery/candidates-fasta/lncRNA_candidates.fasta

head ../output/lncRNA-discovery/candidates-fasta/lncRNA_candidates.fasta
```

## CPC2

Initializes a conda environment (Anaconda) and runs CPC2, a software to predict whether a transcript is coding or non-coding. The results are saved to the \$OUTPUT_DIR. CPC2 uses ORF (Open Reading Frame) Analysis, Isometric Feature Mapping (Isomap), Sequence Homology, RNA Sequence Features, and Quality of Translation to assess coding potential and flag any transcripts we would want to exclude using the FASTA generated in the previous step.

```{r, engine='bash', eval=FALSE, echo=TRUE}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" #Change conda and tool path
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
-i ../output/lncRNA-discovery/candidates-fasta/lncRNA_candidates.fasta \
-o ../output/lncRNA-discovery/cpc2/lncRNA_cpc2.txt
```

## Final Filter

Filters the CPC2 results to get only noncoding transcripts (using the class "noncoding" from the CPC2 results) and extracts their IDs and matches these IDs with the sequences from the previous step to generate a GTF of long noncoding transcripts.

Matches these IDs with the sequences from the previous step to generate a GTF of noncoding transcripts.

```{r, engine='bash', eval=FALSE, echo=TRUE}
awk '$8 == "noncoding" {print $1}' ../output/lncRNA-discovery/cpc2/lncRNA_cpc2.txt > ../output/lncRNA-discovery/transcript-ids/noncoding_transcripts_ids.txt
```

This results in a final list of transcript IDs identified as lncRNAs.

## Subsetting FASTA

This creates a FASTA of only lncRNAs.

```{r, engine='bash'}

/home/shared/samtools-1.12/samtools faidx ../output/lncRNA-discovery/candidates-fasta/lncRNA_candidates.fasta \

-r ../output/lncRNA-discovery/transcript-ids/noncoding_transcripts_ids.txt > ../output/lncRNA-discovery/lncRNA-fasta/final_lncRNA.fasta

```

Check

```{r, engine='bash', eval=TRUE}

fgrep -c "\>" ../output/lncRNA-discovery/lncRNA-fasta/final_lncRNA.fasta

fgrep "\>" ../output/lncRNA-discovery/lncRNA-fasta/final_lncRNA.fasta | head -5

head ../output/lncRNA-discovery/lncRNA-fasta/final_lncRNA.fasta

```

# Part 2: Expression Count Matrices

![Workflow](https://github.com/course-fish546-2023/zach-lncRNA/blob/main/data/DGE-workflow.png?raw=true)

## Convert Bed to GTF

Transcript quantification using the Kallisto software on multiple pairs of FASTQ files...

```{r, engine='bash', eval=FALSE, echo=TRUE}
awk 'BEGIN{OFS="\t"; count=1} {printf "%s\t.\tlncRNA\t%d\t%d\t.\t+\t.\tgene_id \"lncRNA_%03d\";\n", $1, $2, $3, count++;}' /home/shared/8TB_HDD_03/sr320/github/deep-dive-expression/D-Apul/output/10.1-Apul-lncRNA/Apul_lncRNA.bed \
> ../output/05-Apul-lncRNA/lncRNAs.gtf
```

## Feature Count Quantification

```{r, engine='bash', eval=FALSE, echo=TRUE}
/home/shared/subread-2.0.5-Linux-x86_64/bin/featureCounts \
-T 42 \
-a ../output/05-Apul-lncRNA/lncRNAs.gtf \
-o ../output/05-Apul-lncRNA/counts.txt \
-t lncRNA \
-g gene_id \
-p \
../data/*sorted.bam
```