--- title: "lncRNA Comparison" author: Zach Bengtsson, Steven Roberts date: "`r format(Sys.time(), '%d %B, %Y')`" output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: true html_document: theme: readable highlight: zenburn toc: true toc_float: true number_sections: true code_folding: show code_download: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(formattable) library(Biostrings) library(spaa) library(tm) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # lncRNA Discovery (using Acropora pulchra as example) ## HISAT HISAT2 build to create an index for the Acropora millipora FASTA file... ```{r engine='bash'} #FASTA file path input ref_genome="/home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna" index="../output/05.2-lncRNA/Amil.index" /home/shared/hisat2-2.2.1/hisat2-build \ -f "$ref_genome" \ "$index" ``` HISAT2 to align paired-end RNA-Seq reads to the ```{bash} ls /home/shared/8TB_HDD_01/apul/trimmed/ ``` ```{bash} find /home/shared/8TB_HDD_01/apul/trimmed/*gz \ | xargs basename -s _R1_001.fastp-trim.20230519.fastq.gz | xargs -I{} \ /home/shared/hisat2-2.2.1/hisat2 \ -x ../output/05.2-lncRNA/Amil.index \ -p 20 \ -1 /home/shared/8TB_HDD_01/apul/trimmed/{}_R1_001.fastp-trim.20230519.fastq.gz \ -2 /home/shared/8TB_HDD_01/apul/trimmed/{}_R2_001.fastp-trim.20230519.fastq.gz \ -S ../output/05.2-lncRNA/{}.sam ``` ## Samtools Samtools to convert the SAM files output from the previous HISAT2 command into sorted BAM files... ```{bash samtools} for file in ../output/05.2-lncRNA/*.sam; do base=$(basename "$file" .sam) /home/shared/samtools-1.12/samtools view -bS "$file" | \ /home/shared/samtools-1.12/samtools sort \ -o ../output/05.2-lncRNA/"$base".sorted.bam done ``` ```{r engine='bash'} num_threads=20 # Change this to the number of threads/cores you want to use find ../output/05.2-lncRNA/ -name "*.sam" | xargs -I {} -P $num_threads bash -c ' file={} base=$(basename "$file" .sam) /home/shared/samtools-1.12/samtools view -bS "$file" | \ /home/shared/samtools-1.12/samtools sort \ -o ../output/05.2-lncRNA/"$base".sorted.bam ' ``` ```{bash} process_file() { file=$1 base=$(basename "$file" .sam) /home/shared/samtools-1.12/samtools view -bS "$file" | \ /home/shared/samtools-1.12/samtools sort \ -o ../output/05.2-lncRNA/"$base".sorted.bam } export -f process_file find ../output/05.2-lncRNA/ -name "*.sam" | parallel -j4 process_file ``` ## StringTie StringTie to assemble transcripts from the sorted BAM files ::: panel-tabset ## Software Info \`StringTie\` is a software tool for assembling RNA-Seq data, which helps in reconstructing a set of transcripts that can potentially explain the RNA-Seq read alignments (often provided in BAM or SAM format) for a given genomic sequence. Here's a basic overview of what StringTie does: 1\. \*\*Transcript Assembly\*\*: After RNA-Seq reads are aligned to a reference genome using tools like \`TopHat\`, \`HISAT\`, or \`STAR\`, StringTie can use those alignments to assemble transcripts. This process is particularly useful for identifying novel isoforms or alternative splicing events. 2\. \*\*Quantification\*\*: Apart from transcript assembly, StringTie also provides an estimate of the expression levels of each transcript, usually in the form of Fragments Per Kilobase of transcript per Million mapped reads (FPKM) or Transcripts Per Million (TPM). 3\. \*\*Merging Transcripts\*\*: StringTie can merge multiple transcript assemblies, which can be useful when you have RNA-Seq data from multiple conditions or time points and you want to create a unified set of transcripts. 4\. \*\*Comparison with Reference Annotation\*\*: You can also compare the assembled transcripts with a reference annotation (like GTF/GFF files) to determine how many of the assembled transcripts match known annotated transcripts and to find novel ones. 5\. \*\*Ballgown Integration\*\*: StringTie can produce output that's compatible with \`Ballgown\`, a tool for differential expression analysis and visualization of RNA-Seq data. Why is it popular? \- \*\*Efficiency\*\*: StringTie is known for its speed and efficiency, making it suitable for large datasets, like those produced in modern RNA-Seq experiments. \- \*\*Sensitivity\*\*: The algorithm behind StringTie was designed to be highly sensitive, especially when detecting alternative splicing events. \- \*\*Flexibility\*\*: It offers various functionalities beyond just transcript assembly, like quantification and integration with differential expression tools. When considering transcript assembly and quantification tools, it's essential to keep in mind the specific needs and characteristics of your dataset. While StringTie is powerful and versatile, there are other tools in the bioinformatics field, each with its strengths and weaknesses, that might be more appropriate for specific use cases or datasets. ## Example Code Assuming you've already aligned your RNA-Seq reads to a reference genome and have a resulting BAM file (e.g., `aligned_reads.bam`), you can use the following steps: 1. **Transcript Assembly with StringTie**: ``` bash stringtie -p 8 -G reference_annotation.gtf -o output.gtf -l sample_name aligned_reads.bam ``` - `-p 8`: Use 8 threads/processors. - `-G reference_annotation.gtf`: This is the reference annotation in GTF format. You might have one from sources like GENCODE or Ensembl. - `-o output.gtf`: This specifies the name of the output GTF file. - `-l sample_name`: Labels the transcripts with the provided prefix (useful for distinguishing samples). 2. **Quantify Transcripts**: The previous command also provides expression estimates. However, you might want to merge transcripts from multiple samples or conditions and then re-quantify. For this, you can use a two-step process: a. **Merge Transcripts**: ``` bash stringtie --merge -p 8 -G reference_annotation.gtf -o merged_transcripts.gtf assembly_list.txt ``` - `assembly_list.txt`: A text file listing the paths to the GTF files for each sample, one file per line. b. **Re-quantify against Merged Transcriptome**: ``` bash stringtie -e -B -p 8 -G merged_transcripts.gtf -o sample_name_recquant.gtf aligned_reads.bam ``` 3. **Prepare Output for Ballgown**: If you plan to use Ballgown for downstream differential expression analysis, you'd need the `-B` flag during the quantification step (as shown above). This tells StringTie to produce output files in a format Ballgown can ingest. Remember to adjust the file names (`aligned_reads.bam`, `reference_annotation.gtf`, etc.) to match your dataset. Always consult the StringTie documentation for the most up-to-date information and for an explanation of additional parameters not covered here. \## more ::: ```{bash} ls ../output/05.2-lncRNA/*.sorted.bam ``` ```{bash} cd ../data curl -O https://gannet.fish.washington.edu/seashell/snaps/GCF_013753865.1_Amil_v2.1_genomic.gff ``` ```{bash} find ../output/05.2-lncRNA/ -name "*sorted.bam" | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 20 \ -G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \ -o ../output/05.2-lncRNA/{}.gtf \ ../output/05.2-lncRNA/{}.sorted.bam \ ``` ```{r, engine='bash'} head -3 ../output/05.2-lncRNA/*.gtf ``` Use StringTie merge to merge all the individual GTF files into a single merged GTF file... ```{r, engine='bash' stringtie2} /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ --merge \ -G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \ -o ../output/05.2-lncRNA/apul_merged.gtf \ ../output/05.2-lncRNA/*.gtf ``` ```{bash} head ../output/05.2-lncRNA/apul_merged.gtf tail ../output/05.2-lncRNA/apul_merged.gtf wc -l ../output/05.2-lncRNA/apul_merged.gtf ``` ## GFFcompare Use gffcompare to compare annotation file generated by StringTie to a reference annotation file and to produce a set of output files summarizing the results of the comparison, including classification of each transcript... ```{bash gffcompare} /home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \ -r ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \ -G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \ -o /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged \ /home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf \ ``` ```{bash} head /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf ``` ## Filter Filter to get a subset of the transcripts from the original GTF file that are putative lncRNA candidates based on their length and lack of overlap with known reference transcripts... ```{bash filter} awk '$3 == "transcript" && $1 !~ /^#/ {print}' /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf ``` ```{bash} head /home/shared/8TB_HDD_01/pver/filter-output/merged_lncRNA_candidates.gtf ``` ## Bedtools Get fasta of lncRNA candidate regions with Bedtools... ```{bash} /home/shared/bedtools2/bin/bedtools \ getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf -fo /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -name -split ``` ```{bash} head /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta ``` ## CPC2 Evaluate coding potential of transcripts with Coding Potential Calculator 2... ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py -i /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -o ~/github/coral-lncRNA/output/apul_merged_cpc2_results ``` ## Final Filter Filter for those transcripts with label "noncoding"... ```{bash} # Filter out transcripts with coding potential awk '$8 == "noncoding" {print $1}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_cpc2_results.txt > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt grep -Fwf /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf ``` ## De-duplicate GTF Remove duplicates of transcript IDs that appear more than once within the GTF... ```{bash} awk '!seen[$0]++' ~/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf > ~/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf ``` This results in the final GTF list with transcript IDs containing scaffold and base range, information used to filter the reference FASTA for only lncRNAs. ## Reformat to BED Make GTF into BED format so Bedtools can recognize scaffold and base range... ```{bash} awk -F":|-" '{print $3 "\t" $4 "\t" $5}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed ``` ## BEDTools Use BEDTools to extract sequences of lncRNAs using the reformatted GTF list of transcript IDs... ```{bash} /home/shared/bedtools2/bin/bedtools \ getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed -fo /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_bedtools_lncRNAs.fasta -name ```