lncRNA Comparison

Author

Zach Bengtsson, Steven Roberts

Published

August 16, 2023

lncRNA Discovery

(using Acropora pulchra as example)

HISAT

HISAT2 build to create an index for the Acropora millipora FASTA file…

#FASTA file path input
ref_genome="/home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna"
index="../output/05.2-lncRNA/Amil.index"
/home/shared/hisat2-2.2.1/hisat2-build \
-f "$ref_genome" \
"$index"

HISAT2 to align paired-end RNA-Seq reads to the Pver_genome_assembly_v1.0.fasta index…

ls /home/shared/8TB_HDD_01/apul/trimmed/
find /home/shared/8TB_HDD_01/apul/trimmed/*gz \
| xargs basename -s _R1_001.fastp-trim.20230519.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/05.2-lncRNA/Amil.index \
-p 20 \
-1 /home/shared/8TB_HDD_01/apul/trimmed/{}_R1_001.fastp-trim.20230519.fastq.gz \
-2 /home/shared/8TB_HDD_01/apul/trimmed/{}_R2_001.fastp-trim.20230519.fastq.gz \
-S ../output/05.2-lncRNA/{}.sam

Samtools

Samtools to convert the SAM files output from the previous HISAT2 command into sorted BAM files…

for file in ../output/05.2-lncRNA/*.sam; do
    base=$(basename "$file" .sam)
    /home/shared/samtools-1.12/samtools view -bS "$file" | \
    /home/shared/samtools-1.12/samtools sort \
    -o ../output/05.2-lncRNA/"$base".sorted.bam
done

StringTie

StringTie to assemble transcripts from the sorted BAM files generated by the previous Samtools commands…

`StringTie` is a software tool for assembling RNA-Seq data, which helps in reconstructing a set of transcripts that can potentially explain the RNA-Seq read alignments (often provided in BAM or SAM format) for a given genomic sequence.

Here’s a basic overview of what StringTie does:

1. **Transcript Assembly**: After RNA-Seq reads are aligned to a reference genome using tools like `TopHat`, `HISAT`, or `STAR`, StringTie can use those alignments to assemble transcripts. This process is particularly useful for identifying novel isoforms or alternative splicing events.

2. **Quantification**: Apart from transcript assembly, StringTie also provides an estimate of the expression levels of each transcript, usually in the form of Fragments Per Kilobase of transcript per Million mapped reads (FPKM) or Transcripts Per Million (TPM).

3. **Merging Transcripts**: StringTie can merge multiple transcript assemblies, which can be useful when you have RNA-Seq data from multiple conditions or time points and you want to create a unified set of transcripts.

4. **Comparison with Reference Annotation**: You can also compare the assembled transcripts with a reference annotation (like GTF/GFF files) to determine how many of the assembled transcripts match known annotated transcripts and to find novel ones.

5. **Ballgown Integration**: StringTie can produce output that’s compatible with `Ballgown`, a tool for differential expression analysis and visualization of RNA-Seq data.

Why is it popular?

- **Efficiency**: StringTie is known for its speed and efficiency, making it suitable for large datasets, like those produced in modern RNA-Seq experiments.

- **Sensitivity**: The algorithm behind StringTie was designed to be highly sensitive, especially when detecting alternative splicing events.

- **Flexibility**: It offers various functionalities beyond just transcript assembly, like quantification and integration with differential expression tools.

When considering transcript assembly and quantification tools, it’s essential to keep in mind the specific needs and characteristics of your dataset. While StringTie is powerful and versatile, there are other tools in the bioinformatics field, each with its strengths and weaknesses, that might be more appropriate for specific use cases or datasets.

Assuming you’ve already aligned your RNA-Seq reads to a reference genome and have a resulting BAM file (e.g., aligned_reads.bam), you can use the following steps:

  1. Transcript Assembly with StringTie:

    stringtie -p 8 -G reference_annotation.gtf -o output.gtf -l sample_name aligned_reads.bam
    • -p 8: Use 8 threads/processors.
    • -G reference_annotation.gtf: This is the reference annotation in GTF format. You might have one from sources like GENCODE or Ensembl.
    • -o output.gtf: This specifies the name of the output GTF file.
    • -l sample_name: Labels the transcripts with the provided prefix (useful for distinguishing samples).
  2. Quantify Transcripts: The previous command also provides expression estimates. However, you might want to merge transcripts from multiple samples or conditions and then re-quantify. For this, you can use a two-step process:

    1. Merge Transcripts:

      stringtie --merge -p 8 -G reference_annotation.gtf -o merged_transcripts.gtf assembly_list.txt
      • assembly_list.txt: A text file listing the paths to the GTF files for each sample, one file per line.
    2. Re-quantify against Merged Transcriptome:

      stringtie -e -B -p 8 -G merged_transcripts.gtf -o sample_name_recquant.gtf aligned_reads.bam
  3. Prepare Output for Ballgown: If you plan to use Ballgown for downstream differential expression analysis, you’d need the -B flag during the quantification step (as shown above). This tells StringTie to produce output files in a format Ballgown can ingest.

Remember to adjust the file names (aligned_reads.bam, reference_annotation.gtf, etc.) to match your dataset. Always consult the StringTie documentation for the most up-to-date information and for an explanation of additional parameters not covered here. ## more

:::

find ../output/05.2-lncRNA/*.sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o ../output/05.2-lncRNA/{}.gtf \
/home/shared/8TB_HDD_01/apul/bam/{}.bam \

Use StringTie merge to merge all the individual GTF files into a single merged GTF file…

/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o /home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf \
/home/shared/8TB_HDD_01/apul/stringtie-assembly/*.gtf
head /home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf
tail /home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf

GFFcompare

Use gffcompare to compare annotation file generated by StringTie to a reference annotation file and to produce a set of output files summarizing the results of the comparison, including classification of each transcript…

/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
-r ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-G ../data/GCF_013753865.1_Amil_v2.1_genomic.gff \
-o /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged \
/home/shared/8TB_HDD_01/apul/merged-gtf/stringtie_merged.gtf \
head /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf

Filter

Filter to get a subset of the transcripts from the original GTF file that are putative lncRNA candidates based on their length and lack of overlap with known reference transcripts…

awk '$3 == "transcript" && $1 !~ /^#/ {print}' /home/shared/8TB_HDD_01/apul/gffcompare/gffcompare_merged.combined.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf
head /home/shared/8TB_HDD_01/pver/filter-output/merged_lncRNA_candidates.gtf

Bedtools

Get fasta of lncRNA candidate regions with Bedtools…

/home/shared/bedtools2/bin/bedtools \
getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_01/apul/filter-one/merged_lncRNA_candidates.gtf -fo /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -name -split
head /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta

CPC2

Evaluate coding potential of transcripts with Coding Potential Calculator 2…

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py -i /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta -o ~/github/coral-lncRNA/output/apul_merged_cpc2_results

Final Filter

Filter for those transcripts with label “noncoding”…

# Filter out transcripts with coding potential
awk '$8 == "noncoding" {print $1}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_cpc2_results.txt > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt
grep -Fwf /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_noncoding_transcripts_ids.txt /home/shared/8TB_HDD_01/apul/get-fasta/merged_lncRNA_candidates.fasta > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf

De-duplicate GTF

Remove duplicates of transcript IDs that appear more than once within the GTF…

awk '!seen[$0]++' ~/github/coral-lncRNA/ouput/apul_merged_final_lncRNAs.gtf > ~/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf

This results in the final GTF list with transcript IDs containing scaffold and base range, information used to filter the reference FASTA for only lncRNAs.

Reformat to BED

Make GTF into BED format so Bedtools can recognize scaffold and base range…

awk -F":|-" '{print $3 "\t" $4 "\t" $5}' /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.gtf > /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed

BEDTools

Use BEDTools to extract sequences of lncRNAs using the reformatted GTF list of transcript IDs…

/home/shared/bedtools2/bin/bedtools \
getfasta -fi /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/data/GCF_013753865.1_Amil_v2.1_genomic.fna -bed /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_deduplicated_final_lncRNAs.bed -fo /home/shared/8TB_HDD_02/zbengt/github/coral-lncRNA/ouput/apul_bedtools_lncRNAs.fasta -name