---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Genome Annotation - O.lurida 20190709-v081 Transcript Isoform ID with Stringtie on Mox
date: '2019-07-16 15:07'
tags:
  - Ostrea lurida
  - Olympia oyster
  - mox
  - stringtie
  - gtf
  - 20190709-v081
  - annotation
  - isoforms
categories:
  - 2019
  - Olympia Oyster Genome Sequencing
---
[Earlier today, I generated the necessary Hista2 index, which incorporated splice sites and exons](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-16-Genome-Annotation-O.lurida-20190709-v081-Hisat2-Transcript-Isoform-Index/), for use with Stringtie in order to identify transcript isoforms in our [20190709-Olurida_v081 annotation](https://robertslab.github.io/sams-notebook/2019/07/09/Genome-Annotation-Olurida_v081-with-MAKER-and-Tissue-specific-Transcriptomes-on-Mox/). This annotation utilized tissue-specific transcriptome assemblies provided by Katherine Silliman.

I used all the trimmed FastQ files from the [20180827 Trinity transcriptome assembly](https://robertslab.github.io/sams-notebook/posts/2018/2018-09-19-transcriptome-assembly-olympia-oyster-rnaseq-data-with-trinity/), as this utilized all of our existing RNAseq data.

- [20180827_trinity_oly_RNAseq/trinity_out_dir.tar.gz](https://owl.fish.washington.edu/Athaliana/20180827_trinity_oly_RNAseq/trinity_out_dir.tar.gz)

Command to pull trimmed files (Trimmomatic) out of the Trinity output folder that is a gzipped tarball:

```shell
tar -ztvf trinity_out_dir.tar.gz \
| grep -E "*P.qtrim.gz" \
&& tar -zxvf trinity_out_dir.tar.gz \
-C /home/sam/Downloads/ \
--wildcards "*P.qtrim.gz"
```

This was run locally on my computer (swoose) and then `rsync`'d to Mox.

NOTE: The "P" in the `*.P.qtrim.gz` represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity. See the [fastq.list.txt](https://gannet.fish.washington.edu/Atumefaciens/20190625_stringtie_oly_v081/fastq.list.txt) for the list of FastQ files used as input. For more info on input FastQ files, refer to [the Nightingales Google Sheet](http://b.link/nightingales).


Here's the quick rundown of how transcript isoform annotation with Stringtie runs:

1. Use Hisat2 reference index with identified splice sites and exons (this was done yesterday).

2. Use Hisat2 to create alignments from each pair of trimmed FastQ files. _Need to use the `--downstream-transcriptome-assembly` option!!!_

3. Use Stringtie to create a GTF for each alignment.

4. Use Stringtie to create a singular, merged GTF from all of the individual GTFs.


SBATCH script (GitHub):

- [20190716_stringtie_20190709-olur-v081.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190716_stringtie_20190709-olur-v081.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_stringtie
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190716_stringtie_20190709-olur-v081

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log


threads=27
genome_index_name="20190709-Olurida_v081"

# Paths to programs
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2="${hisat2_dir}/hisat2"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
stringtie="/gscratch/srlab/programs/stringtie-1.3.6.Linux_x86_64/stringtie"

# Input/output files
genome_gff="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/20190709-Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff"
genome_index_dir="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081"
fastq_dir="/gscratch/srlab/sam/data/O_lurida/RNAseq/"
gtf_list="gtf_list.txt"

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
names_array=()

# Copy Hisat2 genome index
rsync -av "${genome_index_dir}"/${genome_index_name}*.ht2 .

# Generate checksum of GFF file for backtracking to original
# Original named: Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff
# Created in 20190709 Olurida_v081 annotation - renamed to avoid filename clashes with previous annotations.
md5sum "${genome_gff}" > genome_gff.md5

# Create array of fastq R1 files
for fastq in "${fastq_dir}"*R1*.gz
do
  fastq_array_R1+=("${fastq}")
done

# Create array of fastq R2 files
for fastq in "${fastq_dir}"*R2*.gz
do
  fastq_array_R2+=("${fastq}")
done

# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"*R1*.gz
do
  names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}'))
done

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"*.gz
do
  echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done

# Hisat2 alignments
for index in "${!fastq_array_R1[@]}"
do
  sample_name=$(echo "${names_array[index]}")
  "${hisat2}" \
  -x "${genome_index_name}" \
  --downstream-transcriptome-assembly \
  -1 "${fastq_array_R1[index]}" \
  -2 "${fastq_array_R2[index]}" \
  -S "${sample_name}".sam \
  2> "${sample_name}"_hisat2.err

# Sort SAM files, convert to BAM, and index
  "${samtools}" view \
  -@ "${threads}" \
  -Su "${sample_name}".sam \
  | "${samtools}" sort - \
  -@ "${threads}" \
  -o "${sample_name}".sorted.bam
  "${samtools}" index "${sample_name}".sorted.bam

# Run stringtie on alignments
  "${stringtie}" "${sample_name}".sorted.bam \
  -p "${threads}" \
  -o "${sample_name}".gtf \
  -G "${genome_gff}" \
  -C "${sample_name}.cov_refs.gtf"

# Add GTFs to list file
  echo "${sample_name}.gtf" >> "${gtf_list}"
done

# Create singular transcript file, using GTF list file
"${stringtie}" --merge \
"${gtf_list}" \
-p "${threads}" \
-G "${genome_gff}" \
-o "${genome_index_name}".stringtie.gtf

# Delete unneccessary index files
rm "${genome_index_name}"*.ht2

# Delete unneded SAM files
rm ./*.sam
```

---

# RESULTS

This took ~24hrs to run:

![Screencap of Mox Stringtie runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190716_stringtie_olur_v081_runtime.png?raw=true)

Output folder:

- [20190716_stringtie_20190709-olur-v081/](https://gannet.fish.washington.edu/Atumefaciens/20190716_stringtie_20190709-olur-v081/)

Merged GTF:

- [20190716_stringtie_20190709-olur-v081/20190709-Olurida_v081.stringtie.gtf](https://gannet.fish.washington.edu/Atumefaciens/20190716_stringtie_20190709-olur-v081/20190709-Olurida_v081.stringtie.gtf)

Although I won't link them here, each input FastQ pair has a corresponding alignment file (BAM), coverage file (`.cov_refs.gtf`), Hisat2 alignment stats file (`_hisat2.err`), and GTF.