---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Genome Annotation - O.lurida (v081) Hisat2 Transcript Isoforms Index
date: '2019-06-25 08:01'
tags:
  - Ostrea lurida
  - Olympia oyster
  - v081
  - Olurida_v081
  - hisat2
  - GTF
  - annotation
categories:
  - 2019
  - Olympia Oyster Genome Sequencing
---
Per [this thread in Slack](https://genefish.slack.com/archives/GHB1LCNRW/p1560979157005300?thread_ts=1560978863.004100&cid=GHB1LCNRW), we realized that the ["final" annotation of the Olurida_v081 genome](https://robertslab.github.io/sams-notebook/posts/2019/2019-01-09-Annotation---Olurida_v081-MAKER-Functional-Annotations-on-Mox/) only seemed to have singular mRNA annotations and no apparent isoforms. As such, I decided to see if I could tease out this type of info.

I decided to use [Stringtie](https://ccb.jhu.edu/software/stringtie/index.shtml), as it seemed robust and relatively straightforward. Plus, it had a decent user guide explaining how to tackle this exact problem.

However, before being able to start in with Stringtie, a couple of things needed to be done first to prepare files for use with Stringtie:

1. Create GTF file (basically a GFF specifically for use with transcripts - thus the "T" in GTF) from input GFF file. Done with [GFF utilities software](http://ccb.jhu.edu/software/stringtie/gff.shtml).

2. Identify splice sites and exons in newly-created GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software.

2. Create a Hisat2 reference index that utilizes the GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software.


This was run on Mox. The SBATCH script has a bunch of leftover extraneous steps that aren't relevant to this step of the annotation process; specifically the FastQ manipulation steps. Originally, I had a large script running both this and the subsequent Stringtie process. However, I was having issues with Stringtie and it made more sense to have these GTF/indexing steps as a separate script, so I chopped off the Stringtie stuff and neglected to remove the FastQ stuff. I didn't want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

- [20190625_hisat2-build_oly_v081.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190625_hisat2-build_oly_v081.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_hisat2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190625_hisat2-build_oly_v081

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log


threads=27
genome_index_name="Olurida_v081"

# Paths to programs
gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread"
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"

# Input/output files
genome_gff="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/20181127_oly_genome_snap02.all.renamed.putative_function.domain_added.gff"
exons="hisat2_exons.tab"
fastq_dir="/gscratch/srlab/sam/data/O_lurida/RNAseq/"
genome_fasta="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/Olurida_v081.fa"
splice_sites="hisat2_splice_sites.tab"
transcripts_gtf="20190620_oly_genome_snap02.all.renamed.putative_function.domain_added.transcripts.gtf"

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()

# Create array of fastq R1 files
for fastq in ${fastq_dir}/*R1*.gz
do
  fastq_array_R1+=(${fastq})
done

# Create array of fastq R2 files
for fastq in ${fastq_dir}/*R2*.gz
do
  fastq_array_R2+=(${fastq})
done

# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in ${fastq_dir}/*R1*.gz
do
  names_array+=($(echo ${R1_fastq#${fastq_dir}} | awk -F"[_.]" '{print $1 "_" $5}'))
done

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in ${fastq_dir}*.gz
do
  echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done

# Create transcipts GTF from genome FastA
"${gffread}" -T \
"${genome_gff}" \
-o "${transcripts_gtf}"

# Create Hisat2 exons tab file
"${hisat2_exons}" \
"${transcripts_gtf}" \
> "${exons}"
# Create Hisate2 splice sites tab file
"${hisat2_splice_sites}" \
"${transcripts_gtf}" \
> "${splice_sites}"

# Build Hisat2 reference index using splice sites and exons
"${hisat2_build}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err
```

---

# RESULTS

Output folder:

- [20190625_hisat2-build_oly_v081/](https://gannet.fish.washington.edu/Atumefaciens/20190625_hisat2-build_oly_v081/)

The Hisat2 index files are: `*.ht2`. These will be used with Stringtie for transcript isoform annotation.