---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Genome Annotation - Pgenerosa_v074 Hisat2 Transcript Isoform Index
date: '2019-07-23 13:30'
tags:
  - mox
  - Panopea generosa
  - geoduck
  - hisat2
  - gffutils
  - gff
  - gtf
categories:
  - 2019
  - Geoduck Genome Sequencing
---

Essentially, the steps below (which is what was done here) are needed to prepare files for use with [Stringtie](https://ccb.jhu.edu/software/stringtie/index.shtml):

1. Create GTF file (basically a GFF specifically for use with transcripts - thus the "T" in GTF) from input GFF file. Done with [GFF utilities software](http://ccb.jhu.edu/software/stringtie/gff.shtml).

2. Identify splice sites and exons in newly-created GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software.

3. Create a Hisat2 reference index that utilizes the GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software.

This was run on Mox.

The SBATCH script has a bunch of leftover extraneous steps that aren't relevant to this step of the annotation process; specifically the FastQ manipulation steps. This is due to a copy/paste from a previous Hisat2 run that I neglected to edit out and I didn't want to edit the script after I actually ran it, so have left it in here.

SBATCH script (GitHub):

- [20190723_hisat2-build_pgen_v074.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190723_hisat2-build_pgen_v074.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=oly_hisat2
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_hisat2-build_pgen_v074

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log


threads=28
genome_index_name="Pgenerosa_v074"

# Paths to programs
gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread"
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"

# Input/output files
fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq"
genome_dir="/gscratch/srlab/sam/data/P_generosa/genomes"

genome_gff="${genome_dir}/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff"
exons="hisat2_exons.tab"
genome_fasta="${genome_dir}/Pgenerosa_v074.fa"
splice_sites="hisat2_splice_sites.tab"
transcripts_gtf="Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gtf"

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()

# Create array of fastq R1 files
for fastq in "${fastq_dir}"/*R1*.gz
do
  fastq_array_R1+=("${fastq}")
done

# Create array of fastq R2 files
for fastq in "${fastq_dir}"/*R2*.gz
do
  fastq_array_R2+=("${fastq}")
done

# Create array of sample names
## Uses parameter substitution to strip leading path from filename
## Uses awk to parse out sample name from filename
for R1_fastq in "${fastq_dir}"/*R1*.gz
do
  names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"[_.]" '{print $1 "_" $5}'))
done

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"/*.gz
do
  echo "${fastq#${fastq_dir}}" >> fastq.list.txt
done

# Create transcipts GTF from genome FastA
"${gffread}" -T \
"${genome_gff}" \
-o "${transcripts_gtf}"

# Create Hisat2 exons tab file
"${hisat2_exons}" \
"${transcripts_gtf}" \
> "${exons}"
# Create Hisate2 splice sites tab file
"${hisat2_splice_sites}" \
"${transcripts_gtf}" \
> "${splice_sites}"

# Build Hisat2 reference index using splice sites and exons
"${hisat2_build}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err

# Copy Hisat2 index files to my data directory
rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"
```

---

# RESULTS

This took a _shockingly_ short amount of time to complete: ~10mins!

![Screencap of Hisat2 pgen_v074 runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190723_hisat2_build_pgen074_runtime.png?raw=true)

Other Hisat indexing I've done with _Ostrea lurida_ and the Pgen_v070 assemblies all took ~1hr.

Output folder:

- [20190723_hisat2-build_pgen_v074/](https://gannet.fish.washington.edu/Atumefaciens/20190723_hisat2-build_pgen_v074/)

The Hisat2 index files are: `*.ht2`. These will be used with Stringtie for transcript isoform annotation.