---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Genome Annotations - Splice Site and Exon Extractions for C.virginica GCF_002022765.2 Genome Using Hisat2 on Mox
date: '2021-07-20 10:09'
tags: 
  - Hisat2
  - mox
  - Crassostrea virginica
  - Eastern oyster
categories: 
  - 2021
  - Miscellaneous
---
Previously performed quality trimming on the [_Crassostrea virginica_ (Eastern oyster)](https://en.wikipedia.org/wiki/Eastern_oyster) gonad/sperm RNAseq data on [20210714](https://robertslab.github.io/sams-notebook/posts/2021/2021-07-14-Trimming---C.virginica-Gonad-RNAseq-with-FastP-on-Mox/). Next, I needed to identify exons and splice sites, as well as generate a genome index using [`HISAT2`](https://daehwankimlab.github.io/hisat2/) to be used with [`StringTie`](https://ccb.jhu.edu/software/stringtie/) downstream to identify potential alternative transcripts. This utilized the following NCBI genome files:

- FastA: `GCF_002022765.2_C_virginica-3.0_genomic.fna`

- GFF: `GCF_002022765.2_C_virginica-3.0_genomic.gff`

- GTF: `GCF_002022765.2_C_virginica-3.0_genomic.gtf`

Metadata for this project is here:

[https://github.com/RobertsLab/project-oyster-comparative-omics/blob/master/metadata/Virginica-Final-DNA-RNA-Yield.csv](https://github.com/RobertsLab/project-oyster-comparative-omics/blob/master/metadata/Virginica-Final-DNA-RNA-Yield.csv)

This was run on Mox.

SBATCH script (GitHub):

- [20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices.sh)


```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices

## Script using HiSat2 to build a genome index, identify exons, and splice sites in NCBI C.virginica genome assemlby using Hisat2.


###################################################################################
# These variables need to be set by user

## Assign Variables

# Set number of CPUs to use


threads=40
genome_index_name="cvir_GCF_002022765.2"

# Paths to programs
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2_build="${hisat2_dir}/hisat2-build"
hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py"
hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py"

# Input/output files
exons="cvir_GCF_002022765.2_hisat2_exons.tab"
genome_dir="/gscratch/srlab/sam/data/C_virginica/genomes"
genome_gff="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gff"
genome_fasta="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.fna"
splice_sites="cvir_GCF_002022765.2_hisat2_splice_sites.tab"
transcripts_gtf="${genome_dir}/GCF_002022765.2_C_virginica-3.0_genomic.gtf"

# Programs associative array
declare -A programs_array
programs_array=(
[hisat2_build]="${hisat2_build}" \
[hisat2_exons]="${hisat2_exons}" \
[hisat2_splice_sites]="${hisat2_splice_sites}"
)


###################################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Create Hisat2 exons tab file
"${programs_array[hisat2_exons]}" \
"${transcripts_gtf}" \
> "${exons}"

# Create Hisat2 splice sites tab file
"${programs_array[hisat2_splice_sites]}" \
"${transcripts_gtf}" \
> "${splice_sites}"

# Build Hisat2 reference index using splice sites and exons
"${programs_array[hisat2_build]}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2_build.err

# Generate checksums for all files
md5sum * >> checksums.md5

# Copy Hisat2 index files to my data directory for later use with StringTie
rsync -av "${genome_index_name}"*.ht2 "${genome_dir}"


#######################################################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
```

---

# RESULTS

Runtime was fast, only 12mins:

![Runtime for Hisat2 indexing for C.virginica GCF_002022765.2 on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices_runtime.png?raw=true)

Output folder:

- [20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/](https://gannet.fish.washington.edu/Atumefaciens/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/)

This generates a set of 8 [`HISAT2`](https://daehwankimlab.github.io/hisat2/) genome index files (`*.ht2`), as well as an exon and a splice sites file:

- [20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/cvir_GCF_002022765.2_hisat2_exons.tab](https://gannet.fish.washington.edu/Atumefaciens/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/cvir_GCF_002022765.2_hisat2_exons.tab)

- [0210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/cvir_GCF_002022765.2_hisat2_splice_sites.tab](https://gannet.fish.washington.edu/Atumefaciens/20210720_cvir_GCF_002022765.2_hisat2-build-index-exons-splices/cvir_GCF_002022765.2_hisat2_splice_sites.tab)

Those two files are incorporated into the 8 index files and are not used later on.

Next up, run [`StringTie`](https://ccb.jhu.edu/software/stringtie/) to identify all potential isoforms in this RNAseq data.