--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Annotation - O.lurida (v081) Hisat2 Transcript Isoforms Index date: '2019-06-25 08:01' tags: - Ostrea lurida - Olympia oyster - v081 - Olurida_v081 - hisat2 - GTF - annotation categories: - 2019 - Olympia Oyster Genome Sequencing --- Per [this thread in Slack](https://genefish.slack.com/archives/GHB1LCNRW/p1560979157005300?thread_ts=1560978863.004100&cid=GHB1LCNRW), we realized that the ["final" annotation of the Olurida_v081 genome](https://robertslab.github.io/sams-notebook/posts/2019/2019-01-09-Annotation---Olurida_v081-MAKER-Functional-Annotations-on-Mox/) only seemed to have singular mRNA annotations and no apparent isoforms. As such, I decided to see if I could tease out this type of info. I decided to use [Stringtie](https://ccb.jhu.edu/software/stringtie/index.shtml), as it seemed robust and relatively straightforward. Plus, it had a decent user guide explaining how to tackle this exact problem. However, before being able to start in with Stringtie, a couple of things needed to be done first to prepare files for use with Stringtie: 1. Create GTF file (basically a GFF specifically for use with transcripts - thus the "T" in GTF) from input GFF file. Done with [GFF utilities software](http://ccb.jhu.edu/software/stringtie/gff.shtml). 2. Identify splice sites and exons in newly-created GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software. 2. Create a Hisat2 reference index that utilizes the GTF. Done with [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml) software. This was run on Mox. The SBATCH script has a bunch of leftover extraneous steps that aren't relevant to this step of the annotation process; specifically the FastQ manipulation steps. Originally, I had a large script running both this and the subsequent Stringtie process. However, I was having issues with Stringtie and it made more sense to have these GTF/indexing steps as a separate script, so I chopped off the Stringtie stuff and neglected to remove the FastQ stuff. I didn't want to edit the script after I actually ran it, so have left it in here. SBATCH script (GitHub): - [20190625_hisat2-build_oly_v081.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190625_hisat2-build_oly_v081.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=oly_hisat2 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190625_hisat2-build_oly_v081 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=27 genome_index_name="Olurida_v081" # Paths to programs gffread="/gscratch/srlab/programs/gffread-0.11.4.Linux_x86_64/gffread" hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2_build="${hisat2_dir}/hisat2-build" hisat2_exons="${hisat2_dir}/hisat2_extract_exons.py" hisat2_splice_sites="${hisat2_dir}/hisat2_extract_splice_sites.py" # Input/output files genome_gff="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/20181127_oly_genome_snap02.all.renamed.putative_function.domain_added.gff" exons="hisat2_exons.tab" fastq_dir="/gscratch/srlab/sam/data/O_lurida/RNAseq/" genome_fasta="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/Olurida_v081.fa" splice_sites="hisat2_splice_sites.tab" transcripts_gtf="20190620_oly_genome_snap02.all.renamed.putative_function.domain_added.transcripts.gtf" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in ${fastq_dir}/*R1*.gz do fastq_array_R1+=(${fastq}) done # Create array of fastq R2 files for fastq in ${fastq_dir}/*R2*.gz do fastq_array_R2+=(${fastq}) done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in ${fastq_dir}/*R1*.gz do names_array+=($(echo ${R1_fastq#${fastq_dir}} | awk -F"[_.]" '{print $1 "_" $5}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${fastq_dir}*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Create transcipts GTF from genome FastA "${gffread}" -T \ "${genome_gff}" \ -o "${transcripts_gtf}" # Create Hisat2 exons tab file "${hisat2_exons}" \ "${transcripts_gtf}" \ > "${exons}" # Create Hisate2 splice sites tab file "${hisat2_splice_sites}" \ "${transcripts_gtf}" \ > "${splice_sites}" # Build Hisat2 reference index using splice sites and exons "${hisat2_build}" \ "${genome_fasta}" \ "${genome_index_name}" \ --exon "${exons}" \ --ss "${splice_sites}" \ -p "${threads}" \ 2> hisat2_build.err ``` --- # RESULTS Output folder: - [20190625_hisat2-build_oly_v081/](https://gannet.fish.washington.edu/Atumefaciens/20190625_hisat2-build_oly_v081/) The Hisat2 index files are: `*.ht2`. These will be used with Stringtie for transcript isoform annotation.