--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Annotation - Pgenerosa_v074 Transcript Isoform ID with Stringtie on Mox date: '2019-07-23 14:58' tags: - mox/stringtie - gtf - Pgenerosa_v074 - v074 - Panopea generosa - geoduck categories: - 2019 - Geoduck Genome Sequencing --- After annotating Pgenerosa_v070 and [comparing feature counts](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-22-Genome-Annotation-Pgenerosa_v070-and-v074-Top-18-Scaffolds-Feature-Count-Comparisons/), there was a drastic difference between the two genome versions. Additionally, both of those genomes ended up with no CDS/exon/gene/mRNA features identified in the largest scaffold. So, to explore this further by seeing where (if??) sequencing reads map to the scaffold, and to obtain transcript isoforms for the genome, I ran Stringtie. A [Hisat2 index was prepared earlier](https://robertslab.github.io/sams-notebook/2019/07/23/Genome-Annotation-Pgenerosa_v074-Hisat2-Transcript-Isoform-Index/). Here's the quick rundown of how transcript isoform annotation with Stringtie runs: 1. Use Hisat2 reference index with identified splice sites and exons (this was done earlier). 2. Use Hisat2 to create alignments from each pair of trimmed FastQ files. _Need to use the `--downstream-transcriptome-assembly` option!!!_ 3. Use Stringtie to create a GTF for each alignment. 4. Use Stringtie to create a singular, merged GTF from all of the individual GTFs. The trimmed FastQ files used were extracted from the following Trinity assemblies (links to notebooks): - [ctenidia (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Ctenidia-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [gonad (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Gonad-HiSeq-and-NovaSeq-Data-on-Mox/) - [heart (20190215)](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Heart/) - [EPI99 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Larvae-Day5-EPI99-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [EPI115 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI115-with-HiSeq-Data-on-Mox/) - [EPI116 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI116-with-HiSeq-Data-on-Mox/) - [EPI123 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI123-with-HiSeq-Data-on-Mox/) - [EPI124 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI124-with-HiSeq-and-NovaSeq-Data-on-Mox/) See the [fastq.list.txt](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v070/fastq.list.txt) for the list of FastQ files used as input. Also, see the [Nightingales Google Sheet](http://b.link/nightingales) for more details on sequencing files. NOTE: The "P" in the `*.P.qtrim.gz` represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity. This was run on Mox. SBATCH script (GitHub): - [20190723_stringtie_pgen_v074.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190723_stringtie_pgen_v074.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=pgen_v074_stringtie ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_stringtie_pgen_v074 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v074" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2="${hisat2_dir}/hisat2" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" stringtie="/gscratch/srlab/programs/stringtie-1.3.6.Linux_x86_64/stringtie" # Input/output files genome_gff="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff" genome_index_dir="/gscratch/srlab/sam/data/P_generosa/genomes" fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/" gtf_list="gtf_list.txt" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Copy Hisat2 genome index rsync -av "${genome_index_dir}"/Pgenerosa_v074*.ht2 . # Create array of fastq R1 files for fastq in "${fastq_dir}"*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"_" '{print $1}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Hisat2 alignments for index in "${!fastq_array_R1[@]}" do sample_name=$(echo "${names_array[index]}") "${hisat2}" \ -x "${genome_index_name}" \ --downstream-transcriptome-assembly \ -1 "${fastq_array_R1[index]}" \ -2 "${fastq_array_R2[index]}" \ -S "${sample_name}".sam \ 2> "${sample_name}"_hisat2.err # Sort SAM files, convert to BAM, and index "${samtools}" view \ -@ "${threads}" \ -Su "${sample_name}".sam \ | "${samtools}" sort - \ -@ "${threads}" \ -o "${sample_name}".sorted.bam "${samtools}" index "${sample_name}".sorted.bam # Run stringtie on alignments "${stringtie}" "${sample_name}".sorted.bam \ -p "${threads}" \ -o "${sample_name}".gtf \ -G "${genome_gff}" \ -C "${sample_name}.cov_refs.gtf" # Add GTFs to list file, only if non-empty # Identifies GTF files that only have header gtf_lines=$(wc -l < ${sample_name}.gtf ) if [ "${gtf_lines}" -gt 2 ]; then echo "${sample_name}.gtf" >> "${gtf_list}" fi done # Create singular transcript file, using GTF list file "${stringtie}" --merge \ "${gtf_list}" \ -p "${threads}" \ -G "${genome_gff}" \ -o "${genome_index_name}".stringtie.gtf # Delete unneccessary index files rm "${genome_index_name}"*.ht2 # Delete unneded SAM files rm ./*.sam ``` --- # RESULTS The process of getting this to run took longer than it should have. The job got interrupted by an unintended Mox shutdown and also failed a few times before I was able to figure out the problem. The failure was triggered by one sample that didn't end up having any transcripts present in the output GTF. When Stringtie encounters an empty GTF file, it not only spits out an error (indicating as such), but it also kills the program. I had to re-write part of the script to evaluate whether or not input GTFs for the Stringtie merge step had any content. Once it was running with this change, and didn't get interrupted by Mox problems, it took ~ one day: ![Screencap of Pgenerosa_v074 Stringtie runtime on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190723_stringtie_pgen_v074_runtime.png?raw=true) Output folder: - [20190723_stringtie_pgen_v074/](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v074/) Merged GTF: - [20190723_stringtie_pgen_v074/Pgenerosa_v074.stringtie.gtf](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v074/Pgenerosa_v074.stringtie.gtf) Although I won't link them here, each input FastQ pair has a corresponding alignment file (BAM), coverage file (`.cov_refs.gtf`), Hisat2 alignment stats file (`_hisat2.err`), and GTF. Additionally, I'm not going to delve too much into the analysis of this. I'll write a separate notebook post that goes into more depth on how this data looks, how it compares to the Pgenerosa_v074 Stringtie data, and what this means for the annotation of Scaffold 1... #### UPDATE 20190829 Created a massive, merged BAM file for use with GenSAS for further annotation of the Pgenerosa_v074 assembly. Ran the following commands to create merged BAM file and corresponding index file: ```shell samtools merge 20190723_sorted.merged.bam *.bam samtools index 20190723_sorted.merged.bam ```