--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Annotation - Pgenerosa_v070 Transcript Isoform ID with Stringtie on Mox date: '2019-07-23 14:56' tags: - mox - stringtie - gtf - Pgenerosa_v070 - v070 - Panopea generosa - geoduck categories: - 2019 - Geoduck Genome Sequencing --- After annotating Pgenerosa_v074 and [comparing feature counts](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-22-Genome-Annotation-Pgenerosa_v070-and-v074-Top-18-Scaffolds-Feature-Count-Comparisons/), there was a drastic difference between the two genome versions. Additionally, both of those genomes ended up with no CDS/exon/gene/mRNA features identified in the largest scaffold. So, to explore this further by seeing where (if??) sequencing reads map to the scaffold, and to obtain transcript isoforms for the genome, I ran Stringtie. A [Hisat2 index was prepared earlier](https://robertslab.github.io/sams-notebook/2019/07/23/Genome-Annotation-Pgenerosa_v070-Hisat2-Transcript-Isoform-Index/). Here's the quick rundown of how transcript isoform annotation with Stringtie runs: 1. Use Hisat2 reference index with identified splice sites and exons (this was done earlier). 2. Use Hisat2 to create alignments from each pair of trimmed FastQ files. _Need to use the `--downstream-transcriptome-assembly` option!!!_ 3. Use Stringtie to create a GTF for each alignment. 4. Use Stringtie to create a singular, merged GTF from all of the individual GTFs. The trimmed FastQ files used were extracted from the following Trinity assemblies (links to notebooks): - [ctenidia (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Ctenidia-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [gonad (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Gonad-HiSeq-and-NovaSeq-Data-on-Mox/) - [heart (20190215)](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Heart/) - [EPI99 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Larvae-Day5-EPI99-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [EPI115 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI115-with-HiSeq-Data-on-Mox/) - [EPI116 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI116-with-HiSeq-Data-on-Mox/) - [EPI123 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI123-with-HiSeq-Data-on-Mox/) - [EPI124 (20190409)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI124-with-HiSeq-and-NovaSeq-Data-on-Mox/) See the [fastq.list.txt](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v070/fastq.list.txt) for the list of FastQ files used as input. Also, see the [Nightingales Google Sheet](http://b.link/nightingales) for more details on sequencing files. NOTE: The "P" in the `*.P.qtrim.gz` represents trimmed reads that are properly paired, as determined by Trimmomatic/Trinity. This was run on Mox. SBATCH script (GitHub): - [20190723_stringtie_pgen_v070.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190723_stringtie_pgen_v070.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=pgen_v070_stringtie ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190723_stringtie_pgen_v070 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 genome_index_name="Pgenerosa_v070" # Paths to programs hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0" hisat2="${hisat2_dir}/hisat2" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" stringtie="/gscratch/srlab/programs/stringtie-1.3.6.Linux_x86_64/stringtie" # Input/output files genome_gff="/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v070_genome_snap02.all.renamed.putative_function.domain_added.gff" genome_index_dir="/gscratch/srlab/sam/data/P_generosa/genomes" fastq_dir="/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq/" gtf_list="gtf_list.txt" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Copy Hisat2 genome index rsync -av "${genome_index_dir}"/Pgenerosa_v070*.ht2 . # Create array of fastq R1 files for fastq in "${fastq_dir}"*R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in "${fastq_dir}"*R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in "${fastq_dir}"*R1*.gz do names_array+=($(echo "${R1_fastq#${fastq_dir}}" | awk -F"_" '{print $1}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in "${fastq_dir}"*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Hisat2 alignments for index in "${!fastq_array_R1[@]}" do sample_name=$(echo "${names_array[index]}") "${hisat2}" \ -x "${genome_index_name}" \ --downstream-transcriptome-assembly \ -1 "${fastq_array_R1[index]}" \ -2 "${fastq_array_R2[index]}" \ -S "${sample_name}".sam \ 2> "${sample_name}"_hisat2.err # Sort SAM files, convert to BAM, and index "${samtools}" view \ -@ "${threads}" \ -Su "${sample_name}".sam \ | "${samtools}" sort - \ -@ "${threads}" \ -o "${sample_name}".sorted.bam "${samtools}" index "${sample_name}".sorted.bam # Run stringtie on alignments "${stringtie}" "${sample_name}".sorted.bam \ -p "${threads}" \ -o "${sample_name}".gtf \ -G "${genome_gff}" \ -C "${sample_name}.cov_refs.gtf" # Add GTFs to list file echo "${sample_name}.gtf" >> "${gtf_list}" done # Create singular transcript file, using GTF list file "${stringtie}" --merge \ "${gtf_list}" \ -p "${threads}" \ -G "${genome_gff}" \ -o "${genome_index_name}".stringtie.gtf # Delete unneccessary index files rm "${genome_index_name}"*.ht2 # Delete unneded SAM files rm ./*.sam ``` --- # RESULTS This took a bit more than one day to complete: ![Screencap of Stringtie runtime for Pgenerosa_v070 on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190723_stringtie_pgen_v070_runtime.png?raw=true) Output folder: - [20190723_stringtie_pgen_v070/](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v070/) Merged GTF: - [20190723_stringtie_pgen_v070/Pgenerosa_v070.stringtie.gtf](https://gannet.fish.washington.edu/Atumefaciens/20190723_stringtie_pgen_v070/Pgenerosa_v070.stringtie.gtf) Although I won't link them here, each input FastQ pair has a corresponding alignment file (BAM), coverage file (`.cov_refs.gtf`), Hisat2 alignment stats file (`_hisat2.err`), and GTF. Additionally, I'm not going to delve too much into the analysis of this. I'll write a separate notebook post that goes into more depth on how this data looks, how it compares to the Pgenerosa_v074 Stringtie data, and what this means for the annotation of Scaffold 1...