--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Transcriptome Annotation - C.bairdi Using DIAMOND BLASTx on Mox and MEGAN6 Meganizer on swoose date: '2020-03-18 20:56' tags: - Tanner crab - RNAseq - DIAMOND - BLASTx - mox - MEGAN6 - meganizer - Chionoecetes bairdi categories: - 2020 - Tanner Crab RNAseq --- [After receiving/trimming the latest round of _C.bairdi_ RNAseq data on 20200318](https://robertslab.github.io/sams-notebook/posts/2020/2020-03-18-TrimmingQCMultiQC---C.bairdi-RNAseq-FastQ-with-fastp-on-Mox/), need to get the data ready to perform taxonomic selection of sequencing reads. To do this, I first need to run [DIAMOND BLASTx](https://github.com/bbuchfink/diamond), then "meganize" the output files in preparation for loading into [MEGAN6](https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/megan6/), which will allow for taxonomic-specific read separation. DIAMOND BLASTx will be run on Mox. Meganization will be run on my computer (swoose), due to MEGAN6's reliance on Java X11 window (this is not available on Mox - throws an error when trying to run it). I fully anticipate this process to take a week or two (DIAMOND BLASTx will likely take a few days and read extraction will definitely take many days...) SBATCH script (GitHub): - [20200318_cbai_diamond_blastx.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200318_cbai_diamond_blastx.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=cbai_blastx_DIAMOND ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200318_cbai_diamond_blastx ## Perform DIAMOND BLASTx on trimmed Chionoecetes bairdi (Tanner crab) FastQ files. # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Program paths diamond=/gscratch/srlab/programs/diamond-0.9.29/diamond # DIAMOND NCBI nr database dmnd=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd # Capture program options { echo "Program options for DIAMOND: " echo "" "${diamond}" help echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true # Trimmed FastQ files directory fastq_dir=/gscratch/scrubbed/samwhite/outputs/20200318_cbai_RNAseq_fastp_trimming # Loop through FastQ files, log filenames to fastq_list.txt. # Run DIAMOND on each FastQ for fastq in ${fastq_dir}*fastp-trim*.fq.gz do # Log input FastQs echo "${fastq}" >> fastq_list.txt # Strip leading path and extensions no_path=$(echo "${fastq##*/}") no_ext=$(echo "${no_path%%.*}") # Run DIAMOND with blastx # Output format 100 produces a DAA binary file for use with MEGAN ${diamond} blastx \ --db ${dmnd} \ --query "${fastq}" \ --out "${no_ext}".blastx.daa \ --outfmt 100 \ --top 5 \ --block-size 15.0 \ --index-chunks 4 done ``` --- MEGANIZER script (GitHub): - [20200323_cbai_diamond_blastx_meganizer.sh](https://github.com/RobertsLab/sams-notebook/blob/master/bash_scripts/20200323_cbai_diamond_blastx_meganizer.sh) ```shell #!/bin/bash # Script to run MEGAN6 meganizer on DIAMOND DAA files from # 20200318_cbai_diamond_blastx Mox job. # Requires MEGAN mapping files from: # http://ab.inf.uni-tuebingen.de/data/software/megan6/download # Exit script if any command fails set -e # Program path meganizer=/home/sam/programs/megan/tools/daa2rma # MEGAN mapping files prot_acc2tax=/home/sam/data/databases/MEGAN/prot_acc2tax-Jul2019X1.abin acc2interpro=/home/sam/data/databases/MEGAN/acc2interpro-Jul2019X.abin acc2eggnog=/home/sam/data/databases/MEGAN/acc2eggnog-Jul2019X.abin ## Inititalize arrays daa_array_R1=() daa_array_R2=() # Create array of DAA R1 files for daa in *R1*.daa do daa_array_R1+=("${daa}") done # Create array of DAA R2 files for daa in *R2*.daa do daa_array_R2+=("${daa}") done ## Run MEGANIZER # Capture start "time" # Uses builtin bash variable called ${SECONDS} start=${SECONDS} for index in "${!daa_array_R1[@]}" do start_loop=${SECONDS} sample_name=$(echo "${daa_array_R1[index]}" | awk -F "_" '{print $1}') echo "Now processing ${sample_name}.daa2rma.rma6" echo "" # Run daa2rma with paired option ${meganizer} \ --paired \ --in "${daa_array_R1[index]}" "${daa_array_R2[index]}" \ --acc2taxa ${prot_acc2tax} \ --acc2interpro2go ${acc2interpro} \ --acc2eggnog ${acc2eggnog} \ --out "${sample_name}".daa2rma.rma6 \ 2>&1 | tee --append daa2rma_log.txt end_loop=${SECONDS} loop_runtime=$((end_loop-start_loop)) echo "Finished processing ${sample_name}.daa2rma.rma6 in ${loop_runtime} seconds." echo "" done # Caputure end "time" end=${SECONDS} runtime=$((end-start)) # Print MEGANIZER runtime, in seconds { echo "" echo "---------------------" echo "" echo "Total runtime was: ${runtime} seconds" } >> daa2rma_log.txt ``` --- # RESULTS DIAMOND BLASTx took ~4.5 days: - [DIAMOND BLASTx runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200318_cbai_diamond_blastx_runtime.png?raw=true) The subsequent conversion from DAA to RMA6 files to ~5.6 days. Output folder: - [20200318_cbai_diamond_blastx/](https://gannet.fish.washington.edu/Atumefaciens/20200318_cbai_diamond_blastx/) The RMA6 files can now be loaded into MEGAN6 to extract reads based on taxonomy.