--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Taxonomic Assignments - C.bairdi NanoPore Reads Using DIAMOND BLASTx on Mox and MEGAN6 daa2rma on swoose date: '2020-09-17 08:16' tags: - Tanner crab - DIAMOND - BLASTx - mox - Chionoecetes bairdi - MEGAN6 - NanoPore categories: - 2020 - Miscellaneous --- Earlier today I [quality filtered (>=Q7) our _C.baird_ NanoPore reads](https://robertslab.github.io/sams-notebook/posts/2020/2020-09-17-Data-Wrangling---C.bairdi-NanoPore-Quality-Filtering-Using-NanoFilt-on-Mox/). One of the things I'd like to do now is to attempt to filter reads taxonomically, since the NanoPore data came from both an uninfected crab and _Hematodinium_-infected crab. I ran DIAMOND BLASTx (on Mox), followed by MEGAN6 `daa2rma` (on swoose - MEGAN6 requires some weird Java X11 thingy that won't work on Mox) to convert the DIAMOND output to the proper MEGAN6 format for visualizing taxonomic assignments. SBATCH script (GitHub) for DIAMOND BLASTx on Mox: - [20200917_cbai_diamond_blastx_nanopore_all_Q7.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200917_cbai_diamond_blastx_nanopore_all_Q7.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=cbai_blastx_DIAMOND_nanopore_all ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200917_cbai_diamond_blastx_nanopore_all_Q7 # Script to run DIAMOND BLASTx on all quality filtered (Q7) C.bairdi NanoPore reads # from 20200917 using the --long-reads option # for subsequent import into MEGAN6 to try to separate reads taxonomically. ################################################################################### # These variables need to be set by user # Input FastQ file fastq=/gscratch/srlab/sam/data/C_bairdi/DNAseq/20200917_cbai_nanopore_all_quality-7.fastq # DIAMOND Output filename prefix prefix=20200917_cbai_diamond_blastx_nanopore_all_Q7 # Set number of CPUs to use threads=28 # Program paths diamond=/gscratch/srlab/programs/diamond-2.0.4/diamond # DIAMOND NCBI nr database with taxonomy dumps dmnd_db=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Inititalize arrays programs_array=() # Programs array programs_array=("${diamond}") md5sum "${fastq}" > fastq_checksums.md5 # Run DIAMOND with blastx # Output format 6 produces a standard BLAST tab-delimited file # Run DIAMOND with blastx # Output format 100 produces a DAA binary file for use with MEGAN ${diamond} blastx \ --long-reads \ --db ${dmnd_db} \ --query "${fastq}" \ --out "${prefix}".blastx.daa \ --outfmt 100 \ --top 5 \ --block-size 8.0 \ --index-chunks 1 \ --threads ${threads} # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} help echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ``` Bash script (GitHub) for `daa2rma` on Emu: - [20200917_cbai_nanopore_diamond_blastx_daa2rma.sh](https://github.com/RobertsLab/sams-notebook/blob/master/bash_scripts/20200917_cbai_nanopore_diamond_blastx_daa2rma.sh) ```shell #!/bin/bash # Script to run MEGAN6 daa2rma on DIAMOND DAA files from # 20200917_cbai_diamond_blastx_nanopore_all_Q7. # Utilizes the --longReads option # Requires MEGAN mapping file from: # http://ab.inf.uni-tuebingen.de/data/software/megan6/download # MEGAN mapping file megan_map=/home/sam/data/databases/MEGAN6/megan-map-Jul2020-2.db # Programs array declare -A programs_array programs_array=( [daa2rma]="/home/shared/megan_6.19.9/tools/daa2rma" ) threads=16 ######################################################################### # Exit script if any command fails set -e ## Run daa2rma # Capture start "time" # Uses builtin bash variable called ${SECONDS} start=${SECONDS} for daa in *.daa do start_loop=${SECONDS} sample_name=$(basename --suffix ".blastx.daa" "${daa}") echo "Now processing ${sample_name}.daa2rma.rma6" echo "" # Run daa2rma with long reads option ${programs_array[daa2rma]} \ --in "${daa}" \ --longReads \ --mapDB ${megan_map} \ --out "${sample_name}".daa2rma.rma6 \ --threads ${threads} \ 2>&1 | tee --append daa2rma_log.txt end_loop=${SECONDS} loop_runtime=$((end_loop-start_loop)) echo "Finished processing ${sample_name}.daa2rma.rma6 in ${loop_runtime} seconds." echo "" done # Caputure end "time" end=${SECONDS} runtime=$((end-start)) # Print daa2rma runtime, in seconds { echo "" echo "---------------------" echo "" echo "Total runtime was: ${runtime} seconds" } >> daa2rma_log.txt # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} --help echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true done # Document programs in PATH { date echo "" echo "System PATH:" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ``` --- # RESULTS DIAMOND runtime was pretty quick, 37mins: ![DIAMOND BLASTx runtime of C.baird Q7 NanoPore reads on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200917_cbai_diamond_blastx_nanopore_all_Q7_runtime.png?raw=true) DIAMOND BLASTx Output folder: - [20200917_cbai_diamond_blastx_nanopore_all_Q7/](https://gannet.fish.washington.edu/Atumefaciens/20200917_cbai_diamond_blastx_nanopore_all_Q7/) - DIAMOND BLASTx DAA file: - [20200917_cbai_diamond_blastx_nanopore_all_Q7/20200917_cbai_diamond_blastx_nanopore_all_Q7.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200917_cbai_diamond_blastx_nanopore_all_Q7/20200917_cbai_diamond_blastx_nanopore_all_Q7.blastx.daa) (47MB) daa2rma Output folder: - [20200917_cbai_nanopore_diamond_blastx_daa2rma/](https://gannet.fish.washington.edu/Atumefaciens/20200917_cbai_nanopore_diamond_blastx_daa2rma/) - RMA6 file: - [20200917_cbai_nanopore_diamond_blastx_daa2rma/20200917_cbai_diamond_blastx_nanopore_all_Q7.daa2rma.rma6](https://gannet.fish.washington.edu/Atumefaciens/20200917_cbai_nanopore_diamond_blastx_daa2rma/20200917_cbai_diamond_blastx_nanopore_all_Q7.daa2rma.rma6) I opened the RMA6 file in MEGAN6 to see the taxonomic breakdown and this is what I got: ![screencap of MEGAN6 C.bairdi NanoPore Q7 taxonomic assignments; Alveolata, Arthropoda, and E.canceri highlighted](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200917_cbai_nanopore_Q7_MEGAN6_no-assignments.png?raw=true) NOTE: When using `--long-read` mode, taxonomic assignments counts are always in aligned bases, _not_ aligned reads! Well, the results are pretty intriguing. Here's a short table of the most intriguing data: | Taxonomy | Bases | |-----------------------|------------| | _Alveolata_ | 436,841 | | _Arthropoda_ | 35,864,420 | | _Enterospora canceri_ | 18,134,246 | A few things to note: - As expected, _Arthropoda_ comprises the bulk of assignments. - _Alveolata_ is far fewer than I expected, but it is important to remember that the NanoPore data analyzed here is from two samples: one uninfected, the other infected, so _Hematodinium_ sequence is going to make up a much smaller proportion of reads/bases. - _Enterospora canceri_ (a [known microsporidian agent of infection in crabs](https://academic.oup.com/icesjms/article/65/9/1578/629544)) ha the second most number of bases assigned. AND, this is at the _species_ level, not Phylum level like _Arthropoda_ and _Alveolata_! - I think I've previously noticed a relatively high abundance of _Enterospora canceri_ read assignments in some previous RNAseq taxonomic assignment I performed, but I'll have to double check... - Although not highlighted in the tree/table above, I should not ignore _Aquifex sp._, which has 6,338,292 bases assigned to it! It turns out that members of the Aquifex genus are thermophylic bacteria... Crabs live on the sea floor (which is where one would find hydrothermal vents), but why would we see such a high abundance of these bacteria in muscle/hemolymph? Very curious! Where to go from here. Well, I think I might extract the reads and, well, I don't know what. Individual genome assemblies? I need to think about this a bit more and probably discuss with Steven.