---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Transcriptome Annotation - C.bairdi Using DIAMOND BLASTx on Mox and MEGAN6 Meganizer
date: '2020-01-03 10:05'
tags:
  - tanner crab
  - mox
  - MEGAN
  - DIAMOND
  - BLASTx
  - meganizer
  - Chionoecetes bairdi
categories:
  - 2020
  - Tanner Crab RNAseq
---
Although I previously [annotated](https://robertslab.github.io/sams-notebook/posts/2019/2019-12-25-Transcriptome-Annotation-C.bairdi-Trinity-Assembly-Trinotate-on-Mox/) our [_C.bairdi_ transcriptome from 20191218](https://robertslab.github.io/sams-notebook/2019/12/18/Transcriptome-Assembly-C.bairdi-Trimmed-RNAseq-Using-Trinity-on-Mox/), I realized that the assembly and annotations were combine infected/uninfected samples, possibly making separating crab/_Hematodinium_ sequences a bit more difficult.

I also realized that the MEGAN6 software that I'd previously used for metagenomic taxonomic classification can actually extract sequencing reads. So, I decided to run all of our Tanner crab RNAseq reads through the MEGAN6 process. At the end, I'll separate out reads, based on taxonomy, and then generate "clean" _de novo_ assemblies of Tanner crab and _Hematodinium_!

To start this process, the trimmed reads need to be annotated using DIAMOND BLASTx. Then, the DIAMOND output files need to be "meganized" for importing to MEGAN6.

DIAMOND BLASTx took place on Mox, while "meganization" took place on my lab computer (`swoose`); this is due to the way that MEGAN6 uses Java - it doesn't run properly on Mox.

For reference, these include RNAseq data using a newly established "shorthand": 2018, 2019.

SBATCH script (GitHub):

- [20200103_cbai_diamond_blastx.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200103_cbai_diamond_blastx.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_blastx_DIAMOND
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=20-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200103_cbai_diamond_blastx

## Perform DIAMOND BLASTx on trimmed Chionoecetes bairdi (Tanner crab) FastQ files.

## Trimmed FastQ files originated here:
## https://gannet.fish.washington.edu/Atumefaciens/20191218_cbai_fastp_RNAseq_trimming

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log


# Program paths
diamond=/gscratch/srlab/programs/diamond-0.9.29/diamond

# DIAMOND NCBI nr database
dmnd=/gscratch/srlab/blastdbs/ncbi-nr-20190925/nr.dmnd


# FastQ files directory
fastq_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq


# Loop through FastQ files, log filenames to fastq_list.txt.
# Run DIAMOND on each FastQ
for fastq in ${fastq_dir}*fastp-trim*.fq.gz
do
	# Log input FastQs
	echo "${fastq}" >> fastq_list.txt

	# Strip leading path and extensions
	no_path=$(echo "${fastq##*/}")
	no_ext=$(echo "${no_path%%.*}")

	# Run DIAMOND with blastx
	# Output format 100 produces a DAA binary file for use with MEGAN
	${diamond} blastx \
	--db ${dmnd} \
	--query "${fastq}" \
	--out "${no_ext}".blastx.daa \
	--outfmt 100 \
	--top 5 \
	--block-size 15.0 \
	--index-chunks 4
done
```


MEGANIZER script (GitHub):

- [20200107_cbai_diamond_blastx_meganizer.sh](https://github.com/RobertsLab/sams-notebook/blob/master/bash_scripts/20200107_cbai_diamond_blastx_meganizer.sh)

```shell
#!/bin/bash

# Script to run MEGAN6 meganizer on DIAMOND DAA files from
# 20200103_cbai_diamond_blastx Mox job.

# Requires MEGAN mapping files from:
# http://ab.inf.uni-tuebingen.de/data/software/megan6/download

# Program path
meganizer=/home/sam/programs/megan/tools/daa-meganizer

# MEGAN mapping files
prot_acc2tax=/home/sam/data/databases/MEGAN/prot_acc2tax-Jul2019X1.abin
acc2interpro=/home/sam/data/databases/MEGAN/acc2interpro-Jul2019X.abin
acc2eggnog=/home/sam/data/databases/MEGAN/acc2eggnog-Jul2019X.abin

# Variables
threads=20

## Run MEGANIZER

# Capture start "time"
start=${SECONDS}
for daa in *.daa
do
  ${meganizer} \
  --in "${daa}" \
	--threads "${threads}" \
	--acc2taxa ${prot_acc2tax} \
	--acc2interpro2go ${acc2interpro} \
	--acc2eggnog ${acc2eggnog}
done

# Caputure end "time"
end=${SECONDS}

runtime=$((end-start))

# Print MEGANIZER runtime, in seconds
echo "Runtime was: ${runtime} seconds"
```

---

# RESULTS

Runtime was just a bit over two days (but, it sat in the queue for a full day before being able to run):

![DIAMOND BLASTx runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200103_cbai_diamond_blastx_runtime.png?raw=true)

Output folder:

- [20200103_cbai_diamond_blastx/](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/)


Now that this is complete, I will proceed with using importing into MEGAN6, to create `rma6` file and then separately extract crab reads and _Hematodinium_ reads. These will then be used to generate "clean" transcriptome assemblies for Tanner crab and _Hematodinium_.

Here's the full list of MEGANIZED DIAMOND `daa` files and their sizes (note: they're _huge_ files):

- [304428_S1_L001_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/304428_S1_L001_R1_001.blastx.daa) (56GB)

- [304428_S1_L001_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/304428_S1_L001_R2_001.blastx.daa) (54GB)

- [304428_S1_L002_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/304428_S1_L002_R1_001.blastx.daa) (54GB)

- [304428_S1_L002_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/304428_S1_L002_R2_001.blastx.daa) (52GB)

- [329774_S1_L001_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329774_S1_L001_R1_001.blastx.daa) (39GB)

- [329774_S1_L001_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329774_S1_L001_R2_001.blastx.daa) (36GB)

- [329774_S1_L002_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329774_S1_L002_R1_001.blastx.daa) (34GB)

- [329774_S1_L002_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329774_S1_L002_R2_001.blastx.daa) (32GB)

- [329775_S2_L001_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329775_S2_L001_R1_001.blastx.daa) (40GB)

- [329775_S2_L001_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329775_S2_L001_R2_001.blastx.daa) (36GB)

- [329775_S2_L002_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329775_S2_L002_R1_001.blastx.daa) (37GB)

- [329775_S2_L002_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329775_S2_L002_R2_001.blastx.daa) (32GB)

- [329776_S3_L001_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329776_S3_L001_R1_001.blastx.daa) (35GB)

- [329776_S3_L001_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329776_S3_L001_R2_001.blastx.daa) (32GB)

- [329776_S3_L002_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329776_S3_L002_R1_001.blastx.daa) (30GB)

- [329776_S3_L002_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329776_S3_L002_R2_001.blastx.daa) (29GB)

- [329777_S4_L001_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329777_S4_L001_R1_001.blastx.daa) (40GB)

- [329777_S4_L001_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329777_S4_L001_R2_001.blastx.daa) (34GB)

- [329777_S4_L002_R1_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329777_S4_L002_R1_001.blastx.daa) (36GB)

- [329777_S4_L002_R2_001.blastx.daa](https://gannet.fish.washington.edu/Atumefaciens/20200103_cbai_diamond_blastx/329777_S4_L002_R2_001.blastx.daa) (31GB)