--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: DIAMOND BLASTx - C.bairdi RNAseq vs C.opilio Genome Proteins on Mox date: '2021-03-12 11:45' tags: - mox - DIAMOND - BLASTx - Chionoecetes bairdi - Chionoecetes opilio - RNAseq categories: - 2021 - Tanner Crab RNAseq --- We want to generate an additional [Tanner crab (_Chionoecetes bairdi_)](http://en.wikipedia.org/wiki/Chionoecetes_bairdi) transcriptome, per [this GitHub issue, to generate an additional _C.bairdi_ transcriptome](https://github.com/RobertsLab/resources/issues/1135). This has come about due to the release of the genome of a very closely related crab species, [_Chionoecetes opilio_ (Snow crab)](https://en.wikipedia.org/wiki/Chionoecetes_opilio). I used [`DIAMOND`](https://github.com/bbuchfink/diamond) BLASTx along with the [Snow crab genome protein FastA](https://ftp.ncbi.nlm.nih.gov/genomes/genbank/invertebrate/Chionoecetes_opilio/all_assembly_versions/GCA_016584305.1_ASM1658430v1/GCA_016584305.1_ASM1658430v1_protein.faa.gz) (8.7MB) from NCBI (Acc: [GCA_016584305.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_016584305.1/)). NOTE: Since this is geared toward just identifying matching reads, the BLASTx output format will _only contain the query ID_. There will be one BLASTx output file for each corresponding input FastQ file. SBATCH script (GitHub): - [20210312_cbai-vs-copi_diamond_blastx.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20210312_cbai-vs-copi_diamond_blastx.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=20210312_cbai-vs-copi_diamond_blastx ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=20-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210312_cbai-vs-copi_diamond_blastx ## Script for running BLASTx (using DIAMOND) with all of our C.bairdi RNAseq data to-date. ## BLASTx against C.opilio _(snow crab) NCBI protein FastA ## Output will be in standard BLAST output format 6, but only query ID. ## Output will be used to extract just reads with matches to to C.opilio genome, ## for downstream transcriptome assembly ################################################################################### # These variables need to be set by user # FastQ directory reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq # DIAMOND database dmnd=/gscratch/srlab/sam/data/C_opilio/blastdbs/GCA_016584305.1_ASM1658430v1_protein.dmnd # Programs array declare -A programs_array programs_array=( [diamond]="/gscratch/srlab/programs/diamond-0.9.29/diamond" ) # FastQ array fastq_array=(${reads_dir}/*fastp-trim*.fq.gz) ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # BLASTx FastQ files for fastq in "${!fastq_array[@]}" do # Remove path from transcriptome using parameter substitution fastq_name="${fastq_array[$fastq]##*/}" # Generate checksums for reference echo "" echo "Generating checksum for ${fastq_array[$fastq]}." md5sum "${fastq_array[$fastq]}">> fastq.checksums.md5 echo "Completed checksum for ${fastq_array[$fastq]}." echo "" # Run DIAMOND with blastx # Output format 6 query only returns a single query ID per match # block-size and index-chunks are computing resource optimatization paraeters ${programs_array[diamond]} blastx \ --db ${dmnd} \ --query "${fastq_array[$fastq]}" \ --out "${fastq_name}".blastx.outfmt6-query \ --outfmt 6 qseqid \ --evalue 1e-4 \ --max-target-seqs 1 \ --max-hsps 1 \ --block-size 15.0 \ --index-chunks 4 done ################################################################################### # Capture program options echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true # If MultiQC is in programs_array, copy the config file to this directory. if [[ "${program}" == "multiqc" ]]; then cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml fi done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log echo "Finished logging system PATH" ``` --- # RESULTS Runtime was surprisingly long, considering how fast [`DIAMOND`](https://github.com/bbuchfink/diamond) BLASTx has run on other samples. It took nearly 7hrs to complete: ![DIAMOND BLASTx runtime for 92 C.bairdi RNAseq FastQs vs C.opilio protein FastA](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20210312_cbai-vs-copi_diamond_blastx_runtime.png?raw=true) Output folder: - [20210312_cbai-vs-copi_diamond_blastx/](https://gannet.fish.washington.edu/Atumefaciens/20210312_cbai-vs-copi_diamond_blastx/) - Input FastQ list/MD5 checksums: - [fastq.checksums.md5](https://gannet.fish.washington.edu/Atumefaciens/20210312_cbai-vs-copi_diamond_blastx/fastq.checksums.md5) Due to the large number of output files, I will not link to each of them here. Please browse the output folder linked above.