--- layout: post title: Metagenomics - FastQ BLASTx of Water Samples on Mox date: '2019-05-15 15:07' tags: - metagenomics - Panopea generosa - geoduck - mox - blastx - seqkt categories: - Miscellaneous --- After a meeting on this project today, we decided to try a few things to continue with various approaches to assessing the metagenome. One of the approaches is to just run BLASTx on the FastQ sequences themselves and obtain taxonomy info for them, so that's what I did here. Used the trimmed reads from [20181211](https://robertslab.github.io/sams-notebook/2018/12/11/FastQC-and-Trimming-Metagenomics-(Geoduck)-HiSeqX-Reads-from-20180809.html) as input and combined corresponding paired-end reads into a singular FastA file using [seqkt v1.3](https://github.com/lh3/seqtk/releases/tag/v1.3). Samples breakdown like so: - pH=7.1: MG3, MG6, MG7 - pH=8.2: MG1, MG2, MG5 SBATCH script (GitHub): - [20190515_metagenomics_pgen_fastq_blastx.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190515_metagenomics_pgen_fastq_blastx.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=blastx_metagenomics ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=25-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190515_metagenomics_pgen_fastq_blastx # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log threads=28 # Paths to programs blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" blastx="${blast_dir}/blastx" seqtk="/gscratch/srlab/programs/seqtk-1.3/seqtk" # Paths to blastdbs blastdb_dir="/gscratch/srlab/blastdbs/ncbi-sp-v5" blast_db="${blastdb_dir}/swissprot_v5" samtools="/gscratch/srlab/programs/samtools-1.9/samtools" # Input files fastq_dir="/gscratch/srlab/sam/data/metagenomics/P_generosa/" ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() names_array=() # Create array of fastq R1 files for fastq in ${fastq_dir}/*R1*.gz do fastq_array_R1+=(${fastq}) done # Create array of fastq R2 files for fastq in ${fastq_dir}/*R2*.gz do fastq_array_R2+=(${fastq}) done # Create array of sample names ## Uses parameter substitution to strip leading path from filename ## Uses awk to parse out sample name from filename for R1_fastq in ${fastq_dir}/*R1*.gz do names_array+=($(echo ${R1_fastq#${fastq_dir}} | awk -F"_" '{print $3 $4}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in ${fastq_dir}*.gz do echo "${fastq#${fastq_dir}}" >> fastq.list.txt done # Merge paired-end reads into singular FastA files # Uses seqtk for FastQ/FastA manipulation. for index in "${!fastq_array_R1[@]}" do sample_name=$(echo "${names_array[index]}") if [ "${sample_name}" == "MG1" ] \ || [ "${sample_name}" == "MG2" ] \ || [ "${sample_name}" == "MG5" ] then sample_name="${sample_name}"_pH82 else sample_name="${sample_name}"_pH71 fi "${seqtk}" mergefa "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" > "${sample_name}".fasta done # Export BLAST database directory export BLASTDB=${blastdb_dir} # Loop through FastAs # Create list of those FastAs for reference # Parse out sample names # Run BLASTx on each FastA for fasta in *.fasta do echo "${fasta}" >> input.fasta.list.txt "${samtools}" faidx "${fasta}" no_ext=${fasta%%.*} sample_name=$(echo ${no_ext##*/}) # Run BLASTx on each FastA ${blastx} \ -query "${fasta}" \ -db ${blast_db} \ -max_hsps 1 \ -outfmt "6 std staxid ssciname" \ -evalue 1e-10 \ -num_threads ${threads} \ > "${sample_name}".blastx.outfmt6 done ``` Oof, after running for over _nine days_ the Mox node crashed this past weekend: ![Screencap of blastx node failure notification](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190607_blastx_node_fail.png?raw=true) --- #### RESULTS Output folder: - [20190515_metagenomics_pgen_fastq_blastx](http://gannet.fish.washington.edu/Atumefaciens/20190515_metagenomics_pgen_fastq_blastx)