--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Metagenome Assembly - P.generosa Water Sample Trimmed HiSeqX Data Using Megahit on Mox date: '2019-01-02 14:26' tags: - Panopea generosa - geoduck - megahit - mox - metagenome - assembly categories: - 2019 - Miscellaneous --- Attempting an assembly of our geoduck metagenomics HiSeqX data using [Megahit (v.1.1.4)](https://github.com/voutcn/megahit) on a Mox node. Assembly was run using default settings. Assembly coverage was assessed using the following [BBmap (v.38.34) scripts](https://sourceforge.net/projects/bbmap/): - ```bbwrap.sh``` (alignments) - ```pileup.sh``` (coverage) SBATCH script is linked here and pasted in full below: - [20190102_metagenomics_geo_megahit/20190102_metagenomics_geo_megahit.sh](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/20190102_metagenomics_geo_megahit.sh) ``` #!/bin/bash ## Job Name #SBATCH --job-name=megahit ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=5-00:00:00 ## Memory per node #SBATCH --mem=500G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo ${PATH} | tr : \\n >> system_path.log # variables fastq_dir=/gscratch/srlab/sam/data/metagenomics/P_generosa megahit=/gscratch/srlab/programs/megahit_v1.1.4_LINUX_CPUONLY_x86_64-bin/megahit bbmap_dir=/gscratch/srlab/programs/bbmap_38.34 ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() # Create array of fastq R1 files for fastq in ${fastq_dir}/*R1*.gz; do fastq_array_R1+=(${fastq}) done # Create array of fastq R2 files for fastq in ${fastq_dir}/*R2*.gz; do fastq_array_R2+=(${fastq}) done # Create comma-separated list of input files R1_fastq_list=$(IFS=,; echo "${fastq_array_R1[*]}") R2_fastq_list=$(IFS=,; echo "${fastq_array_R2[*]}") # Run Megahit using paired-end reads ${megahit} \ -1 ${R1_fastq_list} \ -2 ${R2_fastq_list} \ --num-cpu-threads 28 # Determine coverage ## Align reads with BBmap BBwrap ${bbmap_dir}/bbwrap.sh \ ref=megahit_out/final.contigs.fa \ in1=${R1_fastq_list} \ in2=${R2_fastq_list} \ out=aln.sam.gz ## Output contig coverage ${bbmap_dir}/pileup.sh \ in=aln.sam.gz \ out=coverage.txt ``` --- # RESULTS This took a little less than a day to run: ![](http://gannet.fish.washington.edu/Atumefaciens/images/20190103_mox_megahit_complete.png) Output folder: - [20190102_metagenomics_geo_megahit/](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/) Assembly (FastA): - [20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa) Alignment (Gzipped SAM): - [20190102_metagenomics_geo_megahit/aln.sam.gz](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/aln.sam.gz) Coverage file (txt): - [20190102_metagenomics_geo_megahit/coverage.txt](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/coverage.txt) This assembly yielded 2,276,153 contigs (```grep -c ">" final.contigs.fa```). Here're some alignment stats from the ```BBmmap pileup.sh``` command (can be found at end of [the slurm file (text)](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/slurm-553487.out)): ``` Reads: 163236894 Mapped reads: 143703254 Mapped bases: 16437148607 Ref scaffolds: 2276153 Ref bases: 2243059881 Percent mapped: 88.034 Percent proper pairs: 76.768 Average coverage: 7.328 Standard deviation: 27.892 Percent scaffolds with any coverage: 70.54 Percent of reference bases covered: 56.07 ``` Also, ran Quast to assess assembly: ``` python \ /home/sam/software/quast-4.5/quast.py \ --threads=20 \ --min-contig=100 \ --labels=meta_megahit_assembly \ /mnt/gannet/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa ``` Quast output folder: - [20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/) Quast report (HTML): - [20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/report.html](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/report.html) Quast report (text): - [20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/report.txt](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/quast_results/results_2019_01_03_13_37_32/report.txt) Quast ```report.txt```: ``` All statistics are based on contigs of size >= 100 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs). Assembly meta_megahit_assembly # contigs (>= 0 bp) 2276153 # contigs (>= 1000 bp) 496414 # contigs (>= 5000 bp) 41075 # contigs (>= 10000 bp) 12255 # contigs (>= 25000 bp) 2797 # contigs (>= 50000 bp) 864 Total length (>= 0 bp) 2243059881 Total length (>= 1000 bp) 1342986697 Total length (>= 5000 bp) 487449312 Total length (>= 10000 bp) 293688660 Total length (>= 25000 bp) 154694058 Total length (>= 50000 bp) 89125138 # contigs 2276153 Largest contig 2035937 Total length 2243059881 GC (%) 39.32 N50 1413 N75 624 L50 308641 L75 932685 # N's per 100 kbp 0.00 ``` Now, should probably BLASTn, identify ORFs, translate and BLASTp those ORFs.