--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Gene Prediction - HiSeqX Metagenomics from Geoduck Water Using MetaGeneMark on Mox date: '2019-01-03 14:49' tags: - gene prediction - metagenomics - geoduck - Panopea generosa - MetaGeneMark - mox categories: - 2019 - Miscellaneous --- [After assembline the metagenomic data yesterday](https://robertslab.github.io/sams-notebook/posts/2019/2019-01-02-Metagenome-Assembly---P.generosa-Water-Sample-HiSeqX-Data-Using-Megahit/), I needed to predict some genes. I did this using [MetaGeneMark (v.3.38)](http://exon.gatech.edu/GeneMark/) and ran it on Mox. Input FastA(2.2GB): - [20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa) SBATCH script (text): - [20190103_metagenomics_geo_metagenemark/20190103_metagenomics_geo_metagenemark.sh](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103_metagenomics_geo_metagenemark.sh)
#!/bin/bash
## Job Name
#SBATCH --job-name=busco
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=4-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190103_metagenomics_geo_metagenemark
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Load Open MPI module for parallel, multi-node processing
module load icc_19-ompi_3.1.2
# SegFault fix?
export THREADS_DAEMON_MODEL=1
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# Variables
gmhmmp=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp
mgm_mod=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod
assembly_fasta=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa
nuc_out=20190103-mgm-nucleotides.fa
gff_out=20190103-mgm.gff3
prot_out=20190103-mgm-proteins.fa
# Run MetaGeneMark
## Specifying the following:
### -a : output predicted proteins
### -A : write predicted proteins to designated file
### -d : output predicted nucleotides
### -D : write predicted protein to designated file
### -f 3 : Output format in GFF3
### -m : Model file (supplied with software)
### -o : write GFF3 to designated file
${gmhmmp} \
-a \
-A ${prot_out} \
-d \
-D ${nuc_out} \
-f 3 \
-m ${mgm_mod} \
${assembly_fasta} \
-o ${gff_out}
This will output predicted genes, both nucleotides and proteins, as FastA files, and a GFF3 file.
---
# RESULTS
Whoa! This was ridiclously fast! It completed in ~5 minutes!
Output folder:
- [20190103_metagenomics_geo_metagenemark/](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/)
Nucleotide FastA (1.6GB):
- [20190103_metagenomics_geo_metagenemark/20190103-mgm-nucleotides.fa](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm-nucleotides.fa)
Protein FastA (727MB):
- [20190103_metagenomics_geo_metagenemark/20190103-mgm-proteins.fa](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm-proteins.fa)
GFF3 File (1.3GB):
- [20190103_metagenomics_geo_metagenemark/20190103-mgm.gff3](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm.gff3)
A cursory glance at the FastA files (```grep -c ">" fasta```) indicate a total of 3,296,610 genes predicted.
Now, for some annotations using BLASTn and/or BLASTp...