---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Gene Prediction - HiSeqX Metagenomics from Geoduck Water Using MetaGeneMark on Mox
date: '2019-01-03 14:49'
tags:
  - gene prediction
  - metagenomics
  - geoduck
  - Panopea generosa
  - MetaGeneMark
  - mox
categories:
  - 2019
  - Miscellaneous
---
[After assembline the metagenomic data yesterday](https://robertslab.github.io/sams-notebook/posts/2019/2019-01-02-Metagenome-Assembly---P.generosa-Water-Sample-HiSeqX-Data-Using-Megahit/), I needed to predict some genes. I did this using [MetaGeneMark (v.3.38)](http://exon.gatech.edu/GeneMark/) and ran it on Mox.

Input FastA(2.2GB):

- [20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa](http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa)

SBATCH script (text):

- [20190103_metagenomics_geo_metagenemark/20190103_metagenomics_geo_metagenemark.sh](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103_metagenomics_geo_metagenemark.sh)

<pre><code>
#!/bin/bash
## Job Name
#SBATCH --job-name=busco
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=4-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190103_metagenomics_geo_metagenemark

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log

# Variables
gmhmmp=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/gmhmmp
mgm_mod=/gscratch/srlab/programs/MetaGeneMark_linux_64_3.38/mgm/MetaGeneMark_v1.mod
assembly_fasta=/gscratch/scrubbed/samwhite/outputs/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa
nuc_out=20190103-mgm-nucleotides.fa
gff_out=20190103-mgm.gff3
prot_out=20190103-mgm-proteins.fa

# Run MetaGeneMark
## Specifying the following:
### -a : output predicted proteins
### -A : write predicted proteins to designated file
### -d : output predicted nucleotides
### -D : write predicted protein to designated file
### -f 3 : Output format in GFF3
### -m : Model file (supplied with software)
### -o : write GFF3 to designated file
${gmhmmp} \
-a \
-A ${prot_out} \
-d \
-D ${nuc_out} \
-f 3 \
-m ${mgm_mod} \
${assembly_fasta} \
-o ${gff_out}
</code></pre>

This will output predicted genes, both nucleotides and proteins, as FastA files, and a GFF3 file.

---

# RESULTS

Whoa! This was ridiclously fast! It completed in ~5 minutes!

Output folder:

- [20190103_metagenomics_geo_metagenemark/](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/)

Nucleotide FastA (1.6GB):

- [20190103_metagenomics_geo_metagenemark/20190103-mgm-nucleotides.fa](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm-nucleotides.fa)

Protein FastA (727MB):

- [20190103_metagenomics_geo_metagenemark/20190103-mgm-proteins.fa](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm-proteins.fa)

GFF3 File (1.3GB):

- [20190103_metagenomics_geo_metagenemark/20190103-mgm.gff3](http://gannet.fish.washington.edu/Atumefaciens/20190103_metagenomics_geo_metagenemark/20190103-mgm.gff3)

A cursory glance at the FastA files (```grep -c ">" fasta```) indicate a total of 3,296,610 genes predicted.

Now, for some annotations using BLASTn and/or BLASTp...