Marinimicrobia and Sulfitobacter in an Oxygen Deficient Zone

Jordan Winter

Project Goal

Create a presence-absence plot of genes in Marinimicrobia metagenome assembled genomes (MAGs).

Workflow to accomplish this goal:

  • Annotate bins
  • Visualize bins in anvio
  • Use Kegg orthologs for gene annotation

Initial Data

  • Annotations (Kegg orthologs and annotated fasta files)
  • Assembly (fasta files and metadata, like bam files and contig lengths)
  • Bins (list of contigs and reads in each bin)
  • CAT-BAT (information on organism ID of each read and contig)

Bins

The bin files contain the names of the contigs and reads within them.

sulf_bin <- read.csv("../data/Bins/assembly_plus_bins/assembly_plus_bin.4", header=F)
head(sulf_bin)
                      V1
1  MG1058_s15.ctg000016c
2 MG1058_s273.ctg000288c
3 MG1058_s408.ctg000428c

I then got the fasta files for each bin in order to get completeness and contamination statistics.

Bin Annotation

This is part of the code I used to get Marinimicrobia bins using the CAT-BAT annotations of each contig.

    consensus <- all_contigs %>%
      group_by(species) %>%
      summarize(support = sum(ORFs_true)) %>%
      filter(species != "no support")
    index <- which(str_detect(consensus$species, "Marinimicrobia"))
    consensus$species[index] <- "Marinimicrobia"
    consensus <- consensus$species[which(consensus$support
                                         == max(consensus$support))]

Busco

I used Busco to get completeness and contamination of the bins.

conda activate busco
busco -i github/jordan-marinimicrobia/data/Bin_fa -l bacteria_odb10 \
-m geno -o github/jordan-marinimicrobia/output/busco_outputs -c 8

Busco results were combined with clade names in a summary table.

summary_table <- read.csv("../output/all_mar_bins.csv")
kable(head(summary_table, 1))
bin sample Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold.N50 Contigs.N50 Percent.gaps Number.of.scaffolds sum_len clade
24_sample_bam_bin.118 24_sample_bam_bins bacteria_odb10 0.8 0.8 0 0.8 98.4 124 2280 2280 0.000% 782 821700 NA

Anvio

I used Anvio to visualize MAGs (which are the bins) and look at GC content and coverage.

Gene Presence-Absence Code

for (one_contig in unique(ko$contig)){
  ko_small <- subset(ko, contig == one_contig)
  results <- NULL
  for (i in 1:length(all_paths)){
    path <- all_paths[i]
    results$count[i] <- length(which(ko_small$pathway == path))
    results$path[i] <- path
  }
  results <- as.data.frame(results)
  results$bin <- one_contig
  pathway <- rbind(pathway, results)
}

pathway$presence <- "no"
index <- which(pathway$count > 0)
pathway$presence[index] <- "yes"

Gene Presence-Absence Figure

This shows the presence-absence plot for Sulfitobacter.

Next Steps

In the next few weeks, I plan to refine the Sulfitobacter bins and get presence-absence gene plots of the “best” Marinimicrobia bins.

  • Add higher resolution to final presence/absence gene plot
  • Go through workflow with “best” Marinimicrobia bins
  • Clean up code and add more explanations of code