--- title: "Evaluating Assemblies" author: Steven Roberts date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document: theme: readable highlight: zenburn toc: true toc_float: true number_sections: true code_folding: show code_download: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(ShortRead) library(Biostrings) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` Rpubs link https://rpubs.com/sr320/1042307 Check out which assembly has best gene set https://github.com/laurahspencer/DuMOAR/issues/25 ### GAWN (Genome Annotation Without Nightmares): Following the GAWN instructions I ran the script GAWN-annotation.sh using the scaffold-only genome, the transcriptome that Giles generated for me, the Swissprot database, and the config file gawn_config.sh. All resulting files are on Google Drive here, but the files <100MB are also in this repo here. Feel free to drop any files resulting from your work in either location. Gawn: https://gannet.fish.washington.edu/seashell/snaps/PGA_assembly.scaffolds_only.gff3 ### GenSaS: All resulting files are on Google Drive here, and files <100 MB are also in this repo here (which includes all except for the repeat elements). If anyone has a GenSaS account I can share the project with you, I just need to know your username. Note that GenSaS was run with the full genome, so any resulting files should be filtered for only the scaffolds. GenSas https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.fna --- # GAWN ```{r, engine='bash'} cd ../data curl -O https://gannet.fish.washington.edu/seashell/snaps/PGA_assembly.scaffolds_only.gff3 ``` ```{r, engine='bash', eval=TRUE} head ../data/PGA_assembly.scaffolds_only.gff3 ``` ```{r, engine='bash'} awk '$3=="gene"' ../data/PGA_assembly.scaffolds_only.gff3 > ../results/PGA_assembly.scaffolds_only.gene.gff3 ``` ```{r, engine='bash', eval=TRUE} tail ../results/PGA_assembly.scaffolds_only.gene.gff3 ``` ```{r, engine='bash', eval=TRUE} wc -l ../results/PGA_assembly.scaffolds_only.gene.gff3 ``` ## grab fasta ```{r, engine='bash'} /home/shared/bedtools2/bin/bedtools getfasta \ -fi ../data/Mmag_scaffold.fa \ -bed ../results/PGA_assembly.scaffolds_only.gene.gff3 \ -fo ../results/GAWN-gene.fa ``` ```{r, engine='bash'} /home/shared/bedtools2/bin/bedtools getfasta \ -fi ../data/Mmag_scaffold.fa \ -bed ../results/PGA_assembly.scaffolds_only.gene.gff3 \ | tail ``` ```{r, cache=TRUE} # Read in FastA file fasta_filegawn <- "../results/GAWN-gene.fa" sequences <- readDNAStringSet(fasta_filegawn) # Calculate lengths lengths <- width(sequences) # Create a data frame df <- data.frame(length = lengths) # Plot histogram with ggplot2 ggplot(df, aes(x = length)) + geom_histogram(binwidth = 1000, fill = "blue", color = "black") + xlab("Sequence Length") + ylab("Frequency") + ggtitle("Histogram of Sequence Lengths") + theme_minimal() ``` # GenSas https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.fna ```{r, engine='bash'} cd ../data curl -O https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 curl -O https://gannet.fish.washington.edu/seashell/snaps/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.fna ``` ```{r, engine='bash', eval=TRUE} head ../data/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 ``` ```{r, engine='bash'} awk '!/unscaffolded/' ../data/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 > ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes-scaffold.gff3 ``` ```{r, engine='bash'} awk '$3=="gene"' ../data/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.gff3 | awk '!/unscaffolded/' > ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.gene-only.gff3 ``` ```{r, engine='bash', eval=TRUE} tail ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.gene-only.gff3 ``` ```{r, engine='bash', eval=TRUE} wc -l ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.gene-only.gff3 ``` ```{r, engine='bash', eval=TRUE} grep -c ">" ../data/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.fna ``` ```{r, engine='bash', eval=TRUE} tail ../data/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.genes.fna ``` ## grab fasta ```{r, engine='bash'} /home/shared/bedtools2/bin/bedtools getfasta \ -fi ../data/Mmag_scaffold.fa \ -bed ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.gene-only.gff3 \ -fo ../results/GenSAS-gene.fa ``` ```{r, engine='bash', eval=TRUE} /home/shared/bedtools2/bin/bedtools getfasta \ -fi ../data/Mmag_scaffold.fa \ -bed ../results/Metacarcinus-magister-v1.0.a1.6448195bde6ca-publish.gene-only.gff3 \ | tail ``` ```{r, eval=TRUE} # Read in FastA file fasta_filegen <- "../results/GenSAS-gene.fa" sequences <- readDNAStringSet(fasta_filegen) # Calculate lengths lengths <- width(sequences) # Create a data frame df <- data.frame(length = lengths) # Plot histogram with ggplot2 ggplot(df, aes(x = length)) + geom_histogram(binwidth = 1000, fill = "blue", color = "black") + xlab("Sequence Length") + ylab("Frequency") + ggtitle("Histogram of Sequence Lengths") + theme_minimal() ``` # Genome ```{r, engine='bash', eval=TRUE} md5sum ../data/Mmag_scaffold.fa ``` ```{r, engine='bash', eval=TRUE} grep '>' ../data/Mmag_scaffold.fa | wc -l ``` ```{r, engine='bash', eval=TRUE} grep '>' ../data/Mmag_scaffold.fa | head ``` ```{r, engine='bash'} /home/shared/samtools-1.12/samtools faidx ../data/Mmag_scaffold.fa ```