--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left date: 2018-08-08 22:55:49+00:00 layout: post slug: genome-annotation-olympia-oyster-genome-annotation-results-02 title: 'Genome Annotation – Olympia oyster genome annotation results #02' categories: - 2018 - Olympia Oyster Genome Sequencing tags: - annotation - JetStream - maker - olympia oyster - Ostrea lurida - wq-maker --- Yesterday, [I annotated our Olympia oyster genome using WQ-MAKER in just 7hrs!](https://robertslab.github.io/sams-notebook/posts/2018/2018-08-07-genome-annotation-olympia-oyster-genome-using-wq-maker-instance-on-jetstream/). See that link for run setup and configuration. They are essentially the same, except for the change I'll discuss below. The results from that run can be seen here: * [Genome Annotation – Olympia oyster genome annotation results #01](https://robertslab.github.io/sams-notebook/posts/2018/2018-08-08-genome-annotation-olympia-oyster-genome-annotation-results-01/) In that previous run, I neglected to provide a transposable elements FastA file for use with RepeatMasker. I remedied that and re-ran it. I modified [`maker_opts.ctl`](https://owl.fish.washington.edu/Athaliana/20180807_wqmaker_run_oly_02/maker_opts.ctl) to include the following: repeat_protein=../../opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner This TEs file is part of RepeatMasker. * * * # RESULTS Output folder: * [20180807_wqmaker_run_oly_02](https://owl.fish.washington.edu/Athaliana/20180807_wqmaker_run_oly_02/) Annotated genome file (GFF): * [20180807_wqmaker_run_oly_02/Olurida_v081.all.gff (1GB)(https://owl.fish.washington.edu/Athaliana/20180807_wqmaker_run_oly_02/Olurida_v081.all.gff) * * * ![](https://owl.fish.washington.edu/Athaliana/20180807_wq-maker_06.png) * * * This run took about an hour longer than [the previous run](https://robertslab.github.io/sams-notebook/posts/2018/2018-08-07-genome-annotation-olympia-oyster-genome-complete-brief-note/), but for some reason it ran with only 21 workers, instead of 22. This is probably the reason for the increased run time. I'd like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a "parent" contig and its corresponding info (start/stop/length). Then, there are "children" of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations! Probably the easiest and most logical to start working with will be those scaffolds that are annotated with a "protein_match", as these have a corresponding GenBank ID. Parsing these out and then doing a join with NCBI protein IDs will give us a basic annotaiton of "functional" portions of the genome. Additionally, we should probably do some sort of comparison of this run with the previous run where I did _not_ provide the transposable elements FastA file.