--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left date: 2017-10-18 14:07:09+00:00 layout: post slug: genome-assembly-olympia-oyster-pacbio-canu-v1-6 title: Genome Assembly - Olympia oyster PacBio Canu v1.6 categories: - 2017 - Olympia Oyster Genome Sequencing tags: - canu - jupyter notebook - olympia oyster - Ostrea lurida - PacBio --- I decided to run Canu myself, since documentation for [Sean's Canu run](https://genefish.wordpress.com/2017/06/19/seans-notebook-canu-run-finished/) is a bit lacking. Additionally, it looks like he specified a genome size of 500Mbp, which is probably too small. For this assembly, I set the genome size to 1.9Gbp (based on the info in the [BGI assembly report, using 17-mers for calculating genome size](https://github.com/RobertsLab/project-olympia.oyster-genomic/blob/master/docs/20160512_F15FTSUSAT0327_genome_survey.pdf)), which is probably on the large size. Additionally, I remembered we had [an old PacBio run that we had been forgetting about](https://robertslab.github.io/sams-notebook/posts/2017/2017-10-09-data-management-convert-oly-pacbio-h5-to-fastq/) and thought it would be nice to have incorporated into an assembly. See all the messy details of this in the Jupyter Notebook below, but here's the core info about this Canu assembly. PacBio Input files (available on [Owl/nightingales/O_lurida](https://owl.fish.washington.edu/nightingales/O_lurida/):

m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz                                                               m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz
    m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq                       m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz
    m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz  m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz
    m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz  m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz
    m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz  m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz
    m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz

Canu execution command (see the Jupyter Notebook below for more info):

$time canu \
    useGrid=false \
    -p 20171009_oly_pacbio \
    -d /home/data/20171018_oly_pacbio_canu/ \
    genomeSize=1.9g \
    correctedErrorRate=0.075 \
    -pacbio-raw m*

Results: Well, this took a LONG time to run; a bit over two days! The report file contains some interesting tidbits. For instance: * Unitgigging calculates only 1.84x coverage * Trimming removed >5 billion (!!) bases: `867850 reads 5755379456 bases (reads with no overlaps, deleted)` * Unitigging unassembled: `unassembled: 479693 sequences, total length 2277137864 bp` I'll compare this Canu assembly against [Sean's Canu assembly](https://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/oly_pacbio_.contigs.fasta) and see how things look. Report file (text file): [https://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.report](http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.report) Contigs Assembly (FASTA): [https://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.contigs.fasta](http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.contigs.fasta) Complete Canu output directory: [https://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/](http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/) Jupyter Notebook (GitHub): [20171018_docker_oly_canu.ipynb](https://github.com/sr320/LabDocs/blob/master/jupyter_nbs/sam/20171018_docker_oly_canu.ipynb)