--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Data Wrangling - Visualization of C.bairdi NanoPore Sequencing Using NanoPlot on Mox date: '2020-09-14 10:20' tags: - Tanner crab - mox - nanoplot - Chionoecetes bairdi - Hematodinium - nanopore - gDNA - sequencing categories: - 2020 - Miscellaneous --- I previously converting our _C.bairdi_ NanoPre sequencing data from the raw Fast5 format to FastQ format for our three sets of data: - [C.bairdi-20102558-2729-Run-01](https://robertslab.github.io/sams-notebook/posts/2020/2020-09-04-Data-Wrangling---NanoPore-Fast5-Conversion-to-FastQ-of-C.bairdi-20102558-2729-Run-01-on-Mox-with-GPU-Node/) - [C.bairdi-20102558-2729-Run-02](https://robertslab.github.io/sams-notebook/posts/2020/2020-09-04-Data-Wrangling---NanoPore-Fast5-Conversion-to-FastQ-of-C.bairdi-20102558-2729-Run-02-on-Mox-with-GPU-Node/) - [C.bairdi-6129_403_26](https://robertslab.github.io/sams-notebook/posts/2020/2020-09-04-Data-Wrangling---NanoPore-Fast5-Conversion-to-FastQ-of-C.bairdi-6129_403_26-on-Mox-with-GPU-Node/) Before proceeding with assembly and/or trying to tease out taxonomic differences (the C.bairdi-6129_403_26 is from an individual infected with _Hematodinium_), I want to get an idea of how the data looks. So, I've decided to process the `sequencing_summary.txt` file from each Fast5 conversion with [NanoPlot](https://github.com/wdecoster/NanoPlot). This software spits out some tables and some very nice visualizations to help get a better idea of how the sequencing runs look. This was run on Mox. SBATCH script (GitHub): - [20200914_cbai_nanoplot_nanopore-data.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200914_cbai_nanoplot_nanopore-data.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=cbai_nanoplot_nanopore-data ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200914_cbai_nanoplot_nanopore-data ################################################################################### # These variables need to be set by user # Load Anaconda # Uknown why this is needed, but Anaconda will not run if this line is not included. . "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh" # Activate the NanoPlot Anaconda environment conda activate nanoplot_env # Set number of CPUs to use threads=28 # Paths to reads raw_reads_dir_array=( "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_04bb4d86_20102558-2729" \ "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL58500_94244ffd_20102558-2729" \ "/gscratch/srlab/sam/data/C_bairdi/DNAseq/ont_FAL86873_d8db260e_cbai_6129_403_26" ) # Paths to programs nanoplot=NanoPlot ################################################################################### # Exit script if any command fails set -e # Capture this directory wd=$(pwd) # Inititalize array programs_array=() # Programs array programs_array=("${nanoplot}") # Loop through NanoPore data directories # to run NanoPlot, FastQC, and MultiQC for directory in "${!raw_reads_dir_array[@]}" do # Capture NanoPore directory name dir_name=${raw_reads_dir_array[directory]##*/} # Make new directory and change to that directory mkdir "${dir_name}" && cd "$_" current_dir=$(pwd) # Run NanoPlot ## Sets readtype to 1D (default) ## Shows N50 on histograms ## Analysis perfomred using the sequencing summary file generated by ## guppy when converting from Fast5 to FastQ ${programs_array[nanoplot]} \ --threads ${threads} \ --outdir ${current_dir} \ --readtype 1D \ --N50 \ --summary "${raw_reads_dir_array[directory]}"/sequencing_summary.txt # Change back to working directory cd "${wd}" done # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ``` --- # RESULTS Very fast, 50s: ![NanoPlot cumulative runtime for all three C.bairdi NanoPore data sets](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200914_cbai_nanoplot_nanopore-data.png?raw=true) Output folder: - [20200914_cbai_nanoplot_nanopore-data/](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/) Each of the NanoPore NanoPlot analyses are in individual folders, linked below. Although I've only linked to the full report (in HTML format) for each, the plots and data present in the full report are also available in individual images/files within those folders. ##### C.bairdi-20102558-2729_Run-01 (FAL58500_94244ffd) Output folder: - [20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/) Full report (HTML): - [20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/NanoPlot-report.html](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_94244ffd_20102558-2729/NanoPlot-report.html) | General summary | | |---------------------|--------------| | Active channels | 491.0 | | Mean read length | 821.6 | | Mean read quality | 10.1 | | Median read length | 533.0 | | Median read quality | 10.6 | | Number of reads | 78,720.0 | | Read length N50 | 1,254.0 | | Total bases | 64,672,912.0 | Number, percentage and megabases of reads above quality cutoffs: | Quality_cutoff | reads(count) | percentage | bases(count) | |----------------|--------------|------------|--------------| | >Q5 | 73315 | (93.1%) | 60.5Mb | | >Q7 | 67186 | (85.3%) | 56.5Mb | | >Q10 | 46724 | (59.4%) | 41.4Mb | | >Q12 | 20961 | (26.6%) | 20.6Mb | | >Q15 | 966 | (1.2%) | 0.8Mb | Top 5 highest mean basecall quality scores and their read lengths: | Rank | mean_quality | read_length(bp) | bases(count) | |------|--------------|-----------------|--------------| | 1 | 22 | (682) | 60.5Mb | | 2 | 19.8 | (471) | 56.5Mb | | 3 | 19.1 | (414) | 41.4Mb | | 4 | 19.1 | (1539) | 20.6Mb | | 5 | 19 | (234) | 0.8Mb | Top 5 longest reads and their mean basecall quality score: | Rank | read_length(bp) | mean_quality_score | |------|-----------------|--------------------| | 1 | 62629 | (3.3) | | 2 | 51173 | (3.5) | | 3 | 45909 | (3.0) | | 4 | 44689 | (3.4) | | 5 | 42633 | (3.0) | ![FAL58500_94244ffd NanoPlot read density plot](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200914_cbai_nanoplot_nanopore-data_ont_FAL58500_94244ffd_20102558-2729_density-plot.png?raw=true) --- ##### C.bairdi-20102558-2729_Run-02 (FAL58500_04bb4d86) Output folder: - [20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/) Full report (HTML): - [20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/NanoPlot-report.html](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL58500_04bb4d86_20102558-2729/NanoPlot-report.html) | GENERAL SUMMARY | | |---------------------|--------------| | Active channels | 424.0 | | Mean read length | 845.5 | | Mean read quality | 8.7 | | Median read length | 521.0 | | Median read quality | 9.3 | | Number of reads | 21,519.0 | | Read length N50 | 1,406.0 | | Total bases | 18,195,027.0 | Number, percentage and megabases of reads above quality cutoffs: | Quality_cutoff | reads(count) | percentage | bases(count) | |----------------|--------------|------------|--------------| | >Q5 | 17207 | (80.0%) | 15.4Mb | | >Q7 | 14864 | (69.1%) | 13.9Mb | | >Q10 | 8797 | (40.9%) | 8.9Mb | | >Q12 | 3257 | (15.1%) | 3.6Mb | | >Q15 | 121 | (0.6%) | 0.1Mb | Top 5 highest mean basecall quality scores and their read lengths: | Rank | mean_quality | read_length(bp) | |------|--------------|-----------------| | 1 | 18.3 | (323) | | 2 | 17.8 | (1064) | | 3 | 17.4 | (201) | | 4 | 17.3 | (761) | | 5 | 16.9 | (642) | Top 5 longest reads and their mean basecall quality score: | Rank | read_length(bp) | mean_quality_score | |------|-----------------|--------------------| | 1 | 31170 | (3.4) | | 2 | 28158 | (3.2) | | 3 | 27016 | (3.3) | | 4 | 24304 | (3.6) | | 5 | 15552 | (2.9) | ![FAL58500_04bb4d86 NanoPlot read density plot](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200914_cbai_nanoplot_nanopore-data_ont_FAL58500_04bb4d86_20102558-2729_density-plot.png?raw=true) --- #### C.bairdi-6129_403_26 (FAL86873_d8db260e) Output folder: - [20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/) Full report (HTML): - [20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/NanoPlot-report.html](https://gannet.fish.washington.edu/Atumefaciens/20200914_cbai_nanoplot_nanopore-data/ont_FAL86873_d8db260e_cbai_6129_403_26/NanoPlot-report.html) | GENERAL SUMMARY | | |---------------------|------------| | Active channels | 503 | | Mean read length | 2254.5 | | Mean read quality | 10.9 | | Median read length | 965 | | Median read quality | 11.2 | | Number of reads | 506495 | | Read length N50 | 5232 | | Total bases | 1141890358 | Number, percentage and megabases of reads above quality cutoffs: | Quality_cutoff | reads(count) | percentage | bases(count) | |----------------|--------------|------------|--------------| | >Q5 | 493598 | (97.5%) | 1124.9Mb | | >Q7 | 465033 | (91.8%) | 1075.5Mb | | >Q10 | 343296 | (67.8%) | 880.9Mb | | >Q12 | 186709 | (36.9%) | 568.7Mb | | >Q15 | 7394 | (1.5%) | 17.9Mb | Top 5 highest mean basecall quality scores and their read lengths: | Rank | mean_quality | read_length(bp) | |------|--------------|-----------------| | 1 | 20.6 | (980) | | 2 | 20.2 | (1228) | | 3 | 19.7 | (902) | | 4 | 19.6 | (466) | | 5 | 19.6 | (757) | Top 5 longest reads and their mean basecall quality score: | Rank | read_length(bp) | mean_quality_score | |------|-----------------|--------------------| | 1 | 58854 | (11.8) | | 2 | 57315 | (10.8) | | 3 | 51351 | (11.5) | | 4 | 51326 | (13.7) | | 5 | 49825 | (12.9) | ![FAL86873_d8db260e NanoPlot read density plot](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200914_cbai_nanoplot_nanopore-data_ont_FAL86873_d8db260e_cbai_6129_403_26_density-plot.png?raw=true) --- First, we should compare the two C.bairdi-20102558-2729 runs (degraded input DNA, no _Hematodinium_ infection), as these were the same sample, run on the same flowcell. Basically, the second run on the flowcell performs worse in every metric: fewer reads, lower quality scores, shorter read lengths, etc. This is despite the fact that this flowcell was run for nearly 4.5x longer period of time (72hrs vs. 16hrs). This isn't terribly surprising, as [the second run started with only 414 available pores](https://robertslab.github.io/sams-notebook/posts/2020/2020-01-09-NanoPore-Sequencing---C.bairdi-gDNA-Sample-20102558-2729/), which is well below the ONT-suggested minimum of 800 pores for a "good" flowcell, but it is interesting to see the actual impacts on sequencing that this has. The C.bairdi-6129_403_26 run (high quality input DNA, with _Hematodinium_ infection) is a better example of what to expect when using non-degraded DNA. Every metric is _substantially_ better than the C.bairdi-20102558-2729 runs. Now, how to proceed? First, I think I'll just get an assembly made, utilizing all three sets of data, despite the presence of _Hematodinium_ in the C.bairdi-6129_403_26 run. I'll tackle the assembly using [Flye](https://github.com/fenderglass/Flye), as it seems straightforward, is specifically designed for long read assembly, and has built-in assembly "polishing". This will be a start. In regards to teasing out _Hematodinium_ sequences, I'll probably go through the DIAMOND BLASTx -> MEGAN6 process that I've used throughout our _C.baird_ RNAseq project for transcriptome assemblies. MEGAN6 can handle long reads and makes taxonomic read extraction straightforward.