--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Data Analysis - C.virginica MBD Sequencing Coverage date: '2019-04-18 13:54' tags: - Crassostrea virginica - Eastern oyster - MBD - Bismark - jupyter categories: - 2019 - Miscellaneous --- A while ago, Steven tasked me with assessing some questions related to the sequencing coverage we get doing MBD-BSseq in this [GitHub issue](https://github.com/RobertsLab/resources/issues/609). At the heart of it all was really to try to get an idea of how much usable data we actually get when we do an MBD-BSseq project. Yaamini happened to have done an MBD-BSseq project relatively recently, and it's one she's actively working on analyzing, so we went with that data set. Data was initially trimmed: - [20180411 TrimGalore and FastQC](https://robertslab.github.io/sams-notebook/posts/2018/2018-04-11-trimgalorefastqcmultiqc-trim-10bp-53-ends-c-virginica-mbd-bs-seq-fastq-data/) Subsequently, the data was concatenated, subset, and aligned using Bismark: - [20190313 Bismark alignment subsetting](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-13-Methylation-Analysis---C.virginica-Gonad-MBD-with-Varying-Read-Subsets-with-Bismark-on-Mox/) Today, I finally found the time to dedicate to evaluating alignment coverage of each of the Bismark sequence subsets. It was done in a Jupyter Notebook and solely with Bash/Python! I used this as project as an excuse to dive into using Python a bit more, instead of using R. For what I needed to accomplish, I just felt like this approach was simpler (instead of creating an R project and all of that). - [20190418_cvir_mbd_coverage_comparison.ipynb](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20190418_cvir_mbd_coverage_comparison.ipynb) (GitHub) --- # RESULTS Plot of the various "Percent Methylation" coverages in each of the Bismark subsets. The lines/bars are redundant. I added the lines to provide a better visual guide to see how the percent methylation coverage changes from one Bismark subset to the next. A quick guide to X axis labeling: - half_avg_reads: Subset of 50% of the average number of read pairs. - avg_reads: Subset of the average number of read pairs. - half_total_reads: Subset of 50% of the total number of read pairs. - total_reads: All read pairs. ![Bar/line plot showing differences in percent CpG coverage of each Bismark subset](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190418_cvir_mbd_cov_comparison.png?raw=true) --- ##### Summary Table | Bismark Subset | No. of Read pairs | Trimmed read length (bp) | No. of Bases (Gbp) | Genome size (Gbp) | Genome coverage (x fold) | Mean CpG Coverage | Percent CpG Coverage | Percent 5x CpG Coverage | Percent 10x CpG Coverage | |-------------------------------|-------------------|--------------------------|--------------------|-------------------|--------------------------|-------------------|----------------------|-------------------------|--------------------------| | total_reads_CpG_coverage | 275914272 | 80 | 22.07 | 0.684741 | 32.24 | 7.3 | 45.3 | 14.7 | 9.1 | | half_total_reads_CpG_coverage | 137957136 | 80 | 11.04 | 0.684741 | 16.12 | 3.5 | 33.5 | 9.2 | 5.7 | | avg_reads_CpG_coverage | 27591427 | 80 | 2.21 | 0.684741 | 3.22 | 0.7 | 15.3 | 3.2 | 1.8 | | half_avg_reads_CpG_coverage | 13795713 | 80 | 1.10 | 0.684741 | 1.61 | 0.5 | 11.2 | 2.3 | 1.2 | --- After looking at this a bit, it's surprising how little coverage we actually end up with. This is data for _all_ of the samples combined (n = 10), 100bp PE sequencing. This helps inform us on how much sequencing we actually need to do in order to achieve a target coverage. _However_, this analysis doesn't tell us anything about methylated CpG coverage (although, it could if I tweaked things a tiny bit). Yaamini has been working on that, so she should have those answers. Her analysis and my analysis will likely be used to contribute to our [BS-seq decision tree](https://github.com/RobertsLab/resources/issues/543).