---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Methylation Analysis - O.lurida Bismark on Mox
date: '2019-02-22 22:38'
tags:
  - bismark
  - Ostrea lurida
  - mox
  - Olympia oyster
categories:
  - 2019
  - Miscellaneous
---
This is a quick and dirty (i.e. no adaptor/quality trimming) assessment of all of our _Ostrea lurida_ bisulfite sequencing efforts to date in order to get a rough idea of the methylation mapping, per [this GitHub issue](https://github.com/RobertsLab/resources/issues/589). Ran Bismark on Mox on a series of subset of the reads:

- 100k
- 500k
- 1M
- 2M
- 5M
- 10M

<pre><code>
#!/bin/bash
## Job Name
#SBATCH --job-name=bismark
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190224_olurida_se_bismark

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Directories and programs
wd=$(pwd)
bismark_dir="/gscratch/srlab/programs/Bismark-0.19.0"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"
reads_dir="/gscratch/scrubbed/samwhite/data/O_lurida/BSseq/"

## genomes

genome="/gscratch/srlab/sam/data/O_lurida/genomes/Olurida_v081/oly_v081_bismark_genome"

## An array of subsets of reads to use in bismark:
## 100k, 500k, 1M, 2M, 5M, 10M
subset_array=(100000 500000 1000000 2000000 5000000 10000000)

## FastQ Files
se_reads="/gscratch/scrubbed/samwhite/data/O_lurida/BSseq/olur_bsseq_all_R1.fastq.gz"

## FastQ files lists
R1_list="${wd}/olur_bsseq_all_R1.list"

# Check for existence of previous concatenation
# If they exist, delete them

for file in ${se_reads} ${R1_list}
do
  if [ -e ${file} ]; then
    rm ${file}
  fi
done

# Concatenate R1 reads and generate lists of FastQs
for fastq in ${reads_dir}*R1*.gz
do
  echo ${fastq} >> ${R1_list}
  cat ${fastq} >> ${se_reads}
done

# Run bismark using bisulftie-converted genome

for subset in ${subset_array[@]}
do
    mkdir subset_${subset}
    cd subset_${subset}
    ${bismark_dir}/bismark \
    --path_to_bowtie ${bowtie2_dir} \
    --genome ${genome} \
    --score_min L,0,-0.6 \
    -u ${subset} \
    -p 28 \
    ${se_reads} \
    2> subset_${subset}_summary.txt

    # Deduplicate bam files
    ${bismark_dir}/deduplicate_bismark \
    --bam \
    --single \
    *.bam

    # Methylation extraction

    ${bismark_dir}/bismark_methylation_extractor \
    --bedgraph \
    --counts \
    --scaffolds \
    --remove_spaces \
    --multicore 28 \
    --buffer_size 75% \
    *deduplicated.bam

    # Bismark processing report

    ${bismark_dir}/bismark2report

    #Bismark summary report

    ${bismark_dir}/bismark2summary

    # Sort files for methylkit and IGV

    find *deduplicated.bam \
    | xargs basename -s .bam \
    | xargs -I{} ${samtools} \
    sort \
    --threads 28 \
    {}.bam \
    -o {}.sorted.bam

    # Index sorted files for IGV
    # The "-@ 56" below specifies number of CPU threads to use.

    find *.sorted.bam \
    | xargs basename -s .sorted.bam \
    | xargs -I{} ${samtools} \
    index -@ 28 \
    {}.sorted.bam
    cd ${wd}
done

</code></pre>

---

# RESULTS

Output folder:

- [20190224_olurida_se_bismark/](http://gannet.fish.washington.edu/Atumefaciens/20190224_olurida_se_bismark/)

Bedgraphs for each subset:

- There is a bedgraph for each single-end subset. I will not link them all. Here's an example for the 10,000,000 subset:
  - [20190224_olurida_se_bismark/subset_10000000/olur_bsseq_all_R1_bismark_bt2.deduplicated.bedGraph.gz](http://gannet.fish.washington.edu/Atumefaciens/20190224_olurida_se_bismark/subset_10000000/olur_bsseq_all_R1_bismark_bt2.deduplicated.bedGraph.gz)

Code used to pull mapping efficiencies:

```
find ./*_olurida_* -name "subset*.txt" -print0 | xargs -0 grep "Mapping efficiency"
./20190224_olurida_se_bismark/subset_500000/subset_500000_summary.txt:Mapping efficiency:	54.4%
./20190224_olurida_se_bismark/subset_10000000/subset_10000000_summary.txt:Mapping efficiency:	53.8%
./20190224_olurida_se_bismark/subset_1000000/subset_1000000_summary.txt:Mapping efficiency:	54.5%
./20190224_olurida_se_bismark/subset_5000000/subset_5000000_summary.txt:Mapping efficiency:	54.3%
./20190224_olurida_se_bismark/subset_100000/subset_100000_summary.txt:Mapping efficiency:	54.2%
./20190224_olurida_se_bismark/subset_2000000/subset_2000000_summary.txt:Mapping efficiency:	54.4%

```

| Reads Subset | Mapping Efficiency (%) |
|--------------|------------------------|
| 100000       | 54.2                   |
| 500000       | 54.4                   |
| 1000000      | 54.5                   |
| 2000000      | 54,4                   |
| 5000000      | 54.3                   |
| 10000000     | 53.8                   |


```

```

| Read Type (PE or SE) | Reads Subset | C methylated in CpG context (%) |
|----------------------|--------------|---------------------------------|
| SE                   | 100000       | 30.1                            |
| SE                   | 500000       | 30.2                            |
| SE                   | 1000000      | 30.2                            |
| SE                   | 2000000      | 30.4                            |
| SE                   | 5000000      | 30.4                            |
| SE                   | 10000000     | 30.3                            |