--- title: "02.10-bismark-deduplication" author: "Sam White" date: "2024-11-09" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Intro ## Inputs This notebook is a continuation of using [Bismark](https://felixkrueger.github.io/Bismark/) [@krueger2011] to perform methylation analysis. Read alignments were run in [02.00-bismark-bowtie2-alignment.Rmd](./02.00-bismark-bowtie2-alignment.Rmd). This step will perform [deduplication](https://felixkrueger.github.io/Bismark/bismark/deduplication/) (Bismark manual) of the BAM files generated by alignment. Results will be summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016]. Expects input BAM filenames formatted like (e.g.): - `CF03-CM03-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam` ## Outputs - `*_dedup.sorted.bam`: Deduplicated, sorted BAMs. - `*_dedup.sorted.bam.bai`: Deduplicated, sorted BAM index file. Useful for IGV. - `*.deduplicated.bam`: Deduplicated BAMs. These are _not_ sorted. - `*.deduplication_report.txt`: Individual summary reports for each deduplication process. - `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format. Due to the large file sizes of BAMS, these cannot be hosted in the [ceasmallr GitHub repo](https://github.com/sr320/ceasmallr). As such these files are available for download here: - [https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.10-bismark-deduplication/](https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.10-bismark-deduplication/) ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/ceasmallr' echo 'export output_dir_top=${repo_dir}/output/02.10-bismark-deduplication' echo 'export alignment_dir="${repo_dir}/output/02.01-bismark-bowtie2-alignment-SLURM-array"' echo "" echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"' echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"' echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"' echo 'export samtools_dir="${programs_dir}/samtools-1.12"' echo "" echo "# Set number of CPUs to use" echo 'export threads=40' echo "" echo "# Filename patterns" echo "export deduped_filename_pattern='_R1_001.fastp-trim*bismark_bt2_pe.deduplicated.bam'" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Deduplication ```{bash deduplication, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars # Make output directory, if it doesn't exist mkdir --parents "${output_dir_top}" cd "${alignment_dir}" for bam in *_bismark_bt2_pe.bam do # Set stderr filename stderr_name=$(echo "${bam}" | awk -F"." '{print $1"."$2".deduplication.stderr"}') # Run deduplication ${bismark_dir}/deduplicate_bismark \ --samtools_path ${samtools_dir} \ --paired \ --bam \ --output_dir ${output_dir_top} \ ${bam} \ > ${output_dir_top}/${stderr_name} 2>&1 done ``` # Sorting ```{bash sorting, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" for bam in *deduplicated.bam do sample=$(echo ${bam} | awk -F"_" '{print $1}') ${samtools_dir}/samtools sort \ -@ "${threads}" \ -o ${sample}_dedup.sorted.bam \ ${bam} done ``` ## Check file counts ```{bash check-sorted-bam-file-counts, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ls -1 *deduplicated.bam | wc -l ls -1 *dedup.sorted.bam | wc -l ``` # Indexing for Downstream Applications ```{r indexing, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" find *dedup.sorted.bam \ | xargs basename -s _dedup.sorted.bam \ | xargs -I{} ${samtools_dir}/samtools \ index {}_dedup.sorted.bam ``` # Checksums ```{r generate-checksums, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" for file in *; do if [ "${file}" != "checksums.md5" ]; then md5sum "${file}" >> checksums.md5 fi done ``` # MultiQC ```{r multiqc, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ${multiqc} . ``` # List output files ```{r list-outputs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ls -lh ```