--- title: "02.00-bismark-bowtie2-alignment" author: "Sam White" date: "2024-10-23" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Intro This notebook will align trimmed/repaired FastQs generated by [`00.00-trimming-fastp.Rmd`](./00.00-trimming-fastp.Rmd) to the *C.viginica* genome using [Bismark](https://felixkrueger.github.io/Bismark/) [@krueger2011] and [Bowtie2](https://github.com/BenLangmead/bowtie2) [@langmead2018; @langmead2012]. Alignment results will be summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016]. Alignments were run with the Bowtie score setting of `L,0,-0.6`. NOTE: This process will take ~2 weeks to complete, when run on a computer with 48 CPUs! NOTE: This notebook was not used for this project. It was superseded by [`02.01-bismark-bowtie2-alignment-SLURM-array.sh`](./02.01-bismark-bowtie2-alignment-SLURM-array.sh), which was executed as part of an array of nodes on the Univ. of Washington HPC. However, the code has been left here to use for those who do not have access to such resources. ## Inputs Expects trimmed/repaired FastQs generated by [`00.00-trimming-fastp.Rmd`](./00.00-trimming-fastp.Rmd), with the following format: - `*fastp-trim*.fq.gz` ## Outputs The expected outputs will be: - `*_R1_001.fastp-trim_bismark_bt2_PE_report.txt`: A text file summarizing the alignment input and results. Despite the `R1` naming, these reports are based on *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/). - `*_R1_001.fastp-trim_bismark_bt2_pe.bam`:A BAM alignment. Despite the `R1` naming, these BAMs are *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/). - `bismark_summary.txt`: An overall summary report of the alignment process. Essentially, this is all of the individual `*report.txt` files combined into a single file. - `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format. Due to the large file sizes of BAMS, these cannot be hosted in the [ceasmallr GitHub repo](https://github.com/sr320/ceasmallr). As such these files are available for download here: - [https://gannet.fish.washington.edu/gitrepos/ceasmallr/](https://gannet.fish.washington.edu/gitrepos/ceasmallr/) ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/ceasmallr' echo 'export output_dir_top=${repo_dir}/output/02.00-bismark-bowtie2-alignment' echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/"' echo 'export trimmed_fastqs_dir="${repo_dir}/output/00.00-trimming-fastp"' echo "" echo "# Input files" echo 'export bisulfite_genome_url="http://owl.fish.washington.edu/halfshell/genomic-databank/Cvirginica_v300_bisulfite.tar.gz"' echo 'export bisulfite_genome_gz="Cvirginica_v300_bisulfite.tar.gz"' echo 'export bisulfite_genome_dir="${repo_dir}/data/Cvirginica_v300"' echo "" echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"' echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"' echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"' echo 'export samtools_dir="${programs_dir}/samtools-1.12"' echo "" echo "# Program settings" echo 'export bowtie2_min_score="L,0,-0.6"' echo "" echo "# Set FastQ filename patterns" echo "export fastq_pattern='*.fq.gz'" echo "export R1_fastq_pattern='*_R1_*.fq.gz'" echo "export R2_fastq_pattern='*_R2_*.fq.gz'" echo "export trimmed_fastq_pattern='*fastp-trim*.fq.gz'" echo "" echo "# Set number of CPUs to use" echo 'export threads=10' echo "# Bismark already spawns multiple instances and additional threads are multiplicative." echo 'export bismark_threads=8' echo "" echo "## Inititalize arrays" echo 'export fastq_array_R1=()' echo 'export fastq_array_R2=()' echo 'export trimmed_fastqs_array=()' echo 'export R1_names_array=()' echo 'export R2_names_array=()' echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Download bisulfite genome ```{bash download-bisulfite-genome, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${repo_dir}"/data wget --quiet \ --continue \ "${bisulfite_genome_url}" ls -ltrh ``` ## Unpack bisulfite genome ```{bash unpack-bisulfite-genome, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${repo_dir}"/data if [ ! -d "${bisulfite_genome_dir}" ]; then tar -xzf "${bisulfite_genome_gz}" fi tree "${bisulfite_genome_dir}" ``` # Alignment ## Find read pairs This step is not needed and was only used as a "quick fix" measure to generate a file for use by [`02.01-bismark-bowtie2-alignment-SLURM-array.sh`](./02.01-bismark-bowtie2-alignment-SLURM-array.sh). ```{r find-read-pairs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${trimmed_fastqs_dir}" if [[ -f "${output_dir_top}"/fastq_pairs.txt ]]; then rm "${output_dir_top}"/fastq_pairs.txt fi # Find all _R1_ files and match them with their corresponding _R2_ files for R1_file in *_R1_*.fq.gz; do R2_file="${R1_file/_R1_/_R2_}" if [[ -f "$R2_file" ]]; then echo "$R1_file $R2_file" >> "${output_dir_top}"/fastq_pairs.txt else echo "Warning: No matching R2 file for $R1_file" fi done cat "${output_dir_top}"/fastq_pairs.txt ``` ## Perform alignments ```{bash alignment, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${trimmed_fastqs_dir}" find ${trimmed_fastqs_dir}/${R1_fastq_pattern} \ | xargs basename -s "${R1_reads_basename}" \ | xargs -I {} ${bismark_dir}/bismark \ --path_to_bowtie2 ${bowtie2_dir} \ --genome ${bisulfite_genome_dir} \ --score_min "${bowtie2_min_score}" \ --parallel "${bismark_threads}" \ --non_directional \ --samtools_path "${samtools_dir}" \ --gzip \ -p "${threads}" \ -1 ${trimmed_fastqs_dir}/{}"${R1_reads_basename}" \ -2 ${trimmed_fastqs_dir}/{}"${R2_reads_basename}" \ --output_dir "${output_dir_top}" \ 2> "${output_dir_top}"/bismark_summary.txt ``` ## List outputs ```{r list-outputs, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ls -lh ``` # MultiQC ```{r multiqc, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ${multiqc} . ``` # Checksums ```{r generate-checksums, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" for file in *; do if [ "${file}" != "checksums.md5" ]; then md5sum "${file}" >> checksums.md5 fi done ``` # Citations