--- title: "02.00-bismark-bowtie2-alignment" author: "Sam White" date: "2024-10-23" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Intro ## Inputs This notebook will download trimmed WGBS FastQs (trimmed on [20220829](https://robertslab.github.io/sams-notebook/posts/2022/2022-08-29-FastQ-Trimming-and-QC---C.virginica-Larval-BS-seq-Data-from-Lotterhos-Lab-and-Part-of-CEABIGR-Project-Using-fastp-on-Mox/index.html) (Sam's Notebook entry)) and align them to the *C.viginica* genome using [Bismark](https://felixkrueger.github.io/Bismark/) [@krueger2011] and [Bowtie2](https://github.com/BenLangmead/bowtie2) [@langmead2018; @langmead2012]. Alignment results will be summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016]. ## Outputs The expected outputs will be: - `*_R1_001.fastp-trim-20220827_bismark_bt2_PE_report.txt`: A text file summarizing the alignment input and results. Despite the `R1` naming, these reports are based on *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/). - `*_R1_001.fastp-trim-20220827_bismark_bt2_pe.bam`:A BAM alignment. Despite the `R1` naming, these BAMs are *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/). - `bismark_summary.txt`: An overall summary report of the alignment process. Essentially, this is all of the individual `*report.txt` files combined into a single file. - `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format. Due to the large file sizes of BAMS, these cannot be hosted in the [ceasmallr GitHub repo](https://github.com/sr320/ceasmallr). As such these files are available for download here: - [https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.00-bismark-bowtie2-alignment/](https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.00-bismark-bowtie2-alignment/) ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/ceasmallr' echo 'export output_dir_top=${repo_dir}/output/02.00-bismark-bowtie2-alignment' echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/"' echo 'export trimmed_fastqs_dir="${repo_dir}/output/00.00-trimming-fastp"' echo "" echo "# Input files" echo 'export bisulfite_genome_url="http://owl.fish.washington.edu/halfshell/genomic-databank/Cvirginica_v300_bisulfite.tar.gz"' echo 'export bisulfite_genome_gz="Cvirginica_v300_bisulfite.tar.gz"' echo 'export bisulfite_genome_dir="${repo_dir}/data/Cvirginica_v300"' echo "" echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"' echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"' echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"' echo 'export samtools_dir="${programs_dir}/samtools-1.12"' echo "" echo "# Program settings" echo 'export bowtie2_min_score="L,0,-0.6"' echo "" echo "# Set FastQ filename patterns" echo "export fastq_pattern='*.fq.gz'" echo "export R1_fastq_pattern='*_R1_*.fq.gz'" echo "export R2_fastq_pattern='*_R2_*.fq.gz'" echo "export trimmed_fastq_pattern='*fastp-trim*.fq.gz'" echo "" echo "# Set number of CPUs to use" echo 'export threads=4' echo "# Bismark already spawns multiple instances and additional threads are multiplicative." echo 'export bismark_threads=2' echo "" echo "## Inititalize arrays" echo 'export fastq_array_R1=()' echo 'export fastq_array_R2=()' echo 'export trimmed_fastqs_array=()' echo 'export R1_names_array=()' echo 'export R2_names_array=()' echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Download bisulfite genome ```{bash download-bisulfite-genome, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${repo_dir}"/data wget --quiet \ --continue \ "${bisulfite_genome_url}" ls -ltrh ``` ## Unpack bisulfite genome ```{bash unpack-bisulfite-genome, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${repo_dir}"/data if [ ! -d "${bisulfite_genome_dir}" ]; then tar -xzf "${bisulfite_genome_gz}" fi tree "${bisulfite_genome_dir}" ``` # Download trimmed RNA-seq reads Reads are downloaded from The `--cut-dirs 2` command cuts the preceding directory structure (i.e. `Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/`) so that we just end up with the reads. ```{bash download-trimmed-reads, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars # Make output directory if it doesn't exist mkdir --parents ${trimmed_fastqs_dir} # Run wget to retrieve FastQs and MD5 files wget \ --directory-prefix ${trimmed_fastqs_dir} \ --no-check-certificate \ --continue \ --cut-dirs 2 \ --no-host-directories \ --no-parent \ --quiet \ --accept="${trimmed_fastq_pattern}, trimmed-fastq-checksums.md5" \ ${trimmed_reads_url} ls -lh "${trimmed_fastqs_dir}" ``` ## Verify trimmed read checksums ```{bash verify-trimmed-read-checksums, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${trimmed_fastqs_dir}" # Checksums file contains other files, so this just looks for the sRNAseq files. for file in *.md5 do md5sum --check "${file}" done ``` # Alignment ## Find read pairs ```{r find-read-pairs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${trimmed_fastqs_dir}" if [[ -f "${output_dir_top}"/fastq_pairs.txt ]]; then rm "${output_dir_top}"/fastq_pairs.txt fi # Find all _R1_ files and match them with their corresponding _R2_ files for R1_file in *_R1_*.fq.gz; do R2_file="${R1_file/_R1_/_R2_}" if [[ -f "$R2_file" ]]; then echo "$R1_file $R2_file" >> "${output_dir_top}"/fastq_pairs.txt else echo "Warning: No matching R2 file for $R1_file" fi done cat "${output_dir_top}"/fastq_pairs.txt ``` ```{bash alignment, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${trimmed_fastqs_dir}" find ${trimmed_fastqs_dir}/${R1_fastq_pattern} \ | xargs basename -s "${R1_reads_basename}" \ | xargs -I {} ${bismark_dir}/bismark \ --path_to_bowtie2 ${bowtie2_dir} \ --genome ${bisulfite_genome_dir} \ --score_min "${bowtie2_min_score}" \ --parallel "${bismark_threads}" \ --non_directional \ --samtools_path "${samtools_dir}" \ --gzip \ -p "${threads}" \ -1 ${trimmed_fastqs_dir}/{}"${R1_reads_basename}" \ -2 ${trimmed_fastqs_dir}/{}"${R2_reads_basename}" \ --output_dir "${output_dir_top}" \ 2> "${output_dir_top}"/bismark_summary.txt ``` ## List outputs ```{r list-outputs, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ls -lh ``` # MultiQC ```{r multiqc, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ${multiqc} . ``` # Checksums ```{r generate-checksums, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" for file in *; do if [ "${file}" != "checksums.md5" ]; then md5sum "${file}" >> checksums.md5 fi done ``` # Citations