---
title: "02.00-bismark-bowtie2-alignment"
author: "Sam White"
date: "2024-10-23"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---
# Intro

## Inputs
This notebook will download trimmed WGBS FastQs (trimmed on [20220829](https://robertslab.github.io/sams-notebook/posts/2022/2022-08-29-FastQ-Trimming-and-QC---C.virginica-Larval-BS-seq-Data-from-Lotterhos-Lab-and-Part-of-CEABIGR-Project-Using-fastp-on-Mox/index.html) (Sam's Notebook entry)) and align them to the *C.viginica* genome using [Bismark](https://felixkrueger.github.io/Bismark/) [@krueger2011] and [Bowtie2](https://github.com/BenLangmead/bowtie2) [@langmead2018; @langmead2012]. Alignment results will be summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016].

## Outputs

The expected outputs will be:

-   `*_R1_001.fastp-trim-20220827_bismark_bt2_PE_report.txt`: A text file summarizing the alignment input and results. Despite the `R1` naming, these reports are based on *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/).

-   `*_R1_001.fastp-trim-20220827_bismark_bt2_pe.bam`:A BAM alignment. Despite the `R1` naming, these BAMs are *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/).

-   `bismark_summary.txt`: An overall summary report of the alignment process. Essentially, this is all of the individual `*report.txt` files combined into a single file.

-   `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format.

Due to the large file sizes of BAMS, these cannot be hosted in the [ceasmallr GitHub repo](https://github.com/sr320/ceasmallr). As such these files are available for download here:

- [https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.00-bismark-bowtie2-alignment/](https://gannet.fish.washington.edu/Atumefaciens/gitrepos/ceasmallr/output/02.00-bismark-bowtie2-alignment/)

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/ceasmallr'
echo 'export output_dir_top=${repo_dir}/output/02.00-bismark-bowtie2-alignment'
echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/"'
echo 'export trimmed_fastqs_dir="${repo_dir}/output/00.00-trimming-fastp"'
echo ""

echo "# Input files"
echo 'export bisulfite_genome_url="http://owl.fish.washington.edu/halfshell/genomic-databank/Cvirginica_v300_bisulfite.tar.gz"'
echo 'export bisulfite_genome_gz="Cvirginica_v300_bisulfite.tar.gz"'
echo 'export bisulfite_genome_dir="${repo_dir}/data/Cvirginica_v300"'
echo ""

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"'
echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"'
echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"'
echo 'export samtools_dir="${programs_dir}/samtools-1.12"'
echo ""

echo "# Program settings"
echo 'export bowtie2_min_score="L,0,-0.6"'
echo ""

echo "# Set FastQ filename patterns"
echo "export fastq_pattern='*.fq.gz'"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"

echo "export trimmed_fastq_pattern='*fastp-trim*.fq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=4'
echo "# Bismark already spawns multiple instances and additional threads are multiplicative."
echo 'export bismark_threads=2' 
echo ""


echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export trimmed_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
```

# Download bisulfite genome

```{bash download-bisulfite-genome, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${repo_dir}"/data

wget --quiet \
--continue \
"${bisulfite_genome_url}"

ls -ltrh
```

## Unpack bisulfite genome

```{bash unpack-bisulfite-genome, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${repo_dir}"/data

if [ ! -d "${bisulfite_genome_dir}" ]; then
  tar -xzf "${bisulfite_genome_gz}"
fi

tree "${bisulfite_genome_dir}"
```

# Download trimmed RNA-seq reads

Reads are downloaded from <https://gannet.fish.washington.edu/Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/>

The `--cut-dirs 2` command cuts the preceding directory structure (i.e. `Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/`) so that we just end up with the reads.

```{bash download-trimmed-reads, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

# Make output directory if it doesn't exist
mkdir --parents ${trimmed_fastqs_dir}

# Run wget to retrieve FastQs and MD5 files
wget \
--directory-prefix ${trimmed_fastqs_dir} \
--no-check-certificate \
--continue \
--cut-dirs 2 \
--no-host-directories \
--no-parent \
--quiet \
--accept="${trimmed_fastq_pattern}, trimmed-fastq-checksums.md5" \
${trimmed_reads_url}

ls -lh "${trimmed_fastqs_dir}"
```

## Verify trimmed read checksums

```{bash verify-trimmed-read-checksums, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${trimmed_fastqs_dir}"

# Checksums file contains other files, so this just looks for the sRNAseq files.
for file in *.md5
do
  md5sum --check "${file}"
done
```

# Alignment

## Find read pairs


```{r find-read-pairs, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars


cd "${trimmed_fastqs_dir}"

if [[ -f "${output_dir_top}"/fastq_pairs.txt ]]; then
  rm "${output_dir_top}"/fastq_pairs.txt
fi

# Find all _R1_ files and match them with their corresponding _R2_ files
for R1_file in *_R1_*.fq.gz; do
    R2_file="${R1_file/_R1_/_R2_}"
    if [[ -f "$R2_file" ]]; then
        echo "$R1_file $R2_file" >> "${output_dir_top}"/fastq_pairs.txt
    else
        echo "Warning: No matching R2 file for $R1_file"
    fi
done


cat "${output_dir_top}"/fastq_pairs.txt


```


```{bash alignment, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${trimmed_fastqs_dir}"

find ${trimmed_fastqs_dir}/${R1_fastq_pattern} \
| xargs basename -s "${R1_reads_basename}" \
| xargs -I {} ${bismark_dir}/bismark \
--path_to_bowtie2 ${bowtie2_dir} \
--genome ${bisulfite_genome_dir} \
--score_min "${bowtie2_min_score}" \
--parallel "${bismark_threads}" \
--non_directional \
--samtools_path "${samtools_dir}" \
--gzip \
-p "${threads}" \
-1 ${trimmed_fastqs_dir}/{}"${R1_reads_basename}" \
-2 ${trimmed_fastqs_dir}/{}"${R2_reads_basename}" \
--output_dir "${output_dir_top}" \
2> "${output_dir_top}"/bismark_summary.txt

```

## List outputs

```{r list-outputs, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

ls -lh
```

# MultiQC

```{r multiqc, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

${multiqc} .
```

# Checksums

```{r generate-checksums, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

for file in *; do
  if [ "${file}" != "checksums.md5" ]; then
    md5sum "${file}" >> checksums.md5
  fi
done
```

# Citations