---
title: "02.00-bismark-bowtie2-alignment"
author: "Sam White"
date: "2024-10-23"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---
# Intro

This notebook will align trimmed/repaired FastQs generated by [`00.00-trimming-fastp.Rmd`](./00.00-trimming-fastp.Rmd) to the *C.viginica* genome using [Bismark](https://felixkrueger.github.io/Bismark/) [@krueger2011] and [Bowtie2](https://github.com/BenLangmead/bowtie2) [@langmead2018; @langmead2012]. Alignment results will be summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016].

Alignments were run with the Bowtie score setting of `L,0,-0.6`.

NOTE: This process will take ~2 weeks to complete, when run on a computer with 48 CPUs!

NOTE: This notebook was not used for this project. It was superseded by [`02.01-bismark-bowtie2-alignment-SLURM-array.sh`](./02.01-bismark-bowtie2-alignment-SLURM-array.sh), which was executed as part of an array of nodes on the Univ. of Washington HPC. However, the code has been left here to use for those who do not have access to such resources.

## Inputs

Expects trimmed/repaired FastQs generated by [`00.00-trimming-fastp.Rmd`](./00.00-trimming-fastp.Rmd), with the following format:

- `*fastp-trim*.fq.gz`

## Outputs

The expected outputs will be:

-   `*_R1_001.fastp-trim_bismark_bt2_PE_report.txt`: A text file summarizing the alignment input and results. Despite the `R1` naming, these reports are based on *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/).

-   `*_R1_001.fastp-trim_bismark_bt2_pe.bam`:A BAM alignment. Despite the `R1` naming, these BAMs are *paired reads*; the `R1` naming is a quirk of [Bismark](https://felixkrueger.github.io/Bismark/).

-   `bismark_summary.txt`: An overall summary report of the alignment process. Essentially, this is all of the individual `*report.txt` files combined into a single file.

-   `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format.

Due to the large file sizes of BAMS, these cannot be hosted in the [ceasmallr GitHub repo](https://github.com/sr320/ceasmallr). As such these files are available for download here:

- [https://gannet.fish.washington.edu/gitrepos/ceasmallr/](https://gannet.fish.washington.edu/gitrepos/ceasmallr/)

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/ceasmallr'
echo 'export output_dir_top=${repo_dir}/output/02.00-bismark-bowtie2-alignment'
echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/20220826-cvir-larvae_zygote-BSseq-fastp_trimming/"'
echo 'export trimmed_fastqs_dir="${repo_dir}/output/00.00-trimming-fastp"'
echo ""

echo "# Input files"
echo 'export bisulfite_genome_url="http://owl.fish.washington.edu/halfshell/genomic-databank/Cvirginica_v300_bisulfite.tar.gz"'
echo 'export bisulfite_genome_gz="Cvirginica_v300_bisulfite.tar.gz"'
echo 'export bisulfite_genome_dir="${repo_dir}/data/Cvirginica_v300"'
echo ""

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"'
echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"'
echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"'
echo 'export samtools_dir="${programs_dir}/samtools-1.12"'
echo ""

echo "# Program settings"
echo 'export bowtie2_min_score="L,0,-0.6"'
echo ""

echo "# Set FastQ filename patterns"
echo "export fastq_pattern='*.fq.gz'"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"

echo "export trimmed_fastq_pattern='*fastp-trim*.fq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=10'
echo "# Bismark already spawns multiple instances and additional threads are multiplicative."
echo 'export bismark_threads=8' 
echo ""


echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export trimmed_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
```

# Download bisulfite genome

```{bash download-bisulfite-genome, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${repo_dir}"/data

wget --quiet \
--continue \
"${bisulfite_genome_url}"

ls -ltrh
```

## Unpack bisulfite genome

```{bash unpack-bisulfite-genome, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${repo_dir}"/data

if [ ! -d "${bisulfite_genome_dir}" ]; then
  tar -xzf "${bisulfite_genome_gz}"
fi

tree "${bisulfite_genome_dir}"
```


# Alignment

## Find read pairs

This step is not needed and was only used as a "quick fix" measure to generate a file for use by [`02.01-bismark-bowtie2-alignment-SLURM-array.sh`](./02.01-bismark-bowtie2-alignment-SLURM-array.sh).

```{r find-read-pairs, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars


cd "${trimmed_fastqs_dir}"

if [[ -f "${output_dir_top}"/fastq_pairs.txt ]]; then
  rm "${output_dir_top}"/fastq_pairs.txt
fi

# Find all _R1_ files and match them with their corresponding _R2_ files
for R1_file in *_R1_*.fq.gz; do
    R2_file="${R1_file/_R1_/_R2_}"
    if [[ -f "$R2_file" ]]; then
        echo "$R1_file $R2_file" >> "${output_dir_top}"/fastq_pairs.txt
    else
        echo "Warning: No matching R2 file for $R1_file"
    fi
done


cat "${output_dir_top}"/fastq_pairs.txt


```


## Perform alignments

```{bash alignment, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${trimmed_fastqs_dir}"

find ${trimmed_fastqs_dir}/${R1_fastq_pattern} \
| xargs basename -s "${R1_reads_basename}" \
| xargs -I {} ${bismark_dir}/bismark \
--path_to_bowtie2 ${bowtie2_dir} \
--genome ${bisulfite_genome_dir} \
--score_min "${bowtie2_min_score}" \
--parallel "${bismark_threads}" \
--non_directional \
--samtools_path "${samtools_dir}" \
--gzip \
-p "${threads}" \
-1 ${trimmed_fastqs_dir}/{}"${R1_reads_basename}" \
-2 ${trimmed_fastqs_dir}/{}"${R2_reads_basename}" \
--output_dir "${output_dir_top}" \
2> "${output_dir_top}"/bismark_summary.txt

```

## List outputs

```{r list-outputs, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

ls -lh
```

# MultiQC

```{r multiqc, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

${multiqc} .
```

# Checksums

```{r generate-checksums, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cd "${output_dir_top}"

for file in *; do
  if [ "${file}" != "checksums.md5" ]; then
    md5sum "${file}" >> checksums.md5
  fi
done
```

# Citations