---
title: "14.00-heatwave-genetics-trimmed-bwa-alignments"
author: "Sam White"
date: "2025-07-11"
output: 
  github_document:
    toc: true
    number_sections: true
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---

# BACKGROUND

This Rmd file will perform the following:

1. Create BWA genome index using [BWA](https://github.com/lh3/bwa) [@li2013], if needed.

2.  Align trimmed FastQs (previously performed by Laura Spencer) to the *G.macrocepahlus* genome ([NCBI GCF_031168955.1_ASM3116895v1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_031168955.1/)) using [BWA](https://github.com/lh3/bwa) [@li2013].

3.  Use [picard](https://github.com/broadinstitute/picard) [@Picard2019toolkit] to remove duplicates.

4.  Use [bamUtil](https://github.com/statgen/bamUtil) [@jun2015] to remove overlaps.

5.  Use [samtools](https://www.htslib.org/doc/samtools-depth.html) [@danecek2021] to compute depth.


# INPUTS

- Genome FastA
- BWA genome FastA index

# OUTPUTS

- Deduplicated, sorted, clipped BAM files
- samtools depth file
- BAM index files

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# VARIABLES

```{r variables}
# DIRECTORIES
top_output_dir <- file.path("..", "output")
output_dir <- file.path(top_output_dir, "14.00-heatwave-genetics-trimmed-bwa-alignments")
data_dir <- file.path("..", "data")
trimmed_reads_dir <- file.path(data_dir, "trimmed-fastqs-heatwave-genetics")

# PROGRAMS
programs_dir <- file.path("", "home", "shared")
bamUtil_dir <- file.path(programs_dir, "bamUtil-1.0.15", "bin")
bwa <- file.path(programs_dir, "bwa-0.7.17", "bwa")
picard_dir <- file.path(programs_dir, "picard-3.4.0-6-g80ca98e2a", "build", "libs")
R1_suffix <- "_trimmed_clipped_R1_paired.fq.gz"
R2_suffix <- "_trimmed_clipped_R2_paired.fq.gz"
samtools_dir <- file.path(programs_dir, "samtools-1.12")
samtools <- file.path(samtools_dir, "samtools")


# FILES
genome_fasta <- file.path(data_dir, "GCF_031168955.1_ASM3116895v1_genomic.fna")
bwa_index <- file.path(data_dir, "bwa-GCF_031168955.1_ASM3116895v1_genomic")

# THREADS
threads <- "40"


# FORMATTING
line <- "-----------------------------------------------"

# Export these as environment variables for bash chunks.
Sys.setenv(
  bamUtil_dir = bamUtil_dir,
  bwa = bwa,
  bwa_index = bwa_index,
  data_dir = data_dir,
  genome_fasta = genome_fasta,
  line = line,
  output_dir = output_dir,
  picard_dir = picard_dir,
  top_output_dir = top_output_dir,
  R1_suffix = R1_suffix,
  R2_suffix = R2_suffix,
  samtools = samtools,
  threads = threads,
  trimmed_reads_dir = trimmed_reads_dir
)
```

# ALIGNMENTS

## Create BWA Index
```{r bwa-index, engine='bash'}
# Check if BWA index file exists
if [[ ! -f "bwa-GCF_031168955.1_ASM3116895v1_genomic.bwt" ]]; then
    echo "BWA index file not found. Creating BWA index..."
    ${bwa} index -p ${data_dir}/${bwa_index} ${data_dir}/${genome_fasta}
else
    echo "BWA index file already exists. Skipping index creation."
fi
```

## Check FastQs
```{r, check-fastqc, engine='bash'}

ls -1 "${trimmed_reads_dir}/"*.fq.gz | wc -l
```

## Align with BWA

This step will produced sorted BAM files.

Alignment is done with the `-M` option, which does the following:

> Mark shorter split hits as secondary (for Picard compatibility).

`samtools view` is run with the `-F 4`. This will exclude unmapped reads.

See [https://broadinstitute.github.io/picard/explain-flags.html](https://broadinstitute.github.io/picard/explain-flags.html) and [https://www.htslib.org/doc/samtools-view.html](https://www.htslib.org/doc/samtools-view.html) for more explanation.


```{r, bwa-alignment, engine='bash'}
# Make output directory, if it doesn't exist
mkdir --parents ${output_dir}

# Declare arrays
R1_array=()
R2_array=()

# Populate arrays
R1_array=("${trimmed_reads_dir}"/*"${R1_suffix}")
R2_array=("${trimmed_reads_dir}"/*"${R2_suffix}")

# Run BWA
time \
for fastq in "${!R1_array[@]}"
do
  # Remove path from FastQ
  sample_no_path=${R1_array[fastq]##*/}
  
  # Capture just samples name (string before first `_`)
  sample=${sample_no_path%%_*}
  
  echo "${fastq}"
  echo "${sample}"
  
  # Run BWA
  ${bwa} mem \
  -M \
  -t ${threads} \
  ${bwa_index} \
  ${R1_array[fastq]} \
  ${R2_array[fastq]} \
  2> ${output_dir}/${sample}-bwa.stderr \
  | samtools view -@ 20 -bS -F 4 \
  | samtools sort -@ 20 -o ${output_dir}/${sample}.sorted.bam
done
```

# REFERENCES