--- title: "14.00-heatwave-genetics-trimmed-bwa-alignments" author: "Sam White" date: "2025-07-11" output: github_document: toc: true number_sections: true bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # BACKGROUND This Rmd file will perform the following: 1. Create BWA genome index using [BWA](https://github.com/lh3/bwa) [@li2013], if needed. 2. Align trimmed FastQs (previously performed by Laura Spencer) to the *G.macrocepahlus* genome ([NCBI GCF_031168955.1_ASM3116895v1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_031168955.1/)) using [BWA](https://github.com/lh3/bwa) [@li2013]. 3. Use [picard](https://github.com/broadinstitute/picard) [@Picard2019toolkit] to remove duplicates. 4. Use [bamUtil](https://github.com/statgen/bamUtil) [@jun2015] to remove overlaps. 5. Use [samtools](https://www.htslib.org/doc/samtools-depth.html) [@danecek2021] to compute depth. # INPUTS - Genome FastA - BWA genome FastA index # OUTPUTS - Deduplicated, sorted, clipped BAM files - samtools depth file - BAM index files ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # VARIABLES ```{r variables} # DIRECTORIES top_output_dir <- file.path("..", "output") output_dir <- file.path(top_output_dir, "14.00-heatwave-genetics-trimmed-bwa-alignments") data_dir <- file.path("..", "data") trimmed_reads_dir <- file.path(data_dir, "trimmed-fastqs-heatwave-genetics") # PROGRAMS programs_dir <- file.path("", "home", "shared") bamUtil_dir <- file.path(programs_dir, "bamUtil-1.0.15", "bin") bwa <- file.path(programs_dir, "bwa-0.7.17", "bwa") picard_dir <- file.path(programs_dir, "picard-3.4.0-6-g80ca98e2a", "build", "libs") R1_suffix <- "_trimmed_clipped_R1_paired.fq.gz" R2_suffix <- "_trimmed_clipped_R2_paired.fq.gz" samtools_dir <- file.path(programs_dir, "samtools-1.12") samtools <- file.path(samtools_dir, "samtools") # FILES genome_fasta <- file.path(data_dir, "GCF_031168955.1_ASM3116895v1_genomic.fna") bwa_index <- file.path(data_dir, "bwa-GCF_031168955.1_ASM3116895v1_genomic") # THREADS threads <- "40" # FORMATTING line <- "-----------------------------------------------" # Export these as environment variables for bash chunks. Sys.setenv( bamUtil_dir = bamUtil_dir, bwa = bwa, bwa_index = bwa_index, data_dir = data_dir, genome_fasta = genome_fasta, line = line, output_dir = output_dir, picard_dir = picard_dir, top_output_dir = top_output_dir, R1_suffix = R1_suffix, R2_suffix = R2_suffix, samtools = samtools, threads = threads, trimmed_reads_dir = trimmed_reads_dir ) ``` # ALIGNMENTS ## Create BWA Index ```{r bwa-index, engine='bash'} # Check if BWA index file exists if [[ ! -f "bwa-GCF_031168955.1_ASM3116895v1_genomic.bwt" ]]; then echo "BWA index file not found. Creating BWA index..." ${bwa} index -p ${data_dir}/${bwa_index} ${data_dir}/${genome_fasta} else echo "BWA index file already exists. Skipping index creation." fi ``` ## Check FastQs ```{r, check-fastqc, engine='bash'} ls -1 "${trimmed_reads_dir}/"*.fq.gz | wc -l ``` ## Align with BWA This step will produced sorted BAM files. Alignment is done with the `-M` option, which does the following: > Mark shorter split hits as secondary (for Picard compatibility). `samtools view` is run with the `-F 4`. This will exclude unmapped reads. See [https://broadinstitute.github.io/picard/explain-flags.html](https://broadinstitute.github.io/picard/explain-flags.html) and [https://www.htslib.org/doc/samtools-view.html](https://www.htslib.org/doc/samtools-view.html) for more explanation. ```{r, bwa-alignment, engine='bash'} # Make output directory, if it doesn't exist mkdir --parents ${output_dir} # Declare arrays R1_array=() R2_array=() # Populate arrays R1_array=("${trimmed_reads_dir}"/*"${R1_suffix}") R2_array=("${trimmed_reads_dir}"/*"${R2_suffix}") # Run BWA time \ for fastq in "${!R1_array[@]}" do # Remove path from FastQ sample_no_path=${R1_array[fastq]##*/} # Capture just samples name (string before first `_`) sample=${sample_no_path%%_*} echo "${fastq}" echo "${sample}" # Run BWA ${bwa} mem \ -M \ -t ${threads} \ ${bwa_index} \ ${R1_array[fastq]} \ ${R2_array[fastq]} \ 2> ${output_dir}/${sample}-bwa.stderr \ | samtools view -@ 20 -bS -F 4 \ | samtools sort -@ 20 -o ${output_dir}/${sample}.sorted.bam done ``` # REFERENCES