---
title: "02-alignment"
author: "Kathleen Durkin"
date: "2025-04-29"
always_allow_html: true
output: 
  github_document:
    toc: true
    toc_depth: 3
    number_sections: true
    html_preview: true 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
---

I'll be performing alignment of WGBS reads to the *C. virginica* genome using `bismark`. Sam has already generated scripts for this purpose as part of the [`ceasmallr`](https://github.com/sr320/ceasmallr/tree/main_) project, but I'll be rerunning alignment for the `FISH 546` bioinformatics class (SP 2025).

Download trimmed reads
```{r, engine='bash', eval=FALSE}
# Run wget to retrieve FastQs and MD5 files
# Note: the --no-clobber command will skip re-downloading any files that are already present in the output directory
wget \
--directory-prefix ../data/trimmed-WGBS-reads \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 4 \
--no-host-directories \
--no-parent \
--quiet \
--no-clobber \
--accept="*.fq.gz,checksums.md5,multiqc_report.html,fastq_pairs.txt" https://gannet.fish.washington.edu/gitrepos/ceasmallr/output/00.00-trimming-fastp/

```

We should have 64 trimmed reads (32 samples, R1 and R2 for each sample)
```{r, engine='bash'}
echo "How many trimmed reads files were downloaded?"
ls ../data/trimmed-WGBS-reads/*.fq.gz | wc -l
```

Download C.virginica genome
```{r, engine='bash', eval=FALSE}
wget \
--directory-prefix ../data/ \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--no-clobber \
 https://gannet.fish.washington.edu/gitrepos/ceasmallr/data/Cvirginica_v300/

```


Download scripts files
```{r, engine='bash', eval=FALSE}
curl -L -O https://raw.githubusercontent.com/sr320/ceasmallr/refs/heads/main/code/02.01-bismark-bowtie2-alignment-SLURM-array.sh

curl -L -O https://raw.githubusercontent.com/sr320/ceasmallr/refs/heads/main/code/02.02-bismark-SLURM-job.sh
```

Modify scripts as needed.

Changes I made:

.sh file:

- changed repo directory and output directory


.job file:

- changed the notification email and working directory from Sam's info to mine

- changed final line, which provides file path to the .sh script


Ran job script by running the below line from Klone terminal:
```
sbatch --array=0-$(($(wc -l < ../data/trimmed-WGBS-reads/fastq_pairs.txt) - 1)) 02.02-bismark-SLURM-job.sh
```

Finished running after ~26 hours, check some of the outputs. I'm a little concerned because the email I recieved from Hyak notifying me of completion said `Slurm Array Summary Job_id=25714300_* (25714300) Name=bismark_job_array Failed, Mixed, ExitCode [0-2]`

```{r, engine='bash'}
cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/

echo "Number of error files (should have 32):"
ls *.err | wc -l

echo "Number of .out files (should have 32):"
ls *.out | wc -l

echo "Number of bismark summary files (should have 32):"
ls *bismark_summary.txt | wc -l

find . -type f -name "*.out" -size -100c

```
There are 33 error and .out files because I mistakenly removed a +1 in the job command, so it started at the wrong index. That means the jobid_0 files are basically duds. So we actually have 32 of each, as expected.

There are only 28 bismark summary files, instead of the expected 32. Excluding the dud jobid_0.out, there are 4 .out files that are very small (<100 B): ./bismark_job_25714300_12.out
./bismark_job_25714300_13.out
./bismark_job_25714300_9.out
./bismark_job_25714300_10.out

Each of these files just contains the following:

```
Contents of pair:


Files  and  have already been processed. Exiting.

```

So... for each of these sub jobs, the read pair is interpreted as having already been processed. Let's look at all the -bismark_summary.txt files (I think there should only be one of these for each pair of processed reads)

These files seem to be named in the format:

SampleID-jobSubID-bismark_summary.txt

```{r, engine='bash'}
cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/
  
ls *bismark_summary.txt | sort
```

Yikes. Ok it looks like, not only are many reads not run appropriately through Bismark, many of the reads were processed twice (e.g. `CF03-CM03-Zygote-1-bismark_summary.txt`, `CF03-CM03-Zygote-5-bismark_summary.txt`). That means only ~half of the reads were processed, not 28/32. 


It also looks like 

```{r, engine='bash'}
cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/

echo "How many output .bam files?"
ls *.bam | wc -l

echo "How large are the .bam files?"
ls -lh *.bam
```

There are only 18 output bam files, and 6 of those are empty (size 0)


Since I'm having trouble with Sam's script, I'm going to try Steven's simpler script. It also has