02-alignment ================ Kathleen Durkin 2025-04-29 I’ll be performing alignment of WGBS reads to the *C. virginica* genome using `bismark`. Sam has already generated scripts for this purpose as part of the [`ceasmallr`](https://github.com/sr320/ceasmallr/tree/main_) project, but I’ll be rerunning alignment for the `FISH 546` bioinformatics class (SP 2025). Download trimmed reads ``` bash # Run wget to retrieve FastQs and MD5 files # Note: the --no-clobber command will skip re-downloading any files that are already present in the output directory wget \ --directory-prefix ../data/trimmed-WGBS-reads \ --recursive \ --no-check-certificate \ --continue \ --cut-dirs 4 \ --no-host-directories \ --no-parent \ --quiet \ --no-clobber \ --accept="*.fq.gz,checksums.md5,multiqc_report.html,fastq_pairs.txt" https://gannet.fish.washington.edu/gitrepos/ceasmallr/output/00.00-trimming-fastp/ ``` We should have 64 trimmed reads (32 samples, R1 and R2 for each sample) ``` bash echo "How many trimmed reads files were downloaded?" ls ../data/trimmed-WGBS-reads/*.fq.gz | wc -l ``` ## How many trimmed reads files were downloaded? ## 64 Download C.virginica genome ``` bash wget \ --directory-prefix ../data/ \ --recursive \ --no-check-certificate \ --continue \ --cut-dirs 3 \ --no-host-directories \ --no-parent \ --quiet \ --no-clobber \ https://gannet.fish.washington.edu/gitrepos/ceasmallr/data/Cvirginica_v300/ ``` Download scripts files ``` bash curl -L -O https://raw.githubusercontent.com/sr320/ceasmallr/refs/heads/main/code/02.01-bismark-bowtie2-alignment-SLURM-array.sh curl -L -O https://raw.githubusercontent.com/sr320/ceasmallr/refs/heads/main/code/02.02-bismark-SLURM-job.sh ``` ## % Total % Received % Xferd Average Speed Time Time Time Current ## Dload Upload Total Spent Left Speed ## 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 3220 100 3220 0 0 12365 0 --:--:-- --:--:-- --:--:-- 12337 ## % Total % Received % Xferd Average Speed Time Time Time Current ## Dload Upload Total Spent Left Speed ## 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 1258 100 1258 0 0 5299 0 --:--:-- --:--:-- --:--:-- 5308 Modify scripts as needed. Changes I made: .sh file: - changed repo directory and output directory .job file: - changed the notification email and working directory from Sam’s info to mine - changed final line, which provides file path to the .sh script Ran job script by running the below line from Klone terminal: sbatch --array=0-$(($$(wc -l < ../data/trimmed-WGBS-reads/fastq_pairs.txt) - 1)) 02.02-bismark-SLURM-job.sh Finished running after ~26 hours, check some of the outputs. I’m a little concerned because the email I recieved from Hyak notifying me of completion said `Slurm Array Summary Job_id=25714300_* (25714300) Name=bismark_job_array Failed, Mixed, ExitCode [0-2]` ``` bash cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/ echo "Number of error files (should have 32):" ls *.err | wc -l echo "Number of .out files (should have 32):" ls *.out | wc -l echo "Number of bismark summary files (should have 32):" ls *bismark_summary.txt | wc -l find . -type f -name "*.out" -size -100c ``` ## Number of error files (should have 32): ## 33 ## Number of .out files (should have 32): ## 33 ## Number of bismark summary files (should have 32): ## 28 ## ./bismark_job_25714300_0.out ## ./bismark_job_25714300_12.out ## ./bismark_job_25714300_13.out ## ./bismark_job_25714300_9.out ## ./bismark_job_25714300_10.out There are 33 error and .out files because I mistakenly removed a +1 in the job command, so it started at the wrong index. That means the jobid_0 files are basically duds. So we actually have 32 of each, as expected. There are only 28 bismark summary files, instead of the expected 32. Excluding the dud jobid_0.out, there are 4 .out files that are very small (\<100 B): ./bismark_job_25714300_12.out ./bismark_job_25714300_13.out ./bismark_job_25714300_9.out ./bismark_job_25714300_10.out Each of these files just contains the following: Contents of pair: Files and have already been processed. Exiting. So… for each of these sub jobs, the read pair is interpreted as having already been processed. Let’s look at all the -bismark_summary.txt files (I think there should only be one of these for each pair of processed reads) These files seem to be named in the format: SampleID-jobSubID-bismark_summary.txt ``` bash cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/ ls *bismark_summary.txt | sort ``` ## CF01-CM02-Larvae-8-bismark_summary.txt ## CF02-CM02-Zygote-3-bismark_summary.txt ## CF03-CM03-Zygote-1-bismark_summary.txt ## CF03-CM03-Zygote-5-bismark_summary.txt ## CF03-CM04-Larvae-14-bismark_summary.txt ## CF03-CM04-Larvae-15-bismark_summary.txt ## CF03-CM04-Larvae-4-bismark_summary.txt ## CF03-CM05-Larvae-16-bismark_summary.txt ## CF03-CM05-Larvae-6-bismark_summary.txt ## CF06-CM01-Zygote-19-bismark_summary.txt ## CF06-CM01-Zygote-20-bismark_summary.txt ## CF06-CM02-Larvae-11-bismark_summary.txt ## CF06-CM02-Larvae-22-bismark_summary.txt ## CF07-CM02-Zygote-24-bismark_summary.txt ## CF07-CM02-Zygote-28-bismark_summary.txt ## CF08-CM03-Zygote-23-bismark_summary.txt ## CF08-CM03-Zygote-26-bismark_summary.txt ## CF08-CM03-Zygote-2-bismark_summary.txt ## CF08-CM04-Larvae-7-bismark_summary.txt ## EF01-EM01-Zygote-31-bismark_summary.txt ## EF01-EM01-Zygote-32-bismark_summary.txt ## EF02-EM02-Zygote-17-bismark_summary.txt ## EF03-EM03-Zygote-18-bismark_summary.txt ## EF04-EM04-Zygote-21-bismark_summary.txt ## EF05-EM06-Larvae-25-bismark_summary.txt ## EF06-EM02-Larvae-27-bismark_summary.txt ## EF07-EM01-Zygote-29-bismark_summary.txt ## EF07-EM03-Larvae-30-bismark_summary.txt Yikes. Ok it looks like, not only are many reads not run appropriately through Bismark, many of the reads were processed twice (e.g. `CF03-CM03-Zygote-1-bismark_summary.txt`, `CF03-CM03-Zygote-5-bismark_summary.txt`). That means only ~half of the reads were processed, not 28/32. It also looks like ``` bash cd ../output/02.01-bismark-bowtie2-alignment-SLURM-array/ echo "How many output .bam files?" ls *.bam | wc -l echo "How large are the .bam files?" ls -lh *.bam ``` ## How many output .bam files? ## 18 ## How large are the .bam files? ## -rw-r--r-- 1 kdurkin1 nogroup 9.0G May 3 00:52 CF01-CM02-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 4.7G May 3 03:43 CF02-CM02-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 5.1G May 3 00:04 CF03-CM03-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 07:56 CF03-CM04-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 04:56 CF03-CM05-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 03:38 CF06-CM01-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 2 23:14 CF06-CM02-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 04:38 CF07-CM02-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 00:49 CF08-CM03-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 105M May 2 20:05 CF08-CM04-Larvae_R1_001.fastp-trim.REPAIRED_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 0 May 3 01:30 EF01-EM01-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 5.8G May 2 23:38 EF02-EM02-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 3.8G May 2 23:38 EF03-EM03-Zygote_R1_001.fastp-trim.REPAIRED_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 2.1G May 3 00:04 EF04-EM04-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 16G May 3 06:32 EF05-EM06-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 8.1G May 3 00:16 EF06-EM02-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 4.0G May 3 03:52 EF07-EM01-Zygote_R1_001.fastp-trim_bismark_bt2_pe.bam ## -rw-r--r-- 1 kdurkin1 nogroup 8.5G May 3 04:22 EF07-EM03-Larvae_R1_001.fastp-trim_bismark_bt2_pe.bam There are only 18 output bam files, and 6 of those are empty (size 0)