01.00-E-Peve-WGBS-trimming-cutadapt-FastQC-MultiQC
================
Zoe Dellaert
2025-04-10

- [1 Background](#1-background)
  - [1.1 Outputs](#11-outputs)
  - [1.2 Cutadapt trimming, FastQC, and
    MultiQC](#12-cutadapt-trimming-fastqc-and-multiqc)

------------------------------------------------------------------------

# 1 Background

This Rmd file trims WGBS-seq files using
[cutadapt](https://cutadapt.readthedocs.io/en/stable/), followed by
quality checks with [FastQC](https://github.com/s-andrews/FastQC) and
[MultiQC](https://multiqc.info/) (Ewels et al. 2016).

<div class="callout-note">

If you need to download the raw sequencing reads, please see
[00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd](https://github.com/urol-e5/deep-dive-expression/blob/main/E-Peve/code/00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd)

</div>

## 1.1 Outputs

Due to size, trimmed FastQs cannot be uploaded to GitHub. All trimmed
FastQs produced by this script will be uploaded here:

[01.00-E-Peve-WGBS-trimming-cutadapt-FastQC-MultiQC](https://gannet.fish.washington.edu/gitrepos/urol-e5/deep-dive-expression/E-Peve/output/01.00-E-Peve-WGBS-trimming-cutadapt-FastQC-MultiQC/)

## 1.2 Cutadapt trimming, FastQC, and MultiQC

``` bash
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --signal=2
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT_80 #email you when job starts, stops and/or fails
#SBATCH --error=scripts/outs_errs/"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=scripts/outs_errs/"%x_output.%j" #once your job is completed, any final job report comments will be put in this file

# load modules needed

module load uri/main
module load cutadapt/3.5-GCCcore-11.2.0

# Set directories and files
reads_dir="../data/raw-fastqs/"
out_dir="../output/01.00-E-Peve-WGBS-trimming-cutadapt-FastQC-MultiQC/"

mkdir -p ${out_dir}

#make arrays of R1 and R2 reads
R1_raw=($('ls' ${reads_dir}*R1*.fastq.gz))
R2_raw=($('ls' ${reads_dir}*R2*.fastq.gz))

R1_name=($(basename -s ".fastq.gz" ${R1_raw[@]}))
R2_name=($(basename -s ".fastq.gz" ${R2_raw[@]}))

for i in ${!R1_raw[@]}; do
    cutadapt \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -o "${out_dir}trimmed_${R1_name[$i]}.fastq.gz" -p "${out_dir}trimmed_${R2_name[$i]}.fastq.gz" \
    ${R1_raw[$i]} ${R2_raw[$i]} \
    -q 20,20 --minimum-length 20 --cores=20

    echo "trimming of ${R1_raw[$i]} and ${R2_raw[$i]} complete"
    
     # Generate md5 checksums for newly trimmed files
     md5sum "${out_dir}trimmed_${R1_name[$i]}.fastq.gz" > ${out_dir}trimmed_${R1_name[$i]}.fastq.gz.md5
     md5sum "${out_dir}trimmed_${R2_name[$i]}.fastq.gz" > ${out_dir}trimmed_${R2_name[$i]}.fastq.gz.md5
done

# unload conflicting modules with modules needed below
module unload cutadapt/3.5-GCCcore-11.2.0

# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b

#make trimmed_qc output folder
mkdir -p ${out_dir}/trimmed_qc
cd ${out_dir}

# Create an array of fastq files to process
files=($('ls' trimmed*.fastq.gz)) 

# Run fastqc in parallel
echo "Starting fastqc..." $(date)
parallel -j 20 "fastqc {} -o trimmed_qc/ && echo 'Processed {}'" ::: "${files[@]}"
echo "fastQC done." $(date)

cd trimmed_qc/

#Compile MultiQC report from FastQC files
multiqc *  #Compile MultiQC report from FastQC files 

echo "QC of trimmed data complete." $(date)
```

<div id="refs" class="references csl-bib-body hanging-indent"
entry-spacing="0">

<div id="ref-ewels2016" class="csl-entry">

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016.
“MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a
Single Report.” *Bioinformatics* 32 (19): 3047–48.
<https://doi.org/10.1093/bioinformatics/btw354>.

</div>

</div>