01.00-D-Apul-WGBS-trimming-cutadapt-FastQC-MultiQC
================
Zoe Dellaert
2025-04-09
- [1 Background](#1-background)
- [1.1 Outputs](#11-outputs)
- [1.2 Cutadapt trimming, FastQC, and
MultiQC](#12-cutadapt-trimming-fastqc-and-multiqc)
------------------------------------------------------------------------
# 1 Background
This Rmd file trims WGBS-seq files using
[cutadapt](https://cutadapt.readthedocs.io/en/stable/), followed by
quality checks with [FastQC](https://github.com/s-andrews/FastQC) and
[MultiQC](https://multiqc.info/) (Ewels et al. 2016).
If you need to download the raw sequencing reads, please see
[00.00-D-Apul-WGBS-reads-FastQC-MultiQC.Rmd](https://github.com/urol-e5/deep-dive-expression/blob/main/D-Apul/code/00.00-D-Apul-WGBS-reads-FastQC-MultiQC.Rmd)
## 1.1 Outputs
Due to size, trimmed FastQs cannot be uploaded to GitHub. All trimmed
FastQs produced by this script will be uploaded here:
[01.00-D-Apul-WGBS-trimming-cutadapt-FastQC-MultiQC](https://gannet.fish.washington.edu/gitrepos/urol-e5/deep-dive-expression/D-Apul/output/01.00-D-Apul-WGBS-trimming-cutadapt-FastQC-MultiQC/)
## 1.2 Cutadapt trimming, FastQC, and MultiQC
``` bash
#!/usr/bin/env bash
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --signal=2
#SBATCH --no-requeue
#SBATCH --mem=200GB
#SBATCH -t 48:00:00
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT_80 #email you when job starts, stops and/or fails
#SBATCH --error=scripts/outs_errs/"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=scripts/outs_errs/"%x_output.%j" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load uri/main
module load cutadapt/3.5-GCCcore-11.2.0
# Set directories and files
reads_dir="../data/raw-fastqs/"
out_dir="../output/01.00-D-Apul-WGBS-trimming-cutadapt-FastQC-MultiQC/"
mkdir -p ${out_dir}
#make arrays of R1 and R2 reads
R1_raw=($('ls' ${reads_dir}*R1*.fastq.gz))
R2_raw=($('ls' ${reads_dir}*R2*.fastq.gz))
R1_name=($(basename -s ".fastq.gz" ${R1_raw[@]}))
R2_name=($(basename -s ".fastq.gz" ${R2_raw[@]}))
for i in ${!R1_raw[@]}; do
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o "${out_dir}trimmed_${R1_name[$i]}.fastq.gz" -p "${out_dir}trimmed_${R2_name[$i]}.fastq.gz" \
${R1_raw[$i]} ${R2_raw[$i]} \
-q 20,20 --minimum-length 20 --cores=20
echo "trimming of ${R1_raw[$i]} and ${R2_raw[$i]} complete"
done
# unload conflicting modules with modules needed below
module unload cutadapt/3.5-GCCcore-11.2.0
# load modules needed
module load parallel/20240822
module load fastqc/0.12.1
module load uri/main
module load all/MultiQC/1.12-foss-2021b
#make trimmed_qc output folder
mkdir -p ${out_dir}/trimmed_qc
cd ${out_dir}
# Create an array of fastq files to process
files=($('ls' trimmed*.fastq.gz))
# Run fastqc in parallel
echo "Starting fastqc..." $(date)
parallel -j 20 "fastqc {} -o trimmed_qc/ && echo 'Processed {}'" ::: "${files[@]}"
echo "fastQC done." $(date)
cd trimmed_qc/
#Compile MultiQC report from FastQC files
multiqc * #Compile MultiQC report from FastQC files
echo "QC of trimmed data complete." $(date)
```
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016.
“MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a
Single Report.” *Bioinformatics* 32 (19): 3047–48.
.