---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Trimming/MultiQC - Methcompare Bisulfite FastQs with fastp on Mox
date: '2020-03-06 13:36'
tags:
  - fastp
  - multiqc
  - mox
  - trimming
  - fastq
categories:
  - 2020
  - Miscellaneous
---
Steven asked me to trim a set of FastQ files, provided by Hollie Putnam, in preparation for methylation analysis using [Bismark](https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html). The analysis is part of a coral project comparing DNA methylation profiles of different species, as well as comparing different sample prep protocols. There's a dedicated GitHub repo here:

- [Meth_Compare](https://github.com/hputnam/Meth_Compare)

I roughly followed the [trimming pipeline that Hollie had already put together](https://github.com/hputnam/Meth_Compare/blob/master/Meth_Compare_Pipeline.md), but opted to use the program [fastp](https://github.com/OpenGene/fastp) as it is generally faster than other trimmers and comes with the bonus ability of generating pre/post-trimming graphs/tables; similar to FastQC. Additionally, [MultiQC(https://multiqc.info/)] can also interpret the output of fastp to generate summary statistics/graphs like it can with FastQC.

The data consisted of two different types of libraries: reduced representation bisfultie (RRBS) and whole genome bisulfite (WGBS). Knowing this, I followed the Bismark trimming guidelines for each library type. The fastp trimming and MultiQC were run with the following SBATCH script (GitHub):

- [20200305_methcompare_fastp_trimming.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200305_methcompare_fastp_trimming.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=pgen_fastp_trimming_EPI
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200305_methcompare_fastp_trimming


### WGBS and RRBS trimming using fastp.
### FastQ files were provide by Hollie Putnam.
### See this GitHub repo for more info:
### https://github.com/hputnam/Meth_Compare

# Exit script if any command fails
# set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# Set number of CPUs to use
threads=27

# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc

# Programs array
programs_array=("${fastp}" "${multiqc}")


# Capture program options
for program in "${!programs_array[@]}"
do
	echo "Program options for ${programs_array[program]}: "
	echo ""
	${programs_array[program]} -h
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
done &>> program_options.log


# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5

# Inititalize arrays
# These were provided by Hollie Putnam
# See https://github.com/hputnam/Meth_Compare/blob/master/Meth_Compare_Pipeline.md
rrbs_array=(Meth4 Meth5 Meth6 Meth13 Meth14 Meth15)
wgbs_array=(Meth1 Meth2 Meth3 Meth7 Meth8 Meth9 Meth10 Meth11 Meth12 Meth16 Meth17 Meth18)

# Assign file suffixes to variables
read1="_R1_001.fastq.gz"
read2="_R2_001.fastq.gz"

# Create list of fastq files used in analysis
for fastq in *.gz
do
  echo "${fastq}" >> fastq.list.txt
done

# Run fastp on RRBS files
# Specifies removal of first 2bp from 3' end of read1 and
# removes 2bp from 5' end of read2, per Bismark instructions for RRBS
# https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html
for index in "${!rrbs_array[@]}"
do
	timestamp=$(date +%Y%m%d%M%S)
	${fastp} \
	--in1 "${rrbs_array[index]}${read1}" \
	--in2 "${rrbs_array[index]}${read2}" \
	--detect_adapter_for_pe \
	--trim_tail1 2 \
	--trim_front2 2 \
	--thread ${threads} \
	--html "${rrbs_array[index]}.fastp-trim.${timestamp}.report.html" \
	--json "${rrbs_array[index]}.fastp-trim.${timestamp}.report.json" \
	--out1 "${rrbs_array[index]}.fastp-trim.${timestamp}${read1}" \
	--out2 "${rrbs_array[index]}.fastp-trim.${timestamp}${read2}"

	# Generate md5 checksums for newly trimmed files
	{
		md5sum "${rrbs_array[index]}.fastp-trim.${timestamp}${read1}"
		md5sum "${rrbs_array[index]}.fastp-trim.${timestamp}${read2}"
	} >> "${trimmed_checksums}"

done

# Run fastp on WGBS files
# Specifies removal of first 10bp from 5' and 3' end of all reads
# per Bismark instructions for WGBS Zymo/Swift library kits
# https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html
for index in "${!wgbs_array[@]}"
do
	timestamp=$(date +%Y%m%d%M%S)
	${fastp} \
	--in1 "${wgbs_array[index]}${read1}" \
	--in2 "${wgbs_array[index]}${read2}" \
	--detect_adapter_for_pe \
	--trim_front1 10 \
	--trim_tail1 10 \
	--trim_front2 10 \
	--trim_tail2 10 \
	--thread ${threads} \
	--html "${wgbs_array[index]}.fastp-trim.${timestamp}.report.html" \
	--json "${wgbs_array[index]}.fastp-trim.${timestamp}.report.json" \
	--out1 "${wgbs_array[index]}.fastp-trim.${timestamp}${read1}" \
	--out2 "${wgbs_array[index]}.fastp-trim.${timestamp}${read2}"

	# Generate md5 checksums for newly trimmed files
	{
		md5sum "${wgbs_array[index]}.fastp-trim.${timestamp}${read1}"
		md5sum "${wgbs_array[index]}.fastp-trim.${timestamp}${read2}"
	} >> "${trimmed_checksums}"

done

# Run multiqc
${multiqc} .
```


---

# RESULTS

This took ~6.5hrs to complete:

![Screencap of Mox runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200305_methcompare_fastp_trimming_runtime.png?raw=true)

The runtime in the image above shows a runtime of ~5hrs. However, a subset of samples were _not_ properly processed by fastp (everything in the logs looked fine, no errors, but no output files were generated; very odd). I re-ran a subset of the code on the "missing" samples and it worked fine. Took ~1.5hrs to process the remaining samples.

Output folder:

- [20200305_methcompare_fastp_trimming/](https://gannet.fish.washington.edu/Atumefaciens/20200305_methcompare_fastp_trimming/)

I retained the raw FastQs provided by Hollie for posterity.

Trimmed files are named with the following convention:

- *.fastp-trim*.gz

Individual (on a per read pair basis) fastp HTML reports are named similarly:

- *.report.html

MultiQC report (HTML):

- [20200305_methcompare_fastp_trimming/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20200305_methcompare_fastp_trimming/multiqc_report.html)