---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Trimming/FastQC/MultiQC - C.bairdi RNAseq FastQ with fastp on Mox
date: '2019-12-18 09:51'
tags:
  - fastp
  - fastqc
  - multiqc
  - Chionoecetes bairdi
  - tanner crab
  - mox
categories:
  - 2019
  - Tanner Crab RNAseq
---
Grace/Steven asked me to generate a _de novo_ transcriptome assembly of our current _C.bairdi_ RNAseq data in [this GitHub issue](https://github.com/RobertsLab/resources/issues/808). As part of that, I needed to quality trim the data first. Although I could automate this as part of the transcriptome assembly (Trinity has Trimmomatic built-in), I would be unable to view the post-trimming results until after the assembly was completed. So, I opted to do the trimming step separately, to evaluate the data prior to assembly.

Trimming was performed using [fastp (v0.20.0)](https://github.com/OpenGene/fastp) on Mox.

I used the following Bash script to initiate file transfer to Mox and then call the SBATCH script for trimming:


- [20191218_cbai_RNAseq_rsync.sh](https://gannet.fish.washington.edu/Atumefaciens/20191218_cbai_fastp_RNAseq_trimming/20191218_cbai_RNAseq_rsync.sh)

```shell
#!/bin/bash

## Script to transfer C.bairdi RNAseq files and then run SBATCH script for fastp trimming.

# Exit script if any command fails
set -e

# Transfer files
rsync -av --progress owl:/volume1/web/nightingales/C_bairdi/*.gz .

# Run SBATCH script to begin fastp trimming
sbatch 20191218_cbai_fastp_RNAseq_trimming.sh

```


SBATCH script (GitHub):

- [20191218_cbai_fastp_RNAseq_trimming.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20191218_cbai_fastp_RNAseq_trimming.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=pgen_fastp_trimming_EPI
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20191218_cbai_fastp_RNAseq_trimming


### C.bairdi RNAseq trimming using fastp.

# This script is called by 20191218_cbai_RNAseq_rsync.sh. That script transfers the FastQ files
# to the working directory from: https://owl.fish.washington.edu/nightingales/C_bairdi

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# Set number of CPUs to use
threads=27

# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5

# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
R1_names_array=()
R2_names_array=()

# Create array of fastq R1 files
for fastq in *R1*.gz
do
  fastq_array_R1+=("${fastq}")
done

# Create array of fastq R2 files
for fastq in *R2*.gz
do
  fastq_array_R2+=("${fastq}")
done


# Create array of sample names
## Uses awk to parse out sample name from filename
for R1_fastq in *R1*.gz
do
  R1_names_array+=($(echo "${R1_fastq}" | awk -F"." '{print $1}'))
done

# Create array of sample names
## Uses awk to parse out sample name from filename
for R2_fastq in *R2*.gz
do
  R2_names_array+=($(echo "${R2_fastq}" | awk -F"." '{print $1}'))
done

# Create list of fastq files used in analysis
for fastq in *.gz
do
  echo "${fastq}" >> fastq.list.txt
done

# Run fastp on files
for index in "${!fastq_array_R1[@]}"
do
	timestamp=$(date +%Y%m%d%M%S)
  R1_sample_name=$(echo "${R1_names_array[index]}")
	R2_sample_name=$(echo "${R2_names_array[index]}")
	${fastp} \
	--in1 "${fastq_array_R1[index]}" \
	--in2 "${fastq_array_R2[index]}" \
	--detect_adapter_for_pe \
	--thread ${threads} \
	--html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \
	--json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \
	--out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
	--out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz

	# Generate md5 checksums for newly trimmed files
	{
		md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
		md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
	} >> "${trimmed_checksums}"
	# Remove original FastQ files
	rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done
```


---

# RESULTS

Took ~40 minutes to complete:

![screencap of fastp runtime on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20191218_cbai_fastp_RNAseq_trimming_runtime.png?raw=true)

Output folder:

- [20191218_cbai_fastp_RNAseq_trimming](https://gannet.fish.washington.edu/Atumefaciens/20191218_cbai_fastp_RNAseq_trimming)


MultiQC Report (HTML):

- [20191218_cbai_fastp_RNAseq_trimming/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20191218_cbai_fastp_RNAseq_trimming/multiqc_report.html)


Overall, the data looks fine. There's a high degree of sequence duplication, but this is expected when dealing with RNAseq libraries.

One really nice aspect of using [fastp](https://github.com/OpenGene/fastp) is that it generates HTML reports for each file trimmed, and the reports include before and after data/plots. There's almost no need for FastQC. With that said, MultiQC only doesn't recognize the [fastp](https://github.com/OpenGene/fastp) reports, but _does_ recognize the FastQC reports. Have the aggregated report of all files that MultiQC is _very_ nice for looking at all the data at one time.


Will proceed with Trinity _de novo_ assembly.