---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: FastQ Trimming and QC - P.generosa RNA-seq Data from 20220323 on Mox
date: '2023-04-26 10:56'
tags: 
  - mox
  - Panopea generosa
  - Pacific geoduck
  - fastp
  - RNA-seq
  - MultiQC
  - FastQC
  - trimming
categories: 
  - 2023
  - Miscellaneous
---
Addressing the update to [this GitHub Issue](https://github.com/RobertsLab/resources/issues/1434) regarding identifying [_Panopea generosa_ (Pacific geoduck)](http://en.wikipedia.org/wiki/Geoduck) long non-coding RNAs (lncRNAs), I used the RNA-seq data from [the Nextflow NF-Core RNAseq pipeline run on 20220323](https://robertslab.github.io/sams-notebook/posts/2022/2022-03-23-Differential-Gene-Expression---P.generosa-DGE-Between-Tissues-Using-Nextlow-NF-Core-RNAseq-Pipeline-on-Mox/). Although that data was supposed to have been trimmed in the Nextflow NF-Core RNA-seq pipeline, the FastQC reports still show adapter contamination and some funky stuff happening at the 5' end of the reads. So, I've opted to trim the "trimmed" files with [`fastp`](https://github.com/OpenGene/fastp), using a hard 20bp trim at the 5' end of all reads. [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [`MultiQC`](https://multiqc.info/) were run before/after trimming. Job was run on Mox.

SLURM Script (GitHub):

- [20230426-pgen-fastqc-fastp-multiqc-RNAseq.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20230426-pgen-fastqc-fastp-multiqc-RNAseq.sh)

```bash
#!/bin/bash
## Job Name
#SBATCH --job-name=20230426-pgen-fastqc-fastp-multiqc-RNAseq
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20230426-pgen-fastqc-fastp-multiqc-RNAseq

### FastQC and fastp trimming of P.generosa RNA-seq data from 20220323.

https://robertslab.github.io/sams-notebook/posts/2022/2022-03-23-Differential-Gene-Expression---P.generosa-DGE-Between-Tissues-Using-Nextlow-NF-Core-RNAseq-Pipeline-on-Mox/

### fastp expects input FastQ files to be in format: *[12]_val_[12].fq.gz
### E.g. heart_1_val_1.fq.gz


###################################################################################
# These variables need to be set by user

## Assign Variables

# Set FastQ filename patterns
fastq_pattern='*.fq.gz'
R1_fastq_pattern='*val_1.fq.gz'
R2_fastq_pattern='*val_2.fq.gz'

# Set number of CPUs to use
threads=40

# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
fastq_checksums=input_fastq_checksums.md5

# FastQC output directory
output_dir=$(pwd)

# Data directories
reads_dir=/gscratch/scrubbed/samwhite/data/P_generosa/RNAseq


## Inititalize arrays
raw_fastqs_array=()
R1_names_array=()
R2_names_array=()

# Paths to programs
fastp=/gscratch/srlab/programs/fastp.0.23.1
fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc

# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}"
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Sync raw FastQ files to working directory
echo ""
echo "Transferring files via rsync..."
rsync --archive --verbose \
"${reads_dir}"${fastq_pattern} .
echo ""
echo "File transfer complete."
echo ""

### Run FastQC ###

### NOTE: Do NOT quote raw_fastqc_list
# Create array of trimmed FastQs
raw_fastqs_array=(${fastq_pattern})

# Pass array contents to new variable as space-delimited list
raw_fastqc_list=$(echo "${raw_fastqs_array[*]}")

echo "Beginning FastQC on raw reads..."
echo ""

# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${raw_fastqc_list}

echo "FastQC on raw reads complete!"
echo ""

### END FASTQC ###

# Create arrays of fastq R1 files and sample names
# Do NOT quote R1_fastq_pattern variable
for fastq in ${R1_fastq_pattern}
do
  fastq_array_R1+=("${fastq}")

  # Use parameter substitution to remove all text up to and including last "." from
  # right side of string.
  R1_names_array+=("${fastq%%.*}")
done

# Create array of fastq R2 files
# Do NOT quote R2_fastq_pattern variable
for fastq in ${R2_fastq_pattern}
do
  fastq_array_R2+=("${fastq}")

  # Use parameter substitution to remove all text up to and including last "." from
  # right side of string.
  R2_names_array+=("${fastq%%.*}")
done


# Create MD5 checksums for raw FastQs
for fastq in ${fastq_pattern}
do
  echo "Generating checksum for ${fastq}"
  md5sum "${fastq}" | tee --append ${fastq_checksums}
  echo ""
done


### RUN FASTP ###

# Run fastp on files
# Adds JSON report output for downstream usage by MultiQC
# Trims 20bp from 5' end of all reads
# Trims poly G, if present
# Uses parameter substitution (e.g. ${R1_sample_name%%_*})to rm the _R[12] for report names.
echo "Beginning fastp trimming."
echo ""

for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name="${R1_names_array[index]}"
  R2_sample_name="${R2_names_array[index]}"
  ${fastp} \
  --in1 ${fastq_array_R1[index]} \
  --in2 ${fastq_array_R2[index]} \
  --detect_adapter_for_pe \
  --trim_poly_g \
  --trim_front1 20 \
  --trim_front2 20 \
  --thread ${threads} \
  --html "${R1_sample_name%%_*}".fastp-trim."${timestamp}".report.html \
  --json "${R1_sample_name%%_*}".fastp-trim."${timestamp}".report.json \
  --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
  --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz

  # Generate md5 checksums for newly trimmed files
  {
      md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz
      md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
  } >> "${trimmed_checksums}"
  
  # Remove original FastQ files
  echo ""
  echo " Removing ${fastq_array_R1[index]} and ${fastq_array_R2[index]}."
  rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}"
done

echo ""
echo "fastp trimming complete."
echo ""

### END FASTP ###


### RUN FASTQC ###

### NOTE: Do NOT quote ${trimmed_fastqc_list}

# Create array of trimmed FastQs
trimmed_fastq_array=(*fastp-trim*.fq.gz)

# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastq_array[*]}")

# Run FastQC
echo "Beginning FastQC on trimmed reads..."
echo ""
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${trimmed_fastqc_list}

echo ""
echo "FastQC on trimmed reads complete!"
echo ""

### END FASTQC ###

### RUN MULTIQC ###
echo "Beginning MultiQC..."
echo ""
${multiqc} .
echo ""
echo "MultiQC complete."
echo ""

### END MULTIQC ###

####################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
```


---

# RESULTS

Run time was ~2hrs 45mins:

![Screencap showing runtime of 02:43:56](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20230426-pgen-fastqc-fastp-multiqc-RNAseq-runtime.png?raw=true)

Overall, trimming looks satisfactory. Will proceed with next step(s) - [`HISAT2`](https://daehwankimlab.github.io/hisat2/) alignment, [`StringTie`](https://ccb.jhu.edu/software/stringtie/) analysis, and gffcompare.

---

Output folder:

- [20230426-pgen-fastqc-fastp-multiqc-RNAseq/](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/)

  #### MultiQC Report (HTML)

  - [20230426-pgen-fastqc-fastp-multiqc-RNAseq/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/multiqc_report.html)

  #### Trimmed FastQs

  - [ctenidia_1_val_1.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/ctenidia_1_val_1.fastp-trim.20230426.fq.gz) (3.8G)

    - MD5: `d8dfc9356937726c87f8a9e0cccc54f7`

  - [ctenidia_2_val_2.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/ctenidia_2_val_2.fastp-trim.20230426.fq.gz) (4.0G)

    - MD5: `c922ade826ad86785f5fab83d14402bf`

  - [gonad_1_val_1.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/gonad_1_val_1.fastp-trim.20230426.fq.gz) (4.1G)

    - MD5: `ba2fe679cb38f69678a92ec30558003b`

  - [gonad_2_val_2.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/gonad_2_val_2.fastp-trim.20230426.fq.gz) (4.4G)

    - MD5: `9628a62060e456e9e3a9522580306cd4`

  - [heart_1_val_1.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/heart_1_val_1.fastp-trim.20230426.fq.gz) (7.2G)

    - MD5: `7a0752fe1d366cd6e9dbd16288c9785d`

  - [heart_2_val_2.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/heart_2_val_2.fastp-trim.20230426.fq.gz) (7.1G)

    - MD5: `34519d234870317a953939f1742f5ea2`

  - [juvenile_1_val_1.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/juvenile_1_val_1.fastp-trim.20230426.fq.gz) (23G)

    - MD5: `974d17adfd390c6dba379c06d3d5b637`

  - [juvenile_2_val_2.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/juvenile_2_val_2.fastp-trim.20230426.fq.gz) (23G)

    - MD5: `6740608da69688b7d8e61d4c5ae431e1`

  - [larvae_1_val_1.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/larvae_1_val_1.fastp-trim.20230426.fq.gz) (4.7G)

    - MD5: `97c5715a26ade72efb305b9496bd1cec`

  - [larvae_2_val_2.fastp-trim.20230426.fq.gz](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-fastqc-fastp-multiqc-RNAseq/larvae_2_val_2.fastp-trim.20230426.fq.gz) (5.0G)

    - MD5: `49217a766a3f835f53b93441b890fa5e`