Introduction
FastQC/MultiQC assessment of raw and fastp
-trimmed sequences of E5 P.meandrina sRNAseq data from 20230515.
A max length of 31bp is based on the fastp
insert peak size from previous trimming tests based on the the adapter and polyG trimming results, and previous evaluation of mean read lengths via FastQC
and MultiQC
.
Inputs:
Outputs:
FastQC
HTML reports for raw and trimmed reads.
MultiQC
HTML summaries of FastQC
for raw and trimmed reads.
Trimmed and merged reads with final length of 31bp: *fastp-adapters-polyG-31bp-merged.fq.gz
Libraries were prepared and sequenced by Azenta:
Create a Bash variables file
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/F-Pmea/output/08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged'
echo 'export raw_reads_dir=${deep_dive_dir}/F-Pmea/data/sRNAseq-raw_reads'
echo 'export raw_fastqc_dir=${raw_reads_dir}/raw-fastqc'
echo 'export raw_reads_url="https://owl.fish.washington.edu/nightingales/P_meandrina/30-852430235/"'
echo 'export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc'
echo 'export trimmed_reads_dir=${output_dir_top}/trimmed-reads'
echo ""
echo "# Paths to programs"
echo 'export fastqc=/home/shared/FastQC-0.12.1/fastqc'
echo 'export fastp=/home/shared/fastp'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo ""
echo "# Set FastQ filename patterns"
echo "export fastq_pattern='sRNA*'"
echo "export R1_fastq_pattern='*_R1_*.fastq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fastq.gz'"
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "# Set maximum read length"
echo 'export max_read_length=31'
echo ""
echo "# Input/output files"
echo 'export fastq_checksums=input_fastq_checksums.md5'
echo 'export trimmed_checksums=trimmed_fastq_checksums.md5'
echo 'export NEB_adapters_fasta=NEB-adapters.fasta'
echo ""
echo "## NEB nebnext-small-rna-library-prep-set-for-illumina adapters"
echo 'export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"'
echo 'export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"'
echo ""
echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export raw_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo 'export trimmed_fastqs_array=()'
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[fastqc]="${fastqc}" \'
echo '[multiqc]="${multiqc}" \'
echo '[fastp]="${fastp}"'
echo ")"
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive
export output_dir_top=${deep_dive_dir}/F-Pmea/output/08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged
export raw_reads_dir=${deep_dive_dir}/F-Pmea/data/sRNAseq-raw_reads
export raw_fastqc_dir=${raw_reads_dir}/raw-fastqc
export raw_reads_url="https://owl.fish.washington.edu/nightingales/P_meandrina/30-852430235/"
export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc
export trimmed_reads_dir=${output_dir_top}/trimmed-reads
# Paths to programs
export fastqc=/home/shared/FastQC-0.12.1/fastqc
export fastp=/home/shared/fastp
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
# Set FastQ filename patterns
export fastq_pattern='sRNA*'
export R1_fastq_pattern='*_R1_*.fastq.gz'
export R2_fastq_pattern='*_R2_*.fastq.gz'
# Set number of CPUs to use
export threads=40
# Set maximum read length
export max_read_length=31
# Input/output files
export fastq_checksums=input_fastq_checksums.md5
export trimmed_checksums=trimmed_fastq_checksums.md5
export NEB_adapters_fasta=NEB-adapters.fasta
## NEB nebnext-small-rna-library-prep-set-for-illumina adapters
export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"
## Inititalize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export raw_fastqs_array=()
export R1_names_array=()
export R2_names_array=()
export trimmed_fastqs_array=()
# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}" \
[multiqc]="${multiqc}" \
[fastp]="${fastp}"
)
Download raw sRNAseq reads
Reads are downloaded from https://owl.fish.washington.edu/nightingales/A_pulchra/30-852430235/
The --cut-dirs 3
command cuts the preceding directory structure (i.e. nightingales/A_pulchra/30-852430235/
) so that we just end up with the reads.
# Load bash variables into memory
source .bashvars
wget \
--directory-prefix ${raw_reads_dir} \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--accept "${fastq_pattern},checksums.md5" ${raw_reads_url}
ls -lh "${raw_reads_dir}"/${fastq_pattern}
-rw-r--r-- 1 sam sam 347K Feb 16 08:40 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2-fastp-adapters-polyG-31bp-merged.html
-rw-r--r-- 1 sam sam 89K Feb 16 08:40 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2-fastp-adapters-polyG-31bp-merged.json
-rw-r--r-- 1 sam sam 225K Feb 15 14:38 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.html
-rw-r--r-- 1 sam sam 55K Feb 15 14:38 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.json
-rw-r--r-- 1 sam sam 1.2G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.3G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-47-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 347K Feb 16 08:41 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2-fastp-adapters-polyG-31bp-merged.html
-rw-r--r-- 1 sam sam 89K Feb 16 08:41 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2-fastp-adapters-polyG-31bp-merged.json
-rw-r--r-- 1 sam sam 225K Feb 15 14:39 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.html
-rw-r--r-- 1 sam sam 55K Feb 15 14:39 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.json
-rw-r--r-- 1 sam sam 903M May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.1G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-48-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 348K Feb 16 08:42 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2-fastp-adapters-polyG-31bp-merged.html
-rw-r--r-- 1 sam sam 89K Feb 16 08:42 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2-fastp-adapters-polyG-31bp-merged.json
-rw-r--r-- 1 sam sam 225K Feb 15 14:39 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.html
-rw-r--r-- 1 sam sam 55K Feb 15 14:39 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.json
-rw-r--r-- 1 sam sam 1.1G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.2G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-50-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 347K Feb 16 08:43 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2-fastp-adapters-polyG-31bp-merged.html
-rw-r--r-- 1 sam sam 89K Feb 16 08:43 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2-fastp-adapters-polyG-31bp-merged.json
-rw-r--r-- 1 sam sam 225K Feb 15 14:40 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.html
-rw-r--r-- 1 sam sam 55K Feb 15 14:40 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.json
-rw-r--r-- 1 sam sam 1.1G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.2G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-53-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 347K Feb 16 08:44 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2-fastp-adapters-polyG-31bp-merged.html
-rw-r--r-- 1 sam sam 89K Feb 16 08:44 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2-fastp-adapters-polyG-31bp-merged.json
-rw-r--r-- 1 sam sam 225K Feb 15 14:41 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.html
-rw-r--r-- 1 sam sam 55K Feb 15 14:41 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.json
-rw-r--r-- 1 sam sam 1.1G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.2G May 17 2023 /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/data/sRNAseq-raw_reads/sRNA-POC-57-S1-TP2_R2_001.fastq.gz
Verify raw read checksums
# Load bash variables into memory
source .bashvars
cd "${raw_reads_dir}"
# Checksums file contains other files, so this just looks for the sRNAseq files.
grep "sRNA" checksums.md5 | md5sum --check
sRNA-POC-47-S1-TP2_R1_001.fastq.gz: OK
sRNA-POC-47-S1-TP2_R2_001.fastq.gz: OK
sRNA-POC-48-S1-TP2_R1_001.fastq.gz: OK
sRNA-POC-48-S1-TP2_R2_001.fastq.gz: OK
sRNA-POC-50-S1-TP2_R1_001.fastq.gz: OK
sRNA-POC-50-S1-TP2_R2_001.fastq.gz: OK
sRNA-POC-53-S1-TP2_R1_001.fastq.gz: OK
sRNA-POC-53-S1-TP2_R2_001.fastq.gz: OK
sRNA-POC-57-S1-TP2_R1_001.fastq.gz: OK
sRNA-POC-57-S1-TP2_R2_001.fastq.gz: OK
Create adapters FastA for use with fastp
trimming
# Load bash variables into memory
source .bashvars
mkdir --parents "${output_dir_top}"
echo "Creating adapters FastA."
echo ""
adapter_count=0
# Check for adapters file first
# Then create adapters file if doesn't exist
if [ -f "${output_dir_top}/${NEB_adapters_fasta}" ]; then
echo "${output_dir_top}/${NEB_adapters_fasta} already exists. Nothing to do."
else
for adapter in "${first_adapter}" "${second_adapter}"
do
adapter_count=$((adapter_count + 1))
printf ">%s\n%s\n" "adapter_${adapter_count}" "${adapter}"
done >> "${output_dir_top}/${NEB_adapters_fasta}"
fi
echo ""
echo "Adapters FastA:"
echo ""
cat "${output_dir_top}/${NEB_adapters_fasta}"
echo ""
Creating adapters FastA.
/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/F-Pmea/output/08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged/NEB-adapters.fasta already exists. Nothing to do.
Adapters FastA:
>adapter_1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>adapter_2
GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT
FastQC/MultiQC on raw reads
# Load bash variables into memory
source .bashvars
############ RUN FASTQC ############
# Create array of trimmed FastQs
raw_fastqs_array=(${raw_reads_dir}/${fastq_pattern})
# Pass array contents to new variable as space-delimited list
raw_fastqc_list=$(echo "${raw_fastqs_array[*]}")
echo "Beginning FastQC on raw reads..."
echo ""
# Run FastQC
### NOTE: Do NOT quote raw_fastqc_list
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${raw_fastqc_dir} \
--quiet \
${raw_fastqc_list}
echo "FastQC on raw reads complete!"
echo ""
############ END FASTQC ############
############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""
${programs_array[multiqc]} ${raw_fastqc_dir} -o ${raw_fastqc_dir}
echo ""
echo "MultiQC on raw FastQs complete."
echo ""
############ END MULTIQC ############
echo "Removing FastQC zip files."
echo ""
rm ${raw_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""
# View directory contents
ls -lh ${raw_fastqc_dir}
Trimming with fastp
# Load bash variables into memory
source .bashvars
# Create output directory, if it doesn't exist.
mkdir --parents "${trimmed_reads_dir}"
# Change to directory with raw reads
cd "${raw_reads_dir}"
# Create arrays of FastQ R1 files and sample names
# Do NOT quote R1_fastq_pattern variable
for fastq in ${R1_fastq_pattern}
do
fastq_array_R1+=("${fastq}")
# Use parameter substitution to remove all text up to and including last "." from
# right side of string.
R1_names_array+=("${fastq%%.*}")
done
# Create array of FastQ R2 files
# Do NOT quote R2_fastq_pattern variable
for fastq in ${R2_fastq_pattern}
do
fastq_array_R2+=("${fastq}")
# Use parameter substitution to remove all text up to and including last "." from
# right side of string.
R2_names_array+=("${fastq%%.*}")
done
############ RUN FASTP ############
# Uses parameter substitution (e.g. ${R1_sample_name%%_*})to rm the _R[12]
# Run fastp on files
echo "Beginning fastp trimming."
echo ""
time \
for index in "${!fastq_array_R1[@]}"
do
# Get sample name
R1_sample_name="${R1_names_array[index]%%_*}"
# Set fastp report title
trimmed_fastp_report_title="${R1_sample_name}-fastp-adapters-polyG-${max_read_length}bp-merged"
# Set merged output location and filename
merged_output=${trimmed_reads_dir}/${trimmed_fastp_report_title}.fq.gz
# Begin fastp trimming
${programs_array[fastp]} \
--in1 ${fastq_array_R1[index]} \
--in2 ${fastq_array_R2[index]} \
--adapter_fasta ${output_dir_top}/${NEB_adapters_fasta} \
--trim_poly_g \
--overlap_len_require 17 \
--length_limit ${max_read_length} \
--merge \
--merged_out ${merged_output} \
--thread ${threads} \
--html "${trimmed_fastp_report_title}.html" \
--json "${trimmed_fastp_report_title}.json" \
--report_title "${trimmed_fastp_report_title}"
# Move to trimmed directory
# This is done so checksums file doesn't include excess path
cd ${trimmed_reads_dir}
# Generate md5 checksums for newly trimmed files
{
md5sum "${trimmed_fastp_report_title}.fq.gz"
} >> "${trimmed_checksums}"
# Change back to to raw reads directory
cd "${raw_reads_dir}"
done
echo ""
echo "fastp trimming complete."
echo ""
echo "Trimmed FastQs MD5 checksums:"
echo ""
cat "${trimmed_reads_dir}/${trimmed_checksums}"
############ END fastp ############
FastQC/MultiQC on trimmed reads
# Load bash variables into memory
source .bashvars
# Create output directory, if it doesn't exist.
mkdir --parents "${trimmed_fastqc_dir}"
############ RUN FASTQC ############
### NOTE: Do NOT quote raw_fastqc_list
# Create array of trimmed FastQs
trimmed_fastqs_array=(${trimmed_reads_dir}/${fastq_pattern})
# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}")
echo "Beginning FastQC on raw reads..."
echo ""
# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${trimmed_fastqc_dir} \
--quiet \
${trimmed_fastqc_list}
echo "FastQC on trimmed reads complete!"
echo ""
############ END FASTQC ############
############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""
${programs_array[multiqc]} ${trimmed_fastqc_dir} -o ${trimmed_fastqc_dir}
echo ""
echo "MultiQC on trimmed FastQs complete."
echo ""
############ END MULTIQC ############
echo "Removing FastQC zip files."
echo ""
rm ${trimmed_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""
# View directory contents
ls -lh ${trimmed_fastqc_dir}
Summary
A quick comparison of raw and trimmed reads to show trimming worked:
- quality is improved
- length is 31bp
- adapters removed
Citations
---
title: "08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged"
author: "Sam White"
date: "2024-02-15"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
link-citations: true
---
```{r setup, include=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(reticulate)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# Introduction

FastQC/MultiQC assessment of raw and [`fastp`](https://github.com/OpenGene/fastp)-trimmed sequences of E5 *P.meandrina* sRNAseq data from [20230515](https://robertslab.github.io/sams-notebook/posts/2023/2023-05-17-Data-Management---E5-Coral-RNA-seq-and-sRNA-seq-Reorganizing-and-Renaming/).

A max length of 31bp is based on the `fastp` insert peak size from previous trimming tests based on the the adapter and polyG trimming results, and previous evaluation of mean read lengths via [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [`MultiQC`](https://multiqc.info/).

Inputs:

-   sRNAseq reads.

Outputs:

-   [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) HTML reports for raw and trimmed reads.

-   [`MultiQC`](https://multiqc.info/) HTML summaries of [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for raw and trimmed reads.

-   Trimmed and merged reads with final length of 31bp: `*fastp-adapters-polyG-31bp-merged.fq.gz`

Libraries were prepared and sequenced by Azenta:

-   Library prep: [NEB nebnext-small-rna-library-prep-set-for-illumina kit](https://www.neb.com/en-us/-/media/nebus/files/manuals/manuale7300_e7330_e7560_e7580.pdf?rev=d0964a2e637843b1afcb9f7d666d07b2&hash=7AC0B0EB012708EFAB0E4DBEEAF1446A) (PDF)

-   Sequencing: Illumina HiSeq 4000, 150bp PE

------------------------------------------------------------------------

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/F-Pmea/output/08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged'

echo 'export raw_reads_dir=${deep_dive_dir}/F-Pmea/data/sRNAseq-raw_reads'
echo 'export raw_fastqc_dir=${raw_reads_dir}/raw-fastqc'
echo 'export raw_reads_url="https://owl.fish.washington.edu/nightingales/P_meandrina/30-852430235/"'
echo 'export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc'
echo 'export trimmed_reads_dir=${output_dir_top}/trimmed-reads'
echo ""

echo "# Paths to programs"
echo 'export fastqc=/home/shared/FastQC-0.12.1/fastqc'
echo 'export fastp=/home/shared/fastp'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo ""



echo "# Set FastQ filename patterns"
echo "export fastq_pattern='sRNA*'"
echo "export R1_fastq_pattern='*_R1_*.fastq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fastq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""

echo "# Set maximum read length"
echo 'export max_read_length=31'
echo ""

echo "# Input/output files"
echo 'export fastq_checksums=input_fastq_checksums.md5'
echo 'export trimmed_checksums=trimmed_fastq_checksums.md5'
echo 'export NEB_adapters_fasta=NEB-adapters.fasta'
echo ""

echo "## NEB nebnext-small-rna-library-prep-set-for-illumina adapters"
echo 'export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"'
echo 'export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"'
echo ""

echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export raw_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo 'export trimmed_fastqs_array=()'
echo ""

echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[fastqc]="${fastqc}" \'
echo '[multiqc]="${multiqc}" \'
echo '[fastp]="${fastp}"'
echo ")"
} > .bashvars

cat .bashvars
```

# Download raw sRNAseq reads

Reads are downloaded from <https://owl.fish.washington.edu/nightingales/A_pulchra/30-852430235/>

The `--cut-dirs 3` command cuts the preceding directory structure (i.e. `nightingales/A_pulchra/30-852430235/`) so that we just end up with the reads.

```{bash download-raw-reads, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

wget \
--directory-prefix ${raw_reads_dir} \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--accept "${fastq_pattern},checksums.md5" ${raw_reads_url}

ls -lh "${raw_reads_dir}"/${fastq_pattern}
```

## Verify raw read checksums

```{bash verify-raw-read-checksums, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

cd "${raw_reads_dir}"

# Checksums file contains other files, so this just looks for the sRNAseq files.
grep "sRNA" checksums.md5 | md5sum --check
```

# Create adapters FastA for use with [`fastp`](https://github.com/OpenGene/fastp) trimming

```{bash create-FastA-of-adapters, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

mkdir --parents "${output_dir_top}"

echo "Creating adapters FastA."
echo ""
adapter_count=0

# Check for adapters file first
# Then create adapters file if doesn't exist
if [ -f "${output_dir_top}/${NEB_adapters_fasta}" ]; then
  echo "${output_dir_top}/${NEB_adapters_fasta} already exists. Nothing to do."
else
  for adapter in "${first_adapter}" "${second_adapter}"
  do
    adapter_count=$((adapter_count + 1))
    printf ">%s\n%s\n" "adapter_${adapter_count}" "${adapter}"
  done >> "${output_dir_top}/${NEB_adapters_fasta}"
fi

echo ""
echo "Adapters FastA:"
echo ""
cat "${output_dir_top}/${NEB_adapters_fasta}"
echo ""
```

# FastQC/MultiQC on raw reads

```{bash raw-fastqc-multiqc, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

############ RUN FASTQC ############


# Create array of trimmed FastQs
raw_fastqs_array=(${raw_reads_dir}/${fastq_pattern})

# Pass array contents to new variable as space-delimited list
raw_fastqc_list=$(echo "${raw_fastqs_array[*]}")

echo "Beginning FastQC on raw reads..."
echo ""

# Run FastQC
### NOTE: Do NOT quote raw_fastqc_list
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${raw_fastqc_dir} \
--quiet \
${raw_fastqc_list}

echo "FastQC on raw reads complete!"
echo ""

############ END FASTQC ############

############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""

${programs_array[multiqc]} ${raw_fastqc_dir} -o ${raw_fastqc_dir}

echo ""
echo "MultiQC on raw FastQs complete."
echo ""

############ END MULTIQC ############

echo "Removing FastQC zip files."
echo ""
rm ${raw_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""

# View directory contents
ls -lh ${raw_fastqc_dir}

```

# Trimming with fastp

```{bash fastp-trimming, engine='bash', cache=TRUE}
# Load bash variables into memory
source .bashvars

# Create output directory, if it doesn't exist.
mkdir --parents "${trimmed_reads_dir}"

# Change to directory with raw reads
cd "${raw_reads_dir}"

# Create arrays of FastQ R1 files and sample names
# Do NOT quote R1_fastq_pattern variable
for fastq in ${R1_fastq_pattern}
do
  fastq_array_R1+=("${fastq}")

  # Use parameter substitution to remove all text up to and including last "." from
  # right side of string.
  R1_names_array+=("${fastq%%.*}")
done


# Create array of FastQ R2 files
# Do NOT quote R2_fastq_pattern variable
for fastq in ${R2_fastq_pattern}
do
  fastq_array_R2+=("${fastq}")

  # Use parameter substitution to remove all text up to and including last "." from
  # right side of string.
  R2_names_array+=("${fastq%%.*}")
done


############ RUN FASTP ############
# Uses parameter substitution (e.g. ${R1_sample_name%%_*})to rm the _R[12]

# Run fastp on files
echo "Beginning fastp trimming."
echo ""

time \
for index in "${!fastq_array_R1[@]}"
do
  # Get sample name
  R1_sample_name="${R1_names_array[index]%%_*}"

  # Set fastp report title
  trimmed_fastp_report_title="${R1_sample_name}-fastp-adapters-polyG-${max_read_length}bp-merged"
  
  # Set merged output location and filename
  merged_output=${trimmed_reads_dir}/${trimmed_fastp_report_title}.fq.gz

  # Begin fastp trimming
  ${programs_array[fastp]} \
  --in1 ${fastq_array_R1[index]} \
  --in2 ${fastq_array_R2[index]} \
  --adapter_fasta ${output_dir_top}/${NEB_adapters_fasta} \
  --trim_poly_g \
  --overlap_len_require 17 \
  --length_limit ${max_read_length} \
  --merge \
  --merged_out ${merged_output} \
  --thread ${threads} \
  --html "${trimmed_fastp_report_title}.html" \
  --json "${trimmed_fastp_report_title}.json" \
  --report_title "${trimmed_fastp_report_title}"
    
  # Move to trimmed directory
  # This is done so checksums file doesn't include excess path
  cd ${trimmed_reads_dir}

  # Generate md5 checksums for newly trimmed files
  {
    md5sum "${trimmed_fastp_report_title}.fq.gz"
  } >> "${trimmed_checksums}"
  
  # Change back to to raw reads directory
  cd "${raw_reads_dir}"

done

echo ""
echo "fastp trimming complete."
echo ""

echo "Trimmed FastQs MD5 checksums:"
echo ""

cat "${trimmed_reads_dir}/${trimmed_checksums}"

############ END fastp ############

```

# FastQC/MultiQC on trimmed reads

```{bash FastQC-MultiQC-trimmed-reads, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

# Create output directory, if it doesn't exist.
mkdir --parents "${trimmed_fastqc_dir}"

############ RUN FASTQC ############

### NOTE: Do NOT quote raw_fastqc_list
# Create array of trimmed FastQs
trimmed_fastqs_array=(${trimmed_reads_dir}/${fastq_pattern})

# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}")

echo "Beginning FastQC on raw reads..."
echo ""

# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${trimmed_fastqc_dir} \
--quiet \
${trimmed_fastqc_list}

echo "FastQC on trimmed reads complete!"
echo ""

############ END FASTQC ############

############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""

${programs_array[multiqc]} ${trimmed_fastqc_dir} -o ${trimmed_fastqc_dir}

echo ""
echo "MultiQC on trimmed FastQs complete."
echo ""

############ END MULTIQC ############

echo "Removing FastQC zip files."
echo ""
rm ${trimmed_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""

# View directory contents
ls -lh ${trimmed_fastqc_dir}

```

# Summary

A quick comparison of raw and trimmed reads to show trimming worked:

-   quality is improved
-   length is 31bp
-   adapters removed

| RAW                                                                                                                                                                                                | TRIMMED                                                                                                                                                                                                    |
|-----------------------------------|-------------------------------------|
| ![Raw MultiQC per base sequence quality plot](https://github.com/urol-e5/deep-dive/blob/main/F-Pmea/data/sRNAseq-raw_reads/raw-fastqc/fastqc_per_base_sequence_quality_plot.png?raw=true) | ![Trimmed MultiQC per base sequence quality plot](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/output/08.2-Dapul-sRNAseq-trimming-31bp-fastp-merged/trimmed-fastqc/fastqc_per_base_sequence_quality_plot.png?raw=true) |
| ![Raw MultiQC adapter content plot](https://github.com/urol-e5/deep-dive/blob/main/F-Pmea/data/sRNAseq-raw_reads/raw-fastqc/fastqc_adapter_content_plot.png?raw=true)                     | ![Trimmed MultiQC adapter content plot](https://github.com/urol-e5/deep-dive/blob/main/D-Apul/output/08.2-Dapul-sRNAseq-trimming-31bp-fastp-merged/trimmed-fastqc/fastqc_adapter_content_plot.png?raw=true)                     |

# Citations