FastQC/MultiQC assessment of raw and flexbar-trimmed sequences of E5 A.pulchra sRNAseq data from 20230515.
Inputs:
*.fastq.gz
)Outputs:
FastQC
HTML reports for raw and trimmed reads.
Trimmed reads with final length of 25bp:
*flexbar_trim.25bp.fastq.gz
Libraries were prepared and sequenced by Azenta:
Library prep: NEB nebnext-small-rna-library-prep-set-for-illumina kit (PDF)
Sequencing: Illumina HiSeq 4000, 150bp PE
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming'
echo 'export raw_fastqc_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/raw-fastqc'
echo 'export raw_reads_dir=${deep_dive_dir}/D-Apul/data/08-Apul-sRNAseq-trimming/raw-reads'
echo 'export raw_reads_url="https://owl.fish.washington.edu/nightingales/A_pulchra/30-852430235/"'
echo 'export trimmed_fastqc_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-fastqc'
echo 'export trimmed_reads_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-reads'
echo ""
echo "# Paths to programs"
echo 'export fastqc=/home/shared/FastQC-0.12.1/fastqc'
echo 'export flexbar=/home/shared/flexbar-3.5.0-linux/flexbar'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo ""
echo "# Set FastQ filename patterns"
echo "export fastq_pattern='*.fastq.gz'"
echo "export R1_fastq_pattern='*_R1_*.fastq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fastq.gz'"
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "# Set maximum read length"
echo 'export max_read_length=25'
echo ""
echo "# Input/output files"
echo 'export fastq_checksums=input_fastq_checksums.md5'
echo 'export trimmed_checksums=trimmed_fastq_checksums.md5'
echo 'export NEB_adapters_fasta=NEB-adapters.fasta'
echo ""
echo "## NEB nebnext-small-rna-library-prep-set-for-illumina adapters"
echo 'export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"'
echo 'export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"'
echo ""
echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export raw_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo 'export trimmed_fastqs_array=()'
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[fastqc]="${fastqc}" \'
echo '[multiqc]="${multiqc}" \'
echo '[flexbar]="${flexbar}"'
echo ")"
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive
export output_dir_top=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming
export raw_fastqc_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/raw-fastqc
export raw_reads_dir=${deep_dive_dir}/D-Apul/data/08-Apul-sRNAseq-trimming/raw-reads
export raw_reads_url="https://owl.fish.washington.edu/nightingales/A_pulchra/30-852430235/"
export trimmed_fastqc_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-fastqc
export trimmed_reads_dir=${deep_dive_dir}/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-reads
# Paths to programs
export fastqc=/home/shared/FastQC-0.12.1/fastqc
export flexbar=/home/shared/flexbar-3.5.0-linux/flexbar
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
# Set FastQ filename patterns
export fastq_pattern='*.fastq.gz'
export R1_fastq_pattern='*_R1_*.fastq.gz'
export R2_fastq_pattern='*_R2_*.fastq.gz'
# Set number of CPUs to use
export threads=40
# Set maximum read length
export max_read_length=25
# Input/output files
export fastq_checksums=input_fastq_checksums.md5
export trimmed_checksums=trimmed_fastq_checksums.md5
export NEB_adapters_fasta=NEB-adapters.fasta
## NEB nebnext-small-rna-library-prep-set-for-illumina adapters
export first_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
export second_adapter="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT"
## Inititalize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export raw_fastqs_array=()
export R1_names_array=()
export R2_names_array=()
export trimmed_fastqs_array=()
# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}" \
[multiqc]="${multiqc}" \
[flexbar]="${flexbar}"
)
Reads are downloaded from https://owl.fish.washington.edu/nightingales/A_pulchra/30-852430235/
The --cut-dirs 3
command cuts the preceding directory
structure (i.e. nightingales/A_pulchra/30-852430235/
) so
that we just end up with the reads.
# Load bash variables into memory
source .bashvars
wget \
--directory-prefix ${raw_reads_dir} \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--accept "sRNA*,checksums.md5" ${raw_reads_url}
ls -lh "${raw_reads_dir}"
total 9.0G
-rw-r--r-- 1 sam sam 1.4K May 17 10:39 checksums.md5
-rw-r--r-- 1 sam sam 848M May 17 10:20 sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 892M May 17 10:21 sRNA-ACR-140-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 1019M May 17 10:22 sRNA-ACR-145-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 986M May 17 10:22 sRNA-ACR-145-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 911M May 17 10:23 sRNA-ACR-150-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 1.1G May 17 10:23 sRNA-ACR-150-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 902M May 17 10:24 sRNA-ACR-173-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 917M May 17 10:25 sRNA-ACR-173-S1-TP2_R2_001.fastq.gz
-rw-r--r-- 1 sam sam 803M May 17 10:25 sRNA-ACR-178-S1-TP2_R1_001.fastq.gz
-rw-r--r-- 1 sam sam 836M May 17 10:26 sRNA-ACR-178-S1-TP2_R2_001.fastq.gz
# Load bash variables into memory
source .bashvars
cd "${raw_reads_dir}"
# Checksums file contains other files, so this just looks for the sRNAseq files.
grep "sRNA" checksums.md5 | md5sum --check
sRNA-ACR-140-S1-TP2_R1_001.fastq.gz: OK
sRNA-ACR-140-S1-TP2_R2_001.fastq.gz: OK
sRNA-ACR-145-S1-TP2_R1_001.fastq.gz: OK
sRNA-ACR-145-S1-TP2_R2_001.fastq.gz: OK
sRNA-ACR-150-S1-TP2_R1_001.fastq.gz: OK
sRNA-ACR-150-S1-TP2_R2_001.fastq.gz: OK
sRNA-ACR-173-S1-TP2_R1_001.fastq.gz: OK
sRNA-ACR-173-S1-TP2_R2_001.fastq.gz: OK
sRNA-ACR-178-S1-TP2_R1_001.fastq.gz: OK
sRNA-ACR-178-S1-TP2_R2_001.fastq.gz: OK
# Load bash variables into memory
source .bashvars
echo "Creating adapters FastA."
echo ""
adapter_count=0
# Check for adapters file first
# Then create adapters file if doesn't exist
if [ -f "${output_dir_top}/${NEB_adapters_fasta}" ]; then
echo "${output_dir_top}/${NEB_adapters_fasta} already exists. Nothing to do."
else
for adapter in "${first_adapter}" "${second_adapter}"
do
adapter_count=$((adapter_count + 1))
printf ">%s\n%s\n" "adapter_${adapter_count}" "${adapter}"
done >> "${output_dir_top}/${NEB_adapters_fasta}"
fi
echo ""
echo "Adapters FastA:"
echo ""
cat "${output_dir_top}/${NEB_adapters_fasta}"
echo ""
Creating adapters FastA.
/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/D-Apul/output/08-Apul-sRNAseq-trimming/NEB-adapters.fasta already exists. Nothing to do.
Adapters FastA:
>adapter_1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>adapter_2
GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT
# Load bash variables into memory
source .bashvars
############ RUN FASTQC ############
# Create array of trimmed FastQs
raw_fastqs_array=(${raw_reads_dir}/${fastq_pattern})
# Pass array contents to new variable as space-delimited list
raw_fastqc_list=$(echo "${raw_fastqs_array[*]}")
echo "Beginning FastQC on raw reads..."
echo ""
# Run FastQC
### NOTE: Do NOT quote raw_fastqc_list
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${raw_fastqc_dir} \
--quiet \
${raw_fastqc_list}
echo "FastQC on raw reads complete!"
echo ""
############ END FASTQC ############
############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""
${programs_array[multiqc]} ${raw_fastqc_dir} -o ${raw_fastqc_dir}
echo ""
echo "MultiQC on raw FastQs complete."
echo ""
############ END MULTIQC ############
echo "Removing FastQC zip files."
echo ""
rm ${raw_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""
# View directory contents
ls -lh ${raw_fastqc_dir}
Beginning FastQC on raw reads...
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
FastQC on raw reads complete!
Beginning MultiQC on raw FastQC...
/// MultiQC 🔍 | v1.14
| multiqc | MultiQC Version v1.17 now available!
| multiqc | Search path : /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/D-Apul/output/08-Apul-sRNAseq-trimming/raw-fastqc
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 23/23
| fastqc | Found 10 reports
| multiqc | Compressing plot data
| multiqc | Report : ../output/08-Apul-sRNAseq-trimming/raw-fastqc/multiqc_report.html
| multiqc | Data : ../output/08-Apul-sRNAseq-trimming/raw-fastqc/multiqc_data
| multiqc | MultiQC complete
MultiQC on raw FastQs complete.
Removing FastQC zip files.
FastQC zip files removed.
total 8.2M
-rw-r--r-- 1 sam sam 132K Oct 31 15:14 fastqc_adapter_content_plot.png
-rw-r--r-- 1 sam sam 98K Oct 31 15:14 fastqc_per_base_sequence_quality_plot.png
drwxr-xr-x 2 sam sam 4.0K Nov 1 11:08 multiqc_data
-rw-r--r-- 1 sam sam 1.3M Nov 1 11:08 multiqc_report.html
-rw-r--r-- 1 sam sam 855 Oct 31 15:14 README.md
-rw-r--r-- 1 sam sam 671K Nov 1 11:08 sRNA-ACR-140-S1-TP2_R1_001_fastqc.html
-rw-r--r-- 1 sam sam 686K Nov 1 11:08 sRNA-ACR-140-S1-TP2_R2_001_fastqc.html
-rw-r--r-- 1 sam sam 672K Nov 1 11:08 sRNA-ACR-145-S1-TP2_R1_001_fastqc.html
-rw-r--r-- 1 sam sam 689K Nov 1 11:08 sRNA-ACR-145-S1-TP2_R2_001_fastqc.html
-rw-r--r-- 1 sam sam 668K Nov 1 11:08 sRNA-ACR-150-S1-TP2_R1_001_fastqc.html
-rw-r--r-- 1 sam sam 686K Nov 1 11:08 sRNA-ACR-150-S1-TP2_R2_001_fastqc.html
-rw-r--r-- 1 sam sam 676K Nov 1 11:08 sRNA-ACR-173-S1-TP2_R1_001_fastqc.html
-rw-r--r-- 1 sam sam 685K Nov 1 11:08 sRNA-ACR-173-S1-TP2_R2_001_fastqc.html
-rw-r--r-- 1 sam sam 674K Nov 1 11:08 sRNA-ACR-178-S1-TP2_R1_001_fastqc.html
-rw-r--r-- 1 sam sam 687K Nov 1 11:08 sRNA-ACR-178-S1-TP2_R2_001_fastqc.html
# Load bash variables into memory
source .bashvars
# Change to directory with raw reads
cd "${raw_reads_dir}"
# Create arrays of FastQ R1 files and sample names
# Do NOT quote R1_fastq_pattern variable
for fastq in ${R1_fastq_pattern}
do
fastq_array_R1+=("${fastq}")
# Use parameter substitution to remove all text up to and including last "." from
# right side of string.
R1_names_array+=("${fastq%%.*}")
done
# Create array of FastQ R2 files
# Do NOT quote R2_fastq_pattern variable
for fastq in ${R2_fastq_pattern}
do
fastq_array_R2+=("${fastq}")
# Use parameter substitution to remove all text up to and including last "." from
# right side of string.
R2_names_array+=("${fastq%%.*}")
done
############ RUN FLEXBAR ############
# Uses parameter substitution (e.g. ${R1_sample_name%%_*})to rm the _R[12]
# Uses NEB adapter file
# --adapter-pair-overlap ON: Recommended by NEB sRNA kit
# --qtrim-threshold 25: Minimum quality
# --qtrim-format i1.8: Sets sequencer as illumina
# --post-trim-length: Trim reads from 3' end to max length
# --target: Sets file naming patterns
# --zip-output GZ: Sets type of compression. GZ = gzip
# Run flexbar on files
echo "Beginning flexbar trimming."
echo ""
time \
for index in "${!fastq_array_R1[@]}"
do
R1_sample_name="${R1_names_array[index]}"
R2_sample_name="${R2_names_array[index]}"
# Begin flexbar trimming
${programs_array[flexbar]} \
--reads ${fastq_array_R1[index]} \
--reads2 ${fastq_array_R2[index]} \
--adapters ${output_dir_top}/${NEB_adapters_fasta} \
--adapter-pair-overlap ON \
--qtrim-format i1.8 \
--qtrim-threshold 25 \
--post-trim-length ${max_read_length} \
--threads ${threads} \
--target "${trimmed_reads_dir}/${R1_sample_name%%_*}.flexbar_trim.${max_read_length}bp" \
--zip-output GZ
# Move to trimmed directory
# This is done so checksums file doesn't include excess path
cd ${trimmed_reads_dir}
# Generate md5 checksums for newly trimmed files
{
md5sum "${R1_sample_name%%_*}.flexbar_trim.${max_read_length}bp_1.fastq.gz"
md5sum "${R2_sample_name%%_*}.flexbar_trim.${max_read_length}bp_2.fastq.gz"
} >> "${trimmed_checksums}"
# Change back to to raw reads directory
cd "${raw_reads_dir}"
done
echo ""
echo "flexbar trimming complete."
echo ""
echo "Trimmed FastQs MD5 checksums:"
echo ""
cat "${trimmed_reads_dir}/${trimmed_checksums}"
############ END FLEXBAR ############
Beginning flexbar trimming.
real 19m59.196s
user 295m56.087s
sys 24m36.310s
flexbar trimming complete.
Trimmed FastQs MD5 checksums:
b23f6fdc4f240e383c7a600de877f0be sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_1.fastq.gz
eddd8d23d0ec50b10b99d44fcd2a633e sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_2.fastq.gz
313a035a62a7d21b32aeec003a667bc9 sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_1.fastq.gz
47cc778f7fcd7d0e5341e35e4d9d352e sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_2.fastq.gz
f233655d21bdd984c4e367ff8dfdafae sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_1.fastq.gz
3c2d082af19281931145751952eadeba sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_2.fastq.gz
a09c6afdc13752e0b0365ea6fab41e8b sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_1.fastq.gz
ae6250aef2e37c73861a98c113a8cc89 sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_2.fastq.gz
26c80f93437ef8214f718d0de4f073f6 sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_1.fastq.gz
faf3dc22cc62cecaff092c5d07f4a13a sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_2.fastq.gz
b23f6fdc4f240e383c7a600de877f0be sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_1.fastq.gz
eddd8d23d0ec50b10b99d44fcd2a633e sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_2.fastq.gz
313a035a62a7d21b32aeec003a667bc9 sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_1.fastq.gz
47cc778f7fcd7d0e5341e35e4d9d352e sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_2.fastq.gz
f233655d21bdd984c4e367ff8dfdafae sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_1.fastq.gz
3c2d082af19281931145751952eadeba sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_2.fastq.gz
a09c6afdc13752e0b0365ea6fab41e8b sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_1.fastq.gz
ae6250aef2e37c73861a98c113a8cc89 sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_2.fastq.gz
26c80f93437ef8214f718d0de4f073f6 sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_1.fastq.gz
faf3dc22cc62cecaff092c5d07f4a13a sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_2.fastq.gz
# Load bash variables into memory
source .bashvars
############ RUN FASTQC ############
### NOTE: Do NOT quote raw_fastqc_list
# Create array of trimmed FastQs
trimmed_fastqs_array=(${trimmed_reads_dir}/${fastq_pattern})
# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}")
echo "Beginning FastQC on raw reads..."
echo ""
# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${trimmed_fastqc_dir} \
--quiet \
${trimmed_fastqc_list}
echo "FastQC on trimmed reads complete!"
echo ""
############ END FASTQC ############
############ RUN MULTIQC ############
echo "Beginning MultiQC on raw FastQC..."
echo ""
${programs_array[multiqc]} ${trimmed_fastqc_dir} -o ${trimmed_fastqc_dir}
echo ""
echo "MultiQC on trimmed FastQs complete."
echo ""
############ END MULTIQC ############
echo "Removing FastQC zip files."
echo ""
rm ${trimmed_fastqc_dir}/*.zip
echo "FastQC zip files removed."
echo ""
# View directory contents
ls -lh ${trimmed_fastqc_dir}
Beginning FastQC on raw reads...
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
FastQC on trimmed reads complete!
Beginning MultiQC on raw FastQC...
/// MultiQC 🔍 | v1.14
| multiqc | MultiQC Version v1.17 now available!
| multiqc | Search path : /home/shared/8TB_HDD_01/sam/gitrepos/deep-dive/D-Apul/output/08-Apul-sRNAseq-trimming/trimmed-fastqc
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 23/23
| fastqc | Found 10 reports
| multiqc | Compressing plot data
| multiqc | Report : ../output/08-Apul-sRNAseq-trimming/trimmed-fastqc/multiqc_report.html
| multiqc | Data : ../output/08-Apul-sRNAseq-trimming/trimmed-fastqc/multiqc_data
| multiqc | MultiQC complete
MultiQC on trimmed FastQs complete.
Removing FastQC zip files.
FastQC zip files removed.
total 6.5M
-rw-r--r-- 1 sam sam 55K Oct 31 15:14 fastqc_adapter_content_plot.png
-rw-r--r-- 1 sam sam 54K Oct 31 15:14 fastqc_per_base_sequence_quality_plot.png
drwxr-xr-x 2 sam sam 4.0K Nov 1 11:29 multiqc_data
-rw-r--r-- 1 sam sam 1.2M Nov 1 11:29 multiqc_report.html
-rw-r--r-- 1 sam sam 1.1K Oct 31 15:14 README.md
-rw-r--r-- 1 sam sam 523K Nov 1 11:29 sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_1_fastqc.html
-rw-r--r-- 1 sam sam 520K Nov 1 11:29 sRNA-ACR-140-S1-TP2.flexbar_trim.25bp_2_fastqc.html
-rw-r--r-- 1 sam sam 526K Nov 1 11:29 sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_1_fastqc.html
-rw-r--r-- 1 sam sam 522K Nov 1 11:29 sRNA-ACR-145-S1-TP2.flexbar_trim.25bp_2_fastqc.html
-rw-r--r-- 1 sam sam 525K Nov 1 11:29 sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_1_fastqc.html
-rw-r--r-- 1 sam sam 526K Nov 1 11:29 sRNA-ACR-150-S1-TP2.flexbar_trim.25bp_2_fastqc.html
-rw-r--r-- 1 sam sam 525K Nov 1 11:29 sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_1_fastqc.html
-rw-r--r-- 1 sam sam 521K Nov 1 11:29 sRNA-ACR-173-S1-TP2.flexbar_trim.25bp_2_fastqc.html
-rw-r--r-- 1 sam sam 526K Nov 1 11:29 sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_1_fastqc.html
-rw-r--r-- 1 sam sam 530K Nov 1 11:29 sRNA-ACR-178-S1-TP2.flexbar_trim.25bp_2_fastqc.html
A quick comparison of raw and trimmed reads to show trimming worked:
RAW | TRIMMED |
---|---|