--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Trimming/FastQC/MultiQC - P.generosa EPI FastQs with FASTP on Mox date: '2019-09-23 14:19' tags: - Panopea generosa - geoduck - fastp - trimming categories: - 2019 - Miscellaneous --- Steven noticed that the [M-Bias plots generated by Bismark from these files](https://github.com/RobertsLab/resources/issues/408#issuecomment-534180697) was a little wonky and asked that I try trimming them a bit more. The files were originally quality/adaptor [trimmed with TrimGalore! on 20180516](https://robertslab.github.io/sams-notebook/posts/2018/2018-05-16-trimgalorefastqcmultiqc-trimgalore-rrbs-geoduck-bs-seq-fastq-data-directional/). I decided to use some trimming software called [fastp](https://github.com/OpenGene/fastp), since it had a lot of options (including trimming without the need for quality filtering/trimming, as well as being multi-threaded). Looking at the M-Bias plots and the original FastQC assessment, I opted to trim the first 20bp from the 5' end of all reads. I followed this up with FastQC and MultiQC. Use the following Bash script to initiate file transfer to Mox and call the SBATCH script: ```shell #!/bin/bash ## Rsync P.generosa EPI files and then run SBATCH script for hard trimming. rsync -av --progress owl:/volume1/web/Athaliana/20180516_geoduck_trimgalore_rrbs/*.fq.gz . sbatch 20190923_pgen_fastp_EPI_trimming.sh ``` SBATCH script (GitHub): - [20190923_pgen_fastp_EPI_trimming.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190923_pgen_fastp_EPI_trimming.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=pgen_fastp_trimming_EPI ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190923_pgen_fastp_EPI_trimming # This script is called by 20190923_pgen_EPI_rsync.sh. That script transfers the FastQ files # to the working directory from: https://owl.fish.washington.edu/Athaliana/20180516_geoduck_trimgalore_rrbs # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Set number of CPUs to use threads=28 # Set number of nucleotides to hard trim num_nucs_trim=20 # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() R1_names_array=() R2_names_array=() # Create array of fastq R1 files for fastq in *R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in *R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses awk to parse out sample name from filename for R1_fastq in *R1*.gz do R1_names_array+=($(echo "${R1_fastq}" | awk -F"." '{print $1}')) done # Create array of sample names ## Uses awk to parse out sample name from filename for R2_fastq in *R2*.gz do R2_names_array+=($(echo "${R2_fastq}" | awk -F"." '{print $1}')) done # Create list of fastq files used in analysis ## Uses parameter substitution to strip leading path from filename for fastq in *.gz do echo "${fastq}" >> fastq.list.txt done # Run fastp on files and trim set number of nucleotides from 5' end of reads for index in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[index]}") R2_sample_name=$(echo "${R2_names_array[index]}") ${fastp} \ --in1 "${fastq_array_R1[index]}" \ --in2 "${fastq_array_R2[index]}" \ --disable_quality_filtering \ --disable_length_filtering \ --disable_adapter_trimming \ --trim_front1 ${num_nucs_trim} \ --trim_front2 ${num_nucs_trim} \ --thread ${threads} \ --out1 "${R1_sample_name}".20bp-trim.fq.gz \ --out2 "${R2_sample_name}".20bp-trim.fq.gz # Remove original FastQ files rm "${fastq_array_R1[index]}" "${fastq_array_R2[index]}" done ``` --- # RESULTS This was pretty quick, ~2.75hrs: ![fastp runtime screencap](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190923_fastp_trimming_EPI_runtime.png?raw=true) Output folder: - [20190923_pgen_fastp_EPI_trimming/](https://gannet.fish.washington.edu/Atumefaciens/20190923_pgen_fastp_EPI_trimming/) fastp report (HTML): - [20190923_pgen_fastp_EPI_trimming/fastp.html](https://gannet.fish.washington.edu/Atumefaciens/20190923_pgen_fastp_EPI_trimming/fastp.html) MultiQC report (HTML): - [20190923_pgen_fastp_EPI_trimming/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20190923_pgen_fastp_EPI_trimming/multiqc_report.html) Firstly, it turns out that [fastp](https://github.com/OpenGene/fastp) is pretty awesome! It automatically generates a summary HTML file with before and after trimming data, similar to that of FastQC. Had I realized this, I might not have bothered with FastQC... Anyway, with that being said, this is how the before/after look for Read 1s via the fastp HTML: ##### BEFORE ![fastp sequence quality plots before](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190923_fastp_trimming_EPI_before.png?raw=true) --- ##### AFTER ![fastp sequence quality plots before](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190923_fastp_trimming_EPI_after.png?raw=true)