--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Trimming-FastQC-MultiQC - Robertos C.gigas WGBS FastQ Data with fastp FastQC and MultiQC on Mox date: '2020-08-18 09:57' tags: - wgbs - bisulfite sequencing - fastq - Crassostrea gigas - FastQC - fastp - MultiQc - mox categories: - 2020 - Miscellaneous --- Steven asked me to [trim Roberto's _C.gigas_ whole genome bisulfite sequencing (WGBS) reads](https://github.com/RobertsLab/resources/issues/992) (GitHub Issue) "following his methods". The only thing specified is trimming Illumina adaptors and then trimming 10bp from the 5' end of reads. No mention of which software was used. I opted to use [fastp](https://github.com/OpenGene/fastp), due to its speed and built-in QC metrics/plots. Despite the built-in tools, I also ran FastQC and MultiQC, post-trimming to get a more comprehensive overview. Process was run on Mox. SBATCH script (GitHub): - [20200818_cgig_wgbs_roberto_fastp_trimming.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200818_cgig_wgbs_roberto_fastp_trimming.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=cgigas_fastp_trimming_roberto_wgbs ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200818_cgig_wgbs_roberto_fastp_trimming ### Roberto's C.gigas WGBS trimming using fastp. ################################################################################### # These variables need to be set by user # Set number of CPUs to use threads=28 # Input/output files trimmed_checksums=trimmed_fastq_checksums.md5 raw_reads_dir=/gscratch/srlab/sam/data/C_gigas/wgbs # Paths to programs fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp fastqc=/gscratch/srlab/programs/fastqc_v0.11.8/fastqc multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc ################################################################################### # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Capture date timestamp=$(date +%Y%m%d) ## Inititalize arrays fastq_array_R1=() fastq_array_R2=() programs_array=() R1_names_array=() R2_names_array=() # Programs array programs_array=("${fastp}" "${multiqc}" "${fastqc}") # Sync raw FastQ files to working directory rsync --archive --verbose \ "${raw_reads_dir}"[035]*.fastq.gz . # Create array of fastq R1 files for fastq in *R1*.gz do fastq_array_R1+=("${fastq}") done # Create array of fastq R2 files for fastq in *R2*.gz do fastq_array_R2+=("${fastq}") done # Create array of sample names ## Uses awk to parse out sample name from filename for R1_fastq in *R1*.gz do R1_names_array+=($(echo "${R1_fastq}" | awk -F"_" '{print $1}')) done # Create array of sample names ## Uses awk to parse out sample name from filename for R2_fastq in *R2*.gz do R2_names_array+=($(echo "${R2_fastq}" | awk -F"_" '{print $1}')) done # Create list of fastq files used in analysis for fastq in *.gz do echo "${fastq}" >> fastq.list.txt done # Run fastp on files # Trim 10bp from 5' from each read for fastq in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[fastq]}") R2_sample_name=$(echo "${R2_names_array[fastq]}") ${fastp} \ --in1 "${fastq_array_R1[fastq]}" \ --in2 "${fastq_array_R2[fastq]}" \ --detect_adapter_for_pe \ --trim_front1 10 \ --trim_front2 10 \ --thread ${threads} \ --html "${R1_sample_name}".fastp-trim."${timestamp}".report.html \ --json "${R1_sample_name}".fastp-trim."${timestamp}".report.json \ --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz # Generate md5 checksums for newly trimmed files { md5sum "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz md5sum "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz } >> "${trimmed_checksums}" # Run FastQC ${fastqc} --threads ${threads} \ "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \ "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz # Remove original FastQ files rm "${fastq_array_R1[fastq]}" "${fastq_array_R2[fastq]}" done # Run MultiQC ${multiqc} . # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${programs_array[program]}: " echo "" ${programs_array[program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ``` --- # RESULTS Actually took longer than I expected; ~3.5hrs: ![fastp trimming runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200818_cgig_wgbs_roberto_fastp_trimming_runtime.png?raw=true) Output folder: - [20200818_cgig_wgbs_roberto_fastp_trimming/](https://gannet.fish.washington.edu/Atumefaciens/20200818_cgig_wgbs_roberto_fastp_trimming/) - Trimmed files can be found with this pattern: `*fastp-trim*.fq.gz` MultiQC Report (HTML): - [20200818_cgig_wgbs_roberto_fastp_trimming/multiqc_report.html](https://gannet.fish.washington.edu/Atumefaciens/20200818_cgig_wgbs_roberto_fastp_trimming/multiqc_report.html) - NOTE: Report contains summaries from both `fastp` and `FastQC` results - Each trimmed file has a corresponding `*_fastqc.html`