01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC ================ Sam White 2025-02-07 - 1 Background - 1.1 Inputs - 1.2 Outputs - 2 Create a Bash variables file - 3 Fastp Trimming - 4 Quality Check with FastQC and MultiQC ------------------------------------------------------------------------ # 1 Background This Rmd file trims WGBS-seq files using [fastp](https://github.com/OpenGene/fastp) (**chen2023?**), followed by quality checks with [FastQC](https://github.com/s-andrews/FastQC) and [MultiQC](https://multiqc.info/) (Ewels et al. 2016).
If you need to download the raw sequencing reads, please see [00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd](https://github.com/urol-e5/deep-dive-expression/blob/06f8620587e96ecce970b79bd9e501bbd2a6812e/E-Peve/code/00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd)
## 1.1 Inputs FastQs: Expects FastQs formatted like so: `-_S_R1_001.fastq.gz` ## 1.2 Outputs Due to size, trimmed FastQs cannot be uploaded to GitHub. All trimmed FastQs produced by this script are here: [01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/](https://gannet.fish.washington.edu/gitrepos/urol-e5/deep-dive-expression/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/) # 2 Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ``` bash { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export repo_dir="/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression"' echo 'export output_dir_top="${repo_dir}/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC"' echo 'export raw_reads_dir="${repo_dir}/E-Peve/data/raw-fastqs"' echo 'export trimmed_fastqs_dir=${output_dir_top}/trimmed-fastqs' echo 'export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc' echo "" echo "# Set FastQ filename patterns" echo "export fastq_pattern='*.fastq.gz'" echo "export R1_fastq_pattern='*_R1_001.fastq.gz'" echo "export R2_fastq_pattern='*_R1_001.fastq.gz'" echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'" echo "" echo "# Set number of CPUs to use" echo 'export threads=40' echo "" echo "# Paths to programs" echo 'export fastp=/home/shared/fastp' echo 'export fastqc=/home/shared/FastQC-0.12.1/fastqc' echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc' echo "" echo "## Inititalize arrays" echo 'export fastq_array_R1=()' echo 'export fastq_array_R2=()' echo 'export raw_fastqs_array=()' echo 'export R1_names_array=()' echo 'export R2_names_array=()' echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` #### Assign Variables #### # Data directories export repo_dir="/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression" export output_dir_top="${repo_dir}/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC" export raw_reads_dir="${repo_dir}/E-Peve/data/raw-fastqs" export trimmed_fastqs_dir=${output_dir_top}/trimmed-fastqs export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc # Set FastQ filename patterns export fastq_pattern='*.fastq.gz' export R1_fastq_pattern='*_R1_001.fastq.gz' export R2_fastq_pattern='*_R1_001.fastq.gz' export trimmed_fastq_pattern='*fastp-trim.fq.gz' # Set number of CPUs to use export threads=40 # Paths to programs export fastp=/home/shared/fastp export fastqc=/home/shared/FastQC-0.12.1/fastqc export multiqc=/home/sam/programs/mambaforge/bin/multiqc ## Inititalize arrays export fastq_array_R1=() export fastq_array_R2=() export raw_fastqs_array=() export R1_names_array=() export R2_names_array=() # Print formatting export line="--------------------------------------------------------" # 3 Fastp Trimming [fastp](https://github.com/OpenGene/fastp) (**chen2023?**) is set to auto-detect Illumina adapters, as well as trim the first 110bp from each read, as past experience shows these first 10bp are more inconsistent than the remainder of the read length. ``` bash # Load bash variables into memory source .bashvars # Make output directories, if it doesn't exist mkdir --parents "${trimmed_fastqs_dir}" # Change to raw reads directory cd "${raw_reads_dir}" # Create arrays of fastq R1 files and sample names for fastq in ${R1_fastq_pattern} do fastq_array_R1+=("${fastq}") R1_names_array+=("$(echo "${fastq}" | awk -F"_" '{print $1}')") done # Create array of fastq R2 files for fastq in ${R2_fastq_pattern} do fastq_array_R2+=("${fastq}") R2_names_array+=("$(echo "${fastq}" | awk -F"_" '{print $1}')") done # Create list of fastq files used in analysis # Create MD5 checksum for reference if [ ! -f "${output_dir_top}"/raw-fastq-checksums.md5 ]; then for fastq in *.gz do md5sum "${fastq}" >>"${output_dir_top}"/raw-fastq-checksums.md5 done fi # Run fastp on files # Adds JSON report output for downstream usage by MultiQC for index in "${!fastq_array_R1[@]}" do R1_sample_name=$(echo "${R1_names_array[index]}") R2_sample_name=$(echo "${R2_names_array[index]}") echo "${R1_sample_name}" echo "" ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --trim_front1 10 \ --trim_front2 10 \ --trim_poly_g \ --thread ${threads} \ --html "${trimmed_fastqs_dir}"/"${R1_sample_name}".fastp-trim.report.html \ --json "${trimmed_fastqs_dir}"/"${R1_sample_name}".fastp-trim.report.json \ --out1 "${trimmed_fastqs_dir}"/"${R1_sample_name}"_R1.fastp-trim.fq.gz \ --out2 "${trimmed_fastqs_dir}"/"${R2_sample_name}"_R2.fastp-trim.fq.gz \ 2>> "${trimmed_fastqs_dir}"/fastp.stderr # Generate md5 checksums for newly trimmed files cd "${trimmed_fastqs_dir}" md5sum "${R1_sample_name}"_R1.fastp-trim.fq.gz > "${R1_sample_name}"_R1.fastp-trim.fq.gz.md5 md5sum "${R2_sample_name}"_R2.fastp-trim.fq.gz > "${R2_sample_name}"_R2.fastp-trim.fq.gz.md5 # Change back to previous directory cd - done ``` # 4 Quality Check with FastQC and MultiQC ``` bash # Load bash variables into memory source .bashvars ############ RUN FASTQC ############ # Create array of trimmed FastQs trimmed_fastqs_array=(${trimmed_fastqs_dir}/${trimmed_fastq_pattern}) # Pass array contents to new variable as space-delimited list trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}") echo "Beginning FastQC on trimmed reads..." echo "" # Run FastQC ### NOTE: Do NOT quote raw_fastqc_list ${fastqc} \ --threads ${threads} \ --outdir ${trimmed_fastqs_dir} \ --quiet \ ${trimmed_fastqc_list} echo "FastQC on trimmed reads complete!" echo "" ############ END FASTQC ############ ############ RUN MULTIQC ############ echo "Beginning MultiQC on trimmed FastQC..." echo "" ${multiqc} ${trimmed_fastqs_dir} -o ${trimmed_fastqs_dir} echo "" echo "MultiQC on trimmed FastQs complete." echo "" ############ END MULTIQC ############ echo "Removing FastQC zip files." echo "" rm ${trimmed_fastqs_dir}/*.zip echo "FastQC zip files removed." echo "" ``` Beginning FastQC on trimmed reads... application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip FastQC on trimmed reads complete! Beginning MultiQC on trimmed FastQC... /// MultiQC πŸ” | v1.14 | multiqc | MultiQC Version v1.27 now available! | multiqc | Search path : /home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 46/46 | fastqc | Found 10 reports | multiqc | Compressing plot data | multiqc | Report : ../output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/multiqc_report.html | multiqc | Data : ../output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/multiqc_data | multiqc | MultiQC complete MultiQC on trimmed FastQs complete. Removing FastQC zip files. FastQC zip files removed.
Ewels, Philip, MΓ₯ns Magnusson, Sverker Lundin, and Max KΓ€ller. 2016. β€œMultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” *Bioinformatics* 32 (19): 3047–48. .