--- title: "01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC" author: "Sam White" date: "2025-01-02" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Background This notebook will trim raw WGBS FastQs using [fastp](https://github.com/OpenGene/fastp). Samples which generate an error during trimming will attempt to be repaired using [BBTools `repairl.sh`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/) (BBMap – Bushnell B. – sourceforge.net/projects/bbmap/). Trimming results will be analyzed with FastQC and summarized by [MultiQC](https://github.com/MultiQC/MultiQC) [@ewels2016]. Based off of the initial FastQC/MultiQC, we trimmed 15bp from each read. ## Inputs Raw WGBS FastQ files with the following pattern: - `*.fastq.gz` ::: {.callout-note} If one needs to download the raw FastQs, please see [00.20-D-Apul-WGBS-reads-FastQC-MultiQC.Rmd](./00.20-D-Apul-WGBS-reads-FastQC-MultiQC.Rmd) ::: ## Outputs The expected outputs will be: - `multiqc_report.html`: A summary report of the alignment results generated by [MultiQC](https://github.com/MultiQC/MultiQC), in HTML format. - `*fastp-trim*.fq.gz`: Trimmed FastQ files. - Due to the large file sizes of FastQs, these cannot be hosted in the [timeseries_molecular GitHub repo](https://github.com/urol-e5/timeseries_molecular/tree/main). As such these files are available for download here: - [https://gannet.fish.washington.edu/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC/](https://gannet.fish.washington.edu/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC0/) ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export repo_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular' echo 'export output_dir_top=${repo_dir}/D-Apul/output/01.20-D-Apul-WGBS-trimming-fastp-FastQC-MultiQC' echo 'export raw_reads_dir="${repo_dir}/D-Apul/data/wgbs-raw-fastqs"' echo 'export trimmed_fastqs_dir="${output_dir_top}/trimmed-fastqs"' echo "" echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export bbmap_repair="${programs_dir}/bbmap-39.12/repair.sh"' echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"' echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"' echo 'export fastp="${programs_dir}/fastp-v0.24.0/fastp"' echo 'export fastqc="${programs_dir}/FastQC-0.12.1/fastqc"' echo 'export multiqc="/home/sam/programs/mambaforge/bin/multiqc"' echo 'export samtools_dir="${programs_dir}/samtools-1.12"' echo "" echo "# Set FastQ filename patterns" echo "export fastq_pattern='*.fastq.gz'" echo "export R1_fastq_pattern='*_R1_*.fastq.gz'" echo "export R2_fastq_pattern='*_R2_*.fastq.gz'" echo "export trimmed_fastq_pattern='*fastp-trim*.fq.gz'" echo "" echo "# Set number of CPUs to use" echo 'export threads=40' echo "" echo "## Inititalize arrays" echo 'export fastq_array_R1=()' echo 'export fastq_array_R2=()' echo 'export trimmed_fastqs_array=()' echo 'export R1_names_array=()' echo 'export R2_names_array=()' echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Trimming and Error Repair Trimming will remove any Illumina sequencing adapters, as well as polyG and polyX (primarily polyA) sequences. Trimming will also remove the first 25bp from the 5' end of each read. Samples generating an error during trimming will attempt to be repaired with BBTools' `repair.sh` script. ```{r fastp-trimming, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars # Make output directory, if it doesn't exist mkdir --parents ${output_dir_top} cd "${raw_reads_dir}" # Create arrays of fastq R1 files and sample names # Do NOT quote R1_fastq_pattern variable for fastq in ${R1_fastq_pattern} do fastq_array_R1+=("${fastq}") # Use parameter substitution to remove all text up to and including last "." from # right side of string. R1_names_array+=("${fastq%%.*}") done # Create array of fastq R2 files # Do NOT quote R2_fastq_pattern variable for fastq in ${R2_fastq_pattern} do fastq_array_R2+=("${fastq}") # Use parameter substitution to remove all text up to and including last "." from # right side of string. R2_names_array+=(${fastq%%.*}) done # Run fastp on files # Adds JSON report output for downstream usage by MultiQC echo "Beginning fastp trimming." echo "" for index in "${!fastq_array_R1[@]}" do R1_sample_name="${R1_names_array[index]}" R2_sample_name="${R2_names_array[index]}" stderr_PE_name=$(echo ${R1_sample_name} | awk -F"_" '{print $1}') ${fastp} \ --in1 ${fastq_array_R1[index]} \ --in2 ${fastq_array_R2[index]} \ --detect_adapter_for_pe \ --trim_poly_g \ --trim_poly_x \ --thread 16 \ --trim_front1 25 \ --trim_front2 25 \ --html ${output_dir_top}/"${R1_sample_name}".fastp-trim.report.html \ --json ${output_dir_top}/"${R1_sample_name}".fastp-trim.report.json \ --out1 ${output_dir_top}/"${R1_sample_name}".fastp-trim.fq.gz \ --out2 ${output_dir_top}/"${R2_sample_name}".fastp-trim.fq.gz \ 2> ${output_dir_top}/"${stderr_PE_name}".fastp-trim.stderr grep --before-context 5 "ERROR" ${output_dir_top}/"${stderr_PE_name}".fastp-trim.stderr # Check for fastp errors and then repair if grep --quiet "ERROR" ${output_dir_top}/"${stderr_PE_name}".fastp-trim.stderr; then rm ${output_dir_top}/"${R1_sample_name}".fastp-trim.fq.gz rm ${output_dir_top}/"${R2_sample_name}".fastp-trim.fq.gz ${bbmap_repair} \ in1=${fastq_array_R1[index]} \ in2=${fastq_array_R2[index]} \ out1="${R1_sample_name}".REPAIRED.fastq.gz \ out2="${R2_sample_name}".REPAIRED.fastq.gz \ outs=/dev/null \ 2> "${R1_sample_name}".REPAIRED.stderr ${fastp} \ --in1 "${R1_sample_name}".REPAIRED.fastq.gz \ --in2 "${R2_sample_name}".REPAIRED.fastq.gz \ --detect_adapter_for_pe \ --trim_poly_g \ --trim_poly_x \ --thread ${threads} \ --trim_front1 25 \ --trim_front2 25 \ --html ${output_dir_top}/"${R1_sample_name}".fastp-trim.REPAIRED.report.html \ --json ${output_dir_top}/"${R1_sample_name}".fastp-trim.REPAIRED.report.json \ --out1 ${output_dir_top}/"${R1_sample_name}".fastp-trim.REPAIRED.fq.gz \ --out2 ${output_dir_top}/"${R2_sample_name}".fastp-trim.REPAIRED.fq.gz \ 2> ${output_dir_top}/"${stderr_PE_name}".fastp-trim.REPAIRED.stderr if grep --quiet "ERROR" ${output_dir_top}/"${stderr_PE_name}".fastp-trim.REPAIRED.stderr; then echo "These ${stderr_PE_name} samples are broken." echo "Just give up. :(" echo "" fi fi echo "Finished trimming:" echo "${fastq_array_R1[index]}" echo "${fastq_array_R1[index]}" echo "" # Generate md5 checksums for newly trimmed files cd "${output_dir_top}" md5sum "${R1_sample_name}".fastp-trim.fq.gz > "${R1_sample_name}".fastp-trim.fq.gz.md5 md5sum "${R2_sample_name}".fastp-trim.fq.gz > "${R2_sample_name}".fastp-trim.fq.gz.md5 cd "${raw_reads_dir}" done ``` # FastQC ```{r fastqc, engine='bash'} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ############ RUN FASTQC ############ # Create array of trimmed FastQs trimmed_fastqs_array=(${trimmed_fastq_pattern}) # Pass array contents to new variable as space-delimited list trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}") echo "Beginning FastQC on trimmed reads..." echo "" # Run FastQC ### NOTE: Do NOT quote raw_fastqc_list ${fastqc} \ --threads ${threads} \ --outdir "${output_dir_top}" \ --quiet \ ${trimmed_fastqc_list} echo "FastQC on trimmed reads complete!" echo "" ############ END FASTQC ############ ``` # MultiQC ```{r multiqc, engine='bash'} # Load bash variables into memory source .bashvars cd "${output_dir_top}" ${multiqc} \ --interactive \ . ```