01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC
================
Sam White
2025-02-07

- <a href="#1-background" id="toc-1-background">1 Background</a>
  - <a href="#11-inputs" id="toc-11-inputs">1.1 Inputs</a>
  - <a href="#12-outputs" id="toc-12-outputs">1.2 Outputs</a>
- <a href="#2-create-a-bash-variables-file"
  id="toc-2-create-a-bash-variables-file">2 Create a Bash variables
  file</a>
- <a href="#3-fastp-trimming" id="toc-3-fastp-trimming">3 Fastp
  Trimming</a>
- <a href="#4-quality-check-with-fastqc-and-multiqc"
  id="toc-4-quality-check-with-fastqc-and-multiqc">4 Quality Check with
  FastQC and MultiQC</a>

------------------------------------------------------------------------

# 1 Background

This Rmd file trims WGBS-seq files using
[fastp](https://github.com/OpenGene/fastp) (**chen2023?**), followed by
quality checks with [FastQC](https://github.com/s-andrews/FastQC) and
[MultiQC](https://multiqc.info/) (Ewels et al. 2016).

<div class="callout-note">

If you need to download the raw sequencing reads, please see
[00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd](https://github.com/urol-e5/deep-dive-expression/blob/06f8620587e96ecce970b79bd9e501bbd2a6812e/E-Peve/code/00.00-E-Peve-WGBS-reads-FastQC-MultiQC.Rmd)

</div>

## 1.1 Inputs

FastQs: Expects FastQs formatted like so:
`<colony_ID>-<timepoint>_S<number>_R1_001.fastq.gz`

## 1.2 Outputs

Due to size, trimmed FastQs cannot be uploaded to GitHub. All trimmed
FastQs produced by this script are here:

[01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/](https://gannet.fish.washington.edu/gitrepos/urol-e5/deep-dive-expression/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/)

# 2 Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

``` bash
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export repo_dir="/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression"'
echo 'export output_dir_top="${repo_dir}/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC"'
echo 'export raw_reads_dir="${repo_dir}/E-Peve/data/raw-fastqs"'
echo 'export trimmed_fastqs_dir=${output_dir_top}/trimmed-fastqs'
echo 'export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc'
echo ""


echo "# Set FastQ filename patterns"
echo "export fastq_pattern='*.fastq.gz'"
echo "export R1_fastq_pattern='*_R1_001.fastq.gz'"
echo "export R2_fastq_pattern='*_R1_001.fastq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""

echo "# Paths to programs"
echo 'export fastp=/home/shared/fastp'
echo 'export fastqc=/home/shared/FastQC-0.12.1/fastqc'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo ""


echo "## Inititalize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export raw_fastqs_array=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""

} > .bashvars

cat .bashvars
```

    #### Assign Variables ####

    # Data directories
    export repo_dir="/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression"
    export output_dir_top="${repo_dir}/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC"
    export raw_reads_dir="${repo_dir}/E-Peve/data/raw-fastqs"
    export trimmed_fastqs_dir=${output_dir_top}/trimmed-fastqs
    export trimmed_fastqc_dir=${output_dir_top}/trimmed-fastqc

    # Set FastQ filename patterns
    export fastq_pattern='*.fastq.gz'
    export R1_fastq_pattern='*_R1_001.fastq.gz'
    export R2_fastq_pattern='*_R1_001.fastq.gz'
    export trimmed_fastq_pattern='*fastp-trim.fq.gz'

    # Set number of CPUs to use
    export threads=40

    # Paths to programs
    export fastp=/home/shared/fastp
    export fastqc=/home/shared/FastQC-0.12.1/fastqc
    export multiqc=/home/sam/programs/mambaforge/bin/multiqc

    ## Inititalize arrays
    export fastq_array_R1=()
    export fastq_array_R2=()
    export raw_fastqs_array=()
    export R1_names_array=()
    export R2_names_array=()

    # Print formatting
    export line="--------------------------------------------------------"

# 3 Fastp Trimming

[fastp](https://github.com/OpenGene/fastp) (**chen2023?**) is set to
auto-detect Illumina adapters, as well as trim the first 110bp from each
read, as past experience shows these first 10bp are more inconsistent
than the remainder of the read length.

``` bash
# Load bash variables into memory
source .bashvars

# Make output directories, if it doesn't exist
mkdir --parents "${trimmed_fastqs_dir}"

# Change to raw reads directory
cd "${raw_reads_dir}"

# Create arrays of fastq R1 files and sample names
for fastq in ${R1_fastq_pattern}
do
  fastq_array_R1+=("${fastq}")
  R1_names_array+=("$(echo "${fastq}" | awk -F"_" '{print $1}')")
done

# Create array of fastq R2 files
for fastq in ${R2_fastq_pattern}
do
  fastq_array_R2+=("${fastq}")
  R2_names_array+=("$(echo "${fastq}" | awk -F"_" '{print $1}')")
done

# Create list of fastq files used in analysis
# Create MD5 checksum for reference
if [ ! -f "${output_dir_top}"/raw-fastq-checksums.md5 ]; then
  for fastq in *.gz
    do
      md5sum "${fastq}" >>"${output_dir_top}"/raw-fastq-checksums.md5
  done
fi

# Run fastp on files
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name=$(echo "${R1_names_array[index]}")
  R2_sample_name=$(echo "${R2_names_array[index]}")
  echo "${R1_sample_name}"
  echo ""
  ${fastp} \
  --in1 ${fastq_array_R1[index]} \
  --in2 ${fastq_array_R2[index]} \
  --detect_adapter_for_pe \
  --trim_front1 10 \
  --trim_front2 10 \
  --trim_poly_g \
  --thread ${threads} \
  --html "${trimmed_fastqs_dir}"/"${R1_sample_name}".fastp-trim.report.html \
  --json "${trimmed_fastqs_dir}"/"${R1_sample_name}".fastp-trim.report.json \
  --out1 "${trimmed_fastqs_dir}"/"${R1_sample_name}"_R1.fastp-trim.fq.gz \
  --out2 "${trimmed_fastqs_dir}"/"${R2_sample_name}"_R2.fastp-trim.fq.gz \
  2>> "${trimmed_fastqs_dir}"/fastp.stderr

  # Generate md5 checksums for newly trimmed files
  cd "${trimmed_fastqs_dir}"
  md5sum "${R1_sample_name}"_R1.fastp-trim.fq.gz > "${R1_sample_name}"_R1.fastp-trim.fq.gz.md5
  md5sum "${R2_sample_name}"_R2.fastp-trim.fq.gz > "${R2_sample_name}"_R2.fastp-trim.fq.gz.md5
  
  # Change back to previous directory
  cd -
done
```

# 4 Quality Check with FastQC and MultiQC

``` bash
# Load bash variables into memory
source .bashvars


############ RUN FASTQC ############


# Create array of trimmed FastQs
trimmed_fastqs_array=(${trimmed_fastqs_dir}/${trimmed_fastq_pattern})

# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastqs_array[*]}")

echo "Beginning FastQC on trimmed reads..."
echo ""

# Run FastQC
### NOTE: Do NOT quote raw_fastqc_list
${fastqc} \
--threads ${threads} \
--outdir ${trimmed_fastqs_dir} \
--quiet \
${trimmed_fastqc_list}

echo "FastQC on trimmed reads complete!"
echo ""

############ END FASTQC ############

############ RUN MULTIQC ############
echo "Beginning MultiQC on trimmed FastQC..."
echo ""

${multiqc} ${trimmed_fastqs_dir} -o ${trimmed_fastqs_dir}

echo ""
echo "MultiQC on trimmed FastQs complete."
echo ""

############ END MULTIQC ############

echo "Removing FastQC zip files."
echo ""
rm ${trimmed_fastqs_dir}/*.zip
echo "FastQC zip files removed."
echo ""
```

    Beginning FastQC on trimmed reads...

    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    application/gzip
    FastQC on trimmed reads complete!

    Beginning MultiQC on trimmed FastQC...


      /// MultiQC 🔍 | v1.14

    |           multiqc | MultiQC Version v1.27 now available!
    |           multiqc | Search path : /home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/deep-dive-expression/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs
    |         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 46/46  
    |            fastqc | Found 10 reports
    |           multiqc | Compressing plot data
    |           multiqc | Report      : ../output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/multiqc_report.html
    |           multiqc | Data        : ../output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/multiqc_data
    |           multiqc | MultiQC complete

    MultiQC on trimmed FastQs complete.

    Removing FastQC zip files.

    FastQC zip files removed.

<div id="refs" class="references csl-bib-body hanging-indent">

<div id="ref-ewels2016" class="csl-entry">

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016.
“MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a
Single Report.” *Bioinformatics* 32 (19): 3047–48.
<https://doi.org/10.1093/bioinformatics/btw354>.

</div>

</div>