---
title: "11-mnultispecies-RNASeq-trimming"
output: html_document
date: "2025-02-18"
---


```{bash}
# Code chunk from Sam's shell script
  # Run fastp
  # Specifies reports in HTML and JSON formats
 /home/shared/fastp \
  --in1 ${fastq_array_R1[index]} \
  --in2 ${fastq_array_R2[index]} \
  --detect_adapter_for_pe \
  --thread ${threads} \
  --html "${sample_name}".fastp-trim."${timestamp}".report.html \
  --json "${sample_name}".fastp-trim."${timestamp}".report.json \
  --out1 "${R1_sample_name}".fastp-trim."${timestamp}".fq.gz \
  --out2 "${R2_sample_name}".fastp-trim."${timestamp}".fq.gz
```

```{bash}
# Set the directory containing FASTQ files
FASTQ_DIR="/home/shared/8TB_HDD_02/graceac9/multispecies2023"
THREADS=16  # Adjust as needed
OUTDIR="../output/11-multi-fastp"

# Loop through all R1 files in the directory
for R1_FILE in ${FASTQ_DIR}/*_R1_001.fastq.gz; do
  # Derive corresponding R2 file name
  R2_FILE="${R1_FILE/_R1_001.fastq.gz/_R2_001.fastq.gz}"
  
  # Ensure the R2 file exists
  if [[ ! -f "$R2_FILE" ]]; then
    echo "Skipping ${R1_FILE}, no matching R2 file found."
    continue
  fi
  
  # Extract the sample name
  SAMPLE_NAME=$(basename "$R1_FILE" | sed 's/_R1_001.fastq.gz//')
  
  # Define output file names
  OUT_R1="${OUTDIR}/${SAMPLE_NAME}_R1.fastp-trim.fq.gz"
  OUT_R2="${OUTDIR}/${SAMPLE_NAME}_R2.fastp-trim.fq.gz"
  HTML_REPORT="${OUTDIR}/${SAMPLE_NAME}.fastp-trim.report.html"
  JSON_REPORT="${OUTDIR}/${SAMPLE_NAME}.fastp-trim.report.json"

  # Run fastp
  /home/shared/fastp --in1 "$R1_FILE" \
        --in2 "$R2_FILE" \
        --detect_adapter_for_pe \
        --trim_front1 10 \
        --trim_front2 10 \
        --thread "$THREADS" \
        --html "$HTML_REPORT" \
        --json "$JSON_REPORT" \
        --out1 "$OUT_R1" \
        --out2 "$OUT_R2"

  echo "Finished processing: $SAMPLE_NAME"
done

echo "All samples processed."
```
```{bash}
 # Set CPU threads to use
 threads=48
 # Populate array with FastQ files
 fastq_array=(/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/*.fq.gz)
 # Pass array contents to new variable
 fastqc_list=$(echo "${fastq_array[*]}")
 # Run FastQC
 # NOTE: Do NOT quote ${fastqc_list}
 /home/shared/FastQC-0.12.1/fastqc \
 --threads ${threads} \
 --outdir /home/shared/8TB_HDD_02/graceac9/fastqc/trimmedmusp \
 ${fastqc_list}
```
FastQC files are in: `/home/shared/8TB_HDD_02/graceac9/fastqc/trimmedmusp`


In terminal in the Rproj, put:    
`eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate`

Then navigate into the directory: `/home/shared/8TB_HDD_02/graceac9/fastqc/trimmedmusp` and run in terminal: `multiqc .`

The report will be generated in seconds... 


To view the report, transfer the html to owl or or gannet 

In terminal, while still in the directory where the fastqc report lives, run the following to `rsync` the file to the directory on owl: 
`rsync --archive --progress --verbose multiqc_report.html grace@owl.fish.washington.edu:/volume1/web/gcrandall/multispeciesSSWD/QCreports`

The report now lives on OWL: http://owl.fish.washington.edu/gcrandall/multispeciesSSWD/QCreports/multiqc_report_trimmedRNAseqData.html   

* NOTE: In Owl, I renamed the multi-qc report to "multiqc_report_trimmedRNAseqData.html" because there will be another report in there from the trimmed data