---
title: "02-trim-primers"
output: html_document
date: "2023-04-25"
---

This workflow is adapted from the following pipeline: https://benjjneb.github.io/dada2/ITS_workflow.html

# Get started

install r packages
```{r}
install.packages('ShortRead')
install.packages("BiocManager")
BiocManager::install("dada2")
install.packages('ShortRead')
```

load packages
```{r}
library(BiocManager)
library(ShortRead)
library("dada2")
```


To get cutadapt, I downloaded the miniconda installer using curl 

install cutadapt software
```{bash}
#download miniconda from url
cd ../software
curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh

```

and then moved to the directory it was in by typing `cd software` in the terminal. I then typed bash Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh in the terminal to start the setup process. Conda is located in: /Users/hailaschultz/miniconda3. 

I checked that installation worked by typing `conda list` in the terminal. In the terminal I then typed:
`conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict`

and then `conda create -n cutadaptenv cutadapt`

On my personal computer, I had to use `CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt`

I ran conda init bash

To activate the conda environment, in the terminal I typed `conda activate cutadaptenv`

# Remove Primers
download fastqc

```{bash}
curl -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip
```

use fastqc 
```{bash}
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc
```

make multiqc report
in the terminal I ran `conda install multiqc` to install multiqc

```{bash}
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc

multiqc .
```


take a look at data

```{bash}
zcat 2018APRP12Ve-COI-48samples_S145_L001_R1_001.fastq.gz | head -n 2
```

Make lists of forward and reverse read files

list files for forward and reverse reads
```{r}
#set file path
path <- "../data/fastq-files"
#make matched list of forward and reverse reads
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

#specify primer sequences
FWD <- "GGWACWGGWTGAACWGTWTAYCCYCC"  
REV <- "TANACYTCNGGRTGNCCRAARAAYCA"  
#in the reverse primer, I replaced I with N
```


Check if primers are actually in data

get orientations of primers
```{r}
allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
    orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), 
        RevComp = reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
```

count number of times primers occur
```{r}
primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs[[1]]), 
    FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs[[1]]), 
    REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs[[1]]), 
    REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs[[1]]))
```
Tell R the path to the cutadapt command

```{r}
cutadapt <- "/Users/hailaschultz/miniconda3/envs/cutadaptenv/bin/cutadapt" 
system2(cutadapt, args = "--version") 
```


use cutadapt
```{r}
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs[i], fnRs[i])) # input files
}
```


Check if adapters were trimmed for one sample
```{r}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), 
    FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), 
    REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), 
    REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
#there are 5 forward reads that weren't trimmed... don't know why this is
```

QC check on files after primers were trimmed

use fastqc 
```{bash}
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc
```

make multiqc report

```{bash}
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc

multiqc .
```