---
title: "Week_06"
author: "Sarah Yerrace""
---

```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(kableExtra)
library(DT)
library(Biostrings)
library(tm)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,         # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```

# Assignment Details and Goals

Create and inspect and alignment files. Including visualizing and capturing "outside" graphics. Publish notebook in rpubs and provide link at top of code.

Minimally show bam file, and at least 2 genome feature files.

Goals: 1) Work with large files but keep them off the machine 2) work with SAM/BAM 3) Properly mark/ comment work

# Task 1:

This downloads the BAM file

```{r, engine='bash'}
cd ../data
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam.bai
```

This downloads the genome for C. virginica (Eastern Oyster)

```{r, engine='bash'}
cd ../data
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa.fai
```

Run this in terminal: make sure to navigate to appropriate directory first

/home/shared/samtools-1.12/samtools tview\
../data/19F_R1_val_1\_bismark_bt2_pe.deduplicated.sorted.bam\
../data/GCF_002022765.2_C\_virginica-3.0_genomic.fa

Take a screen shot of this to include for the assignment!

# Task 2: Aligning WGS data and visualizing in IGV

This chunk downloads fastq files for C. gigas (Pacific Oyster) R1 is 4.6 GB and R2 is 4.9 GB

```{r, engine='bash'}
cd ../data
curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R2_001.fastq.gz
curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R1_001.fastq.gz
```

This downloads the genome C. gigas (Pacific Oyster)

```{r, engine='bash'}
cd ../data
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa.fai
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/GCF_902806645.1_cgigas_uk_roslin_v1_genomic-mito.gtf
```

Alignment

This builds index using Hisat. It will have 8 files

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/cgigas_uk_roslin_v1_genomic-mito.fa \
../output/cgigas_uk_roslin_v1_genomic-mito.index
```

Build the SAM file.

Sam file is very large: 63.9 GB

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/cgigas_uk_roslin_v1_genomic-mito.index \
-p 4 \
-1 ../data/F143n08_R1_001.fastq.gz \
-2 ../data/F143n08_R2_001.fastq.gz \
-S ../output/F143_cgigas.sam
```

```{r, engine='bash'}
tail -1 ../output/F143_cgigas.sam
```

Will likely have to point to Crabby-Duck Repo for the rest of the assignment...

This should convert SAM to BAM

```{r, engine='bash'}
# Convert SAM to BAM, using 4 additional threads
/home/shared/samtools-1.12/samtools view -@ 4 -bS \
/home/shared/8TB_HDD_01/sr320/github/steven-crabby-duck/output/F143_cgigas.sam > ../output/F143_cgigas.bam
```

This sorts and indexs the BAM file

```{r, engine='bash'}
# Sort the BAM file, using 4 additional threads
/home/shared/samtools-1.12/samtools sort -@ 4 \
../output/F143_cgigas.bam -o ../output/F143_cgigas_sorted.bam

# Index the sorted BAM file (multi-threading is not applicable to this operation)
/home/shared/samtools-1.12/samtools index \
../output/F143_cgigas_sorted.bam
```

Use `bcftools mpileup` to generate a pileup in a (.txt) Binary Call Format file

However, according to announcement #10, do not have to run bcftools

```{r, engine='bash'}
/home/shared/bcftools-1.14/bcftools mpileup --threads 4 --no-BAQ \
--fasta-ref ../data/cgigas_uk_roslin_v1_genomic-mito.fa \
../output/F143_cgigas_sorted.bam > ../output/F143_mpileup_output.txt
```

Take a look at the mileup file

```{r, engine='bash'}
tail ../output/F143_mpileup_output.txt
```

call variants, output variant sites, and store results in VCF file

```{r, engine='bash'}
cat ../output/F143_mpileup_output.txt \
| /home/shared/bcftools-1.14/bcftools call -mv -Oz \
> ../output/F143_mpile.vcf.gz
```

Note, these chunks may not work but cause use VCF from chunk above for visualization

```{r, engine='bash'}
zgrep "^##" -v ../output/F143_mpile.vcf.gz | \
awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}' | head

```

```{r, engine='bash'}
/home/shared/bcftools-1.14/bcftools call \
-v -c ../output/F143_mpile.vcf.gz \
> ../output/F143_mpile_calls.vcf
```

# Screenshots as required

![](assignments/output/Screenshot%202023-05-05%20163108.png)

I don't think this is what it's supposed to look like but I couldn't zoom in or anything for some reason...

![](assignments/output/06_IGV_GCF.png)

![](assignments/output/06_IGV_translated.png)