---
title: "15-hisat2-deseq2-summer2022"
output: html_document
date: "2024-09-02"
---
Follow code from 03-hisat2-summer2021_2022.Rmd. A lot of steps were done in the beginning that don't need to be repeated. 

Summer 2022    
`/home/shared/8TB_HDD_02/graceac9/data/pycno2022` 

```{r, engine='bash'}
cd ../data

```


```{r, engine='bash', eval=TRUE}
head ../data/augustus.hints.gtf

```

For `stringtie` later, the gtf needs to be in gff format. There isn't currently one in existence that works. Steven figured out how to fix this issue (https://sr320.github.io/tumbling-oysters/posts/sr320-20-seastar/)  

Won't run code because Steven did this already and the new file lives in:
` ../analyses/12-fix-gff/mod_augustus.gtf`

```{r}
library(knitr)
library(tidyverse)
library(Biostrings)
library(pheatmap)
library(DESeq2)
library(tidyverse)
```

# 1. Build an index
Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html 

```{bash}
pwd
```

From the gtf, get exon list
```{bash}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../analyses/12-fix-gff/mod_augustus.gtf \
> ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab
```


```{bash}
head ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab
```

from the gtf, get the splice sites
```{bash}
#!/bin/bash

# This script will extract splice sites from the gtf file

# This is the command to extract splice sites from the gtf file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../analyses/12-fix-gff/mod_augustus.gtf \
> ../analyses/15-hisat2-deseq2-summer2022/m_splice_sites.tab

```

use the genome fasta to make an index for alignment
```{bash}
# build an index 
/home/shared/hisat2-2.2.1/hisat2-build \
../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \
../data/Phel_genome_index \
--exon ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab \
--ss ../analyses/15-hisat2-deseq2-summer2022/m_splice_sites.tab \
-p 40 \
../analyses/12-fix-gff/mod_augustus.gtf \
2> ../analyses/15-hisat2-deseq2-summer2022/hisat2-build_stats.txt
```


```{bash}
tail ../analyses/03-hisat2/hisat2-build_stats.txt
```


# 2. ALIGNMENT
Trimmed data in `/home/shared/8TB-`/home/shared/8TB_HDD_02/graceac9/data/pycno2022`
```{bash}
pwd
```


```{bash}
for file in ../../../data/pycno2022/*_R1*fq.gz; do
    # Remove the  part to get the base name
    base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz)
    # Construct the names of the pair of files
    file1=${base}_R1_001.fastq.gz.fastp-trim.20231101.fq.gz
    file2=${base}_R2_001.fastq.gz.fastp-trim.20231101.fq.gz
    # Run the hisat2 command
    /home/shared/hisat2-2.2.1/hisat2 \
    -x ../data/Phel_genome_index \
    --dta \
    -p 20 \
    -1 ../../../data/pycno2022/$file1 \
    -2 ../../../data/pycno2022/$file2 \
    -S ../analyses/15-hisat2-deseq2-summer2022/${base}.sam \
    2> ../analyses/15-hisat2-deseq2-summer2022/${base}-hisat.out
done
```

```{bash}
cat ../analyses/15-hisat2-deseq2-summer2022/*-hisat.out
```

Convert sam to bam
```{bash}
for samfile in ../analyses/15-hisat2-deseq2-summer2022/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"
  /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"
  /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done
```

```{bash}
ls ../analyses/15-hisat2-deseq2-summer2022/*sorted.bam | wc -l
```

```{r, engine='bash', eval=TRUE}
cat ../analyses/15-hisat2-deseq2-summer2022/*hisat.out \
| grep "overall alignment rate"
``` 

# `stringtie`
```{bash}
find ../analyses/15-hisat2-deseq2-summer2022/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../analyses/13-hisat-deseq2/mod_augustus.gff \
-o ../analyses/15-hisat2-deseq2-summer2022/{}.gtf \
../analyses/15-hisat2-deseq2-summer2022/{}.sorted.bam
```

```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc


multiqc ../analyses/15-hisat2-deseq2-summer2022/ \
-o ../analyses/15-hisat2-deseq2-summer2022/
```

# get a count matrix
```{bash}
ls ../analyses/15-hisat2-deseq2-summer2022/*gtf
```

copy the above and put into a txt file

write library name to left of path

```{bash}
cat ../analyses/15-hisat2-deseq2-summer2022/list01.txt
```

```{r, engine='bash'}
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
-i ../analyses/15-hisat2-deseq2-summer2022/list01.txt \
-g ../data/gene_count_matrix_2022.csv \
-t ../data/transcript_count_matrix_2022.csv
```

```{bash}
head ../data/*matrix_2022.csv
```

# `DESEQ2` for Experiment C

```{r}
library(DESeq2)
library(dplyr)
library(ggplot2)
library(pheatmap)
library("pcaExplorer")
library("airway")
```


Install 2024-09-06 - `pcaExplorer`:
```{r}
#if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#BiocManager::install("pcaExplorer")
```
```{r}
#if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
#BiocManager::install("airway")
```


## Explore all the samples with `pcaExplorer` 

made a coldata frame in google sheets https://docs.google.com/spreadsheets/d/1tZuBu8wFnKgh_mvZY8jpleF4nzMsQh-NOHZ1YKoeyqU/edit?gid=0#gid=0 

downloaded as .csv, then copy-pasted into data directory and pushed to github 

in `pcaExplorere`, can select the count matrices to import

it's call coldata2022.csv in data in this repo

```{r}
pcaExplorer()
```


saved a bunch of plots 
can be found in analyses/15-hisat2-deseq2-summer2022/pcaExplorer_plots

## Use `DESeq2` to compare the samples that `pcaExplorer` showed would be interesting 
Want a balanced comparison, so I'll be comparing the libraries shown in the plot [noday0_norepeatstars_size_disease_signs_3pergroup_samplesPca_sampleout.pdf](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/analyses/15-hisat2-deseq2-summer2022/pcaExplorer_plots/noday0_norepeatstars_size_disease_signs_3pergroup_samplesPca_sampleout.pdf)

I'm not sure why the sample IDs got all cut-off, but below is a table detailing the samples:    

| library_ID | star_ID | star_age_class | treatment | experiment_day | total_armdrop | total_armtwist | disease_sign  |
|------------|---------|----------------|-----------|----------------|---------------|----------------|---------------|
| PSC.0186   |      15 | adult          | control   |             11 |             0 |              0 | healthy       |
| PSC.0203   |      31 | adult          | control   |             12 |             0 |              0 | healthy       |
| PSC.0209   |       9 | adult          | control   |             13 |             0 |              0 | healthy       |
| PSC.0177   |      34 | juvenile       | control   |             11 |             0 |              0 | healthy       |
| PSC.0219   |      63 | juvenile       | control   |             14 |             0 |              0 | healthy       |
| PSC.0230   |      65 | juvenile       | control   |             15 |             0 |              0 | healthy       |
| PSC.0174   |      30 | adult          | exposed   |             10 |             1 |              6 | first_armdrop |
| PSC.0190   |      10 | adult          | exposed   |             11 |             1 |              0 | first_armdrop |
| PSC.0231   |      13 | adult          | exposed   |             15 |             1 |              5 | first_armdrop |
| PSC.0187   |      57 | juvenile       | exposed   |             11 |             1 |              1 | first_armdrop |
| PSC.0188   |      38 | juvenile       | exposed   |             11 |             1 |              0 | first_armdrop |
| PSC.0228   |      61 | juvenile       | exposed   |             14 |             1 |              0 | first_armdrop |


## `DESeq2` --> Differentially expressed genes and transcripts for the above library comparisons
### DIFFERENTIALLY EXPRESSED TRANSCRIPTS

Read in counts: 
```{r}
counts2022 <- read.csv("../data/transcript_count_matrix_2022.csv", row.names = "transcript_id")
head(counts2022)
```
subset just the counts for the libraries specified in the table above:
```{r}
counts2022sub <- select(counts2022, PSC.0228, PSC.0187, PSC.0188, PSC.0174, PSC.0190, PSC.0231, PSC.0230, PSC.0219, PSC.0177, PSC.0186, PSC.0209, PSC.0203)
head(counts2022sub)
```
create a text file called "conditions.txt" in `/analyses/15-hisat2-deseq2-summer2022/` and put the metadata for the libraries in it, with tabs between each column 

```{r}
deseq2.colData <- data.frame(condition=factor(c("exposed", "exposed", "exposed", "exposed", "exposed", "exposed", "control", "control", "control", "control", "control", "control")), 
                             type=factor(rep("paired-end", 12)),
                             age=factor(c("juvenile", "juvenile", "juvenile", "adult", "adult", "adult", "juvenile", "juvenile", "juvenile", "adult", "adult", "adult")))
rownames(deseq2.colData) <- colnames(counts2022sub)
```


set counts as matrix:
```{r}
counts2022sub <- as.matrix(counts2022sub)
```


```{r}
# Check all sample IDs in colData are also in CountData and match their orders
all(rownames(deseq2.colData) %in% colnames(counts2022sub)) #should return TRUE
```

```{r}
counts2022sub <- counts2022sub[, rownames(deseq2.colData)]
all(rownames(deseq2.colData) == colnames(counts2022sub)) # This should also return TRUE
```


modelled after crab paper https://github.com/RobertsLab/paper-tanner-crab/blob/master/scripts/DESeq.Rmd 

```{r}
# Create a DESeqDataSet from count matrix and labels, with the design taking age into account
dds <- DESeqDataSetFromMatrix(countData = counts2022sub,
                                     colData = deseq2.colData, 
                                     design = ~ condition + age)
```

Check levels of age and condition 
```{r}
levels(dds$age)
```

```{r}
levels(dds$condition)
```

note: `DESeq2` automatically puts the levels in alphabetical order and the first listed level is the reference level for the factor. 

We want control to be the reference. 
```{r}
dds$condition = relevel(dds$condition, "control")
levels(dds$condition)
```


The following will pull the results from `condition` because that is our variable of interest. This tells us how age contributes to the DEGs
```{r}
design(dds) <- formula(~ age + condition)
dds <- DESeq(dds)
```

Access results:
```{r}
deseq2.res <- results(dds)
head(deseq2.res)
```

```{r}
summary(deseq2.res)
```

```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
```
6189 dets between control and exposed, impacted by age.... 

```{r}
inf <- deseq2.res
# The main plot
plot(inf$baseMean, inf$log2FoldChange, pch=20, cex=0.45, ylim=c(-15, 15), log="x", col="darkgray",
     #main="Infection Status  (pval </= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
inf.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(inf.sig$baseMean, inf.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
Write out differentially expressed gene list: 
```{r}
#write.table(inf.sig, "../analyses/15-hisat2-deseq2-summer2022/DETlist_2022_exposed_v_control_no_contrast.tab", sep = '\t', row.names = T, quote = FALSE)
```

wrote out 2024-09-17


Now need to perform a contrast to see if there is a difference between these groups as it relates to age. 

In the multifactor section of the `DESeq2` manual:                
The contrast argument of the function _results_ needs a character vector of three componenets: the name of the variable (in this case "age"), and the name of the factor level for the numerator of the log2 ratio (juvenile) and the denominator (adult) 

A **contrast** is a linear combination of estimated log2 fold changes. Can be used to test if differences between groups are equal to zero.         
```{r}
resultsNames(dds)
```

```{r}
deseq2.resage <- results(dds,
                          contrast = c("age", "juvenile",  "adult"))
head(deseq2.resage)
```


```{r}
age <- deseq2.resage
# The main plot
plot(age$baseMean, age$log2FoldChange, pch=20, cex=0.45, ylim=c(-15, 15), log="x", col="darkgray",
     #main="Age  (pval </= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
age.sig <- deseq2.resage[!is.na(deseq2.resage$padj) & deseq2.resage$padj <= 0.05, ]
points(age.sig$baseMean, age.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.resage[!is.na(deseq2.resage$padj) & deseq2.resage$padj <= 0.05, ])
```
92 DETs -  differentially expressed transcripts associated with age that are different between control and exposed... 

```{r}
summary(deseq2.resage)
```

### make a heatmap of those DETs: 

`join` the DET list with the count data from those libraries:
write out the count matrix for those libraries:

```{r}
counts2022 <- read.csv("../data/transcript_count_matrix_2022.csv", row.names = "transcript_id")
head(counts2022)
```

subset just the counts for the libraries specified in the table above:
```{r}
counts2022sub <- select(counts2022, PSC.0228, PSC.0187, PSC.0188, PSC.0174, PSC.0190, PSC.0231, PSC.0230, PSC.0219, PSC.0177, PSC.0186, PSC.0209, PSC.0203)
head(counts2022sub)
```

```{r}
#write.table(counts2022sub, "../analyses/15-hisat2-deseq2-summer2022/subset_summer2022_count_matrix.tab", sep = '\t', row.names = T)
```
wrote out and commented out code 2024-09-12


```{r}
#write.table(age.sig, "../analyses/15-hisat2-deseq2-summer2022/DETlist-contrast_age.tab", sep = "\t", row.names = T, quote = FALSE, col.names = TRUE)
```
Wrote out 09112024. Comment out code. 

read in the files:
```{r}
counts2022sub <- read.delim("../analyses/15-hisat2-deseq2-summer2022/subset_summer2022_count_matrix.tab")
counts2022sub
```
set rownames as a colum called transcript ID
```{r}
library(tibble)
counts2022sub <- tibble::rownames_to_column(counts2022sub, "transcriptID")
head(counts2022sub)
```
read in the DETlist:
```{r}
detage <- read.delim("../analyses/15-hisat2-deseq2-summer2022/DETlist-contrast_age.tab")
head(detage)
```
set rownames as a colum called transcript ID
```{r}
library(tibble)
detage <- tibble::rownames_to_column(detage, "transcriptID")
head(detage)
```
`join` the lists based on the "transcriptID"
```{r}
detagecounts <- left_join(detage, counts2022sub, by = "transcriptID")
head(detagecounts)
```
wrte out table:


```{r}
#write.table(detagecounts, "../analyses/15-hisat2-deseq2-summer2022/DETlist_counts_contrast_age.tab", sep = "\t", row.names = T, quote = FALSE, col.names = TRUE)
```
wrote out 2024-09-12


### `join` the DET list with the blast uniprot output: 
read in the blast output and blastquery go slim output:
```{r}
blast <- read.table("../analyses/06-BLAST/summer2021-uniprot_blastx.tab")
head(blast)
```

rename first column "transcriptID":
```{r}
colnames(blast) <- c("transcriptID", "V2", "uniprot_accession_ID", "gene", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14")
head(blast)
```
```{r}
blastq <- read.delim("../analyses/06-BLAST/Blastquery-GOslim.tab", sep = '\t', header = F)
head(blastq)
```
rename first column "transcriptID":
```{r}
colnames(blastq) <- c("transcriptID", "GO_ID", "biolodical_process", "V4")
head(blastq)
```

reaad in the DET list from above:
```{r}
detage <- read.table("../analyses/15-hisat2-deseq2-summer2022/DETlist-contrast_age.tab")
head(detage)
```
make rownmaes into a column called "transcriptID":
```{r}
library(tibble)
detage <- tibble::rownames_to_column(detage, "transcriptID")
head(detage)
```

`join` the two tables based on transcriptID:
```{r}
detageblast <- left_join(detage, blast, by = "transcriptID")
head(detageblast)
```
`join` above with blastq
```{r}
detageblastq <- left_join(detageblast, blastq, by = "transcriptID")
head(detageblastq)
```


write out table:
```{r}
#write.table(detageblastq, "../analyses/15-hisat2-deseq2-summer2022/DETlist-contrast_age_annot.tab", sep = "\t", row.names = F, quote = FALSE, col.names = TRUE)
```
Wrote out 09112024. Comment out code. 

57 have annotations.

juvenile exposed are 0228, 0187, 0188
juvenile control are 0230, 0219, 0177

```{r}
# Select top 50 differentially expressed transcripts
res <- results(dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

```
first six columns are exposed, last six columns are control

# `DESeq2` with Gene List!!! CLEAR ENVIRONMENT
```{r}
library(DESeq2)
library(dplyr)
library(ggplot2)
library(pheatmap)
library("pcaExplorer")
library("airway")
```

| library_ID | star_ID | star_age_class | treatment | experiment_day | total_armdrop | total_armtwist | disease_sign  |
|------------|---------|----------------|-----------|----------------|---------------|----------------|---------------|
| PSC.0186   |      15 | adult          | control   |             11 |             0 |              0 | healthy       |
| PSC.0203   |      31 | adult          | control   |             12 |             0 |              0 | healthy       |
| PSC.0209   |       9 | adult          | control   |             13 |             0 |              0 | healthy       |
| PSC.0177   |      34 | juvenile       | control   |             11 |             0 |              0 | healthy       |
| PSC.0219   |      63 | juvenile       | control   |             14 |             0 |              0 | healthy       |
| PSC.0230   |      65 | juvenile       | control   |             15 |             0 |              0 | healthy       |
| PSC.0174   |      30 | adult          | exposed   |             10 |             1 |              6 | first_armdrop |
| PSC.0190   |      10 | adult          | exposed   |             11 |             1 |              0 | first_armdrop |
| PSC.0231   |      13 | adult          | exposed   |             15 |             1 |              5 | first_armdrop |
| PSC.0187   |      57 | juvenile       | exposed   |             11 |             1 |              1 | first_armdrop |
| PSC.0188   |      38 | juvenile       | exposed   |             11 |             1 |              0 | first_armdrop |
| PSC.0228   |      61 | juvenile       | exposed   |             14 |             1 |              0 | first_armdrop |

Read in counts: 
```{r}
counts2022 <- read.csv("../data/gene_count_matrix_2022.csv", row.names = "gene_id")
head(counts2022)
```
subset just the counts for the libraries specified in the table above:
```{r}
counts2022sub <- select(counts2022, PSC.0228, PSC.0187, PSC.0188, PSC.0174, PSC.0190, PSC.0231, PSC.0230, PSC.0219, PSC.0177, PSC.0186, PSC.0209, PSC.0203)
head(counts2022sub)
```

create a text file called "conditions.txt" in `/analyses/15-hisat2-deseq2-summer2022/` and put the metadata for the libraries in it, with tabs between each column 

```{r}
deseq2.colData <- data.frame(condition=factor(c("exposed", "exposed", "exposed", "exposed", "exposed", "exposed", "control", "control", "control", "control", "control", "control")), 
                             type=factor(rep("paired-end", 12)),
                             age=factor(c("juvenile", "juvenile", "juvenile", "adult", "adult", "adult", "juvenile", "juvenile", "juvenile", "adult", "adult", "adult")))
rownames(deseq2.colData) <- colnames(counts2022sub)
```


set counts as matrix:
```{r}
counts2022sub <- as.matrix(counts2022sub)
```

```{r}
# Check all sample IDs in colData are also in CountData and match their orders
all(rownames(deseq2.colData) %in% colnames(counts2022sub)) #should return TRUE
```

```{r}
counts2022sub <- counts2022sub[, rownames(deseq2.colData)]
all(rownames(deseq2.colData) == colnames(counts2022sub)) # This should also return TRUE
```

modelled after crab paper https://github.com/RobertsLab/paper-tanner-crab/blob/master/scripts/DESeq.Rmd 

```{r}
# Create a DESeqDataSet from count matrix and labels, with the design taking age into account
dds <- DESeqDataSetFromMatrix(countData = counts2022sub,
                                     colData = deseq2.colData, 
                                     design = ~ condition + age)
```

Check levels of age and condition 
```{r}
levels(dds$age)
```

```{r}
levels(dds$condition)
```

note: `DESeq2` automatically puts the levels in alphabetical order and the first listed level is the reference level for the factor. 

We want control to be the reference. 
```{r}
dds$condition = relevel(dds$condition, "control")
levels(dds$condition)
```


The following will pull the results from `condition` because that is our variable of interest. This tells us how age contributes to the DEGs
```{r}
design(dds) <- formula(~ age + condition)
dds <- DESeq(dds)
```

Access results:
```{r}
deseq2.res <- results(dds)
head(deseq2.res)
```

```{r}
summary(deseq2.res)
```

```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
```

6202 degs

```{r}
inf <- deseq2.res
# The main plot
plot(inf$baseMean, inf$log2FoldChange, pch=20, cex=0.45, ylim=c(-15, 15), log="x", col="darkgray",
     #main="Infection Status  (pval </= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
inf.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(inf.sig$baseMean, inf.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
Write out differentially expressed gene list: 
```{r}
#write.table(inf.sig, "../analyses/15-hisat2-deseq2-summer2022/DEGlist_2022_exposed_v_control_no_contrast.tab", sep = '\t', row.names = T, quote = FALSE)
```

write out 2024-09-17

Now need to perform a contrast to see if there is a difference between these groups as it relates to age. 

In the multifactor section of the `DESeq2` manual:                
The contrast argument of the function _results_ needs a character vector of three componenets: the name of the variable (in this case "age"), and the name of the factor level for the numerator of the log2 ratio (juvenile) and the denominator (adult) 

A **contrast** is a linear combination of estimated log2 fold changes. Can be used to test if differences between groups are equal to zero.         
```{r}
resultsNames(dds)
```

```{r}
deseq2.resage <- results(dds,
                          contrast = c("age", "juvenile",  "adult"))
head(deseq2.resage)
```

```{r}
age <- deseq2.resage
# The main plot
plot(age$baseMean, age$log2FoldChange, pch=20, cex=0.45, ylim=c(-15, 15), log="x", col="darkgray",
     #main="Age  (pval </= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
age.sig <- deseq2.resage[!is.na(deseq2.resage$padj) & deseq2.resage$padj <= 0.05, ]
points(age.sig$baseMean, age.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
```
```{r}
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.resage[!is.na(deseq2.resage$padj) & deseq2.resage$padj <= 0.05, ])
```

82 degs 

```{r}
summary(deseq2.resage)
```

```{r}
#write.table(age.sig, "../analyses/15-hisat2-deseq2-summer2022/DEGlist-contrast_age.tab", sep = "\t", row.names = T, quote = FALSE, col.names = TRUE)
```

wrote out 2024-09-17

## ANNOTATE DEG LISTS
read in the files:
```{r}
counts2022sub <- read.delim("../analyses/15-hisat2-deseq2-summer2022/subset_summer2022_count_matrix.tab")
counts2022sub
```
set rownames as a colum called transcript ID
```{r}
library(tibble)
counts2022sub <- tibble::rownames_to_column(counts2022sub, "geneID")
head(counts2022sub)
```
read in the DEGlist:
```{r}
degs <- read.delim("../analyses/15-hisat2-deseq2-summer2022/DEGlist_2022_exposed_v_control_no_contrast.tab")
head(degs)
```
set rownames as a colum called gene ID
```{r}
library(tibble)
degs <- tibble::rownames_to_column(degs, "geneID")
head(degs)
```
`join` the lists based on the "geneID"
```{r}
degcounts <- left_join(degs, counts2022sub, by = "geneID")
head(degcounts)
```
wrte out table:


```{r}
#write.table(degcounts, "../analyses/15-hisat2-deseq2-summer2022/DEGlist_counts_exposedVcontrol.tab", sep = "\t", row.names = T, quote = FALSE, col.names = TRUE)
```
wrote out 2024-09-17

### `join` the DEG count list with the blast uniprot output: 
read in the blast output and blastquery go slim output:
```{r}
blast <- read.table("../analyses/06-BLAST/summer2021-uniprot_blastx.tab")
head(blast)
```

rename first column "geneID":
```{r}
colnames(blast) <- c("geneID", "V2", "uniprot_accession_ID", "gene", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14")
head(blast)
```
```{r}
blastq <- read.delim("../analyses/06-BLAST/Blastquery-GOslim.tab", sep = '\t', header = F)
head(blastq)
```
rename first column "geneID":
```{r}
colnames(blastq) <- c("geneID", "GO_ID", "biolodical_process", "V4")
head(blastq)
```
reaad in the DEG list from above:
```{r}
degs <- read.table("../analyses/15-hisat2-deseq2-summer2022/DEGlist_counts_exposedVcontrol.tab")
head(degs)
```

`join` the two tables based on transcriptID:
```{r}
degsblast <- left_join(degs, blast, by = "geneID")
head(degsblast)
```
`join` above with blastq
```{r}
degsblastq <- left_join(degsblast, blastq, by = "geneID")
head(degsblastq)
```
write out table:
```{r}
#write.table(degsblastq, "../analyses/15-hisat2-deseq2-summer2022/DEGlist-counts_exposedVcontrol_annot.tab", sep = "\t", row.names = F, quote = FALSE, col.names = TRUE)
```
Wrote out 2024-09-17 Comment out code. 

#### ANNOTATE AGE CONTRAST DEGLIST
read in the DEGlist:
```{r}
degsage <- read.delim("../analyses/15-hisat2-deseq2-summer2022/DEGlist-contrast_age.tab")
head(degsage)
```

set rownames as a colum called gene ID
```{r}
library(tibble)
degsage <- tibble::rownames_to_column(degsage, "geneID")
head(degsage)
```


`join` the lists based on the "geneID"
```{r}
degagecounts <- left_join(degsage, counts2022sub, by = "geneID")
head(degagecounts)
```
`join` the blast and blastq


`join` the two tables based on transcriptID:
```{r}
degsagecountsblast <- left_join(degagecounts, blast, by = "geneID")
head(degsagecountsblast)
```

`join` above with blastq
```{r}
degsagecountsblastq <- left_join(degsagecountsblast, blastq, by = "geneID")
head(degsagecountsblastq)
```

write out table:
```{r}
#write.table(degsagecountsblastq, "../analyses/15-hisat2-deseq2-summer2022/DEGlist-counts_contrastAge_annot.tab", sep = "\t", row.names = F, quote = FALSE, col.names = TRUE)
```
Wrote out 2024-09-17 Comment out code.