---
title: "25-compare-2021-2022"
output: html_document
date: "2024-10-16"
---
Rmd to compare the DEG lists from 2021 Control vs Exposed _P. helianthoides_ to the DEG list from 2022 Control vs Exposed _P. helianthoides_.

I'm not totally sure the best way to approach this. 

The 2021 dataset is just adults, 8 control vs 8 exposed. 

The 2022 dataset is controls: 3 adults, 3 juveniles, and exposed: 3 adults, 3 juveniles. In other words, without taking age into account, 6 control vs 6 exposed. 

Is this a fair comparison? Can/should I do something else to get a better picture? 

Goal is to see if there's any difference in immune response across the two years. 

These experiments are comparable in that the inoculum type was the same, though not from the same sources. The inoculum was blended tissue, centrifuged to pellet out chunks, from a wasting _P. helianthoides_. 

```{r}
library(dplyr)
library(tidyverse)
```

2021 DEG list - get the annotated version:
```{r}
deg21annot <- read.delim("../analyses/23-annotating-deg-lists/DEGlist_2021_exposedVcontrol_annotated.tab")
head(deg21annot)
```
6,938 DEGs


2022 DEG list - get annotated
```{r}
deg22annot <- read.delim("../analyses/23-annotating-deg-lists/DEGlist_2022_exposedVcontrol_annotated.tab")
head(deg22annot)
```
6,237 DEGs 

Get a list of the genes that are unique to the 2021 dataset:
```{r}
unique21 <- anti_join(deg21annot, deg22annot, by = "gene_id")
head(unique21)
```
2,824 DEGs that are in the 2021 list that aren't in the 2022 list. 

write out DEG table unique to 2021: 
```{r}
#write.table(unique21, "../analyses/25-compare-2021-2022/DEGlist_unique_2021.tab", sep = '\t', row.names = F, quote = F)
```
wrote out 2024-10-25.


Get a list of DEGs that are unique to the 2022 dataset: 
```{r}
unique22 <- anti_join(deg22annot, deg21annot, by = "gene_id")
head(unique22)
```
2,123 genes that are in the 2022 list that aren't in the 2021 list 

write out table of DEGs unique to 2022 dataset: 
```{r}
#write.table(unique22, "../analyses/25-compare-2021-2022/DEGlist_unique_2022.tab", sep = '\t', row.names = F, quote = F)
```
wrote out 2024-10-25

use `inner_join` to  get the DEGs that match between the years:
```{r}
deg21_22 <- inner_join(deg21annot, deg22annot, by = "gene_id")
head(deg21_22)
```

4,114 DEGs are the same across both years. 

2024-10-30
lots of unnecessary repeat columns. 
remove V2.y - V14.y and onward:
Entry.y, Reviewed.y, Entry.Name.y, Protein.names.y, Gene.Names.y, Organism.y, Length.y, Gene.Ontology.biological.process.y, Gene.Ontology.IDs.y

In other words, delete columns 64-85
```{r}
df <- deg21_22[ -c(64:85) ]
head(df)
```


Write out table of DEGs that are same between 2021 and 2022 datasets:
```{r}
#write.table(df, "../analyses/25-compare-2021-2022/DEGlist_same_2021-2022.tab", sep = '\t', row.names = F, quote = F)
```

wrote out 2024-10-30


# Get genes that contribute to the top 9 significantly enriched biological processes from DAVID output (Benjamini-Hochberg <0.05). 


read in DAVID output of the top 9 enriched processes:
```{r}
sigen <- read.delim("../analyses/25-compare-2021-2022/")
```
=======
wrote out 2024-10-25

# 2024-10-23
subset gene_id column, uniprot accession id column, and GO ID column for GITHUB ISSUE asking for how to get GO Slim terms:
```{r}
same2122 <- read.delim("../analyses/25-compare-2021-2022/DEGlist_same_2021-2022.tab")
head(same2122)
```

```{r}
library(tidyverse)
library(dplyr)
forgoslim <- select(same2122, "gene_id", "uniprot_accession.x", "Gene.Ontology.IDs.x")
head(forgoslim)
```

rename columns to remove .x from columns 2 and 3:
```{r}
forgoslim <- rename(forgoslim, "uniprot_accession" = "uniprot_accession.x", "Gene.Ontology.IDs" = "Gene.Ontology.IDs.x")
head(forgoslim)
```

write out table:
```{r}
#write.table(forgoslim, "../analyses/25-compare-2021-2022/DEGlist_same_2021-2022_forGOslim.tab", sep = '\t', row.names = F, quote = F)

```
comment out 2024-10-25

# 2024-10-28
Get genes related to the top 9 enriched (Benjamini <0.05) processes from the shared 2021-2022 DEG list DAVID output: https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/analyses/24-2021-2022-enrichment/2021-2022-same-DAVID.txt

read in DAVID output:
```{r}
david <- read.delim("../analyses/24-2021-2022-enrichment/2021-2022-same-DAVID.txt")
head(david)
```

rename the "Genes" column "uniprot_accession"
```{r}
colnames(david) <- c("Category", "Term", "Count", "%", "PValue", "uniprot_accession", "List Total", "Pop Hits", "Pop Total", "Fold Enrichment", "Bonferroni", "Benjamini", "FDR")
head(david)
```

subset the top 9 enriched processes:
```{r}
top9 <- filter(david, Benjamini < 0.05) %>% top_n(9, wt = Benjamini)
head(top9)
```

write out table:
```{r}
#write.table(top9, "../analyses/25-compare-2021-2022/2021-2022-same-DAVID-top9_enriched.tab", sep = '\t', row.names = F, quote = F)

```

wrote out 2024-10-30

Now get list of DEGs that have the uniprot accession IDs from that short 9 term list: 

https://chatgpt.com/share/671c1bb2-c178-8013-aa40-fe7f974859ac --> use R option
https://github.com/RobertsLab/resources/issues/1996 


```{r}
top9 <- read.delim("../analyses/25-compare-2021-2022/2021-2022-same-DAVID-top9_enriched.tab")
head(top9)
```
```{r}
# Load required libraries
library(tidyverse)


# Separate the values by commas into rows
data_long <- top9 %>%
  separate_rows(uniprot_accession, sep = ", ")

# View the output
print(data_long)
```
pull out columns: term + uniprot_accession

```{r}
data_long_sub <- select(data_long, "Term", "uniprot_accession")
head(data_long_sub)
```


join with deg list 4,114 degs
read in deg list:
```{r}
shared2122 <- read.delim("../analyses/25-compare-2021-2022/DEGlist_same_2021-2022.tab")
head(shared2122)
```
rename uniprot_accession.x to uniprot_accession

```{r}
names(shared2122)[names(shared2122) == 'uniprot_accession.x'] <- 'uniprot_accession'
head(shared2122)
```
yay

`join` shared2122 with data_long_sub
use `left_join` with shared2122 first because may be multiple genes per uniprot_accession 
```{r}
top9genelist <- left_join(data_long_sub, shared2122, by = "uniprot_accession")
head(top9genelist)
```
YAY `

488 ROW - makes sense because some uniprot_accession IDs can have multiple genes 


wrrite out table: 
```{r}
#write.table(top9genelist, "../analyses/25-compare-2021-2022/DEGlist_enrichedprocesses_2021-2022.tab", sep = '\t',  row.names = F, quote = F)
```

wrote out 2024-10-30