---
title: "04.2-miRNA-comparison-targets-FE"
author: "Kathleen Durkin"
date: "2025-07-10"
output: 
  github_document:
    toc: true
    number_sections: true
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
editor_options: 
  markdown: 
    wrap: 72
---

Load packages
```{r}
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(purrr)
library(GSEABase)
library(GO.db)
library(knitr)
library(tidyverse)
```


Hypothesize that conserved miRNA present in all three species will also have targets enriched for the same processes across species.

# Functionally enriched processes

Load FE tables:

```{r}
# Apul
Apul_FE <- read.csv("../../D-Apul/output/09.13-Apul-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_sig_cor_targets_topGO_FE.csv") %>% dplyr::select(-X)

# Peve
Peve_FE <- read.csv("../../E-Peve/output/10.14-Peve-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_sig_cor_targets_topGO_FE.csv") %>% dplyr::select(-X)

# Ptuh
Ptuh_FE <- read.csv("../../F-Ptuh/output/11.14-Ptuh-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_sig_cor_targets_topGO_FE.csv") %>% dplyr::select(-X)
```

Load in assigned miRNA names
```{r}
Apul_names <- read.csv("../../D-Apul/output/11-Apul-sRNA-ShortStack_4.1.0-pulchra_genome/ShortStack_out/Apul_Results_mature_named_miRNAs.csv") %>% dplyr::select(Name, given_miRNA_name)

Peve_names <- read.csv("../../E-Peve/output/05-Peve-sRNA-ShortStack_4.1.0/ShortStack_out/Peve_Results_mature_named_miRNAs.csv") %>% dplyr::select(Name, given_miRNA_name)

Ptuh_names <- read.csv("../../F-Ptuh/output/05-Ptuh-sRNA-ShortStack_4.1.0/ShortStack_out/Ptuh_Results_mature_named_miRNAs.csv") %>% dplyr::select(Name, given_miRNA_name)
```

Annotate miRNA dfs with given names
```{r}
Apul_FE_df <- left_join(Apul_FE, Apul_names, by = c("miRNA" = "Name"))
Peve_FE_df <- left_join(Peve_FE, Peve_names, by = c("miRNA" = "Name"))
Ptuh_FE_df <- left_join(Ptuh_FE, Ptuh_names, by = c("miRNA" = "Name"))
```

Separate conserved miRNA (present in all 3 species) from the rest. In `04-miRNA-comparison`, identified the miRNA conserved among all 3 species to be: miR-100, miR-2023, miR-2025, and miR-2036
```{r}
Apul_FE_conserved <- Apul_FE_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Apul_FE_unconserved <- Apul_FE_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

Peve_FE_conserved <- Peve_FE_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Peve_FE_unconserved <- Peve_FE_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

Ptuh_FE_conserved <- Ptuh_FE_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Ptuh_FE_unconserved <- Ptuh_FE_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

# Also annotate the full dfs
Apul_FE_df$conservation <- ifelse(Apul_FE_df$given_miRNA_name %in% Apul_FE_conserved$given_miRNA_name, "conserved", "unconserved")
Peve_FE_df$conservation <- ifelse(Peve_FE_df$given_miRNA_name %in% Peve_FE_conserved$given_miRNA_name, "conserved", "unconserved")
Ptuh_FE_df$conservation <- ifelse(Ptuh_FE_df$given_miRNA_name %in% Ptuh_FE_conserved$given_miRNA_name, "conserved", "unconserved")
```

## Conserved miRNA 

Take a look
```{r}
print(Apul_FE_conserved)
print(Peve_FE_conserved)
print(Ptuh_FE_conserved)
```

### miR-100

The only miRNA with targets that are enriched for a given function across all three species is miR-100:

- In Apul, miR-100 targets are functionally enriched for terms related to fundamental cellular processes like **lipid transport/metabolism and gene expression regulation**

- In Peve, miR-100 targets are also enriched for fundamental terms, but of a different theme, including **immune signaling, developmental processes, and DNA/protein modifications**

- In Ptuh, miR-100 targets are only enriched for a single Molecular Function, **calcium ion binding**. This is a fundamental molecular function involved in signaling, and would be important to the processes enriched in both Apul and Peve. It is also an enriched term in Peve. 

Note: Our ability to describe the functional role of miRNA in Ptuh will likely be particularly limited, since our functional annotation of the Ptuh genome is the least complete of our three species (A.millepora genome: 99% of genes functionally annotated; P.evermanni genome: 72%; P.meandrina genome: 60%)

### miR-2025

miR-2025 only has targets with enriched functions in two species, A.pulchra and P.evermanni. 

In both, miR-2025 targets are enriched for child terms of **oxidoreductase activity**, suggesting potential role in **redox balance or metabolism**.


Let's also look at the annotations themselves, regardless of whether the processes were statistically overrepresented. This will give more terms to look at for the less-well-annotated P.tuahiniensis.

# Functionally annotated

Load and format FA tables:

```{r}
# Apul
Apul_FA <- read.csv("../../D-Apul/output/09.13-Apul-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_targets_FA.csv") %>% dplyr::select(-X)

# Peve
Peve_FA <- read.csv("../../E-Peve/output/10.14-Peve-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_targets_FA.csv") %>% dplyr::select(-X)

# Ptuh
Ptuh_FA <- read.csv("../../F-Ptuh/output/11.14-Ptuh-mRNA-miRNA-interactions-FE-pooled/miRNA_pooled_targets_FA.csv") %>% dplyr::select(-X)
```

## GoSlim summary terms

Code from [Roberts Lab Handbook](https://robertslab.github.io/resources/bio-Annotation/#map-go-ids-to-goslims)

The expected input file has at least two columns. One each with:

- gene ID

- Gene Ontology (GO) ID

NOTE: The GO IDs in the GO ID column should be separated with a semi-colon.

The basic output from this process will be:

- GOslim IDs (as rownames)

- GOslim terms

Variables
```{r, eval=TRUE}
# Column names corresponding to gene name/ID and GO IDs
GO.ID.column <- "Gene.Ontology.IDs"
gene.ID.column <- "mRNA"

# Relative path or URL to input file
#input.file <- "https://raw.githubusercontent.com/grace-ac/paper-pycno-sswd-2021-2022/d1cdf13c36085868df4ef4b75d2b7de03ef08d1c/analyses/25-compare-2021-2022/DEGlist_same_2021-2022_forGOslim.tab"


##### Official GO info - no need to change #####
goslims_obo <- "goslim_generic.obo"
goslims_url <- "http://current.geneontology.org/ontology/subsets/goslim_generic.obo"
```

Set GSEAbase location and download goslim_generic.obo
```{r download-generic-goslim-obo, eval=TRUE}
# Find GSEAbase installation location
gseabase_location <- find.package("GSEABase")

# Load path to GOslim OBO file
goslim_obo_destintation <- file.path(gseabase_location, "extdata", goslims_obo, fsep = "/")

# Download the GOslim OBO file
download.file(url = goslims_url, destfile = goslim_obo_destintation)

# Loads package files
gseabase_files <- system.file("extdata", goslims_obo, package="GSEABase")
```

Read in gene/GO file
```{r read-in-gene-file, eval=TRUE}
#full.gene.df <- read.csv(file = input.file, header = TRUE, sep = "\t")
full.gene.df <- Apul_FA
```

Remove rows with NA, remove whitespace in GO IDs column and keep just gene/GO IDs columns
```{r remove-NA-and-uniprotIDs, eval=TRUE}

# Clean whitespace, filter NA/empty rows, select columns, and split GO terms using column name variables
gene.GO.df <- full.gene.df %>%
  mutate(!!GO.ID.column := str_replace_all(.data[[GO.ID.column]], "\\s*;\\s*", ";")) %>% # Clean up spaces around ";"
  filter(!is.na(.data[[gene.ID.column]]) & !is.na(.data[[GO.ID.column]]) & .data[[GO.ID.column]] != "") %>% 
  dplyr::select(all_of(c(gene.ID.column, GO.ID.column)))
```

This flattens the file so all of the GO IDs per gene are separated into one GO ID per gene per row.
```{r flatten-gene-and-GO-IDs, eval=TRUE}
flat.gene.GO.df <- gene.GO.df %>% separate_rows(!!sym(GO.ID.column), sep = ";")
```

Groups the genes by GO ID (i.e. lists all genes associated with each unique GO ID)
```{r group-by-GO, eval=TRUE}
grouped.gene.GO.df <- flat.gene.GO.df %>%
  group_by(!!sym(GO.ID.column)) %>%
  summarise(!!gene.ID.column := paste(.data[[gene.ID.column]], collapse = ","))
```

Map GO IDs to GOslims
```{r vectorize-GOIDs, eval=TRUE}
# Vector of GO IDs
go_ids <- grouped.gene.GO.df[[GO.ID.column]]
```

Creates new OBO Collection object of just GOslims, based on provided GO IDs.
```{r extract-GOslims-from-OBO, eval=TRUE}

# Create GSEAbase GOCollection using `go_ids`
myCollection <- GOCollection(go_ids)

# Retrieve GOslims from GO OBO file set
slim <- getOBOCollection(gseabase_files)
```

Get Biological Process (BP) GOslims associated with provided GO IDs.
```{r retrieve-BP-GOslims, eval=TRUE}
# Retrieve Biological Process (BP) GOslims
slimdf <- goSlim(myCollection, slim, "BP", verbose)
```

Performs mapping of of GOIDs to GOslims

Returns:

GOslim IDs (as rownames)
GOslim terms
Counts of GO IDs matching to corresponding GOslim
Percentage of GO IDs matching to corresponding GOslim
GOIDs mapped to corresponding GOslim, in a semi-colon delimited format

```{r map-GO-to-GOslims, eval=TRUE}
# List of GOslims and all GO IDs from `go_ids`
gomap <- as.list(GOBPOFFSPRING[rownames(slimdf)])

# Maps `go_ids` to matching GOslims
mapped <- lapply(gomap, intersect, ids(myCollection))

# Append all mapped GO IDs to `slimdf`
# `sapply` needed to apply paste() to create semi-colon delimited values
slimdf$GO.IDs <- sapply(lapply(gomap, intersect, ids(myCollection)), paste, collapse=";")

# Remove "character(0) string from "GO.IDs" column
slimdf$GO.IDs[slimdf$GO.IDs == "character(0)"] <- ""

# Add self-matching GOIDs to "GO.IDs" column, if not present
for (go_id in go_ids) {
  # Check if the go_id is present in the row names
  if (go_id %in% rownames(slimdf)) {
    # Check if the go_id is not present in the GO.IDs column
    # Also removes white space "trimws()" and converts all to upper case to handle
    # any weird, "invisible" formatting issues.
    if (!go_id %in% trimws(toupper(strsplit(slimdf[go_id, "GO.IDs"], ";")[[1]]))) {
      # Append the go_id to the GO.IDs column with a semi-colon separator
      if (length(slimdf$GO.IDs) > 0 && nchar(slimdf$GO.IDs[nrow(slimdf)]) > 0) {
        slimdf[go_id, "GO.IDs"] <- paste0(slimdf[go_id, "GO.IDs"], "; ", go_id)
      } else {
        slimdf[go_id, "GO.IDs"] <- go_id
      }
    }
  }
}
```

"Flatten" file so each row is single GO ID with corresponding GOslim rownames_to_column needed to retain row name info
```{r flatten-GOslims-file, eval=TRUE}
# "Flatten" file so each row is single GO ID with corresponding GOslim
# rownames_to_column needed to retain row name info
slimdf_separated <- as.data.frame(slimdf %>%
  rownames_to_column('GOslim') %>%
  separate_rows(GO.IDs, sep = ";"))

# Group by unique GO ID
grouped_slimdf <- slimdf_separated %>%
  filter(!is.na(GO.IDs) & GO.IDs != "") %>%
  group_by(GO.IDs) %>%
  summarize(GOslim = paste(GOslim, collapse = ";"),
            Term = paste(Term, collapse = ";"))
```

Sorts GOslims by Count, in descending order and then selects just the Term and Count columns.
```{r sort-and-select-slimdf-counts, eval=TRUE}

slimdf.sorted <- slimdf %>% arrange(desc(Count))

slim.count.df <- slimdf.sorted %>% 
  dplyr::select(Term, Count)
```


Annotate miRNA dfs with given names and GOSlim terms
```{r}
Apul_FA_df <- left_join(Apul_FA, Apul_names, by = c("miRNA" = "Name"))
Peve_FA_df <- left_join(Peve_FA, Peve_names, by = c("miRNA" = "Name"))
Ptuh_FA_df <- left_join(Ptuh_FA, Ptuh_names, by = c("miRNA" = "Name"))

# function to map column of semicolon-delimited GO IDs to my GOSlim terms
IDs_to_slims <- function(df, mapping_df){
  # Step 1: Separate Gene.Ontology.IDs into rows
  df_long <- df %>%
    separate_rows(Gene.Ontology.IDs, sep = ";") %>%
    mutate(Gene.Ontology.IDs = str_trim(Gene.Ontology.IDs))  # remove any extra spaces
  
  # Step 2: Join with mapping_df to get GO Slim terms
  df_mapped <- df_long %>%
    left_join(mapping_df, by = c("Gene.Ontology.IDs" = "GO.IDs"))
  
  # Step 3: Collapse GO Slim terms back into semicolon-delimited format per GI_ID
  df_result <- df_mapped %>%
    group_by(mRNA) %>%
    summarise(
      Gene.Ontology.IDs = paste(Gene.Ontology.IDs, collapse = ";"),
      GO_Slim_Terms = paste(unique(Term[!is.na(Term)]), collapse = ";")
    )
}

# Obtain GOSlim summaries
Apul_FA_slims <- IDs_to_slims(Apul_FA_df, slimdf_separated) %>% dplyr::select(mRNA, GO_Slim_Terms)
Peve_FA_slims <- IDs_to_slims(Peve_FA_df, slimdf_separated) %>% dplyr::select(mRNA, GO_Slim_Terms)
Ptuh_FA_slims <- IDs_to_slims(Ptuh_FA_df, slimdf_separated) %>% dplyr::select(mRNA, GO_Slim_Terms)

# Annotate FA dfs with GOSlim terms
Apul_FA_df <- left_join(Apul_FA_df, Apul_FA_slims, by = "mRNA")
Peve_FA_df <- left_join(Peve_FA_df, Peve_FA_slims, by = "mRNA")
Ptuh_FA_df <- left_join(Ptuh_FA_df, Ptuh_FA_slims, by = "mRNA")
```

Separate based on correlation significance
```{r}
Apul_FA_unsig <- Apul_FA_df %>% filter(p_value >= 0.05)
Peve_FA_unsig <- Peve_FA_df  %>% filter(p_value >= 0.05)
Ptuh_FA_unsig <- Ptuh_FA_df  %>% filter(p_value >= 0.05)

Apul_FA_df <- Apul_FA_df %>% filter(p_value < 0.05)
Peve_FA_df<- Peve_FA_df  %>% filter(p_value < 0.05)
Ptuh_FA_df <- Ptuh_FA_df  %>% filter(p_value < 0.05)
```


Separate conserved miRNA (present in all 3 species) from the rest. In `04-miRNA-comparison`, identified the miRNA conserved among all 3 species to be: miR-100, miR-2023, miR-2025, and miR-2036
```{r}
Apul_FA_conserved <- Apul_FA_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Apul_FA_unconserved <- Apul_FA_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

Peve_FA_conserved <- Peve_FA_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Peve_FA_unconserved <- Peve_FA_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

Ptuh_FA_conserved <- Ptuh_FA_df %>% filter(str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))
Ptuh_FA_unconserved <- Ptuh_FA_df %>% filter(!str_detect(given_miRNA_name, "mir-100|mir-2023|mir-2025|mir-2036"))

# Also annotate the full dfs
Apul_FA_df$conservation <- ifelse(Apul_FA_df$given_miRNA_name %in% Apul_FA_conserved$given_miRNA_name, "conserved", "unconserved")
Peve_FA_df$conservation <- ifelse(Peve_FA_df$given_miRNA_name %in% Peve_FA_conserved$given_miRNA_name, "conserved", "unconserved")
Ptuh_FA_df$conservation <- ifelse(Ptuh_FA_df$given_miRNA_name %in% Ptuh_FA_conserved$given_miRNA_name, "conserved", "unconserved")
```

## Overlapping GOSlim terms

Take a look
```{r}
unique(Apul_FA_conserved %>% filter(!is.na(GO_Slim_Terms), GO_Slim_Terms != "") %>% dplyr::select(given_miRNA_name))
unique(Peve_FA_conserved %>% filter(!is.na(GO_Slim_Terms), GO_Slim_Terms != "") %>% dplyr::select(given_miRNA_name))
unique(Ptuh_FA_conserved %>% filter(!is.na(GO_Slim_Terms), GO_Slim_Terms != "") %>% dplyr::select(given_miRNA_name))


```

Only miR-100 and miR-2023 have targets that are annotated and can be summarized to GOSlims in all three species. miR-2025 has target GOSlims in Apul and Peve, but miR-2036 targets can only be summarized to GOSlims in Ptuh.

For a conserved miRNA, which processes are targeted in all species?
```{r}
# Apul
# Filter out NA biological processes
Apul_FA_conserved_filt <- Apul_FA_conserved %>%
  filter(!is.na(GO_Slim_Terms))

# Loop through each unique miRNA and calculate term counts
Apul_BP_counts <- Apul_FA_conserved_filt %>%
  group_split(given_miRNA_name) %>%
  set_names(map_chr(., ~ unique(.$given_miRNA_name))) %>%
  map(~ .x %>%
        separate_rows(GO_Slim_Terms, sep = ";\\s*") %>%
        mutate(GO_Slim_Terms = str_trim(GO_Slim_Terms)) %>%
        count(GO_Slim_Terms, sort = TRUE)
      )


# Peve
# Filter out NA biological processes
Peve_FA_conserved_filt <- Peve_FA_conserved %>%
  filter(!is.na(GO_Slim_Terms))

# Loop through each unique miRNA and calculate term counts
Peve_BP_counts <- Peve_FA_conserved_filt %>%
  group_split(given_miRNA_name) %>%
  set_names(map_chr(., ~ unique(.$given_miRNA_name))) %>%
  map(~ .x %>%
        separate_rows(GO_Slim_Terms, sep = ";\\s*") %>%
        mutate(GO_Slim_Terms = str_trim(GO_Slim_Terms)) %>%
        count(GO_Slim_Terms, sort = TRUE)
      )


# Ptuh
# Filter out NA biological processes
Ptuh_FA_conserved_filt <- Ptuh_FA_conserved %>%
  filter(!is.na(GO_Slim_Terms))

# Loop through each unique miRNA and calculate term counts
Ptuh_BP_counts <- Ptuh_FA_conserved_filt %>%
  group_split(given_miRNA_name) %>%
  set_names(map_chr(., ~ unique(.$given_miRNA_name))) %>%
  map(~ .x %>%
        separate_rows(GO_Slim_Terms, sep = ";\\s*") %>%
        mutate(GO_Slim_Terms = str_trim(GO_Slim_Terms)) %>%
        count(GO_Slim_Terms, sort = TRUE)
      )
```

### miR-100

```{r}
# miR-100 (annotations in all 3)
cat("miR-100 annotations present in all 3 species:", "\n")
Reduce(intersect, list(Apul_BP_counts$`apul-mir-100`$GO_Slim_Terms,
                       Peve_BP_counts$`peve-mir-100`$GO_Slim_Terms,
                       Ptuh_BP_counts$`ptuh-mir-100`$GO_Slim_Terms))

# miR-2023 (annotations in all 3)
cat("\n", "miR-2023 annotations present in all 3 species:", "\n")
Reduce(intersect, list(Apul_BP_counts$`apul-mir-2023`$GO_Slim_Terms,
                       Peve_BP_counts$`peve-mir-2023`$GO_Slim_Terms,
                       Ptuh_BP_counts$`ptuh-mir-2023`$GO_Slim_Terms))

# miR-2025 (annotations in Apul and Peve)
cat("\n", "miR-2025 annotations present in all A.pulchra and P.evermanni:", "\n")
Reduce(intersect, list(Apul_BP_counts$`apul-mir-2025`$GO_Slim_Terms,
                       Peve_BP_counts$`peve-mir-2025`$GO_Slim_Terms))

# miR-2036 (P.tuahiniensis)
cat("\n", "miR-2036 annotations present in P.tuahiniensis:", "\n")
Reduce(intersect, list(Ptuh_BP_counts$`ptuh-mir-2036`$GO_Slim_Terms))
```

miR-100 is functionally conserved across the three species for functions related to **cellular organization, transport and signaling, immune response, gene regulation, cell cycle, and development**. These GOSlim terms agree with functions that are functionally enriched in miR-100 targets across the three species.

The other conserved miRNA don't have enough annotated targets in all species to determine whether functional conservation exists. 

Also keep in mind that some important functions (e.g. reproduction, development, immune response) make not appear in this dataset, since deep-dive samples were collected only at a single site and timepoint, and are limited in number.

## Check for ncRNA machinery

load table of genes annotated for ncRNA/methylation machinery
```{r}
prot_df <- read.csv("../data/e5_deep-dive_ncRNA_proteins.csv")

# Remove -T* from Apul protein names 
prot_df <- prot_df %>%
  mutate(Protein_ID = if_else(
    Species == "Apul",
    sub("-T\\d+$", "", Protein_ID),
    Protein_ID
  ))
```

### All putatively binding targets
```{r}
# Check Jill's table of epigenetic machinery genes
left_join(prot_df[prot_df$species == "Apul",], Apul_FA_unsig, by = c("Protein_ID" = "mRNA")) %>% nrow()
left_join(prot_df[prot_df$species == "Peve",], Peve_FA_unsig, by = c("Protein_ID" = "mRNA")) %>% nrow()
left_join(prot_df[prot_df$species == "Ptuh",], Ptuh_FA_unsig, by = c("Protein_ID" = "mRNA")) %>% nrow()

# Manually search for matching words

## function to loop across a list of search terms
count_word_hits <- function(df, column, words) {
  # Ensure relevant columns are character
  text_column <- as.character(df[[column]])
  miRNA_column <- as.character(df$given_miRNA_name)

  # Loop through words and collect counts + miRNAs
  results <- lapply(words, function(word) {
    matches <- grepl(word, text_column, ignore.case = TRUE)
    n_hits <- sum(matches)
    
    if (n_hits > 0) {
      miRNAs <- unique(miRNA_column[matches])
      miRNAs_combined <- paste(miRNAs, collapse = "; ")
    } else {
      miRNAs_combined <- NA
    }
    
    data.frame(
      word = word,
      hits = n_hits,
      associated_miRNAs = miRNAs_combined,
      stringsAsFactors = FALSE
    )
  })
  
  # Combine into one data frame
  result_df <- do.call(rbind, results)
  return(result_df)
}


# Make a list of proteins of interest (DNMT1, Dicer, etc.) from Jill's table 
word_list <- unique(prot_df$Protein_of_interest)

# Search for these proteins of interested in our annotated miRNA targets (correlation significance not considered here)
count_word_hits(Apul_FA_unsig, "Protein.names", word_list)
count_word_hits(Peve_FA_unsig, "Protein.names", word_list)
count_word_hits(Ptuh_FA_unsig, "Protein.names", word_list)
```

Wow! So there are actually quite a few miRNA that putatively bind to the genes that produce epigenetic protein machinery, in all species. All species have miRNA that bind to genes encoding proteins related to AGO2, Dicer, DNMT1, and Piwi. 

Note, however, that this is just a simple, grep-based alphanumeric search for protein names. The above counts may inlude genes that are associated with the search term, but do not actuallye encode that protein (e.g., "DNA methyltransferase 1-associated protein 1 (DNMAP1) (DNMT1-associated protein 1)", instead of "DNA (cytosine-5)-methyltransferase 1 (Dnmt1) (EC 2.1.1.37) (DNA methyltransferase GgaI) (DNA MTase GgaI) (M.GgaI) (MCMT)")

### Significantly correlated targets
```{r}
left_join(prot_df[prot_df$species == "Apul",], Apul_FA_df, by = c("Protein_ID" = "mRNA")) %>% nrow()
left_join(prot_df[prot_df$species == "Peve",], Peve_FA_df, by = c("Protein_ID" = "mRNA")) %>% nrow()
left_join(prot_df[prot_df$species == "Ptuh",], Ptuh_FA_df, by = c("Protein_ID" = "mRNA")) %>% nrow()

# Search for these proteins of interested in our annotated miRNA targets (only those with significantly correlated expression)
count_word_hits(Apul_FA_df, "Protein.names", word_list)
count_word_hits(Peve_FA_df, "Protein.names", word_list)
count_word_hits(Ptuh_FA_df, "Protein.names", word_list)
```
There are only 2 miRNA that target *and* are significantly coexpressed with a protein(s) that's involved in epigenetic regulation: peve-mir-novel-7 and ptuh-mir-novel-6.

#### peve-mir-novel-7

This miRNA targets and is significantly negatively coexpressed with DNMT1, the DNMT responsible for maintaining DNA methylation marks. 

This miRNA is one of the few that is almost exclusively *negatively* correlated with its many targets, which provides strong support for its role as a valid miRNA acting through translational repression or mRNA degradation. It is also negatively correlated wiht the DNMT1 gene it putatively targets, which supports its role as a mediator of DNMT1 expression, and thus potentially a mediator of DNA methylation.

peve-mir-novel-7 is also partially conserved, matching the P.tuahiniensis miRNA ptuh-mir-novel-7 (see https://github.com/urol-e5/deep-dive-expression/blob/main/M-multi-species/code/04-miRNA-comparison.md#421-apul-and-peve).

We can also check to see which terms are functionally expressed in the miRNA's targets:
```{r}
Peve_FE_df %>% filter(given_miRNA_name == "peve-mir-novel-7")
Peve_FA_df %>% filter(given_miRNA_name == "peve-mir-novel-7", !is.na(Gene.Ontology.IDs))
```

#### ptuh-mir-novel-6

This miRNA targets and is significantly negatively coexpressed with AGO2 ("Protein argonaute-2 (Argonaute2)") (manual check of the hit for the word "Piwi" shows that "piwi" is actually just included in the description for the AGO2 protein that ptuh-mir-novel-6 targets). AGO2 is involved in in RISC, binding to sRNA and guiding them to complementary targets. 

ptuh-mir-novel-6 is one of the few miRNA that is almost exclusively *negatively* correlated with its many targets, which provides strong support for its role as a valid miRNA acting through translational repression or mRNA degradation. However, it is *positively* correlated with the AGO2 gene it putatively binds to, which may indicate this is a case of indirect co-regulation.

ptuh-mir-novel-6 is not conserved with any other study species.

We can also check to see which terms are functionally expressed in the miRNA's targets:
```{r}
Peve_FE_df %>% filter(given_miRNA_name == "peve-mir-novel-7")
Peve_FA_df %>% filter(given_miRNA_name == "peve-mir-novel-7", !is.na(Gene.Ontology.IDs))
```

# Plotting

```{r}
# Apul
# Count number of unique mRNA targets per miRNA
Apul_target_counts <- Apul_FA_df %>%
  group_by(given_miRNA_name) %>%
  summarise(num_targets = n_distinct(mRNA)) %>%
  arrange(desc(num_targets))

# Plot
ggplot(Apul_target_counts, aes(x = reorder(given_miRNA_name, -num_targets), y = num_targets)) +
  geom_col(fill = "#408EC6") +
  labs(x = "miRNA", y = "Number of unique mRNA targets", title = "Apul: Target Count per miRNA") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Peve
# Count number of unique mRNA targets per miRNA
Peve_target_counts <- Peve_FA_df %>%
  group_by(given_miRNA_name) %>%
  summarise(num_targets = n_distinct(mRNA)) %>%
  arrange(desc(num_targets))

# Plot
ggplot(Peve_target_counts, aes(x = reorder(given_miRNA_name, -num_targets), y = num_targets)) +
  geom_col(fill = "#1E2761") +
  labs(x = "miRNA", y = "Number of unique mRNA targets", title = "Peve: Target Count per miRNA") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Ptuh
# Count number of unique mRNA targets per miRNA
Ptuh_target_counts <- Ptuh_FA_df %>%
  group_by(given_miRNA_name) %>%
  summarise(num_targets = n_distinct(mRNA)) %>%
  arrange(desc(num_targets))

# Plot
ggplot(Ptuh_target_counts, aes(x = reorder(given_miRNA_name, -num_targets), y = num_targets)) +
  geom_col(fill = "#7A2048") +
  labs(x = "miRNA", y = "Number of unique mRNA targets", title = "Ptuh: Target Count per miRNA") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```


```{r}
stacked_opacity_bar <- function(df, species_color) {

  # Label correlation direction
  df_labeled <- df %>%
    mutate(
      correlation_direction = case_when(
        PCC.cor > 0 ~ "positive",
        PCC.cor < 0 ~ "negative",
        TRUE ~ "zero"
      )
    )

  # Count number of unique mRNA targets per miRNA, correlation, and region
  plot_data <- df_labeled %>%
    group_by(given_miRNA_name, correlation_direction, region) %>%
    summarise(n_mRNA = n_distinct(mRNA), .groups = "drop") %>%
    # Step 2.5: Flip sign for negative correlations
    mutate(n_mRNA = ifelse(correlation_direction == "negative", -n_mRNA, n_mRNA))

  # Alphabetical ordering
  plot_data <- plot_data %>%
    mutate(given_miRNA_name = factor(given_miRNA_name, levels = sort(unique(given_miRNA_name))))

  # Plot
  ggplot(plot_data, aes(x = given_miRNA_name, y = n_mRNA)) +
    geom_col(aes(fill = correlation_direction, alpha = region), position = "stack") +
    scale_fill_manual(values = c(positive = species_color, negative = "#d7191c")) +
    scale_alpha_manual(values = c("3UTR" = 1, "5UTR" = 0.6, "CDS" = 0.3)) +
    theme_minimal() +
    labs(
      title = "miRNA-mRNA Interactions by Correlation and Region",
      x = "miRNA (alphabetical)",
      y = "Number of Predicted mRNA Targets",
      fill = "Correlation",
      alpha = "Region"
    ) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

stacked_opacity_bar(Apul_FA_df, "#408EC6")
stacked_opacity_bar(Peve_FA_df, "#1E2761")
stacked_opacity_bar(Ptuh_FA_df, "#7A2048")

stacked_opacity_bar(Apul_FA_conserved, "#408EC6")
stacked_opacity_bar(Peve_FA_conserved, "#1E2761")
stacked_opacity_bar(Ptuh_FA_conserved, "#7A2048")

# Without the miR-100s, since their high expression levels dominate the plots
stacked_opacity_bar(Apul_FA_conserved[!grepl("mir-100",Apul_FA_conserved$given_miRNA_name),], "#408EC6")
stacked_opacity_bar(Peve_FA_conserved[!grepl("mir-100",Peve_FA_conserved$given_miRNA_name),], "#1E2761")
stacked_opacity_bar(Ptuh_FA_conserved[!grepl("mir-100",Ptuh_FA_conserved$given_miRNA_name),], "#7A2048")

```

miR-100 primarily targets via the CDS region. Interestingly, while miR-100 is primarily negatively coexpressed with its targets in A.pulchra and P.tuahiniensis, it is primarily *positively* coexpressed with its targets in P.evermanni. This suggest that, miR-100 primarily silences its targets in A. pulchra and P.tuahiniensis, through translational repression or mRNA degradation, but *upregulates* its targets in P.evermanni.

```{r}
region_pie <- function(df, species_color){

# Manually tabulate counts per region with base R
region_table <- table(df$region)

# Create 3 colors with varying opacity (alpha)
fill_colors <- scales::alpha(species_color, c(1, 0.6, 0.3))  
names(fill_colors) <- c("3UTR", "5UTR", "CDS")

# Convert to data frame
region_df <- as.data.frame(region_table)
names(region_df) <- c("region", "count")

# Calculate percent and label
region_df$percent <- region_df$count / sum(region_df$count) * 100
region_df$label <- paste0(region_df$region, " (", round(region_df$percent, 1), "%)")

# Plot pie chart
ggplot(region_df, aes(x = "", y = count, fill = region)) +
  geom_col(color = "white", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Proportion of Total Interactions by Region") +
  scale_fill_manual(values = fill_colors)
}

region_pie(Apul_FA_df, "#408EC6")
region_pie(Peve_FA_df, "#1E2761")
region_pie(Ptuh_FA_df, "#7A2048")

region_pie(Apul_FA_conserved, "#408EC6")
region_pie(Peve_FA_conserved, "#1E2761")
region_pie(Ptuh_FA_conserved, "#7A2048")

region_pie(Apul_FA_df[sign(Apul_FA_df$PCC.cor) > 0,], "#408EC6")
region_pie(Peve_FA_df[sign(Peve_FA_df$PCC.cor) > 0,], "#1E2761")
region_pie(Ptuh_FA_df[sign(Ptuh_FA_df$PCC.cor) > 0,], "#7A2048")

```


# Which miRNA targets have multiple binding sites?

```{r}
Apul_binding_sum <- Apul_FA_df %>%
  group_by(given_miRNA_name, mRNA) %>%
  summarise(
    total_binding_sites = n(),
    sites_3UTR = sum(region == "3UTR"),
    sites_5UTR = sum(region == "5UTR"),
    sites_CDS = sum(region == "CDS"),
    PCC.cor = dplyr::first(PCC.cor),
    Protein.names = dplyr::first(Protein.names),
    .groups = "drop"
  )

Peve_binding_sum <- Peve_FA_df %>%
  group_by(given_miRNA_name, mRNA) %>%
  summarise(
    total_binding_sites = n(),
    sites_3UTR = sum(region == "3UTR"),
    sites_5UTR = sum(region == "5UTR"),
    sites_CDS = sum(region == "CDS"),
    PCC.cor = dplyr::first(PCC.cor),
    Protein.names = dplyr::first(Protein.names),
    .groups = "drop"
  )

Ptuh_binding_sum <- Ptuh_FA_df %>%
  group_by(given_miRNA_name, mRNA) %>%
  summarise(
    total_binding_sites = n(),
    sites_3UTR = sum(region == "3UTR"),
    sites_5UTR = sum(region == "5UTR"),
    sites_CDS = sum(region == "CDS"),
    PCC.cor = dplyr::first(PCC.cor),
    Protein.names = dplyr::first(Protein.names),
    .groups = "drop"
  )
```

```{r}
range(Apul_binding_sum$total_binding_sites)
mean(Apul_binding_sum$total_binding_sites)
median(Apul_binding_sum$total_binding_sites)
Apul_binding_sum %>% arrange(total_binding_sites) %>% tail(10)
cat("\n")
range(Peve_binding_sum$total_binding_sites)
mean(Peve_binding_sum$total_binding_sites)
median(Peve_binding_sum$total_binding_sites)
Peve_binding_sum %>% arrange(total_binding_sites) %>% tail(10)
cat("\n")
range(Ptuh_binding_sum$total_binding_sites)
mean(Ptuh_binding_sum$total_binding_sites)
median(Ptuh_binding_sum$total_binding_sites)
Ptuh_binding_sum %>% arrange(total_binding_sites) %>% tail(10)
```

The vast majority of putatively interacting miRNA-mRNA pairs across all three species are predicted to bind in only one location on the mRNA. A small subset are predicted to have multiple valid binding sites. These instances of multiple binding are usually limited to 2-3 binding sites, and are almost exclusively found in the CDS region of the mRNA. There are two notable exceptions in P.tuahiniensis where there are multiple binding sites predicted in the 3UTR. miR-100 exhibits multiple binding with targets in all three species, again largely via binding in the CDS region. Most notably, in A.pulchra, miR-100 is predicted to bind to one target (FUN_014449, an unannotated gene) in a remarkable *74* configurations!

```{r}
ggplot(Apul_binding_sum, aes(x = total_binding_sites, y = abs(PCC.cor), color = as.factor(sign(PCC.cor)))) +
  geom_point() + 
  geom_jitter()

ggplot(Peve_binding_sum, aes(x = total_binding_sites, y = abs(PCC.cor), color = as.factor(sign(PCC.cor)))) +
  geom_point() + 
  geom_jitter()

ggplot(Ptuh_binding_sum, aes(x = total_binding_sites, y = abs(PCC.cor), color = as.factor(sign(PCC.cor)))) +
  geom_point() + 
  geom_jitter()
```

There's no apparent relationship between number of binding sites and magnitude of PCC expression correlation.

# Is correlation direction related to binding region?

```{r}
cat("A.pulchra:", "\n")
# Create table of binding region vs. correlation direction
Apul_cont_table <- table(Apul_FA_df$region, 
                         ifelse(Apul_FA_df$PCC.cor < 0, "negative", "positive"))
print(Apul_cont_table)
# Chisq test of independence
chisq.test(Apul_cont_table)
chisq.test(Apul_cont_table)$stdres

cat("\n", "\n", "P.evermanni:", "\n")

# Create table of binding region vs. correlation direction
Peve_cont_table <- table(Peve_FA_df$region, 
                         ifelse(Peve_FA_df$PCC.cor < 0, "negative", "positive"))
print(Peve_cont_table)
# Chisq test of independence
chisq.test(Peve_cont_table)
chisq.test(Peve_cont_table)$stdres

cat("\n", "\n", "P.tuahiniensis:", "\n")

# Create table of binding region vs. correlation direction
Ptuh_cont_table <- table(Ptuh_FA_df$region, 
                         ifelse(Ptuh_FA_df$PCC.cor < 0, "negative", "positive"))
print(Ptuh_cont_table)
# Chisq test of independence
chisq.test(Ptuh_cont_table)
chisq.test(Ptuh_cont_table)$stdres
```