14.1-Apul-miRNA-mRNA-coexpression-additional_inputs ================ Kathleen Durkin 2025-02-26 - 0.1 Obtain Pearson’s coefficient correlation values - 1 Incorporate miRanda target prediction results - 1.1 full mRNA - 1.2 3’UTR - 1.3 5’UTR - 2 Summary ``` r #library(energy) library(tidyr) library(dplyr) ``` ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` r library(readr) library(ggplot2) ``` Now that we’ve found putative interactions, including those with high complementarity, in `07.1-Apul-miRNA-mRNA-miRanda-additional_inputs` we need to validate miRNA function by examining patterns of coexpression. We’d expect a putatively-interacting miRNA-mRNA pair to be highly coexpressed, and we’d expect a negative relationship to indicate target cleavage or translational repression. ## 0.1 Obtain Pearson’s coefficient correlation values Read in, format, and normalize data ``` r mRNA_counts <- read.csv("../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv") mRNA_counts <- as.data.frame(mRNA_counts) rownames(mRNA_counts) <- mRNA_counts[,1] #set first column that contains gene names as rownames mRNA_counts <- mRNA_counts[,-1] # remove column w/ gene names # Remove any genes with 0 counts across samples mRNA_counts<-mRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) miRNA_counts <- read.delim("../output/03.10-D-Apul-sRNAseq-expression-DESeq2/Apul_miRNA_ShortStack_counts_formatted.txt") # Format miRNA column names to match the mRNA names colnames(miRNA_counts) <- sub("_.*", "", colnames(miRNA_counts)) # Order miRNA column names to match the mRNA column order miRNA_counts <- miRNA_counts[, colnames(mRNA_counts)] # Check that the columns match name and order for both dataframes colnames(miRNA_counts) == colnames(mRNA_counts) ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ``` r # Remove any miRNAs with 0 for all samples miRNA_counts <- miRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) # Function to normalize counts (simple RPM normalization) normalize_counts <- function(counts) { rpm <- t(t(counts) / colSums(counts)) * 1e6 return(rpm) } miRNA_norm <- normalize_counts(miRNA_counts) mRNA_norm <- normalize_counts(mRNA_counts) ``` ``` r # Function to calculate PCC and p-value for a pair of vectors calc_pcc <- function(x, y) { result <- cor.test(x, y, method = "pearson") return(c(PCC = result$estimate, p_value = result$p.value)) } # Create a data frame of all miRNA-mRNA pairs pairs <- expand.grid(miRNA = rownames(miRNA_norm), mRNA = rownames(mRNA_norm)) # Calculate PCC and p-value for each pair pcc_results <- pairs %>% rowwise() %>% mutate( pcc_stats = list(calc_pcc(miRNA_norm[miRNA,], mRNA_norm[mRNA,])) ) %>% unnest_wider(pcc_stats) # Adjust p-values for FDR pcc_results <- pcc_results %>% mutate(adjusted_p_value = p.adjust(p_value, method = "fdr")) # Save as csv write.csv(pcc_results, "../output/14.1-Apul-miRNA-mRNA-coexpression-additional_inputs/Apul-PCC_miRNA_mRNA-full.csv") ``` Check ``` r # Read in results #pcc_results <- read.csv("../output/14.1-Apul-miRNA-mRNA-coexpression-additional_inputs/Apul-PCC_miRNA_mRNA-full.csv") # Use this code to download the PCC results if needed pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/timeseries_molecular/D-Apul/output/14.1-Apul-miRNA-mRNA-coexpression-additional_inputs/Apul-PCC_miRNA_mRNA-full.csv") nrow(pcc_results) ``` ## [1] 1534794 ``` r nrow(pcc_results%>% filter(abs(PCC.cor) > 0.90)) ``` ## [1] 24 ``` r nrow(pcc_results %>% filter(p_value < 0.05)) ``` ## [1] 170507 ``` r nrow(pcc_results %>% filter(p_value < 0.05 & abs(PCC.cor) > 0.90)) ``` ## [1] 24 of the 1,534,794 possible miRNA-mRNA interactions, 24 have a Pearson’s correlation coefficient with a magnitude above 0.9, and 170,507 have a significant correlation (pval\<0.05). All of the coefficients with a magnitude \>0.9 are significant. I find it interesting that so many putative interactions have significant pvalues, but so few have correlation coefficients above 0.9. What does the distribution of significant correlation coefficients look like? ``` r ggplot(pcc_results[pcc_results$p_value < 0.05,], aes(x = PCC.cor)) + geom_density(fill = "blue", alpha = 0.5) + labs(title = "Density of Significant Correlations", x = "Correlation Coefficient", y = "Density") + theme_minimal() ``` ![](14.1-Apul-miRNA-mRNA-coexpression-additional_inputs_files/figure-gfm/unnamed-chunk-5-1.png) ``` r ggplot(pcc_results[pcc_results$p_value < 0.05,], aes(x = abs(PCC.cor))) + geom_density(fill = "blue", alpha = 0.5) + labs(title = "Density of Significant Correlations (magnitudes)", x = "Magnitude of Correlation Coefficient", y = "Density") + theme_minimal() ``` ![](14.1-Apul-miRNA-mRNA-coexpression-additional_inputs_files/figure-gfm/unnamed-chunk-5-2.png) # 1 Incorporate miRanda target prediction results ## 1.1 full mRNA ``` r # miRNA-mRNA_full miRanda output miRNA_mRNA_miRanda <- read_delim("../output/07.1-Apul-miRNA-mRNA-miRanda-additional_inputs/Apul-miRanda-mRNA_full-strict-parsed.txt", col_names=FALSE) ``` ## Rows: 814426 Columns: 9 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (6): X1, X2, X5, X6, X8, X9 ## dbl (3): X3, X4, X7 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r colnames(miRNA_mRNA_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") # format miRNA and mRNA names geneIDs <- read_delim("../output/05-Apul-annotate-UTRs/Apul-mRNA-FUNids.txt", col_names=FALSE) ``` ## Rows: 36447 Columns: 5 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (5): X1, X2, X3, X4, X5 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r geneIDs$X4 <- gsub("Parent=", "", geneIDs$X4) miRNA_mRNA_miRanda$mirna <- gsub(">", "", miRNA_mRNA_miRanda$mirna) miRNA_mRNA_miRanda$mirna <- gsub("\\..*", "", miRNA_mRNA_miRanda$mirna) miRNA_mRNA_miRanda <- left_join(miRNA_mRNA_miRanda, geneIDs, by=c("Target" = "X1")) miRNA_mRNA_miRanda <- select(miRNA_mRNA_miRanda, -X2,-X3,-X5) # Finally, create a column that conatins both the miRNA and interacting mRNA pcc_results$interaction <- paste(pcc_results$miRNA, "_", pcc_results$mRNA) miRNA_mRNA_miRanda$interaction <- paste(miRNA_mRNA_miRanda$mirna, "_", miRNA_mRNA_miRanda$X4) # Annotate w PCC info miRNA_mRNA_miRanda <- left_join(miRNA_mRNA_miRanda, pcc_results, by="interaction") ``` ``` r # Filter to high complementarity putative targets target_21bp <- miRNA_mRNA_miRanda[miRNA_mRNA_miRanda$Al_Len > 20,] target_21bp_3mis <- target_21bp[target_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_mRNA_miRanda) ``` ## [1] 814426 ``` r nrow(miRNA_mRNA_miRanda %>% filter(p_value < 0.05)) ``` ## [1] 99958 ``` r nrow(target_21bp %>% filter(p_value < 0.05)) ``` ## [1] 17425 ``` r nrow(target_21bp_3mis %>% filter(p_value < 0.05)) ``` ## [1] 26 For miRNA binding to the mRNA CDS, miRanda predicts 814,426 putative interactions. Of these, 99,958 have significant PCCs; 17,425 are \>21bp and have signficant PCCs; and 26 are \>21bp with \<=3 mismatches and have significant PCCs. ## 1.2 3’UTR ``` r # Read in data # miRNA-mRNA_full miRanda output miRNA_3UTR_miRanda <- read_delim("../output/07-Apul-miRNA-mRNA-miRanda/Apul-miRanda-3UTR-strict-parsed.txt", col_names=FALSE) ``` ## Rows: 123123 Columns: 9 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (6): X1, X2, X5, X6, X8, X9 ## dbl (3): X3, X4, X7 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r colnames(miRNA_3UTR_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") # format miRNA and mRNA names geneIDs <- read_delim("../output/05-Apul-annotate-UTRs/Apul-3UTR-FUNids.txt", col_names=FALSE) ``` ## Rows: 37359 Columns: 5 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (5): X1, X2, X3, X4, X5 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r geneIDs$X4 <- gsub("Parent=", "", geneIDs$X4) miRNA_3UTR_miRanda$mirna <- gsub(">", "", miRNA_3UTR_miRanda$mirna) miRNA_3UTR_miRanda$mirna <- gsub("\\..*", "", miRNA_3UTR_miRanda$mirna) #miRNA_3UTR_miRanda$Target <- gsub("::.*", "", miRNA_3UTR_miRanda$Target) miRNA_3UTR_miRanda <- left_join(miRNA_3UTR_miRanda, geneIDs, by=c("Target" = "X1")) ``` ## Warning in left_join(miRNA_3UTR_miRanda, geneIDs, by = c(Target = "X1")): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 103 of `x` matches multiple rows in `y`. ## ℹ Row 4166 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = ## "many-to-many"` to silence this warning. ``` r miRNA_3UTR_miRanda <- select(miRNA_3UTR_miRanda, -X2,-X3,-X5) # Somehow ended up with a many-to-many relationship, so remove duplicate rows miRNA_3UTR_miRanda <- miRNA_3UTR_miRanda[!duplicated(miRNA_3UTR_miRanda), ] # Finally, create a column that conatins both the miRNA and interacting mRNA miRNA_3UTR_miRanda$interaction <- paste(miRNA_3UTR_miRanda$mirna, "_", miRNA_3UTR_miRanda$X4) # Annotate w PCC info miRNA_3UTR_miRanda <- left_join(miRNA_3UTR_miRanda, pcc_results, by="interaction") ``` ``` r # Filter to high complementarity putative targets target_3UTR_21bp <- miRNA_3UTR_miRanda[miRNA_3UTR_miRanda$Al_Len > 20,] target_3UTR_21bp_3mis <- target_3UTR_21bp[target_3UTR_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_3UTR_miRanda) ``` ## [1] 123123 ``` r nrow(miRNA_3UTR_miRanda %>% filter(p_value < 0.05)) ``` ## [1] 11693 ``` r nrow(target_3UTR_21bp %>% filter(p_value < 0.05)) ``` ## [1] 2142 ``` r nrow(target_3UTR_21bp_3mis %>% filter(p_value < 0.05)) ``` ## [1] 3 For miRNA binding to the 3’UTR, miRanda predicts 123,123 putative interactions. Of these, 11,693 have significant PCCs; 2,142 are \>21bp and have signficant PCCs; and 3 are \>21bp with \<=3 mismatches and have a significant PCC. ## 1.3 5’UTR ``` r # miRNA-5'UTR miRanda output miRNA_5UTR_miRanda <- read_delim("../output/07.1-Apul-miRNA-mRNA-miRanda-additional_inputs/Apul-miRanda-5UTR_1kb-strict-parsed.txt", col_names=FALSE) ``` ## Rows: 115265 Columns: 9 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (6): X1, X2, X5, X6, X8, X9 ## dbl (3): X3, X4, X7 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r colnames(miRNA_5UTR_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") # format miRNA and mRNA names geneIDs <- read_delim("../output/05-Apul-annotate-UTRs/Apul-5UTR-FUNids.txt", col_names=FALSE) ``` ## Rows: 37158 Columns: 5 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "\t" ## chr (5): X1, X2, X3, X4, X5 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` r geneIDs$X4 <- gsub("Parent=", "", geneIDs$X4) miRNA_5UTR_miRanda$mirna <- gsub(">", "", miRNA_5UTR_miRanda$mirna) miRNA_5UTR_miRanda$mirna <- gsub("\\..*", "", miRNA_5UTR_miRanda$mirna) miRNA_5UTR_miRanda <- left_join(miRNA_5UTR_miRanda, geneIDs, by=c("Target" = "X1")) ``` ## Warning in left_join(miRNA_5UTR_miRanda, geneIDs, by = c(Target = "X1")): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 5 of `x` matches multiple rows in `y`. ## ℹ Row 3781 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = ## "many-to-many"` to silence this warning. ``` r miRNA_5UTR_miRanda <- select(miRNA_5UTR_miRanda, -X2,-X3,-X5) # Somehow ended up with a many-to-many relationship, so remove duplicate rows miRNA_5UTR_miRanda <- miRNA_5UTR_miRanda[!duplicated(miRNA_5UTR_miRanda), ] # Finally, create a column that contains both the miRNA and interacting mRNA pcc_results$interaction <- paste(pcc_results$miRNA, "_", pcc_results$mRNA) miRNA_5UTR_miRanda$interaction <- paste(miRNA_5UTR_miRanda$mirna, "_", miRNA_5UTR_miRanda$X4) # Annotate w PCC info miRNA_5UTR_miRanda <- left_join(miRNA_5UTR_miRanda, pcc_results, by="interaction") ``` ``` r # Filter to high complementarity putative targets target_5UTR_21bp <- miRNA_5UTR_miRanda[miRNA_5UTR_miRanda$Al_Len > 20,] target_5UTR_21bp_3mis <- target_5UTR_21bp[target_5UTR_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_5UTR_miRanda) ``` ## [1] 115265 ``` r nrow(miRNA_5UTR_miRanda %>% filter(p_value < 0.05)) ``` ## [1] 10623 ``` r nrow(target_5UTR_21bp %>% filter(p_value < 0.05)) ``` ## [1] 1775 ``` r nrow(target_5UTR_21bp_3mis %>% filter(p_value < 0.05)) ``` ## [1] 3 For miRNA binding to the 5’UTR, miRanda predicts 115,265 putative interactions. Of these, 10,623 have significant PCCs; 1,775 are \>21bp and have signficant PCCs; and 3 are \>21bp with \<=3 mismatches and have significant PCCs. # 2 Summary How does different input and/or complementarity filtering affect \# putative interactions: Reminder summary of miRanda results: | Input | unfiltered | filtered for complementarity | % retained | |:----------|:-----------|:-----------------------------|:-----------| | full mRNA | 814426 | 181 | 0.0222 % | | 5’UTR | 115265 | 20 | 0.0174 % | | 3’UTR | 123123 | 33 | 0.0268 % | For different filters, how many putative interactions ***also show significant coexpression*** (PCC pval \< 0.05)? | Input | All | 21bp | 21bp, \>=3 mismatch | |:----------|:------|:------|:--------------------| | full mRNA | 99958 | 17425 | 26 | | 5’UTR | 10623 | 1775 | 3 | | 3’UTR | 11693 | 2142 | 3 | Note that some putative interactions indicated by miRanda are not present in the counts data (i.e. the miRNA and/or mRNA had 0 counts inour RNAseq data), and are thus excluded from the PCC-filtered data Is there a clear “cutoff” for what complementarity parameters are most associated with significant coexpression? ``` r miRNA_mRNA_miRanda <- miRNA_mRNA_miRanda %>% mutate(Source = "mRNA") miRNA_3UTR_miRanda <- miRNA_3UTR_miRanda %>% mutate(Source = "3UTR") miRNA_5UTR_miRanda <- miRNA_5UTR_miRanda %>% mutate(Source = "5UTR") # Combine combined_df <- bind_rows(miRNA_mRNA_miRanda, miRNA_3UTR_miRanda, miRNA_5UTR_miRanda) # Convert Source to a factor for discrete shapes combined_df$Source <- factor(combined_df$Source) # Convert p-value condition into a categorical variable for coloring combined_df$Significance <- ifelse(combined_df$p_value < 0.05, "p < 0.05", "p ≥ 0.05") significant_df <- combined_df %>% filter(p_value < 0.05) # Plot 2D density to visualize distribution of significantly coexpressed interactions accross complementarity metrics ggplot(significant_df, aes(x = Al_Len, y = as.numeric(gsub("%", "", Subject_Identity)))) + geom_density_2d_filled(alpha = 0.8) + facet_wrap(~ Source, scales = "free") + # allow different scales per facet scale_fill_viridis_d(option = "D") + # different color palettes (discrete) labs(x = "Alignment Length (Al_Len)", y = "Subject Identity (%)", fill = "Density", title = "2D Density - Significant p < 0.05") + theme_minimal() ``` ![](14.1-Apul-miRNA-mRNA-coexpression-additional_inputs_files/figure-gfm/unnamed-chunk-12-1.png) Interesting… there are signifcant coexpressions happening across both metrics of complementarity. ``` r # Save putative interactions with significantly correlated coexpression for visualization # (e.g. creating an interaction network plot in Cytoscape) write.csv(combined_df[combined_df$p_value < 0.05,], "../output/14.1-Apul-miRNA-mRNA-coexpression-additional_inputs/miRanda-PCC-significant-mRNA_3UTR_5UTR.csv", row.names = FALSE) ```