--- title: "24-Apul-miRanda-input-comparisons" author: "Kathleen Durkin" date: "2024-02-12" always_allow_html: true output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: true bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: ../../references.bib link-citations: true --- ```{r} library(readr) library(dplyr) library(ggplot2) ``` Up to this point, our miRNA target prediction has been primarily performed using the tool miRanda, which requires only seed binding, with 3'UTR regions as input. However, there is growing evidence that cnidarian miRNA binding functions similarly to plants, requiring near-full complementarity and to act primarily through target cleavage, which can occur at any location in the mRNA sequence (not just the 3'UTR). This would mean our current miRanda target predictions are both a) not sufficiently restricted to fll complementarity, and b) incorrectly limited to the 3'UTR. Let's see what happens when we adjust those parameters # miRNA and full mRNA coding sequences Inputs: - FASTA of A.pulchra mRNA coding sequences - FASTA of A.pulchra mature miRNAs ## Get mRNA fasta ```{r, engine='bash', eval=FALSE} /home/shared/bedtools2/bin/bedtools getfasta \ -fi "../data/Apulchra-genome.fa" \ -bed "../data/Apulcra-genome-mRNA_only.gff" \ -fo "../data/24-Apul-miRanda-input-comparisons/Apul_mRNA_full.fa" ``` ## Run miRanda ```{r, engine='bash', eval=FALSE} # Same settings we've been using: # score cutoff >100 # energy cutoff <-10 # strict binding /home/shared/miRanda-3.3a/src/miranda \ ../output/17-Apul-miRNA-lncRNA-BLASTs-RNAhybrid/Apul_ShortStack_4.1.0_mature.fasta \ ../data/24-Apul-miRanda-input-comparisons/Apul_mRNA_full.fa \ -sc 100 \ -en -10 \ -strict \ -out ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict_all.tab ``` ## Summarize results Let's look at the output ```{r, engine='bash', eval=FALSE} echo "miranda run finished!" echo "Counting number of interacting miRNA-lncRNA pairs" zgrep -c "Performing Scan" ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict_all.tab echo "Parsing output" grep -A 1 "Scores for this hit:" ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict_all.tab | sort | grep '>' > ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt echo "counting number of putative interactions predicted (can include multiple interactions between single miRNA-lncRNA pair)" wc -l ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict_all.tab ``` ```{r, engine='bash'} echo "number of putative interactions:" wc -l ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt ``` This is a lot of putative interactions! Note though, that miRanda only requires complementarity of a 8bp seed region of the miRNA. We instead want to look for binding with full or near-full complementarity. Let's look at how many putative interactions are predicted for a binding length of at least 21 nucleotides (the length of our smallest mature miRNA). ```{r, engine='bash'} echo "number of putative interactions of at least 21 nucleotides" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt | wc -l echo "" echo "check some:" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt | head -5 ``` The header for this output is formatted as: mirna Target Score Energy-Kcal/Mol Query-Aln(start-end) Subject-Al(Start-End) Al-Len Subject-Identity Query-Identity We can see from the percent identities (last 2 entries) that this number includes alignments with multiple mismatches. Let's filter again to reduce the number of permissible mismatches. Let's say we want no more than 3 mismatches (a gap is counted as a mismatch). For an alignment of 21 nucleotides, this would be an percent identity of (21-3)/21 = 85.7%. The miRNA is our "subject", so we will filter by column 8. ```{r, engine='bash'} echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt | awk -F'\t' '$8 >= 85' | wc -l echo "" echo "check some:" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt | awk -F'\t' '$8 >= 85' | head -5 ``` So filtering for full or near-full complementarity reduced the number of putative interactions between miRNA and full mRNA sequences from 617982 to 143. # miRNA and mRNA 5'UTRs We've also created a gff of 1kb 5'UTR regions (using the same method used to define 1kb 3'UTRs, in `15-Apul-annotate-UTRs`). Let's try running this through miRanda as well, since it's possible (though not necessarily expected) that miRNAs will bind here ## Get 5'UTR fasta ```{r, engine='bash', eval=FALSE} /home/shared/bedtools2/bin/bedtools getfasta \ -fi "../data/Apulchra-genome.fa" \ -bed "../output/15-Apul-annotate-UTRs/Apul.GFFannotation.5UTR_1kb_corrected.gff" \ -fo "../data/24-Apul-miRanda-input-comparisons/Apul_5UTR_1kb_corrected.fa" ``` ## Run miRanda ```{r, engine='bash', eval=FALSE} # Same settings we've been using: # score cutoff >100 # energy cutoff <-10 # strict binding /home/shared/miRanda-3.3a/src/miranda \ ../output/17-Apul-miRNA-lncRNA-BLASTs-RNAhybrid/Apul_ShortStack_4.1.0_mature.fasta \ ../data/24-Apul-miRanda-input-comparisons/Apul_5UTR_1kb_corrected.fa \ -sc 100 \ -en -10 \ -strict \ -out ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict_all.tab ``` ## Summarize results Let's look at the output ```{r, engine='bash', eval=FALSE} echo "Number of interacting miRNA-lncRNA pairs" zgrep -c "Performing Scan" ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict_all.tab echo "Parsing output" grep -A 1 "Scores for this hit:" ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict_all.tab | sort | grep '>' > ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict-parsed.txt echo "Number of putative interactions predicted (can include multiple interactions between single miRNA-lncRNA pair)" wc -l ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict_all.tab ``` How many putative interactions are predicted for a binding length of at least 21 nucleotides (the length of our smallest mature miRNA)? ```{r, engine='bash'} echo "number of putative interactions:" wc -l ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict-parsed.txt echo "" echo "number of putative interactions of at least 21 nucleotides" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict-parsed.txt | wc -l echo "" echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches" awk -F'\t' '$7 >= 21' ../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict-parsed.txt | awk -F'\t' '$8 >= 85' | wc -l ``` So filtering for full or near-full complementarity reduced the number of putative interactions between miRNA and 5'UTR sequences from 87830 to 12 # miRNA and mRNA 3'UTRs Now let's see how filtering changes the outputs of miRanda run with only the 3'UTR mRNA region (the input we have been using up till now) ```{r, engine='bash'} echo "total number of putative interactions:" wc -l ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt echo "number of putative interactions of at least 21 nucleotides" awk -F'\t' '$7 >= 21' ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt | wc -l echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches" awk -F'\t' '$7 >= 21' ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt | awk -F'\t' '$8 >= 85' | wc -l ``` When only mRNA 3'UTR regions are used as input, filtering for full/near-full complementarity reduces the number of putative interactions from 6109 to 13. # Examine coexpression Now that we've found putative interactions with high complementarity, we need to validate miRNA function by examining patterns of coexpression. We'd expect a putatively-interacting miRNA-mRNA pair to be highly coexpressed, and we'd expect a negative relationship to indicate target cleavage. ## Obtain Pearson's coefficient correlation values This has already been run by Jill in `09-Apul-mRNA-miRNA-interactions`, but I think the csv stored in this repo has been filtered in some way, because it contains fewer entries than expected: ```{r, engine='bash'} wc -l ../output/09-Apul-mRNA-miRNA-interactions/Apul-PCC_miRNA_mRNA.csv ``` Read in, format, and normalize data ```{r} mRNA_counts <- read_csv("../../D-Apul/output/07-Apul-Hisat/gene_count_matrix.csv") mRNA_counts <- as.data.frame(mRNA_counts) rownames(mRNA_counts) <- mRNA_counts[,1] #set first column that contains gene names as rownames mRNA_counts <- mRNA_counts[,-1] # remove column w/ gene names # Remove any genes with 0 counts across samples mRNA_counts<-mRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) miRNA_counts <- read.delim("../output/03.1-Apul-sRNA-summary/Apul_miRNA_ShortStack_counts_formatted.txt") # Remove any miRNAs with 0 for all samples miRNA_counts <- miRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) # Rename gene count cols to match miRNA count cols colnames(mRNA_counts) <- c("sample140", "sample145", "sample150", "sample173", "sample178") # Function to normalize counts (simple RPM normalization) normalize_counts <- function(counts) { rpm <- t(t(counts) / colSums(counts)) * 1e6 return(rpm) } miRNA_norm <- normalize_counts(miRNA_counts) mRNA_norm <- normalize_counts(mRNA_counts) ``` ```{r,} nrow(miRNA_norm)*nrow(mRNA_norm) ``` We'd expect 1311336 PCC values (after filtering out genes/miRNAs with 0 counts), but there are only 916384 in Jill's PCC csv. I'll rerun the PCC just to make sure I have correlation values for all miRNA-mRNA pairs of interest ```{r, eval=FALSE} library(energy) # Function to calculate PCC and p-value for a pair of vectors calc_pcc <- function(x, y) { result <- cor.test(x, y, method = "pearson") return(c(PCC = result$estimate, p_value = result$p.value)) } # Create a data frame of all miRNA-mRNA pairs pairs <- expand.grid(miRNA = rownames(miRNA_norm), mRNA = rownames(mRNA_norm)) # Calculate PCC and p-value for each pair pcc_results <- pairs %>% rowwise() %>% mutate( pcc_stats = list(calc_pcc(miRNA_norm[miRNA,], mRNA_norm[mRNA,])) ) %>% unnest_wider(pcc_stats) # Adjust p-values for FDR pcc_results <- pcc_results %>% mutate(adjusted_p_value = p.adjust(p_value, method = "fdr")) # Save as csv write.csv(pcc_results, "../output/24-Apul-miRanda-input-comparisons/Apul-PCC_miRNA_mRNA-full.csv") ``` Check ```{r} # Read in results # pcc_results <- read.csv("../output/24-Apul-miRanda-input-comparisons/Apul-PCC_miRNA_mRNA-full.csv") # Use this code to download the PCC results if needed pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/24-Apul-miRanda-input-comparisons/Apul-PCC_miRNA_mRNA-full.csv") nrow(pcc_results) nrow(pcc_results%>% filter(PCC.cor > 0.90)) nrow(pcc_results %>% filter(p_value < 0.05)) nrow(pcc_results %>% filter(p_value < 0.05 & PCC.cor > 0.90)) ``` of the 1,311,336 possible miRNA-mRNA interactions, 35,445 have a Pearson's correlation coefficient of above 0.9, and 66,154 have a significant correlation (pval\<0.05). All of the coefficients \>0.9 are significant. ## full mRNA ```{r} # miRNA-mRNA_full miRanda output miRNA_mRNA_miRanda <- read_delim("../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-mRNA_full-strict-parsed.txt", col_names=FALSE) colnames(miRNA_mRNA_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") # format miRNA and mRNA names geneIDs <- read_delim("../output/15-Apul-annotate-UTRs/Apul-mRNA-FUNids.txt", col_names=FALSE) geneIDs$X4 <- gsub("Parent=", "", geneIDs$X4) miRNA_mRNA_miRanda$mirna <- gsub(">", "", miRNA_mRNA_miRanda$mirna) miRNA_mRNA_miRanda$mirna <- gsub("\\..*", "", miRNA_mRNA_miRanda$mirna) miRNA_mRNA_miRanda <- left_join(miRNA_mRNA_miRanda, geneIDs, by=c("Target" = "X1")) miRNA_mRNA_miRanda <- select(miRNA_mRNA_miRanda, -X2,-X3,-X5) # Finally, create a column that conatins both the miRNA and interacting mRNA pcc_results$interaction <- paste(pcc_results$miRNA, "_", pcc_results$mRNA) miRNA_mRNA_miRanda$interaction <- paste(miRNA_mRNA_miRanda$mirna, "_", miRNA_mRNA_miRanda$X4) # Annotate w PCC info miRNA_mRNA_miRanda <- left_join(miRNA_mRNA_miRanda, pcc_results, by="interaction") ``` ```{r} # Filter to high complementarity putative targets target_21bp <- miRNA_mRNA_miRanda[miRNA_mRNA_miRanda$Al_Len > 20,] target_21bp_3mis <- target_21bp[target_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_mRNA_miRanda) nrow(miRNA_mRNA_miRanda %>% filter(p_value < 0.05)) nrow(target_21bp %>% filter(p_value < 0.05)) nrow(target_21bp_3mis %>% filter(p_value < 0.05)) ``` For miRNA binding to the mRNA CDS, miRanda predicts 617,982 putative interactions. Of these, 24,936 have significant PCCs; 4,300 are \>21bp and have signficant PCCs; and 7 are \>21bp with \<=3 mismatches and have significant PCCs. ## 3'UTR ```{r} # Read in data # miRNA-mRNA_full miRanda output miRNA_3UTR_miRanda <- read_delim("../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt", col_names=FALSE) colnames(miRNA_3UTR_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") miRNA_3UTR_miRanda$mirna <- gsub(">", "", miRNA_3UTR_miRanda$mirna) miRNA_3UTR_miRanda$mirna <- gsub("\\..*", "", miRNA_3UTR_miRanda$mirna) miRNA_3UTR_miRanda$Target <- gsub("::.*", "", miRNA_3UTR_miRanda$Target) # Finally, create a column that conatins both the miRNA and interacting mRNA miRNA_3UTR_miRanda$interaction <- paste(miRNA_3UTR_miRanda$mirna, "_", miRNA_3UTR_miRanda$Target) # Annotate w PCC info miRNA_3UTR_miRanda <- left_join(miRNA_3UTR_miRanda, pcc_results, by="interaction") ``` ```{r} # Filter to high complementarity putative targets target_3UTR_21bp <- miRNA_3UTR_miRanda[miRNA_3UTR_miRanda$Al_Len > 20,] target_3UTR_21bp_3mis <- target_3UTR_21bp[target_3UTR_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_3UTR_miRanda) nrow(miRNA_3UTR_miRanda %>% filter(p_value < 0.05)) nrow(target_3UTR_21bp %>% filter(p_value < 0.05)) nrow(target_3UTR_21bp_3mis %>% filter(p_value < 0.05)) ``` For miRNA binding to the 3'UTR, miRanda predicts 6,109 putative interactions. Of these, 266 have significant PCCs; 86 are \>21bp and have signficant PCCs; and 1 is \>21bp with \<=3 mismatches and has a significant PCC. ## 5'UTR ```{r} # miRNA-5'UTR miRanda output miRNA_5UTR_miRanda <- read_delim("../output/24-Apul-miRanda-input-comparisons/Apul-miRanda-5UTR_1kb-strict-parsed.txt", col_names=FALSE) colnames(miRNA_5UTR_miRanda) <- c("mirna", "Target", "Score", "Energy_Kcal_Mol", "Query_Aln", "Subject_Aln", "Al_Len", "Subject_Identity", "Query_Identity") # format miRNA and mRNA names geneIDs <- read_delim("../output/15-Apul-annotate-UTRs/Apul-5UTR-FUNids.txt", col_names=FALSE) geneIDs$X4 <- gsub("Parent=", "", geneIDs$X4) miRNA_5UTR_miRanda$mirna <- gsub(">", "", miRNA_5UTR_miRanda$mirna) miRNA_5UTR_miRanda$mirna <- gsub("\\..*", "", miRNA_5UTR_miRanda$mirna) miRNA_5UTR_miRanda <- left_join(miRNA_5UTR_miRanda, geneIDs, by=c("Target" = "X1")) miRNA_5UTR_miRanda <- select(miRNA_5UTR_miRanda, -X2,-X3,-X5) # Somehow ended up with a many-to-many relationship, so remove duplicate rows miRNA_5UTR_miRanda <- miRNA_5UTR_miRanda[!duplicated(miRNA_5UTR_miRanda), ] # Finally, create a column that contains both the miRNA and interacting mRNA pcc_results$interaction <- paste(pcc_results$miRNA, "_", pcc_results$mRNA) miRNA_5UTR_miRanda$interaction <- paste(miRNA_5UTR_miRanda$mirna, "_", miRNA_5UTR_miRanda$X4) # Annotate w PCC info miRNA_5UTR_miRanda <- left_join(miRNA_5UTR_miRanda, pcc_results, by="interaction") ``` ```{r} # Filter to high complementarity putative targets target_5UTR_21bp <- miRNA_5UTR_miRanda[miRNA_5UTR_miRanda$Al_Len > 20,] target_5UTR_21bp_3mis <- target_5UTR_21bp[target_5UTR_21bp$Subject_Identity>85,] # How many w significant correlation? nrow(miRNA_5UTR_miRanda) nrow(miRNA_5UTR_miRanda %>% filter(p_value < 0.05)) nrow(target_5UTR_21bp %>% filter(p_value < 0.05)) nrow(target_5UTR_21bp_3mis %>% filter(p_value < 0.05)) ``` For miRNA binding to the 5'UTR, miRanda predicts 87,830 putative interactions. Of these, 3,426 have significant PCCs; 596 are \>21bp and have signficant PCCs; and 0 are \>21bp with \<=3 mismatches and have significant PCCs. # Summary How does different input and/or complementarity filtering affect \# putative interactions: | Input | unfiltered | filtered for complementarity | \% retained | |-----------|------------|------------------------------|-------------| | 3'UTR | 6109 | 13 | 0.213 % | | full mRNA | 617982 | 143 | 0.0231 % | | 5'UTR | 87830 | 12 | 0.0137 % | For different filters, how many putative interactions ***also show significant coexpression*** (PCC pval \< 0.05)? | Input | All | 21bp | 21bp, \>=3 mismatch | |-----------|-------|------|---------------------| | 3'UTR | 266 | 86 | 1 | | full mRNA | 24936 | 4300 | 7 | | 5'UTR | 3426 | 596 | 0 | Note that, in general, only \~1/3 of significant coexpressions have a *negative* relationship (which would support functional target cleavage or translational repression) Note also that some putative interactions indicated by miRanda are not present in the counts data (i.e. the miRNA and/or mRNA had 0 counts in our RNAseq data), and are thus excluded from the PCC-filtered data Is there a clear "cutoff" for what complementarity parameters are most associated with significant coexpression? ```{r} miRNA_mRNA_miRanda <- miRNA_mRNA_miRanda %>% mutate(Source = "mRNA") miRNA_3UTR_miRanda <- miRNA_3UTR_miRanda %>% mutate(Source = "3UTR") miRNA_5UTR_miRanda <- miRNA_5UTR_miRanda %>% mutate(Source = "5UTR") # Combine combined_df <- bind_rows(miRNA_mRNA_miRanda, miRNA_3UTR_miRanda, miRNA_5UTR_miRanda) # Convert Source to a factor for discrete shapes combined_df$Source <- factor(combined_df$Source) # Convert p-value condition into a categorical variable for coloring combined_df$Significance <- ifelse(combined_df$p_value < 0.05, "p < 0.05", "p ≥ 0.05") significant_df <- combined_df %>% filter(p_value < 0.05) # Plot with jitter (significant coexpression only) ggplot(significant_df, aes(x = Al_Len, y = as.numeric(gsub("%", "", Subject_Identity)), color = Source)) + geom_jitter(size = 1.5, width = 1, height = 1, alpha = 0.1) + scale_color_manual(values = c("mRNA" = "red", "3UTR" = "blue", "5UTR" = "green")) + facet_wrap(~ Source) + labs(x = "Alignment Length (Al_Len)", y = "Subject Identity (%)", color = "Dataset", title = "Significant (p < 0.05) Alignment Length vs. Subject Identity") + theme_minimal() ``` Interesting... there are signifcant coexpressions happening across both metrics of complementarity. Also interesting, there are notably more significant coexpressions from using 5'UTR inputs than using 3'UTRs -- we were expecting more in the 3UTr because that would be the binding location for translational repression. Next steps: - discuss w team -- should we look for both miRNAs that function through target cleavage (full complementary, full coding sequence) *and* miRNAs that function through "canonical" translational silencing (seed complementarity, 3'UTR)? - Run same comparison using BLAST and/or RNAhybrid as the tool. I'm still not 100% clear on what sequence features miRanda takes into consideration, and any that *are* included will have been based on mammalian miRNAs. It may be more appropriate to use a different tool - if we decide to shift target prediction requirements, need to rerun all target prediction done thus far :( ```{r} # Save putative interactions with significantly correlated coexpression for visualization # (e.g. creating an interaction network plot in Cytoscape) write.csv(combined_df[combined_df$p_value < 0.05,], "../output/24-Apul-miRanda-input-comparisons/miRanda-PCC-significant-mRNA_3UTR_5UTR.csv", row.names = FALSE) ```