---
title: "09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR"
author: "Kathleen Durkin"
date: "2025-06-18"
always_allow_html: true
output: 
  github_document:
    toc: true
    toc_depth: 3
    number_sections: true
    html_preview: true 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
---

miRanda is a target prediction software, used to identify likely miRNA-mRNA interactions.

We've decided that we need to expand our consideration of potential miRNA target sites to include the coding sequence and (potentially) the 5'UTR region, since cnidarian miRNAs may function primarily through target cleavage, instead of translational repression.

Run miRanda using the mRNA coding sequences and the 1kb 5UTR regions as inputs.

```{r}
library(dplyr)
```


# mRNA coding sequence

## Get mature miRNA fasta

```{r, engine='bash', eval=FALSE}
awk '/^>/ {keep = ($0 ~ /mature/)} keep' ../output/11-Apul-sRNA-ShortStack_4.1.0-pulchra_genome/ShortStack_out/mir.fasta > ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_ShortStack_4.1.0_mature.fasta
```

## Get coding sequence fasta

```{r, engine='bash', eval=FALSE}

/home/shared/bedtools2/bin/bedtools getfasta \
-fi "../data/Apulchra-genome.fa" \
-bed "../output/15-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff" \
-fo "../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_mRNA_full.fa"
```

## Run miRanda

```{r, engine='bash', eval=FALSE}

# score cutoff >100
# energy cutoff <-20
# strict binding

/home/shared/miRanda-3.3a/src/miranda \
../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_ShortStack_4.1.0_mature.fasta \
../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_mRNA_full.fa \
-sc 100 \
-en -20 \
-strict \
-out ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict_all.tab
```

## Summarize results

Let's look at the output

```{r, engine='bash'}

echo "miranda run finished!"
echo "counting number of putative interactions predicted"

zgrep -c "Performing Scan" ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict_all.tab

echo "Parsing output"
grep -A 1 "Scores for this hit:" ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict_all.tab | sort | grep '>' > ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt

echo "counting number of putative interactions predicted"
wc -l ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt

```

This is a lot of putative interactions! Note though, that miRanda only requires complementarity of a 8bp seed region of the miRNA. We instead want to look for binding with full or near-full complementarity. Let's look at how many putative interactions are predicted for a binding length of at least 21 nucleotides (the length of our smallest mature miRNA).

```{r, engine='bash', eval=TRUE}
echo "number of putative interactions of at least 21 nucleotides"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt | wc -l
echo ""
echo "check some:"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt | head -5
```

We can also see from the percent identities (last 2 entries) that this number includes alignments with multiple mismatches. Let's filter again to reduce the number of permissible mismatches. Let's say we want no more than 3 mismatches. For an alignment of 21 nucleotides, this would be an alignment rate of (21-3)/21 = 85.7%.

```{r, engine='bash', eval=TRUE}
echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt | awk -F'\t' '$8 >= 85' | wc -l
echo ""
echo "check some:"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt | awk -F'\t' '$8 >= 85' | head -5
```

So filtering for full or near-full complementarity reduced the number of putative interactions between miRNA and full mRNA sequences from 38152 to 107.

# miRNA and mRNA 5'UTRs

We've also created a gff of 1kb 5'UTR regions (using the same method used to define 1kb 3'UTRs, in `05-Apul-annotate-UTRs`). Let's try running this through miRanda as well, since it's possible (though not necessarily expected) that miRNAs will bind here

## Get 5'UTR fasta

```{r, engine='bash', eval=FALSE}
/home/shared/bedtools2/bin/bedtools getfasta \
-fi "../data/Apulchra-genome.fa" \
-bed "../output/05-Apul-annotate-UTRs/Apul.GFFannotation.5UTR_1kb_corrected.gff" \
-fo "../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_5UTR_1kb_corrected.fa"

```

## Run miRanda

```{r, engine='bash', eval=FALSE}

# Same settings we've been using: 
# score cutoff >100
# energy cutoff <-10
# strict binding

/home/shared/miRanda-3.3a/src/miranda \
../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul_5UTR_1kb_corrected.fa \
-sc 100 \
-en -10 \
-strict \
-out ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict_all.tab
```

## Summarize results

Let's look at the output

```{r, engine='bash'}

echo "Number of interacting miRNA-5UTR pairs"
zgrep -c "Performing Scan" ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict_all.tab

echo "Parsing output"
grep -A 1 "Scores for this hit:" ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict_all.tab | sort | grep '>' > ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt

echo "Number of putative interactions predicted"
wc -l ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt

```

How many putative interactions are predicted for a binding length of at least 21 nucleotides (the length of our smallest mature miRNA)?

```{r, engine='bash', eval=TRUE}
echo "number of putative interactions:"
wc -l ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt
echo ""
echo "number of putative interactions of at least 21 nucleotides"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt | wc -l
echo ""
echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches"
awk -F'\t' '$7 >= 21' ../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt | awk -F'\t' '$8 >= 85' | wc -l
```

So filtering for full or near-full complementarity reduced the number of putative interactions between miRNA and 5'UTR sequences from 4959 to 11

# miRNA and mRNA 3'UTRs

Now let's see how filtering changes the outputs of miRanda run with only the 3'UTR mRNA region (the input we have been using up till now)

```{r, engine='bash', eval=TRUE}
echo "total number of putative interactions:"
wc -l ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt

echo "number of putative interactions of at least 21 nucleotides"
awk -F'\t' '$7 >= 21' ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt | wc -l

echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches"
awk -F'\t' '$7 >= 21' ../output/09-Apul-mRNA-miRNA-interactions/miranda_strict_all_1kb_parsed_apul_updated.txt | awk -F'\t' '$8 >= 85' | wc -l
```

When only mRNA 3'UTR regions are used as input, filtering for full/near-full complementarity reduces the number of putative interactions from 6109 to 13.

# Summary

| Input     | unfiltered | filtered for complementarity | \% retained |
|-----------|------------|------------------------------|-------------|
| full mRNA | 38152      | 107                          | 0.280 %     |
| 5'UTR     | 4959       | 11                           | 0.222 %     |
| 3'UTR     | 6109       | 13                           | 0.213 %     |


# Combine with PCC

We've also already calculated pairwise Pearson's correlation coefficients for every miRNA-mRNA pair. Let's merge this table with our CDS and 5UTR miRanda results to see which instances of putative binding are supported through expression correlation.

PCC table available at `https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/09-Apul-mRNA-miRNA-interactions/Apul-PCC_miRNA_mRNA.csv` (too large to store on Gannet), originally generated in `D-Apul/code/09-Apul-mRNA-miRNA-interactions.Rmd`

## Merge with CDS miRanda results

Read in PCC and miRanda tables
```{r}
# Load
Apul_PCC_miRNA_mRNA <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/09-Apul-mRNA-miRNA-interactions/Apul-PCC_miRNA_mRNA.csv") %>% select(-X)
Apul_miRanda_miRNA_CDS <- read.table("../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-mRNA_full-strict-parsed.txt")

# Format the miRNA IDs in miRanda table by removing miRNA coordinates
Apul_miRanda_miRNA_CDS$V1 <- gsub(">", "", Apul_miRanda_miRNA_CDS$V1)
Apul_miRanda_miRNA_CDS$V1 <- sub("\\..*", "", Apul_miRanda_miRNA_CDS$V1)
```

We need to associate the CDS genomic coordinates (used in miRanda output) with gene IDs (used in PCC output)
```{r}
# Load and format mapping table
Apul_mRNA_FUNids <- read.table("../output/15-Apul-annotate-UTRs/Apul-mRNA-FUNids.txt") %>% select(V1, V4)
Apul_mRNA_FUNids$V4 <- gsub("Parent=", "", Apul_mRNA_FUNids$V4)

# Join with miRanda table to annotate genomic coordinates with gene IDs
Apul_miRanda_miRNA_CDS <- left_join(Apul_miRanda_miRNA_CDS, Apul_mRNA_FUNids, by = c("V2" = "V1")) %>% unique()
```

Great! Now we can merge the CDS miRanda results and PCC values
```{r}
# Merge
Apul_miRanda_miRNA_CDS_PCC <- left_join(Apul_miRanda_miRNA_CDS, Apul_PCC_miRNA_mRNA, by = c("V1" = "miRNA", "V4.y" = "mRNA")) %>% unique()

# Save
write.csv(Apul_miRanda_miRNA_CDS_PCC, "../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/miRanda_PCC_miRNA_CDS.csv")
```

Note: there may be some NA values in the PCC columns. This would indicate that one (or both) of the member of that miRNA-mRNA pair had all-zero counts (i.e. it was unexpressed in all samples)

## Merge with 5UTR miRanda results

Read in PCC and miRanda tables
```{r}
# Load
Apul_PCC_miRNA_mRNA <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/09-Apul-mRNA-miRNA-interactions/Apul-PCC_miRNA_mRNA.csv") %>% select(-X)
Apul_miRanda_miRNA_5UTR <- read.table("../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/Apul-miRanda-5UTR_1kb-strict-parsed.txt")

# Format the miRNA IDs in miRanda table by removing miRNA coordinates
Apul_miRanda_miRNA_5UTR$V1 <- gsub(">", "", Apul_miRanda_miRNA_5UTR$V1)
Apul_miRanda_miRNA_5UTR$V1 <- sub("\\..*", "", Apul_miRanda_miRNA_5UTR$V1)
```

We need to associate the 5UTR genomic coordinates (used in miRanda output) with gene IDs (used in PCC output)
```{r}
# Load and format mapping table
Apul_5UTR_FUNids <- read.table("../output/15-Apul-annotate-UTRs/Apul-5UTR-FUNids.txt") %>% select(V1, V4)
Apul_5UTR_FUNids$V4 <- gsub("Parent=", "", Apul_5UTR_FUNids$V4)

# Join with miRanda table to annotate genomic coordinates with gene IDs
Apul_miRanda_miRNA_5UTR <- left_join(Apul_miRanda_miRNA_5UTR, Apul_5UTR_FUNids, by = c("V2" = "V1")) %>% unique()
```

Great! Now we can merge the 5UTR miRanda results and PCC values
```{r}
# Merge
Apul_miRanda_miRNA_5UTR_PCC <- left_join(Apul_miRanda_miRNA_5UTR, Apul_PCC_miRNA_mRNA, by = c("V1" = "miRNA", "V4.y" = "mRNA")) %>% unique()

# Save
write.csv(Apul_miRanda_miRNA_5UTR_PCC, "../output/09.01-Apul-mRNA-miRNA-interactions-CDS_5UTR/miRanda_PCC_miRNA_5UTR.csv")
```

Note: there may be some NA values in the PCC columns. This would indicate that one (or both) of the member of that miRNA-mRNA pair had all-zero counts (i.e. it was unexpressed in all samples)