06-Apul-miRNA-mRNA-RNAhybrid
================
Kathleen Durkin
2024-12-19

- [1 format miRNA fasta](#1-format-mirna-fasta)
- [2 target prediction for full
  mRNA](#2-target-prediction-for-full-mrna)
- [3 formate mRNA fasta](#3-formate-mrna-fasta)
- [4 RNAcalibrate](#4-rnacalibrate)
- [5 RNAhybrid: mRNA file](#5-rnahybrid-mrna-file)
- [6 RNAhybrid: 3’UTR](#6-rnahybrid-3utr)
- [7 RNAhybrid: 5’UTR](#7-rnahybrid-5utr)
- [8 Summarize results](#8-summarize-results)

``` r
library(ggplot2)
```

RNAhybrid is a miRNA-mRNA target prediction tool, which bases its
predictions primarily on thermodynamic binding stability.

Inputs: - miRNA “query” FASTA file – mature miRNA sequences, since the
mature miRNA molecule is what will presumably bind to an mRNA

- mRNA “target” FASTA file – the *A. pulchra* genome fasta

# 1 format miRNA fasta

ShortStack outputs a fasta containing all annotated miRNAs, but it
includes the full precursor sequence and star sequence, in addition to
the mature miRNA. Since mature miRNAs are the “final forms” that are
generally much more highly expressed, we’ll just look at binding of
mature miRNAs for now.

``` bash
# look at miRNA fasta file
head -5 ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta

echo ""

# check the naming convention for sequences
grep "^>" ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta | head -10

# isolate just the mature miRNA sequences
awk '/^>/ {header=$0} /mature/ {print header; found=1; next} found {print; found=0}' ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta > ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta

echo ""

# check filtered file
head -6 ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta
```

    ## >Cluster_1819::ntLink_6:4140195-4140288(+)
    ## TAAATCTGAATGGTCAAATTCGCCTGGATATAAATTTGGTGTGTTAAGCA
    ## ATTCCAATTTTATATCCAGACAAGTTTTGCCATTCAATTGGGGT
    ## >Cluster_1819.mature::ntLink_6:4140247-4140268(+)
    ## TCCAATTTTATATCCAGACAAG
    ## 
    ## >Cluster_1819::ntLink_6:4140195-4140288(+)
    ## >Cluster_1819.mature::ntLink_6:4140247-4140268(+)
    ## >Cluster_1819.star::ntLink_6:4140215-4140236(+)
    ## >Cluster_1832::ntLink_6:4847443-4847535(-)
    ## >Cluster_1832.mature::ntLink_6:4847465-4847486(-)
    ## >Cluster_1832.star::ntLink_6:4847494-4847515(-)
    ## >Cluster_1833::ntLink_6:4871545-4871637(-)
    ## >Cluster_1833.mature::ntLink_6:4871567-4871588(-)
    ## >Cluster_1833.star::ntLink_6:4871596-4871617(-)
    ## >Cluster_1836::ntLink_6:5157537-5157626(+)
    ## 
    ## >Cluster_1819.mature::ntLink_6:4140247-4140268(+)
    ## TCCAATTTTATATCCAGACAAG
    ## >Cluster_1832.mature::ntLink_6:4847465-4847486(-)
    ## ATGATCATAGCACTTTCTTTGT
    ## >Cluster_1833.mature::ntLink_6:4871567-4871588(-)
    ## ATGATCATAGCACTTTCTTTGT

# 2 target prediction for full mRNA

While miRNas are primarily observed binding to the 3’ UTR region of
mRNAs, they are also believed to functionally bind to CDS regions,
though effects on translation are weaker.

# 3 formate mRNA fasta

get a gff file of only CDS sequences

Notably, RNAhybrid only accepts FASTA sequences of 1000 characters or
fewer. We’ll need to ensure the CDS file only contains sequences of
\<1000bp, and break up the sequences that are too long.

``` bash

# mRNA-only genome gff
# Count total sequences in genome gff
wc -l ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff

# Count the number of sequences that contain >1000 bp
awk '{if ($5 - $4 > 1000) count++} END {print count}' ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff

# Check how the sequence names are formatted
head -2 ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff
```

    ## 36447 ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff
    ## 28668
    ## ntLink_0 funannotate mRNA    1105    7056    .   +   .   ID=FUN_000001-T1;Parent=FUN_000001;product=hypothetical protein;
    ## ntLink_0 funannotate mRNA    10215   15286   .   +   .   ID=FUN_000002-T1;Parent=FUN_000002;product=hypothetical protein;

Welp, looks like the genome-based mRNA gff largely contains long,
\>1000bp sequences (28668/36447, or 78.7%)

First let’s deal with the long sequences while we’re in the compact gff
form. I want to break up any sequence \>1000bp into 1000bp chunks,
adding a line to the gff for each chunk.

(I also want there to be overlap among the chunks, in case the break
between two chunks falls in the middle of an miRNA binding site. Let’s
say a 25bp overlap, since that is just over the maximum expected miRNA
length.)

for now let’s not worry about the overlap.

``` bash
awk -v chunk_size=1000 '
BEGIN {OFS="\t"}
{
    seq_length = $5 - $4
    if (seq_length > chunk_size) {
        start = $4
        ogend = $5
        while (start < ogend) {
            end = start + chunk_size 
            if (end > ogend) end = ogend
            $4 = start
            $5 = end
            print
            start = end
        }
    } else {
        print
    }
}' "../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff" > "../data/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff"
```

``` bash

# mRNA-only genome gff
# Count total sequences in genome gff
wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff

# Count the number of sequences that contain >1000 bp
awk '{if ($5 - $4 > 1000) count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff

# Check how the sequence names are formatted
head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5

echo ""
echo ""

# I'm getting an error in getfasta related to some lines of the gff having a different number of columns.
# Each line should contain 9 columns. Count number of lines with more than 9
awk -F'\t' 'NF > 9 {count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff

# For some lines, the final column is being interpreted as two columns
head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5 | awk -F'\t' '{print $9}'
head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5 | awk -F'\t' '{print $10}'
```

Fix \# columns issue. Replace any instances of “hypothetical $$tab$$
protein” with “hypothetical $$space$$ protein”

``` bash
# Replace any instances of "hypothetical -tab- protein" with "hypothetical -space- protein"
sed 's/hypothetical\tprotein/hypothetical protein/g'  ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff
```

Check

``` bash
# Count number of entries with >9 columns (should be nothing now)
awk -F'\t' 'NF > 9 {count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff
```

get an mRNA fasta file

``` bash

# Use mRNA gff and genome fasta to extract mRNA fastas
/home/shared/bedtools2/bin/bedtools getfasta -fi "../data/Apulchra-genome.fa" -bed "../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff" -fo "../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa"
```

Ok, we now have a fasta file of mRNA sequences, broken up so that no
sequence exceeds 1000bp. We should now be able to run RNAhybrid!

I have RNAhybrid installed on a miniconda environment

    # Check path to the conda environment I'm using
    which conda

    # Install RNAhybrid if neccessary
    conda install -y -c genomedk rnahybrid

    # Check installation
    conda list rnahybrid

# 4 RNAcalibrate

Note: I’m not using the RNAcalibrate results for now because of a weird
issue with all-zero RNAhybrid pvals.

``` bash
# RNAhybrid only has built-in support for humans, worms, and flies.
# We can use RNAcalibrate to derive Extreme Value Distribution parameters for our data.

# USe 3'UTR as input, since we expect majority binding there
RNAcalibrate \
-t ../output/05-Apul-annotate-UTRs/Apul_3UTR_1kb.fasta \
-q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
> ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-3UTR.txt
```

``` r
# Now we need to pick a distance parameter and shape parameter
RNAcalibrate_out <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-3UTR.txt", sep = " ")

ggplot(RNAcalibrate_out, aes(x = V3)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean(V3)), color = "red", linetype = "dashed", size = 1) +
  geom_text(aes(x = mean(V3) + 0.15, y = 4, label = paste("Mean =", round(mean(V3), 2))), 
            color = "red", vjust = -0.5, hjust = 1.2) +
  labs(title = "Distance Parameter distribution", x = "Values", y = "Density")

ggplot(RNAcalibrate_out, aes(x = V4)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.005, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean(V4)), color = "red", linetype = "dashed", size = 1) +
  geom_text(aes(x = mean(V4) + 0.017, y = 37, label = paste("Mean =", round(mean(V4), 2))), 
            color = "red", vjust = -0.5, hjust = 1.2) +
  labs(title = "Shape Parameter distribution", x = "Values", y = "Density")
```

``` bash

# Also try with the full mRNA sequences
RNAcalibrate \
-t ../data/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa \
-q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
> ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-mRNA.txt
```

``` r
# Now we need to pick a distance parameter and shape parameter
RNAcalibrate_out <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-mRNA.txt", sep = " ")

ggplot(RNAcalibrate_out, aes(x = V3)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean(V3)), color = "red", linetype = "dashed", size = 1) +
  geom_text(aes(x = mean(V3) + 0.15, y = 4, label = paste("Mean =", round(mean(V3), 2))), 
            color = "red", vjust = -0.5, hjust = 1.2) +
  labs(title = "Distance Parameter distribution", x = "Values", y = "Density")

ggplot(RNAcalibrate_out, aes(x = V4)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.005, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean(V4)), color = "red", linetype = "dashed", size = 1) +
  geom_text(aes(x = mean(V4) + 0.017, y = 37, label = paste("Mean =", round(mean(V4), 2))), 
            color = "red", vjust = -0.5, hjust = 1.2) +
  labs(title = "Shape Parameter distribution", x = "Values", y = "Density")
```

Distributions looks roughly normal, so I’ll use the mean distance and
shape parameters calculated by RNAcalibrate as input in RNAhybrid.

# 5 RNAhybrid: mRNA file

I’m having problems using the RNAcalibrate estimated parameters, where
every output hybridization has a p-value of 0. This still happens if I
remove the `-e` energy cutoff and the `-p` pvalue cutoff, so the `-d`
parameter specification must be causing it.

I’ll use a built-in estimation for now, since those result in reasonable
pvalues. Options are 3utr_human, 3utr_fly, and 3utr-worm. Not sure if
fly or worm is most closely related to corals (they’re both quite
distant), but I’ll just arbitrarily pick worm for now.

``` bash

RNAhybrid \
-s 3utr_worm \
-e -20 \
-p 0.05 \
-c \
-t ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa \
-q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
> ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt
```

# 6 RNAhybrid: 3’UTR

``` bash

RNAhybrid \
-s 3utr_worm \
-e -20 \
-p 0.05 \
-c \
-t ../output/05-Apul-annotate-UTRs/Apul_3UTR_1kb.fasta \
-q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
> ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt
```

# 7 RNAhybrid: 5’UTR

``` bash

RNAhybrid \
-s 3utr_worm \
-e -20 \
-p 0.05 \
-c \
-t ../output/05-Apul-annotate-UTRs/Apul_5UTR_1kb.fasta \
-q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \
> ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt
```

# 8 Summarize results

``` bash
# How many significant hybridizations predicted for each input?
wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt
wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt
wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt
```

    ## 6559 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt
    ## 1191 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt
    ## 1041 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt

``` bash
head -5 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt
```

    ## ntLink_8:32704327-32705327:999:Cluster_1832.mature::ntLink_6:4847465-4847486(-):22:-27.8:0.036639:710:A                A     : ACAAGGAGAGUGCUGU      : UGUUUCUUUCACGAUA      :                 CUAGUA
    ## ntLink_3:84074-84461:387:Cluster_1832.mature::ntLink_6:4847465-4847486(-):22:-26.7:0.014807:51:    C               U :     GAGAGUGCUGUGGUU  :     CUUUCACGAUACUAG  :UGUUU               UA
    ## ntLink_6:2139830-2140830:999:Cluster_1832.mature::ntLink_6:4847465-4847486(-):22:-27.7:0.038633:949: G                 G  :  GGGGGGGGUGCUAUGAU   :  UUUCUUUCACGAUACUA   :UG                 GUA
    ## ptg000001l:20692198-20693198:999:Cluster_1832.mature::ntLink_6:4847465-4847486(-):22:-30.0:0.011310:399: A                  U :  GAGGAAGGUGUUAUGAUC  :  UUUCUUUCACGAUACUAG  :UG                  UA
    ## ptg000018l:9119047-9120047:999:Cluster_1832.mature::ntLink_6:4847465-4847486(-):22:-28.0:0.032950:773:U                A     : GCAAGGAGAGUGCUAU      : UGUUUCUUUCACGAUA      :                 CUAGUA

Now let’s read our RNAhybrid results into R for visualization. Note this
is going to be slightly more complicated than it sounds because the
RNAhybrid compact output is colon-delimited and our target- and
query-IDs contain intentional colons than could get confused with column
delimiters.

mRNA output:

``` r
options(scipen=999)
options(digits = 10)
RNAhybrid_mRNA <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt", sep=":")

# Recombine Columns 1 and 2 (fix incorrect separation of target ID components)
RNAhybrid_mRNA$V1 <- paste(RNAhybrid_mRNA$V1, RNAhybrid_mRNA$V2, sep = ":")
RNAhybrid_mRNA$V2 <- NULL

# Do the same for Columns 4-7 (query ID components)
RNAhybrid_mRNA$V4 <- paste(RNAhybrid_mRNA$V4, RNAhybrid_mRNA$V5, RNAhybrid_mRNA$V6, RNAhybrid_mRNA$V7 , sep = ":")
RNAhybrid_mRNA$V4 <- gsub(":NA:", "::", RNAhybrid_mRNA$V4)
RNAhybrid_mRNA$V5 <- NULL
RNAhybrid_mRNA$V6 <- NULL
RNAhybrid_mRNA$V7 <- NULL

# Rename all columns for readability/accessibility 
colnames(RNAhybrid_mRNA) <- c("target_name", "target_length", "query_name", "query_length",
                              "mfe", "pval", "position",
                              "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq")
```

5’UTR output:

``` r
RNAhybrid_5UTR <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt", sep=":")

# Recombine Columns 1 and 2 (fix incorrect separation of target ID components)
RNAhybrid_5UTR$V1 <- paste(RNAhybrid_5UTR$V1, RNAhybrid_5UTR$V2, sep = ":")
RNAhybrid_5UTR$V2 <- NULL

# Do the same for Columns 4-7 (query ID components)
RNAhybrid_5UTR$V4 <- paste(RNAhybrid_5UTR$V4, RNAhybrid_5UTR$V5, RNAhybrid_5UTR$V6, RNAhybrid_5UTR$V7 , sep = ":")
RNAhybrid_5UTR$V4 <- gsub(":NA:", "::", RNAhybrid_5UTR$V4)
RNAhybrid_5UTR$V5 <- NULL
RNAhybrid_5UTR$V6 <- NULL
RNAhybrid_5UTR$V7 <- NULL

# Rename all columns for readability/accessibility 
colnames(RNAhybrid_5UTR) <- c("target_name", "target_length", "query_name", "query_length",
                              "mfe", "pval", "position",
                              "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq")
```

3’UTR output:

``` r
RNAhybrid_3UTR <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt", sep=":")

# Recombine Columns 1 and 2 (fix incorrect separation of target ID components)
RNAhybrid_3UTR$V1 <- paste(RNAhybrid_3UTR$V1, RNAhybrid_3UTR$V2, sep = ":")
RNAhybrid_3UTR$V2 <- NULL

# Do the same for Columns 4-7 (query ID components)
RNAhybrid_3UTR$V4 <- paste(RNAhybrid_3UTR$V4, RNAhybrid_3UTR$V5, RNAhybrid_3UTR$V6, RNAhybrid_3UTR$V7 , sep = ":")
RNAhybrid_3UTR$V4 <- gsub(":NA:", "::", RNAhybrid_3UTR$V4)
RNAhybrid_3UTR$V5 <- NULL
RNAhybrid_3UTR$V6 <- NULL
RNAhybrid_3UTR$V7 <- NULL

# Rename all columns for readability/accessibility 
colnames(RNAhybrid_3UTR) <- c("target_name", "target_length", "query_name", "query_length",
                              "mfe", "pval", "position",
                              "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq")
```

``` r
RNAhybrid_mRNA$Dataset <- "Dataset 1"
RNAhybrid_5UTR$Dataset <- "Dataset 2"
RNAhybrid_3UTR$Dataset <- "Dataset 3"

# Combine the datasets into a single data frame
combined_data <- rbind(RNAhybrid_mRNA, RNAhybrid_5UTR, RNAhybrid_3UTR)

# Create the faceted histogram
ggplot(combined_data, aes(x = pval)) +
  geom_histogram(binwidth = 0.005, color = "black", fill = "skyblue") +
  facet_wrap(~Dataset) +
  theme_minimal() +
  labs(title = "Histogram of Column C", x = "C", y = "Frequency")
```

![](06-Apul-miRNA-mRNA-RNAhybrid_files/figure-gfm/unnamed-chunk-19-1.png)<!-- -->

I’d also like to get the RNAhybrid results in a gff form for IGV
visualization

Save outputs as tab-delimited files

``` r
write.table(RNAhybrid_mRNA, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, col.names=TRUE, quote = FALSE)
write.table(RNAhybrid_5UTR, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, quote = FALSE)
write.table(RNAhybrid_3UTR, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, quote = FALSE)
```

Now convert these formatted RNAhybrid output into a gff-formatted file

mRNA:

``` bash
#!/bin/bash

# Input and output file paths
INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm-formatted.txt"  # Replace with the path to your input file
OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm.gff"

# Write GFF3 header to the output file
echo "##gff-version 3" > "$OUTPUT_FILE"

# Process the input file, skipping the header line
tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq
do
  # Extract locus name and coordinates from target_name
  locus=$(echo "$target_name" | cut -d':' -f1)
  start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1)
  start_gff=$((start_coord + position))
  end_gff=$((start_gff + query_length))

  # Extract strandedness from query_name
  strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()')

  # Write the GFF3 line
  echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE"
done
```

5’UTR:

``` bash
#!/bin/bash

# Input and output file paths
INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm-formatted.txt"  # Replace with the path to your input file
OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm.gff"

# Write GFF3 header to the output file
echo "##gff-version 3" > "$OUTPUT_FILE"

# Process the input file, skipping the header line
tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq
do
  # Extract locus name and coordinates from target_name
  locus=$(echo "$target_name" | cut -d':' -f1)
  start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1)
  start_gff=$((start_coord + position))
  end_gff=$((start_gff + query_length))

  # Extract strandedness from query_name
  strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()')

  # Write the GFF3 line
  echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE"
done
```

3’UTR:

``` bash
#!/bin/bash

# Input and output file paths
INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm-formatted.txt"  # Replace with the path to your input file
OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm.gff"

# Write GFF3 header to the output file
echo "##gff-version 3" > "$OUTPUT_FILE"

# Process the input file, skipping the header line
tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq
do
  # Extract locus name and coordinates from target_name
  locus=$(echo "$target_name" | cut -d':' -f1)
  start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1)
  start_gff=$((start_coord + position))
  end_gff=$((start_gff + query_length))

  # Extract strandedness from query_name
  strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()')

  # Write the GFF3 line
  echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE"
done
```