--- title: "14-Ptuh-miRNA-lncRNA-BLASTs-miRanda" author: "Kathleen Durkin" date: "2025-06-17" always_allow_html: true output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true toc_depth: 3 number_sections: true html_preview: true bibliography: ../../references.bib link-citations: true --- ```{r load packages} library(dplyr) library(ggplot2) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE # Evaluate code chunks ) ``` Two possible interactions between miRNA and lncRNA are: 1) lncRNA acting as a precursor molecule for miRNA(s), so that the lncRNA contains one or many pre-miRNA sequences and will be broken down into pre-miRNAs molecules, which will then be processed into mature miRNAs. 2) lncRNA acting as a "sponge" for miRNAs, so that an miRNA will bind to the lncRNA instead of being incorporated into an RISC complex to alter gene expression. In situation 1 we would expect one or several **pre-miRNA sequences to appear inside of a lncRNA**. This should be identifiable via BLASTn. In situation 2 we would expect the **mature miRNA sequence to appear inside a lncRNA**. Note that situation 2 is a bit more complicated, because we can't say for certain what sequence similarity is required for binding. In cnidarians, miRNAs seem to act, like plants, through complementarity of the full mature miRNA (this is in contrast to e.g. mammals, where only binding of a short seed region is required) (@moran_cnidarian_2014, @admoni_target_2023). However, for lncRNA acting as sponges, I don't know whether to expect complementarity of the full mature miRNA or only a section, and I don't know what degree of complementarity is required. **Work to identify lncRNA sponges could use BLASTn, but will likely need to include additional methods like miRanda to identify potential binding.** # Prep for BLASTs ## Isolate the pre-mirna and mature mirna sequences ```{r extract-mirna-seqs, engine='bash'} full_mirna_fasta="../output/05-Ptuh-sRNA-ShortStack_4.1.0/ShortStack_out/mir.fasta" premirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor.fasta" mature_mirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature.fasta" star_mirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_star.fasta" # Pull out all sequences that DON'T contain "mature" or "star" in sequence name # Note the pre-miRNAs have sequences for both strands awk ' # If the line starts with ">", check the header /^>/ { if ($0 ~ /mature/ || $0 ~ /star/) { print_seq = 0 # Skip sequences with "mature" or "star" in the header } else { print_seq = 1 # Mark sequences for printing } } # Print the header and the next two lines if marked for printing print_seq { print if (!/^>/) { getline; print } # Capture second sequence line } ' "$full_mirna_fasta" > "$premirna_fasta" # Pull out all sequences that contain "mature" in sequence name grep -A 1 "mature" $full_mirna_fasta | grep -v "^--$" > $mature_mirna_fasta # Pull out all sequences that contain "star" in sequence name grep -A 1 "star" $full_mirna_fasta | grep -v "^--$" > $star_mirna_fasta ``` ```{r, engine='bash', eval=TRUE} premirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor.fasta" mature_mirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature.fasta" star_mirna_fasta="../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_star.fasta" # Check we have appropriate headers, same number of sequences in each grep "^>" $premirna_fasta | head -2 echo "" grep "^>" $mature_mirna_fasta | head -2 echo "" grep "^>" $star_mirna_fasta | head -2 echo "" grep "^>" $premirna_fasta | wc -l echo "" grep "^>" $mature_mirna_fasta | wc -l echo "" grep "^>" $star_mirna_fasta | wc -l echo "" ``` ## Check miRNA lengths ```{r, engine='bash'} # Extract sequence lengths for precursors awk '/^>/ {if (seqlen){print seqlen}; printf $0" " ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor.fasta > ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor_lengths.txt # Sequence lengths for matures awk '/^>/ {if (seqlen){print seqlen}; printf $0" " ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature.fasta > ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature_lengths.txt ``` ```{r, eval=TRUE} # Summary stats of precursor and mature lengths precursor_lengths <- read.table("../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor_lengths.txt", sep = " ", header = FALSE, col.names = c("seqID", "length")) mature_lengths <- read.table("../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature_lengths.txt", sep = " ", header = FALSE, col.names = c("seqID", "length")) cat("Average pre-miRNA length: ", mean(precursor_lengths$length)) cat("\n") cat("Range of pre-miRNA lengths: ", range(precursor_lengths$length)) cat("\n") cat("Average mature miRNA length: ", mean(mature_lengths$length)) cat("\n") cat("Range of mature miRNA lengths: ", range(mature_lengths$length)) ``` ## check lncRNAs LncRNAs were identified from Ptuh RNA-seq data Fasta of Ptuh lncRNAs stored at `https://gannet.fish.washington.edu/acropora/E5-deep-dive-expression/output/01.6-lncRNA-pipline/Ptuh_lncRNA.fasta` ```{r check-lncRNAs, engine='bash', eval=TRUE} curl -L https://gannet.fish.washington.edu/acropora/E5-deep-dive-expression/output/01.6-lncRNA-pipline/Ptuh_lncRNA.fasta -o ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta echo "Number of lncRNAs:" grep "^>" ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta | wc -l ``` ```{r, engine='bash'} # Extract sequence lengths for precursors awk '/^>/ {if (seqlen){print seqlen}; printf $0" " ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta > ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA_lengths.txt ``` ```{r, eval=TRUE} # Summary stats of lncRNA lengths lncRNA_lengths <- read.table("../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA_lengths.txt", sep = " ", header = FALSE, col.names = c("seqID", "length")) cat("Average lncRNA length: ", mean(lncRNA_lengths$length)) cat("\n") cat("Range of lncRNA lengths: ", range(lncRNA_lengths$length)) ggplot(lncRNA_lengths, aes(x = length)) + geom_histogram(binwidth = 500) + labs(title = "A. pulchra lncRNA sequence lengths", x = "Sequence Length [nucleotides]", y = "Frequency") + xlim(200, 50000) + ylim(0, 800) + theme_minimal() ``` # BLASTs ## Make databases Database of pre-miRNAs: ```{r make-premirna-databse, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \ -in ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_precursor.fasta \ -dbtype nucl \ -out ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/Ptuh-db/Ptuh_ShortStack_4.1.0_precursor ``` Database of mature miRNAs: ```{r make-mature-mirna-databse, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \ -in ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature.fasta \ -dbtype nucl \ -out ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/Ptuh-db/Ptuh_ShortStack_4.1.0_mature ``` ## Run BLASTn Generate a list of blast results. It seems plausible that a single lncRNA, which would be hundreds or thousands of nucleotides long, could interact with multiple miRNAs, so I will allow up to 10 hits (~25% of Ptuh miRNAs) for each lncRNA. I want to see the top hits no matter how poor the match is, so I will not filter by e-value at this stage. I’ll also include the “-word_size 4” option, which reduces the required length of the initial match. Full pre-miRNAs: ```{r blastn-premirna, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/blastn \ -task blastn \ -query ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta \ -db ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/Ptuh-db/Ptuh_ShortStack_4.1.0_precursor \ -out ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_precursor_blastn.tab \ -num_threads 40 \ -word_size 4 \ -max_target_seqs 10 \ -max_hsps 1 \ -outfmt 6 ``` ```{r, engine='bash', eval=TRUE} wc -l ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_precursor_blastn.tab ``` Note we have less than (10 * [# of lncRNAs]) output alignments because, while I did not set an evalue threshold, the default evalue threshold of evalue=10 is still in place. That means extremely poor matches were still excluded by default. Mature miRNAs: Note that I'm using the blastn-short option here because all of our mature miRNAs are less than 30 nucleotides long (recommended by [BLAST user manual](https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastn_application_options/)) ```{r blastn-mature-mirna, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/blastn \ -task blastn \ -query ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta \ -db ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/Ptuh-db/Ptuh_ShortStack_4.1.0_mature \ -out ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_mature_blastn.tab \ -num_threads 40 \ -word_size 4 \ -max_target_seqs 10 \ -max_hsps 1 \ -outfmt 6 ``` ```{r, engine='bash', eval=TRUE} wc -l ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_mature_blastn.tab ``` # Examine BLAST tables Read into R and assign informative column labels ```{r read-in-blast-tables, eval=TRUE} precursor_lncRNA_BLASTn <- read.table("../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_precursor_blastn.tab", sep="\t", header=FALSE) mature_lncRNA_BLASTn <- read.table("../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/blasts/lncRNA_to_mature_blastn.tab", sep="\t", header=FALSE) colnames(precursor_lncRNA_BLASTn) <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") colnames(mature_lncRNA_BLASTn) <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") ``` ## LncRNAs as miRNA precursors Are there any alignments of the full precursor miRNA to a lncRNA? Our precursor sequences are 90-98 nucleotides long, so let's look for any alignments of at least 90 nucleotides with 0 mismatches. ```{r, eval=TRUE} precursor_lncRNA_BLASTn %>% filter(length >= 90) %>% filter(mismatch == 0) %>% unique() %>% nrow() precursor_lncRNA_BLASTn %>% filter(length >= 90) %>% filter(mismatch == 0) %>% select(qseqid) %>% unique() %>% nrow() precursor_lncRNA_BLASTn %>% filter(length >= 90) %>% filter(mismatch == 0) %>% select(sseqid) %>% unique() %>% nrow() precursor_lncRNA_BLASTn %>% filter(length >= 90) %>% filter(mismatch == 0) %>% unique() ``` We have 50 alignments of a full pre-miRNA to a lncRNA with no mismatches. 46 lncRNA and 20 miRNA are represented. Note that, as in A.pulchra and P.evermanni, there are instances of a single pre-miRNA matching to several overlapping lncRNA, potentially representing multiple isoforms of a single lncRNA gene. Save these results ```{r} precursor_lncRNAs <- precursor_lncRNA_BLASTn %>% filter(length >= 90) %>% filter(mismatch == 0) write.table(precursor_lncRNAs, "../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/lncRNAs_as_miRNA_precursors.txt") ``` ## LncRNAs as miRNA sponges I'm not sure whether to expect lncRNAs to bind miRNAs in the same way cnidarian miRNA-mRNA binding occurs (nearly perfect complementarity of mature sequence), or whether the mechanism could differ (e.g., requires only a complementary seed region, as in vertebrate miRNA-mRNA binding). that means I don't know what alignment parameters to require for our BLAST results. For now let's say the aligned region must be at least 8 nucleotides (the expected length of an miRNA seed region), and let's require a low evalue of 1e-3, to generally restrict results to those with high complementarity. ```{r, eval=TRUE} mature_lncRNA_BLASTn %>% filter(length >= 8) %>% filter(evalue <= 0.001) %>% unique() ``` 207 putative lncRNA sponges with these parameters. Ultimately though these results are insufficient to determine lncRNA sponging. We need to evaluate miRNA-lncRNA binding. # miRanda miRanda is a target prediction software, used to identify likely miRNA-mRNA interactions. Inputs: - FASTA of A.pulchra lncRNAs - FASTA of A.pulchra mature miRNAs ## Run miRanda ```{r, engine='bash', eval=FALSE} # score cutoff >100 # energy cutoff <-20 # strict binding /home/shared/miRanda-3.3a/src/miranda \ ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_ShortStack_4.1.0_mature.fasta \ ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh_lncRNA.fasta \ -sc 100 \ -en -20 \ -strict \ -out ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict_all.tab ``` # Summarize results Let's look at the output ```{r, engine='bash', eval=TRUE} echo "miranda run finished!" echo "Counting number of interacting miRNA-lncRNA pairs" zgrep -c "Performing Scan" ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict_all.tab echo "Parsing output" grep -A 1 "Scores for this hit:" ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict_all.tab | sort | grep '>' > ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict-parsed.txt echo "counting number of putative interactions predicted (can include multiple interactions between single miRNA-lncRNA pair)" wc -l ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict_all.tab ``` This is a lot of putative interactions! We can probably narrow it down though. In vertebrates, miRNA-mRNA binding only requires complementarity of an miRNA seed region of ~8 nucleotides. This requirement is built in to miRanda target prediction. In cnidarians, however, miRNA-mRNA binding is believed to require near-complete complementarity of the full mature miRNA, similarly to plants ( @admoni_target_2023 , @admoni_mirna-target_2025 ). While I couldn't find any information on expected requirements for miRNA-lncRNA sponges, its possible the binding will function similarly to miRNA-mRNA binding. Let's look at how many putative interactions are predicted for a binding length of at least 21 nucleotides (the length of our smallest mature miRNA). ```{r, engine='bash', eval=TRUE} echo "number of putative interactions of at least 21 nucleotides" awk -F'\t' '$7 >= 21' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict-parsed.txt | wc -l echo "" echo "check some:" awk -F'\t' '$7 >= 21' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict-parsed.txt | head -5 ``` The header for this output is formatted as: mirna Target Score Energy-Kcal/Mol Query-Aln(start-end) Subject-Al(Start-End) Al-Len Subject-Identity Query-Identity We can see from the percent identities (last 2 entries) that this number includes alignments with multiple mismatches. Let's filter again to reduce the number of permissible mismatches. Let's say we want no more than 3 mismatches (a gap is counted as a mismatch). For an alignment of 21 nucleotides, this would be an percent identity of (21-3)/21 = 85.7%. The miRNA is our "subject", so we will filter by column 8. ```{r, engine='bash'} echo "number of putative interactions of at least 21 nucleotides, with at most 3 mismatches" awk -F'\t' '$7 >= 21' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict-parsed.txt | awk -F'\t' '$8 >= 85' | wc -l echo "" echo "check some:" awk -F'\t' '$7 >= 21' ../output/14-Ptuh-miRNA-lncRNA-BLASTs-miRanda/Ptuh-miRanda-lncRNA-strict-parsed.txt | awk -F'\t' '$8 >= 85' | head -5 ``` This is a dramatically smaller number -- only interactions are at least 21 nucleotides with <=3 mismatches