--- title: "06-Apul-miRNA-mRNA-RNAhybrid" author: "Kathleen Durkin" date: "2024-12-19" always_allow_html: true output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true toc_depth: 3 number_sections: true html_preview: true --- ```{r load-packages} library(ggplot2) ``` RNAhybrid is a miRNA-mRNA target prediction tool, which bases its predictions primarily on thermodynamic binding stability. Inputs: - miRNA "query" FASTA file -- mature miRNA sequences, since the mature miRNA molecule is what will presumably bind to an mRNA - mRNA "target" FASTA file -- the *A. pulchra* genome fasta # format miRNA fasta ShortStack outputs a fasta containing all annotated miRNAs, but it includes the full precursor sequence and star sequence, in addition to the mature miRNA. Since mature miRNAs are the "final forms" that are generally much more highly expressed, we'll just look at binding of mature miRNAs for now. ```{r, engine='bash'} # look at miRNA fasta file head -5 ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta echo "" # check the naming convention for sequences grep "^>" ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta | head -10 # isolate just the mature miRNA sequences awk '/^>/ {header=$0} /mature/ {print header; found=1; next} found {print; found=0}' ../output/04-Apul-sRNA-discovery-ShortStack/ShortStack_out/mir.fasta > ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta echo "" # check filtered file head -6 ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta ``` # target prediction for full mRNA While miRNas are primarily observed binding to the 3' UTR region of mRNAs, they are also believed to functionally bind to CDS regions, though effects on translation are weaker. # formate mRNA fasta get a gff file of only CDS sequences Notably, RNAhybrid only accepts FASTA sequences of 1000 characters or fewer. We'll need to ensure the CDS file only contains sequences of <1000bp, and break up the sequences that are too long. ```{r check-genome-seq-lengths, engine='bash'} # mRNA-only genome gff # Count total sequences in genome gff wc -l ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff # Count the number of sequences that contain >1000 bp awk '{if ($5 - $4 > 1000) count++} END {print count}' ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff # Check how the sequence names are formatted head -2 ../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff ``` Welp, looks like the genome-based mRNA gff largely contains long, >1000bp sequences (28668/36447, or 78.7%) First let's deal with the long sequences while we're in the compact gff form. I want to break up any sequence >1000bp into 1000bp chunks, adding a line to the gff for each chunk. (I also want there to be overlap among the chunks, in case the break between two chunks falls in the middle of an miRNA binding site. Let's say a 25bp overlap, since that is just over the maximum expected miRNA length.) for now let's not worry about the overlap. ```{r, engine='bash', eval=FALSE} awk -v chunk_size=1000 ' BEGIN {OFS="\t"} { seq_length = $5 - $4 if (seq_length > chunk_size) { start = $4 ogend = $5 while (start < ogend) { end = start + chunk_size if (end > ogend) end = ogend $4 = start $5 = end print start = end } } else { print } }' "../output/05-Apul-annotate-UTRs/Apulcra-genome-mRNA_only.gff" > "../data/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff" ``` ```{r, engine='bash', eval=FALSE} # mRNA-only genome gff # Count total sequences in genome gff wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff # Count the number of sequences that contain >1000 bp awk '{if ($5 - $4 > 1000) count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff # Check how the sequence names are formatted head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5 echo "" echo "" # I'm getting an error in getfasta related to some lines of the gff having a different number of columns. # Each line should contain 9 columns. Count number of lines with more than 9 awk -F'\t' 'NF > 9 {count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff # For some lines, the final column is being interpreted as two columns head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5 | awk -F'\t' '{print $9}' head -50 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff | tail -5 | awk -F'\t' '{print $10}' ``` Fix # columns issue. Replace any instances of "hypothetical [tab] protein" with "hypothetical [space] protein" ```{r, engine='bash', eval=FALSE} # Replace any instances of "hypothetical -tab- protein" with "hypothetical -space- protein" sed 's/hypothetical\tprotein/hypothetical protein/g' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.gff > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff ``` Check ```{r, engine='bash'} # Count number of entries with >9 columns (should be nothing now) awk -F'\t' 'NF > 9 {count++} END {print count}' ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff ``` get an mRNA fasta file ```{r, engine='bash', eval=FALSE} # Use mRNA gff and genome fasta to extract mRNA fastas /home/shared/bedtools2/bin/bedtools getfasta -fi "../data/Apulchra-genome.fa" -bed "../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000_formatted.gff" -fo "../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa" ``` Ok, we now have a fasta file of mRNA sequences, broken up so that no sequence exceeds 1000bp. We should now be able to run RNAhybrid! I have RNAhybrid installed on a miniconda environment ``` # Check path to the conda environment I'm using which conda # Install RNAhybrid if neccessary conda install -y -c genomedk rnahybrid # Check installation conda list rnahybrid ``` # RNAcalibrate Note: I'm not using the RNAcalibrate results for now because of a weird issue with all-zero RNAhybrid pvals. ```{r, engine='bash', eval=FALSE} # RNAhybrid only has built-in support for humans, worms, and flies. # We can use RNAcalibrate to derive Extreme Value Distribution parameters for our data. # USe 3'UTR as input, since we expect majority binding there RNAcalibrate \ -t ../output/05-Apul-annotate-UTRs/Apul_3UTR_1kb.fasta \ -q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \ > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-3UTR.txt ``` ```{r, eval=FALSE} # Now we need to pick a distance parameter and shape parameter RNAcalibrate_out <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-3UTR.txt", sep = " ") ggplot(RNAcalibrate_out, aes(x = V3)) + geom_histogram(aes(y = after_stat(density)), binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1) + geom_vline(aes(xintercept = mean(V3)), color = "red", linetype = "dashed", size = 1) + geom_text(aes(x = mean(V3) + 0.15, y = 4, label = paste("Mean =", round(mean(V3), 2))), color = "red", vjust = -0.5, hjust = 1.2) + labs(title = "Distance Parameter distribution", x = "Values", y = "Density") ggplot(RNAcalibrate_out, aes(x = V4)) + geom_histogram(aes(y = after_stat(density)), binwidth = 0.005, fill = "blue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1) + geom_vline(aes(xintercept = mean(V4)), color = "red", linetype = "dashed", size = 1) + geom_text(aes(x = mean(V4) + 0.017, y = 37, label = paste("Mean =", round(mean(V4), 2))), color = "red", vjust = -0.5, hjust = 1.2) + labs(title = "Shape Parameter distribution", x = "Values", y = "Density") ``` ```{r, engine='bash', eval=FALSE} # Also try with the full mRNA sequences RNAcalibrate \ -t ../data/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa \ -q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \ > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-mRNA.txt ``` ```{r, eval=FALSE} # Now we need to pick a distance parameter and shape parameter RNAcalibrate_out <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAcalibrate-mRNA.txt", sep = " ") ggplot(RNAcalibrate_out, aes(x = V3)) + geom_histogram(aes(y = after_stat(density)), binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1) + geom_vline(aes(xintercept = mean(V3)), color = "red", linetype = "dashed", size = 1) + geom_text(aes(x = mean(V3) + 0.15, y = 4, label = paste("Mean =", round(mean(V3), 2))), color = "red", vjust = -0.5, hjust = 1.2) + labs(title = "Distance Parameter distribution", x = "Values", y = "Density") ggplot(RNAcalibrate_out, aes(x = V4)) + geom_histogram(aes(y = after_stat(density)), binwidth = 0.005, fill = "blue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1) + geom_vline(aes(xintercept = mean(V4)), color = "red", linetype = "dashed", size = 1) + geom_text(aes(x = mean(V4) + 0.017, y = 37, label = paste("Mean =", round(mean(V4), 2))), color = "red", vjust = -0.5, hjust = 1.2) + labs(title = "Shape Parameter distribution", x = "Values", y = "Density") ``` Distributions looks roughly normal, so I'll use the mean distance and shape parameters calculated by RNAcalibrate as input in RNAhybrid. # RNAhybrid: mRNA file I'm having problems using the RNAcalibrate estimated parameters, where every output hybridization has a p-value of 0. This still happens if I remove the `-e` energy cutoff and the `-p` pvalue cutoff, so the `-d` parameter specification must be causing it. I'll use a built-in estimation for now, since those result in reasonable pvalues. Options are 3utr_human, 3utr_fly, and 3utr-worm. Not sure if fly or worm is most closely related to corals (they're both quite distant), but I'll just arbitrarily pick worm for now. ```{r, engine='bash', eval=FALSE} RNAhybrid \ -s 3utr_worm \ -e -20 \ -p 0.05 \ -c \ -t ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apulcra-genome-mRNA_only_MAX1000.fa \ -q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \ > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt ``` # RNAhybrid: 3'UTR ```{r, engine='bash', eval=FALSE} RNAhybrid \ -s 3utr_worm \ -e -20 \ -p 0.05 \ -c \ -t ../output/05-Apul-annotate-UTRs/Apul_3UTR_1kb.fasta \ -q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \ > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt ``` # RNAhybrid: 5'UTR ```{r, engine='bash', eval=FALSE} RNAhybrid \ -s 3utr_worm \ -e -20 \ -p 0.05 \ -c \ -t ../output/05-Apul-annotate-UTRs/Apul_5UTR_1kb.fasta \ -q ../data/06-Apul-miRNA-mRNA-RNAhybrid/miRNA_mature-Apul.fasta \ > ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt ``` # Summarize results ```{r, engine='bash'} # How many significant hybridizations predicted for each input? wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt wc -l ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt ``` ```{r, engine='bash'} head -5 ../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt ``` Now let's read our RNAhybrid results into R for visualization. Note this is going to be slightly more complicated than it sounds because the RNAhybrid compact output is colon-delimited and our target- and query-IDs contain intentional colons than could get confused with column delimiters. mRNA output: ```{r} options(scipen=999) options(digits = 10) RNAhybrid_mRNA <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utrworm.txt", sep=":") # Recombine Columns 1 and 2 (fix incorrect separation of target ID components) RNAhybrid_mRNA$V1 <- paste(RNAhybrid_mRNA$V1, RNAhybrid_mRNA$V2, sep = ":") RNAhybrid_mRNA$V2 <- NULL # Do the same for Columns 4-7 (query ID components) RNAhybrid_mRNA$V4 <- paste(RNAhybrid_mRNA$V4, RNAhybrid_mRNA$V5, RNAhybrid_mRNA$V6, RNAhybrid_mRNA$V7 , sep = ":") RNAhybrid_mRNA$V4 <- gsub(":NA:", "::", RNAhybrid_mRNA$V4) RNAhybrid_mRNA$V5 <- NULL RNAhybrid_mRNA$V6 <- NULL RNAhybrid_mRNA$V7 <- NULL # Rename all columns for readability/accessibility colnames(RNAhybrid_mRNA) <- c("target_name", "target_length", "query_name", "query_length", "mfe", "pval", "position", "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq") ``` 5'UTR output: ```{r} RNAhybrid_5UTR <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utrworm.txt", sep=":") # Recombine Columns 1 and 2 (fix incorrect separation of target ID components) RNAhybrid_5UTR$V1 <- paste(RNAhybrid_5UTR$V1, RNAhybrid_5UTR$V2, sep = ":") RNAhybrid_5UTR$V2 <- NULL # Do the same for Columns 4-7 (query ID components) RNAhybrid_5UTR$V4 <- paste(RNAhybrid_5UTR$V4, RNAhybrid_5UTR$V5, RNAhybrid_5UTR$V6, RNAhybrid_5UTR$V7 , sep = ":") RNAhybrid_5UTR$V4 <- gsub(":NA:", "::", RNAhybrid_5UTR$V4) RNAhybrid_5UTR$V5 <- NULL RNAhybrid_5UTR$V6 <- NULL RNAhybrid_5UTR$V7 <- NULL # Rename all columns for readability/accessibility colnames(RNAhybrid_5UTR) <- c("target_name", "target_length", "query_name", "query_length", "mfe", "pval", "position", "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq") ``` 3'UTR output: ```{r} RNAhybrid_3UTR <- read.table("../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utrworm.txt", sep=":") # Recombine Columns 1 and 2 (fix incorrect separation of target ID components) RNAhybrid_3UTR$V1 <- paste(RNAhybrid_3UTR$V1, RNAhybrid_3UTR$V2, sep = ":") RNAhybrid_3UTR$V2 <- NULL # Do the same for Columns 4-7 (query ID components) RNAhybrid_3UTR$V4 <- paste(RNAhybrid_3UTR$V4, RNAhybrid_3UTR$V5, RNAhybrid_3UTR$V6, RNAhybrid_3UTR$V7 , sep = ":") RNAhybrid_3UTR$V4 <- gsub(":NA:", "::", RNAhybrid_3UTR$V4) RNAhybrid_3UTR$V5 <- NULL RNAhybrid_3UTR$V6 <- NULL RNAhybrid_3UTR$V7 <- NULL # Rename all columns for readability/accessibility colnames(RNAhybrid_3UTR) <- c("target_name", "target_length", "query_name", "query_length", "mfe", "pval", "position", "noncomp_target_seq", "comp_target_seq", "comp_query_seq", "noncomp_query_seq") ``` ```{r} RNAhybrid_mRNA$Dataset <- "Dataset 1" RNAhybrid_5UTR$Dataset <- "Dataset 2" RNAhybrid_3UTR$Dataset <- "Dataset 3" # Combine the datasets into a single data frame combined_data <- rbind(RNAhybrid_mRNA, RNAhybrid_5UTR, RNAhybrid_3UTR) # Create the faceted histogram ggplot(combined_data, aes(x = pval)) + geom_histogram(binwidth = 0.005, color = "black", fill = "skyblue") + facet_wrap(~Dataset) + theme_minimal() + labs(title = "Histogram of Column C", x = "C", y = "Frequency") ``` I'd also like to get the RNAhybrid results in a gff form for IGV visualization Save outputs as tab-delimited files ```{r, eval=FALSE} write.table(RNAhybrid_mRNA, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, col.names=TRUE, quote = FALSE) write.table(RNAhybrid_5UTR, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, quote = FALSE) write.table(RNAhybrid_3UTR, file="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm-formatted.txt", sep="\t", row.names = FALSE, quote = FALSE) ``` Now convert these formatted RNAhybrid output into a gff-formatted file mRNA: ```{r, engine='bash', eval=FALSE} #!/bin/bash # Input and output file paths INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm-formatted.txt" # Replace with the path to your input file OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-mRNA-compact_3utr_worm.gff" # Write GFF3 header to the output file echo "##gff-version 3" > "$OUTPUT_FILE" # Process the input file, skipping the header line tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq do # Extract locus name and coordinates from target_name locus=$(echo "$target_name" | cut -d':' -f1) start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1) start_gff=$((start_coord + position)) end_gff=$((start_gff + query_length)) # Extract strandedness from query_name strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()') # Write the GFF3 line echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE" done ``` 5'UTR: ```{r, engine='bash', eval=FALSE} #!/bin/bash # Input and output file paths INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm-formatted.txt" # Replace with the path to your input file OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-5UTR-compact_3utr_worm.gff" # Write GFF3 header to the output file echo "##gff-version 3" > "$OUTPUT_FILE" # Process the input file, skipping the header line tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq do # Extract locus name and coordinates from target_name locus=$(echo "$target_name" | cut -d':' -f1) start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1) start_gff=$((start_coord + position)) end_gff=$((start_gff + query_length)) # Extract strandedness from query_name strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()') # Write the GFF3 line echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE" done ``` 3'UTR: ```{r, engine='bash', eval=FALSE} #!/bin/bash # Input and output file paths INPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm-formatted.txt" # Replace with the path to your input file OUTPUT_FILE="../output/06-Apul-miRNA-mRNA-RNAhybrid/Apul-RNAhybrid-3UTR-compact_3utr_worm.gff" # Write GFF3 header to the output file echo "##gff-version 3" > "$OUTPUT_FILE" # Process the input file, skipping the header line tail -n +2 "$INPUT_FILE" | while IFS=$'\t' read -r target_name target_length query_name query_length mfe pval position noncomp_target_seq comp_target_seq comp_query_seq noncomp_query_seq do # Extract locus name and coordinates from target_name locus=$(echo "$target_name" | cut -d':' -f1) start_coord=$(echo "$target_name" | cut -d':' -f2 | cut -d'-' -f1) start_gff=$((start_coord + position)) end_gff=$((start_gff + query_length)) # Extract strandedness from query_name strand=$(echo "$query_name" | grep -o '(-\|+)' | tr -d '()') # Write the GFF3 line echo -e "$locus\tRNAhybrid\tmiRNA_binding\t$start_gff\t$end_gff\t.\t$strand\t.\tID=$query_name;MFE=$mfe;Pval=$pval" >> "$OUTPUT_FILE" done ```