10.1-Ptuh-mRNA-lncRNA-correlation-PCC ================ 2025-06-23 This code will use Pearson’s correlation coefficient to examine possible correlations between mRNA and lncRNA expression. Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to correct for a previous issue of lncRNA with all-zero counts, and then filtered to only include lncRNA with at least 5 counts in at least 1 sample as valid. Both matrices (the mRNA counts and lncRNA counts) are prohibitively large (~30,000 rows and ~15,000 rows, respectively). I’ve used the above code to create an R script that will perform the PCC calculations, `run_mRNA_lncRNA_PCC.R` stored in `F-Ptuh/output/21.2-Ptuh-mRNA-lncRNA-correlation-PCC`. To run the PCC calcs in the background, allowing continued work in Rstudio, run this code chunk from the terminal. ``` bash cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/ Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 & ``` Once the script is finished running, you can toss the intermediate “chunk” files ``` bash cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/ rm -r chunk*.csv ``` Read in gene counts ``` r mRNA_counts <- read.csv("../output/06.2-Ptuh-Hisat/Ptuh-gene_count_matrix.csv") rownames(mRNA_counts) <- mRNA_counts$gene_id mRNA_counts <- mRNA_counts %>% dplyr::select(-gene_id) mRNA_counts <- mRNA_counts %>% dplyr::rename( "sample47"=`RNA.POC.47`, "sample48"=`RNA.POC.48`, "sample50"=`RNA.POC.50`, "sample53"=`RNA.POC.53`, "sample57"=`RNA.POC.57`) #head(mRNA_counts) # Remove any mRNAs with 0 for all samples mRNA_counts <- mRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) # Number of mRNA nrow(mRNA_counts) ``` ## [1] 26508 Read in lncRNA data. Generated in `M-multi-species/output/01.6-lncRNA-pipeline`, counts matrix available at ``` r lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Ptuh-lncRNA-counts-filtered.txt") %>% dplyr::rename("lncrna_id"=Geneid, "sample47"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.47.S1.TP2.sorted.bam`, "sample48"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.48.S1.TP2.sorted.bam`, "sample50"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.50.S1.TP2.sorted.bam`, "sample53"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.53.S1.TP2.sorted.bam`, "sample57"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.57.S1.TP2.sorted.bam`) ``` ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## Geneid = col_character(), ## Chr = col_character(), ## Start = col_double(), ## End = col_double(), ## Strand = col_character(), ## Length = col_double(), ## ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.47.S1.TP2.sorted.bam = col_double(), ## ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.48.S1.TP2.sorted.bam = col_double(), ## ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.50.S1.TP2.sorted.bam = col_double(), ## ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.53.S1.TP2.sorted.bam = col_double(), ## ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.57.S1.TP2.sorted.bam = col_double() ## ) ``` r # Change to df lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% dplyr::select(!c("Chr", "Start", "End", "Strand", "Length")) row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1] lncRNA_counts_df <- lncRNA_counts_df[,-1] # remove the first column (gene names) if needed # Remove any lncRNAs with 0 for all samples lncRNA_counts_df <- lncRNA_counts_df %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) # Number of lncRNA nrow(lncRNA_counts_df) ``` ## [1] 16152 Normalize counts ``` r # Function to normalize counts (simple RPM normalization) normalize_counts <- function(counts) { rpm <- t(t(counts) / colSums(counts)) * 1e6 return(rpm) } # Normalize mRNA and mRNA counts mRNA_norm <- normalize_counts(mRNA_counts) #mRNA_norm <- as.matrix(mRNA_counts_filt) lncRNA_norm <- normalize_counts(lncRNA_counts_df) #mRNA_norm <- as.matrix(mRNA_counts_filt) ``` Calculate PCC ``` r # Function to calculate PCC and p-value for a pair of vectors calc_pcc <- function(x, y) { result <- cor.test(x, y, method = "pearson") return(c(PCC = result$estimate, p_value = result$p.value)) } # Create a data frame of all mRNA-lncRNA pairs pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm)) # Calculate PCC and p-value for each pair pcc_results <- pairs %>% rowwise() %>% mutate( pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,])) ) %>% unnest_wider(pcc_stats) # Adjust p-values for FDR pcc_results <- pcc_results %>% mutate(adjusted_p_value = p.adjust(p_value, method = "fdr")) # Save write.csv(pcc_results, "../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv") ``` Read back in PCC results (available in large-file storage at `https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/F-Ptuh/output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv`) ``` r # pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/F-Ptuh/output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv") ``` Inspect the data ``` r # nrow(pcc_results) # length(unique(pcc_results$mRNA)) # length(unique(pcc_results$lncRNA)) # # # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05? # sig_pairs <- pcc_results %>% # filter(abs(PCC.cor) > 0.5 & p_value < 0.05) # # ## Count positive and negative PCC.cor values # all_count <- nrow(sig_pairs) # positive_count <- sum(sig_pairs$PCC.cor > 0) # negative_count <- sum(sig_pairs$PCC.cor < 0) # cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n") # cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n") # cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n") ``` Bash version: ``` bash cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/ #head Ptuh-PCC_mRNA_lncRNA.csv # Num of lines tail -n +2 Ptuh-PCC_mRNA_lncRNA.csv | wc -l # Num unique mRNAs and lncRNAs awk -F, 'NR > 1 {print $1}' Ptuh-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l awk -F, 'NR > 1 {print $2}' Ptuh-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l # filter for significance awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Ptuh-PCC_mRNA_lncRNA.csv > sig_pairs.csv # Count significant pars and split by direction (pos or neg PCC) # Total tail -n +2 sig_pairs.csv | wc -l # Positive PCC awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l # Negative PCC awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l ``` ## 26751367 ## 26508 ## 16152 ## 26751367 ## 19681301 ## 7070066 26,508 mRNA significantly correlate (p-val \< 0.05) with a total of 16,152 lncRNA, with a total of 26,751,367 unique pairs with significantly correlated expression. This means that every single one of our mRNA significantly correlates with at least one lncRNA, and every lncRNA significantly correlates with at least one mRNA. Of these, all also had correlation magnitudes of at least 0.5. The majority of correlations (19,681,301 or 73.6%) were positive, though a fair number (7,070,066 or 26.4%) were negative. How many mRNAs per lncRNA and vice versa for the sig pairs? ``` r # ## sig pairs # lncRNAs_per_mRNA <- sig_pairs %>% # group_by(mRNA) %>% # summarize(n_lncRNAs = n_distinct(lncRNA)) %>% # arrange(desc(n_lncRNAs)) # # print("lncRNAs per mRNA, significant. mean, range:") # mean(lncRNAs_per_mRNA$n_lncRNAs) # range(lncRNAs_per_mRNA$n_lncRNAs) # # cat("\n") # # mRNAs_per_lncRNA <- sig_pairs %>% # group_by(lncRNA) %>% # summarize(n_mRNAs = n_distinct(mRNA)) %>% # arrange(desc(n_mRNAs)) # # print("mRNAs per lncRNA, significnat. mean, range:") # mean(mRNAs_per_lncRNA$n_mRNAs) # range(mRNAs_per_lncRNA$n_mRNAs) ``` Bash version: ``` bash cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/ #lncRNAs per mRNA awk -F, 'NR > 1 {pair[$1","$2]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt # Range sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' # mRNAs per lncRNA awk -F, 'NR > 1 {pair[$2","$1]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt # Range sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' ``` ## Mean: 1009.18 ## Range: 133 3759 ## Mean: 1656.23 ## Range: 463 3266 For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are correlated with 133-3,759 unique lncRNAs (1009.18 on average), while the lncRNAs are correlated with 463-3,266 unique mRNAs (1656.23 on average).