09.1-Peve-mRNA-lncRNA-correlation-PCC ================ 2025-06-23 - [0.0.1 Note](#001-note) This code will use Pearson’s correlation coefficient to examine possible correlations between mRNA and lncRNA expression. Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to correct for a previous issue of lncRNA with all-zero counts, and then filtered to only include lncRNA with at least 5 counts in at least 1 sample as valid. Both matrices (the mRNA counts and lncRNA counts) are prohibitively large (~30,000 rows, and ~10,000 rows, respectively). I’ve used the above code to create an R script that will perform the PCC calculations, `run_mRNA_lncRNA_PCC.R` stored in `E-Peve/output/21.2-Peve-mRNA-lncRNA-correlation-PCC`. To run the PCC calcs in the background, allowing continued work in Rstudio, run this code chunk from the terminal. ``` bash cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/ Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 & ``` Read in gene counts ``` r mRNA_counts <- read.csv("../output/06.2-Peve-Hisat/Peve-gene_count_matrix.csv") rownames(mRNA_counts) <- mRNA_counts$gene_id mRNA_counts <- mRNA_counts %>% select(-gene_id) mRNA_counts <- mRNA_counts %>% rename("sample71"=`RNA.POR.71`, "sample73"=`RNA.POR.73`, "sample76"=`RNA.POR.76`, "sample79"=`RNA.POR.79`, "sample82"=`RNA.POR.82`) head(mRNA_counts) ``` ## sample71 sample73 sample76 sample79 sample82 ## gene-Peve_00006864 0 0 3 263 79 ## gene-Peve_00022042 439 396 552 640 286 ## gene-Peve_00009109 89 181 33 293 1789 ## gene-Peve_00009108 0 0 0 0 0 ## gene-Peve_00004359 4598 2449 6571 2745 10812 ## gene-Peve_00004358 147 159 24 155 153 ``` r # Remove any mRNAs with 0 for all samples mRNA_counts <- mRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) ``` Read in lncRNA data. Generated in `E-Peve/code/32-Peve-lncRNA-matrix.Rmd`, counts matrix available at ``` r lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Peve-lncRNA-counts-filtered.txt") %>% rename("lncrna_id"=Geneid, "sample71"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.71.sorted.bam`, "sample73"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.73.sorted.bam`, "sample76"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.76.sorted.bam`, "sample79"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.79.sorted.bam`, "sample82"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.82.sorted.bam`) ``` ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## Geneid = col_character(), ## Chr = col_character(), ## Start = col_double(), ## End = col_double(), ## Strand = col_character(), ## Length = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.71.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.73.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.76.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.79.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.82.sorted.bam = col_double() ## ) ``` r # Change to df lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% select(!c("Chr", "Start", "End", "Strand", "Length")) row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1] lncRNA_counts_df <- lncRNA_counts_df[,-1] # remove the first column (gene names) if needed # Remove any lncRNAs with 0 for all samples lncRNA_counts_df <- lncRNA_counts_df %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) ``` Normalize counts ``` r # Function to normalize counts (simple RPM normalization) normalize_counts <- function(counts) { rpm <- t(t(counts) / colSums(counts)) * 1e6 return(rpm) } # Normalize mRNA and mRNA counts mRNA_norm <- normalize_counts(mRNA_counts) #mRNA_norm <- as.matrix(mRNA_counts_filt) lncRNA_norm <- normalize_counts(lncRNA_counts_df) #mRNA_norm <- as.matrix(mRNA_counts_filt) ``` Calculate PCC ``` r # Function to calculate PCC and p-value for a pair of vectors calc_pcc <- function(x, y) { result <- cor.test(x, y, method = "pearson") return(c(PCC = result$estimate, p_value = result$p.value)) } # Create a data frame of all mRNA-lncRNA pairs pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm)) # Calculate PCC and p-value for each pair pcc_results <- pairs %>% rowwise() %>% mutate( pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,])) ) %>% unnest_wider(pcc_stats) # Adjust p-values for FDR pcc_results <- pcc_results %>% mutate(adjusted_p_value = p.adjust(p_value, method = "fdr")) # Save write.csv(pcc_results, "../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv") ``` Once you’ve generated the full correlation matrix, can toss the “chunks” of correlations ``` bash rm -r ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/chunk*.csv ``` ### 0.0.1 Note Below summary stats originally in R, but the PCC results file is too large to load to an R object. Instead, duplicated the R logic in bash to work directly with the CSV file Read back in PCC results (available in large-file storage at `https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv`) ``` r # pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv") ``` Bash version: ``` bash wget -O ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv ``` Inspect the data ``` r # nrow(pcc_results) # length(unique(pcc_results$mRNA)) # length(unique(pcc_results$lncRNA)) # # # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05? # sig_pairs <- pcc_results %>% # filter(abs(PCC.cor) > 0.5 & p_value < 0.05) # # ## Count positive and negative PCC.cor values # all_count <- nrow(sig_pairs) # positive_count <- sum(sig_pairs$PCC.cor > 0) # negative_count <- sum(sig_pairs$PCC.cor < 0) # cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n") # cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n") # cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n") ``` Bash version: ``` bash cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/ #head Peve-PCC_mRNA_lncRNA.csv # Num of lines tail -n +2 Peve-PCC_mRNA_lncRNA.csv | wc -l # Num unique mRNAs and lncRNAs awk -F, 'NR > 1 {print $1}' Peve-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l awk -F, 'NR > 1 {print $2}' Peve-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l # filter for significance awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Peve-PCC_mRNA_lncRNA.csv > sig_pairs.csv # Count significant pairs and split by direction (pos or neg PCC) # Total tail -n +2 sig_pairs.csv | wc -l # Positive PCC awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l # Negative PCC awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l ``` ## 4866282 ## 31497 ## 1045 ## 4866283 ## 4724217 ## 142066 31,497 mRNA significantly correlate (p-val \< 0.05) with a total of 1,045 lncRNA, to form a total of 4,866,282 unique pairs with significantly correlated expression. Of these, all also had correlation magnitudes of at least 0.5. The vast majority of correlations (4,724,217 or 97.1%) were positive, though some (142,066 or 2.9%) were negative. How many mRNAs per lncRNA and vice versa for the sig pairs? ``` r # ## sig pairs # lncRNAs_per_mRNA <- sig_pairs %>% # group_by(mRNA) %>% # summarize(n_lncRNAs = n_distinct(lncRNA)) %>% # arrange(desc(n_lncRNAs)) # # print("lncRNAs per mRNA, significant. mean, range:") # mean(lncRNAs_per_mRNA$n_lncRNAs) # range(lncRNAs_per_mRNA$n_lncRNAs) # # cat("\n") # # mRNAs_per_lncRNA <- sig_pairs %>% # group_by(lncRNA) %>% # summarize(n_mRNAs = n_distinct(mRNA)) %>% # arrange(desc(n_mRNAs)) # # print("mRNAs per lncRNA, significnat. mean, range:") # mean(mRNAs_per_lncRNA$n_mRNAs) # range(mRNAs_per_lncRNA$n_mRNAs) ``` Bash version: ``` bash cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/ #lncRNAs per mRNA awk -F, 'NR > 1 {pair[$1","$2]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt # Range sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' # mRNAs per lncRNA awk -F, 'NR > 1 {pair[$2","$1]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt # Range sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' ``` ## Mean: 154.5 ## Range: 1 483 ## Mean: 4656.73 ## Range: 359 8813 For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are correlated with 1-483 unique lncRNAs, while the lncRNAs are correlated with 359-8,813 unique mRNAs. Generally, mRNA interact with far fewer unique lncRNA (mean = 154.5) than do lncRNA with unique mRNA (mean = 4656.73).