21.1-Apul-mRNA-lncRNA-correlation-PCC ================ 2025-06-23 - [0.0.1 Note](#001-note) This code will use Pearson’s correlation coefficient to examine possible correlations between mRNA and lncRNA expression. Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to correct for a previous issue of lncRNA with all-zero counts, and then filtered to only include lncRNA with at least 5 counts in at least 1 sample as valid. Both matrices (the mRNA counts and lncRNA counts) are prohibitively large (~30,000 rows). I’ve used the below code to create an R script that will perform the PCC calculations, `run_mRNA_lncRNA_PCC.R` stored in `D-Apul/output/21.2-Apul-mRNA-lncRNA-correlation-PCC`. To run the PCC calcs in the background, allowing continued work in Rstudio, run this code chunk from the terminal: ``` bash cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/ Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 & ``` Read in gene counts ``` r mRNA_counts <- read.csv("../output/07-Apul-Hisat/Apul-gene_count_matrix.csv") rownames(mRNA_counts) <- mRNA_counts$gene_id mRNA_counts <- mRNA_counts %>% select(-gene_id) # mRNA_counts %>% # dplyr::rename("sample140"=`RNA.ACR.140`, # "sample145"=`RNA.ACR.145`, # "sample150"=`RNA.ACR.150`, # "sample173"=`RNA.ACR.173`, # "sample178"=`RNA.ACR.178`) head(mRNA_counts) ``` ## RNA.ACR.140 RNA.ACR.145 RNA.ACR.150 RNA.ACR.173 RNA.ACR.178 ## FUN_035039 553 340 256 485 510 ## FUN_035038 2486 775 743 1250 1092 ## FUN_035031 46 6 25 41 29 ## FUN_035030 183 252 48 78 73 ## FUN_035033 1519 311 555 990 370 ## FUN_035032 1764 1297 1035 1763 1360 ``` r # Remove any mRNAs with 0 for all samples mRNA_counts <- mRNA_counts %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) ``` Read in lncRNA data. Generated in `/M-multi-species/output/01.6-lncRNA-pipeline/Apul-lncRNA-counts-filtered.txt`, counts matrix available at ``` r # Updated counts matrix, accessed 8/5/2025 lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Apul-lncRNA-counts-filtered.txt") %>% dplyr::rename("lncrna_id"=Geneid, "sample140"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.140.sorted.bam`, "sample145"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.145.sorted.bam`, "sample150"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.150.sorted.bam`, "sample173"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.173.sorted.bam`, "sample178"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.178.sorted.bam`) ``` ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## Geneid = col_character(), ## Chr = col_character(), ## Start = col_double(), ## End = col_double(), ## Strand = col_character(), ## Length = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.140.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.145.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.150.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.173.sorted.bam = col_double(), ## X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.178.sorted.bam = col_double() ## ) ``` r # Change to df lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% select(!c("Chr", "Start", "End", "Strand", "Length")) row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1] lncRNA_counts_df <- lncRNA_counts_df[,-1] # remove the first column (gene names) if needed # Remove any lncRNAs with 0 for all samples lncRNA_counts_df <- lncRNA_counts_df %>% mutate(Total = rowSums(.[, 1:5]))%>% filter(!Total==0)%>% dplyr::select(!Total) ``` Number of mRNA and lncRNA included in the correlation analysis ``` r nrow(mRNA_counts) ``` ## [1] 33624 ``` r nrow(lncRNA_counts) ``` ## [1] 31490 33,624 mRNA and 31,490 lncRNA will be included in the PCC analysis. Normalize counts ``` r # Function to normalize counts (simple RPM normalization) normalize_counts <- function(counts) { rpm <- t(t(counts) / colSums(counts)) * 1e6 return(rpm) } # Normalize mRNA and mRNA counts mRNA_norm <- normalize_counts(mRNA_counts) #mRNA_norm <- as.matrix(mRNA_counts_filt) lncRNA_norm <- normalize_counts(lncRNA_counts_df) #mRNA_norm <- as.matrix(mRNA_counts_filt) ``` Calculate PCC ``` r # Function to calculate PCC and p-value for a pair of vectors calc_pcc <- function(x, y) { result <- cor.test(x, y, method = "pearson") return(c(PCC = result$estimate, p_value = result$p.value)) } # Create a data frame of all mRNA-lncRNA pairs pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm)) # Calculate PCC and p-value for each pair pcc_results <- pairs %>% rowwise() %>% mutate( pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,])) ) %>% unnest_wider(pcc_stats) # Adjust p-values for FDR pcc_results <- pcc_results %>% mutate(adjusted_p_value = p.adjust(p_value, method = "fdr")) # Save write.csv(pcc_results, "../output/28-Apul-mRNA-lncRNA-interactions/Apul-PCC_mRNA_lncRNA.csv") ``` ### 0.0.1 Note Below summary stats originally in R, but the PCC results file is too large to load to an R object. Instead, duplicated the R logic in bash to work directly with the CSV file Read back in PCC results (available in large-file storage at `https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv`) ``` r # pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv") ``` Bash version: ``` bash wget -O ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA_lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv ``` Inspect the data ``` r # nrow(pcc_results) # length(unique(pcc_results$mRNA)) # length(unique(pcc_results$lncRNA)) # # # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05? # sig_pairs <- pcc_results %>% # filter(abs(PCC.cor) > 0.5 & p_value < 0.05) # # ## Count positive and negative PCC.cor values # all_count <- nrow(sig_pairs) # positive_count <- sum(sig_pairs$PCC.cor > 0) # negative_count <- sum(sig_pairs$PCC.cor < 0) # cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n") # cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n") # cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n") ``` Bash version: ``` bash cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/ #head Apul-PCC_mRNA_lncRNA.csv # Num of lines tail -n +2 Apul-PCC_mRNA_lncRNA.csv | wc -l # Num unique mRNAs and lncRNAs awk -F, 'NR > 1 {print $1}' Apul-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l awk -F, 'NR > 1 {print $2}' Apul-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l # filter for significance awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Apul-PCC_mRNA_lncRNA.csv > sig_pairs.csv # Count significant pars and split by direction (pos or neg PCC) # Total tail -n +2 sig_pairs.csv | wc -l # Positive PCC awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l # Negative PCC awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l ``` ## 118197142 ## 33624 ## 31490 ## 118197142 ## 107168482 ## 11028660 33,624 mRNA significantly correlate (p-val \< 0.05) with a total of 31,490 lncRNA, with a total of 118,197,142 unique pairs with significantly correlated expression. Of these, all also had correlation magnitudes of at least 0.5. The majority of correlations (107,168,482 or 90.7%) were positive, though some (11,028,660 or 9.3%) were negative. How many mRNAs per lncRNA and vice versa for the sig pairs? ``` r # ## sig pairs # lncRNAs_per_mRNA <- sig_pairs %>% # group_by(mRNA) %>% # summarize(n_lncRNAs = n_distinct(lncRNA)) %>% # arrange(desc(n_lncRNAs)) # # print("lncRNAs per mRNA, significant. mean, range:") # mean(lncRNAs_per_mRNA$n_lncRNAs) # range(lncRNAs_per_mRNA$n_lncRNAs) # # cat("\n") # # mRNAs_per_lncRNA <- sig_pairs %>% # group_by(lncRNA) %>% # summarize(n_mRNAs = n_distinct(mRNA)) %>% # arrange(desc(n_mRNAs)) # # print("mRNAs per lncRNA, significnat. mean, range:") # mean(mRNAs_per_lncRNA$n_mRNAs) # range(mRNAs_per_lncRNA$n_mRNAs) ``` Bash version: ``` bash cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/ #lncRNAs per mRNA awk -F, 'NR > 1 {pair[$1","$2]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt # Range sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' # mRNAs per lncRNA awk -F, 'NR > 1 {pair[$2","$1]=1} END { for (p in pair) { split(p, a, ",") count[a[1]]++ } for (k in count) print count[k] }' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt # Mean awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt # Range sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}' ``` ## Mean: 3515.26 ## Range: 409 13731 ## Mean: 3753.48 ## Range: 646 7702 For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are correlated with 490-13,731 unique lncRNAs (3,515 on average), while the lncRNAs are correlated with 646-7,702 unique mRNAs (3,753 on average).