10.1-Ptuh-mRNA-lncRNA-correlation-PCC
================
2025-06-23

This code will use Pearson’s correlation coefficient to examine possible
correlations between mRNA and lncRNA expression.

Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to
correct for a previous issue of lncRNA with all-zero counts, and then
filtered to only include lncRNA with at least 5 counts in at least 1
sample as valid.

Both matrices (the mRNA counts and lncRNA counts) are prohibitively
large (~30,000 rows and ~15,000 rows, respectively). I’ve used the above
code to create an R script that will perform the PCC calculations,
`run_mRNA_lncRNA_PCC.R` stored in
`F-Ptuh/output/21.2-Ptuh-mRNA-lncRNA-correlation-PCC`. To run the PCC
calcs in the background, allowing continued work in Rstudio, run this
code chunk from the terminal.

``` bash
cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/

Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 &
```

Once the script is finished running, you can toss the intermediate
“chunk” files

``` bash
cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/

rm -r chunk*.csv
```

Read in gene counts

``` r
mRNA_counts <- read.csv("../output/06.2-Ptuh-Hisat/Ptuh-gene_count_matrix.csv")
rownames(mRNA_counts) <- mRNA_counts$gene_id
mRNA_counts <- mRNA_counts %>% dplyr::select(-gene_id)
mRNA_counts <- mRNA_counts %>% dplyr::rename( 
         "sample47"=`RNA.POC.47`, 
         "sample48"=`RNA.POC.48`, 
         "sample50"=`RNA.POC.50`, 
         "sample53"=`RNA.POC.53`, 
         "sample57"=`RNA.POC.57`)

#head(mRNA_counts)

# Remove any mRNAs with 0 for all samples 
mRNA_counts <- mRNA_counts %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)

# Number of mRNA
nrow(mRNA_counts)
```

    ## [1] 26508

Read in lncRNA data. Generated in
`M-multi-species/output/01.6-lncRNA-pipeline`, counts matrix available
at
<https://raw.githubusercontent.com/urol-e5/deep-dive-expression/refs/heads/main//M-multi-species/output/01.6-lncRNA-pipeline/Ptuh-lncRNA-counts-filtered.txt>

``` r
lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Ptuh-lncRNA-counts-filtered.txt") %>%
  dplyr::rename("lncrna_id"=Geneid, 
         "sample47"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.47.S1.TP2.sorted.bam`, 
         "sample48"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.48.S1.TP2.sorted.bam`, 
         "sample50"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.50.S1.TP2.sorted.bam`, 
         "sample53"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.53.S1.TP2.sorted.bam`, 
         "sample57"=`...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.57.S1.TP2.sorted.bam`)
```

    ## 
    ## ── Column specification ────────────────────────────────────────────────────────
    ## cols(
    ##   Geneid = col_character(),
    ##   Chr = col_character(),
    ##   Start = col_double(),
    ##   End = col_double(),
    ##   Strand = col_character(),
    ##   Length = col_double(),
    ##   ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.47.S1.TP2.sorted.bam = col_double(),
    ##   ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.48.S1.TP2.sorted.bam = col_double(),
    ##   ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.50.S1.TP2.sorted.bam = col_double(),
    ##   ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.53.S1.TP2.sorted.bam = col_double(),
    ##   ...output.01.6.Ptuh.lncRNA.pipeline.RNA.POC.57.S1.TP2.sorted.bam = col_double()
    ## )

``` r
# Change to df 
lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% dplyr::select(!c("Chr", "Start", "End", "Strand", "Length"))
row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1]
lncRNA_counts_df <- lncRNA_counts_df[,-1]  # remove the first column (gene names) if needed

# Remove any lncRNAs with 0 for all samples 
lncRNA_counts_df <- lncRNA_counts_df %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)

# Number of lncRNA
nrow(lncRNA_counts_df)
```

    ## [1] 16152

Normalize counts

``` r
# Function to normalize counts (simple RPM normalization)
normalize_counts <- function(counts) {
  rpm <- t(t(counts) / colSums(counts)) * 1e6
  return(rpm)
}

# Normalize mRNA and mRNA counts
mRNA_norm <- normalize_counts(mRNA_counts)
#mRNA_norm <- as.matrix(mRNA_counts_filt)

lncRNA_norm <- normalize_counts(lncRNA_counts_df)
#mRNA_norm <- as.matrix(mRNA_counts_filt)
```

Calculate PCC

``` r
# Function to calculate PCC and p-value for a pair of vectors
calc_pcc <- function(x, y) {
  result <- cor.test(x, y, method = "pearson")
  return(c(PCC = result$estimate, p_value = result$p.value))
}

# Create a data frame of all mRNA-lncRNA pairs
pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm))

# Calculate PCC and p-value for each pair
pcc_results <- pairs %>%
  rowwise() %>%
  mutate(
    pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,]))
  ) %>%
  unnest_wider(pcc_stats)

# Adjust p-values for FDR
pcc_results <- pcc_results %>%
  mutate(adjusted_p_value = p.adjust(p_value, method = "fdr"))

# Save
write.csv(pcc_results, "../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv")
```

Read back in PCC results (available in large-file storage at
`https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/F-Ptuh/output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv`)

``` r
# pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/F-Ptuh/output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/Ptuh-PCC_mRNA_lncRNA.csv")
```

Inspect the data

``` r
# nrow(pcc_results)
# length(unique(pcc_results$mRNA))
# length(unique(pcc_results$lncRNA))
# 
# # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05?
# sig_pairs <- pcc_results %>%
#   filter(abs(PCC.cor) > 0.5 & p_value < 0.05)
# 
# ## Count positive and negative PCC.cor values
# all_count <- nrow(sig_pairs)
# positive_count <- sum(sig_pairs$PCC.cor > 0)
# negative_count <- sum(sig_pairs$PCC.cor < 0)
# cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n")
# cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n")
# cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n")
```

Bash version:

``` bash
cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/
#head Ptuh-PCC_mRNA_lncRNA.csv

# Num of lines
tail -n +2 Ptuh-PCC_mRNA_lncRNA.csv | wc -l

# Num unique mRNAs and lncRNAs
awk -F, 'NR > 1 {print $1}' Ptuh-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l
awk -F, 'NR > 1 {print $2}' Ptuh-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l


# filter for significance
awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Ptuh-PCC_mRNA_lncRNA.csv > sig_pairs.csv

# Count significant pars and split by direction (pos or neg PCC)
# Total
tail -n +2 sig_pairs.csv | wc -l

# Positive PCC
awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l

# Negative PCC
awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l
```

    ## 26751367
    ## 26508
    ## 16152
    ## 26751367
    ## 19681301
    ## 7070066

26,508 mRNA significantly correlate (p-val \< 0.05) with a total of
16,152 lncRNA, with a total of 26,751,367 unique pairs with
significantly correlated expression. This means that every single one of
our mRNA significantly correlates with at least one lncRNA, and every
lncRNA significantly correlates with at least one mRNA.

Of these, all also had correlation magnitudes of at least 0.5. The
majority of correlations (19,681,301 or 73.6%) were positive, though a
fair number (7,070,066 or 26.4%) were negative.

How many mRNAs per lncRNA and vice versa for the sig pairs?

``` r
# ## sig pairs 
# lncRNAs_per_mRNA <- sig_pairs %>%
#   group_by(mRNA) %>%
#   summarize(n_lncRNAs = n_distinct(lncRNA)) %>%
#   arrange(desc(n_lncRNAs))
# 
# print("lncRNAs per mRNA, significant. mean, range:")
# mean(lncRNAs_per_mRNA$n_lncRNAs)
# range(lncRNAs_per_mRNA$n_lncRNAs)
# 
# cat("\n")
# 
# mRNAs_per_lncRNA <- sig_pairs %>%
#   group_by(lncRNA) %>%
#   summarize(n_mRNAs = n_distinct(mRNA)) %>%
#   arrange(desc(n_mRNAs))
# 
# print("mRNAs per lncRNA, significnat. mean, range:")
# mean(mRNAs_per_lncRNA$n_mRNAs)
# range(mRNAs_per_lncRNA$n_mRNAs)
```

Bash version:

``` bash
cd ../output/10.1-Ptuh-mRNA-lncRNA-correlation-PCC/

#lncRNAs per mRNA
awk -F, 'NR > 1 {pair[$1","$2]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt

# Range
sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'


# mRNAs per lncRNA
awk -F, 'NR > 1 {pair[$2","$1]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt

# Range
sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'
```

    ## Mean: 1009.18
    ## Range: 133 3759
    ## Mean: 1656.23
    ## Range: 463 3266

For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are
correlated with 133-3,759 unique lncRNAs (1009.18 on average), while the
lncRNAs are correlated with 463-3,266 unique mRNAs (1656.23 on average).