09.1-Peve-mRNA-lncRNA-correlation-PCC
================
2025-06-23

- [0.0.1 Note](#001-note)

This code will use Pearson’s correlation coefficient to examine possible
correlations between mRNA and lncRNA expression.

Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to
correct for a previous issue of lncRNA with all-zero counts, and then
filtered to only include lncRNA with at least 5 counts in at least 1
sample as valid.

Both matrices (the mRNA counts and lncRNA counts) are prohibitively
large (~30,000 rows, and ~10,000 rows, respectively). I’ve used the
above code to create an R script that will perform the PCC calculations,
`run_mRNA_lncRNA_PCC.R` stored in
`E-Peve/output/21.2-Peve-mRNA-lncRNA-correlation-PCC`. To run the PCC
calcs in the background, allowing continued work in Rstudio, run this
code chunk from the terminal.

``` bash
cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/

Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 &
```

Read in gene counts

``` r
mRNA_counts <- read.csv("../output/06.2-Peve-Hisat/Peve-gene_count_matrix.csv")
rownames(mRNA_counts) <- mRNA_counts$gene_id
mRNA_counts <- mRNA_counts %>% select(-gene_id)
mRNA_counts <- mRNA_counts %>% rename("sample71"=`RNA.POR.71`, 
                       "sample73"=`RNA.POR.73`, 
                       "sample76"=`RNA.POR.76`, 
                       "sample79"=`RNA.POR.79`, 
                       "sample82"=`RNA.POR.82`)

head(mRNA_counts)
```

    ##                    sample71 sample73 sample76 sample79 sample82
    ## gene-Peve_00006864        0        0        3      263       79
    ## gene-Peve_00022042      439      396      552      640      286
    ## gene-Peve_00009109       89      181       33      293     1789
    ## gene-Peve_00009108        0        0        0        0        0
    ## gene-Peve_00004359     4598     2449     6571     2745    10812
    ## gene-Peve_00004358      147      159       24      155      153

``` r
# Remove any mRNAs with 0 for all samples 
mRNA_counts <- mRNA_counts %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)
```

Read in lncRNA data. Generated in
`E-Peve/code/32-Peve-lncRNA-matrix.Rmd`, counts matrix available at
<https://raw.githubusercontent.com/urol-e5/deep-dive-expression/refs/heads/main/E-Peve/output/32-Peve-lncRNA-matrix/Peve-lncRNA-counts.txt>

``` r
lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Peve-lncRNA-counts-filtered.txt") %>%
  rename("lncrna_id"=Geneid, 
         "sample71"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.71.sorted.bam`, 
         "sample73"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.73.sorted.bam`, 
         "sample76"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.76.sorted.bam`, 
         "sample79"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.79.sorted.bam`, 
         "sample82"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.82.sorted.bam`)
```

    ## 
    ## ── Column specification ────────────────────────────────────────────────────────
    ## cols(
    ##   Geneid = col_character(),
    ##   Chr = col_character(),
    ##   Start = col_double(),
    ##   End = col_double(),
    ##   Strand = col_character(),
    ##   Length = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.71.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.73.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.76.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.79.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Peve.lncRNA.pipeline.RNA.POR.82.sorted.bam = col_double()
    ## )

``` r
# Change to df 
lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% select(!c("Chr", "Start", "End", "Strand", "Length"))
row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1]
lncRNA_counts_df <- lncRNA_counts_df[,-1]  # remove the first column (gene names) if needed

# Remove any lncRNAs with 0 for all samples 
lncRNA_counts_df <- lncRNA_counts_df %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)
```

Normalize counts

``` r
# Function to normalize counts (simple RPM normalization)
normalize_counts <- function(counts) {
  rpm <- t(t(counts) / colSums(counts)) * 1e6
  return(rpm)
}

# Normalize mRNA and mRNA counts
mRNA_norm <- normalize_counts(mRNA_counts)
#mRNA_norm <- as.matrix(mRNA_counts_filt)

lncRNA_norm <- normalize_counts(lncRNA_counts_df)
#mRNA_norm <- as.matrix(mRNA_counts_filt)
```

Calculate PCC

``` r
# Function to calculate PCC and p-value for a pair of vectors
calc_pcc <- function(x, y) {
  result <- cor.test(x, y, method = "pearson")
  return(c(PCC = result$estimate, p_value = result$p.value))
}

# Create a data frame of all mRNA-lncRNA pairs
pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm))

# Calculate PCC and p-value for each pair
pcc_results <- pairs %>%
  rowwise() %>%
  mutate(
    pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,]))
  ) %>%
  unnest_wider(pcc_stats)

# Adjust p-values for FDR
pcc_results <- pcc_results %>%
  mutate(adjusted_p_value = p.adjust(p_value, method = "fdr"))

# Save
write.csv(pcc_results, "../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv")
```

Once you’ve generated the full correlation matrix, can toss the “chunks”
of correlations

``` bash
rm -r ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/chunk*.csv
```

### 0.0.1 Note

Below summary stats originally in R, but the PCC results file is too
large to load to an R object. Instead, duplicated the R logic in bash to
work directly with the CSV file

Read back in PCC results (available in large-file storage at
`https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv`)

``` r
# pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv")
```

Bash version:

``` bash
wget -O ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/E-Peve/output/09.1-Peve-mRNA-lncRNA-correlation-PCC/Peve-PCC_mRNA_lncRNA.csv
```

Inspect the data

``` r
# nrow(pcc_results)
# length(unique(pcc_results$mRNA))
# length(unique(pcc_results$lncRNA))
# 
# # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05?
# sig_pairs <- pcc_results %>%
#   filter(abs(PCC.cor) > 0.5 & p_value < 0.05)
# 
# ## Count positive and negative PCC.cor values
# all_count <- nrow(sig_pairs)
# positive_count <- sum(sig_pairs$PCC.cor > 0)
# negative_count <- sum(sig_pairs$PCC.cor < 0)
# cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n")
# cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n")
# cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n")
```

Bash version:

``` bash
cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/
#head Peve-PCC_mRNA_lncRNA.csv

# Num of lines
tail -n +2 Peve-PCC_mRNA_lncRNA.csv | wc -l

# Num unique mRNAs and lncRNAs
awk -F, 'NR > 1 {print $1}' Peve-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l
awk -F, 'NR > 1 {print $2}' Peve-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l


# filter for significance
awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Peve-PCC_mRNA_lncRNA.csv > sig_pairs.csv

# Count significant pairs and split by direction (pos or neg PCC)
# Total
tail -n +2 sig_pairs.csv | wc -l

# Positive PCC
awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l

# Negative PCC
awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l
```

    ## 4866282
    ## 31497
    ## 1045
    ## 4866283
    ## 4724217
    ## 142066

31,497 mRNA significantly correlate (p-val \< 0.05) with a total of
1,045 lncRNA, to form a total of 4,866,282 unique pairs with
significantly correlated expression.

Of these, all also had correlation magnitudes of at least 0.5. The vast
majority of correlations (4,724,217 or 97.1%) were positive, though some
(142,066 or 2.9%) were negative.

How many mRNAs per lncRNA and vice versa for the sig pairs?

``` r
# ## sig pairs 
# lncRNAs_per_mRNA <- sig_pairs %>%
#   group_by(mRNA) %>%
#   summarize(n_lncRNAs = n_distinct(lncRNA)) %>%
#   arrange(desc(n_lncRNAs))
# 
# print("lncRNAs per mRNA, significant. mean, range:")
# mean(lncRNAs_per_mRNA$n_lncRNAs)
# range(lncRNAs_per_mRNA$n_lncRNAs)
# 
# cat("\n")
# 
# mRNAs_per_lncRNA <- sig_pairs %>%
#   group_by(lncRNA) %>%
#   summarize(n_mRNAs = n_distinct(mRNA)) %>%
#   arrange(desc(n_mRNAs))
# 
# print("mRNAs per lncRNA, significnat. mean, range:")
# mean(mRNAs_per_lncRNA$n_mRNAs)
# range(mRNAs_per_lncRNA$n_mRNAs)
```

Bash version:

``` bash
cd ../output/09.1-Peve-mRNA-lncRNA-correlation-PCC/

#lncRNAs per mRNA
awk -F, 'NR > 1 {pair[$1","$2]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt

# Range
sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'


# mRNAs per lncRNA
awk -F, 'NR > 1 {pair[$2","$1]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt

# Range
sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'
```

    ## Mean: 154.5
    ## Range: 1 483
    ## Mean: 4656.73
    ## Range: 359 8813

For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are
correlated with 1-483 unique lncRNAs, while the lncRNAs are correlated
with 359-8,813 unique mRNAs. Generally, mRNA interact with far fewer
unique lncRNA (mean = 154.5) than do lncRNA with unique mRNA (mean =
4656.73).