21.1-Apul-mRNA-lncRNA-correlation-PCC
================
2025-06-23

- [0.0.1 Note](#001-note)

This code will use Pearson’s correlation coefficient to examine possible
correlations between mRNA and lncRNA expression.

Run with updated lncRNA counts matrix 8/5/2025. Counts were updated to
correct for a previous issue of lncRNA with all-zero counts, and then
filtered to only include lncRNA with at least 5 counts in at least 1
sample as valid.

Both matrices (the mRNA counts and lncRNA counts) are prohibitively
large (~30,000 rows). I’ve used the below code to create an R script
that will perform the PCC calculations, `run_mRNA_lncRNA_PCC.R` stored
in `D-Apul/output/21.2-Apul-mRNA-lncRNA-correlation-PCC`. To run the PCC
calcs in the background, allowing continued work in Rstudio, run this
code chunk from the terminal:

``` bash
cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/

Rscript run_mRNA_lncRNA_PCC_chunks100.R > run_mRNA_lncRNA_PCC_chunks100_log.txt 2>&1 &
```

Read in gene counts

``` r
mRNA_counts <- read.csv("../output/07-Apul-Hisat/Apul-gene_count_matrix.csv")
rownames(mRNA_counts) <- mRNA_counts$gene_id
mRNA_counts <- mRNA_counts %>% select(-gene_id)
# mRNA_counts %>%
# dplyr::rename("sample140"=`RNA.ACR.140`,
#          "sample145"=`RNA.ACR.145`,
#          "sample150"=`RNA.ACR.150`,
#          "sample173"=`RNA.ACR.173`,
#          "sample178"=`RNA.ACR.178`)

head(mRNA_counts)
```

    ##            RNA.ACR.140 RNA.ACR.145 RNA.ACR.150 RNA.ACR.173 RNA.ACR.178
    ## FUN_035039         553         340         256         485         510
    ## FUN_035038        2486         775         743        1250        1092
    ## FUN_035031          46           6          25          41          29
    ## FUN_035030         183         252          48          78          73
    ## FUN_035033        1519         311         555         990         370
    ## FUN_035032        1764        1297        1035        1763        1360

``` r
# Remove any mRNAs with 0 for all samples 
mRNA_counts <- mRNA_counts %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)
```

Read in lncRNA data. Generated in
`/M-multi-species/output/01.6-lncRNA-pipeline/Apul-lncRNA-counts-filtered.txt`,
counts matrix available at
<https://raw.githubusercontent.com/urol-e5/deep-dive-expression/refs/heads/main/M-multi-species/output/01.6-lncRNA-pipeline/Apul-lncRNA-counts-filtered.txt>

``` r
# Updated counts matrix, accessed 8/5/2025
lncRNA_counts<-read_table(file="../../M-multi-species/output/01.6-lncRNA-pipeline/Apul-lncRNA-counts-filtered.txt") %>%
  dplyr::rename("lncrna_id"=Geneid,
         "sample140"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.140.sorted.bam`,
         "sample145"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.145.sorted.bam`,
         "sample150"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.150.sorted.bam`,
         "sample173"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.173.sorted.bam`,
         "sample178"=`X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.178.sorted.bam`)
```

    ## 
    ## ── Column specification ────────────────────────────────────────────────────────
    ## cols(
    ##   Geneid = col_character(),
    ##   Chr = col_character(),
    ##   Start = col_double(),
    ##   End = col_double(),
    ##   Strand = col_character(),
    ##   Length = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.140.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.145.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.150.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.173.sorted.bam = col_double(),
    ##   X.home.shared.8TB_HDD_02.zbengt.github.deep.dive.expression.M.multi.species.data.01.6.Apul.lncRNA.pipeline.RNA.ACR.178.sorted.bam = col_double()
    ## )

``` r
# Change to df 
lncRNA_counts_df <- as.data.frame(lncRNA_counts) %>% select(!c("Chr", "Start", "End", "Strand", "Length"))
row.names(lncRNA_counts_df) <- lncRNA_counts_df[,1]
lncRNA_counts_df <- lncRNA_counts_df[,-1]  # remove the first column (gene names) if needed

# Remove any lncRNAs with 0 for all samples 
lncRNA_counts_df <- lncRNA_counts_df %>%
     mutate(Total = rowSums(.[, 1:5]))%>%
    filter(!Total==0)%>%
    dplyr::select(!Total)
```

Number of mRNA and lncRNA included in the correlation analysis

``` r
nrow(mRNA_counts)
```

    ## [1] 33624

``` r
nrow(lncRNA_counts)
```

    ## [1] 31490

33,624 mRNA and 31,490 lncRNA will be included in the PCC analysis.

Normalize counts

``` r
# Function to normalize counts (simple RPM normalization)
normalize_counts <- function(counts) {
  rpm <- t(t(counts) / colSums(counts)) * 1e6
  return(rpm)
}

# Normalize mRNA and mRNA counts
mRNA_norm <- normalize_counts(mRNA_counts)
#mRNA_norm <- as.matrix(mRNA_counts_filt)

lncRNA_norm <- normalize_counts(lncRNA_counts_df)
#mRNA_norm <- as.matrix(mRNA_counts_filt)
```

Calculate PCC

``` r
# Function to calculate PCC and p-value for a pair of vectors
calc_pcc <- function(x, y) {
  result <- cor.test(x, y, method = "pearson")
  return(c(PCC = result$estimate, p_value = result$p.value))
}

# Create a data frame of all mRNA-lncRNA pairs
pairs <- expand.grid(mRNA = rownames(mRNA_norm), lncRNA = rownames(lncRNA_norm))

# Calculate PCC and p-value for each pair
pcc_results <- pairs %>%
  rowwise() %>%
  mutate(
    pcc_stats = list(calc_pcc(mRNA_norm[mRNA,], lncRNA_norm[lncRNA,]))
  ) %>%
  unnest_wider(pcc_stats)

# Adjust p-values for FDR
pcc_results <- pcc_results %>%
  mutate(adjusted_p_value = p.adjust(p_value, method = "fdr"))

# Save
write.csv(pcc_results, "../output/28-Apul-mRNA-lncRNA-interactions/Apul-PCC_mRNA_lncRNA.csv")
```

### 0.0.1 Note

Below summary stats originally in R, but the PCC results file is too
large to load to an R object. Instead, duplicated the R logic in bash to
work directly with the CSV file

Read back in PCC results (available in large-file storage at
`https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv`)

``` r
# pcc_results <- read.csv("https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv")
```

Bash version:

``` bash
wget -O ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv https://gannet.fish.washington.edu/kdurkin1/ravenbackups/deep-dive-expression/D-Apul/output/21.1-Apul-mRNA_lncRNA-correlation-PCC/Apul-PCC_mRNA_lncRNA.csv
```

Inspect the data

``` r
# nrow(pcc_results)
# length(unique(pcc_results$mRNA))
# length(unique(pcc_results$lncRNA))
# 
# # Are there any pairs that have a PCC correlation > |0.5| and a p-value < 0.05?
# sig_pairs <- pcc_results %>%
#   filter(abs(PCC.cor) > 0.5 & p_value < 0.05)
# 
# ## Count positive and negative PCC.cor values
# all_count <- nrow(sig_pairs)
# positive_count <- sum(sig_pairs$PCC.cor > 0)
# negative_count <- sum(sig_pairs$PCC.cor < 0)
# cat("Number of correlations with |PCC| > 0.5 and p-val < 0.05:", all_count, "\n")
# cat("Number of rows with positive PCC.cor:", positive_count, "(", (100*positive_count)/all_count, ")", "\n")
# cat("Number of rows with negative PCC.cor:", negative_count, "(", (100*negative_count)/all_count, ")", "\n")
```

Bash version:

``` bash
cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/
#head Apul-PCC_mRNA_lncRNA.csv

# Num of lines
tail -n +2 Apul-PCC_mRNA_lncRNA.csv | wc -l

# Num unique mRNAs and lncRNAs
awk -F, 'NR > 1 {print $1}' Apul-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l
awk -F, 'NR > 1 {print $2}' Apul-PCC_mRNA_lncRNA.csv | sort | uniq | wc -l


# filter for significance
awk -F, 'NR==1 || (sqrt(($3)^2) > 0.5 && $4 < 0.05)' Apul-PCC_mRNA_lncRNA.csv > sig_pairs.csv

# Count significant pars and split by direction (pos or neg PCC)
# Total
tail -n +2 sig_pairs.csv | wc -l

# Positive PCC
awk -F, 'NR>1 && $3 > 0.5 && $4 < 0.05' sig_pairs.csv | wc -l

# Negative PCC
awk -F, 'NR>1 && $3 < -0.5 && $4 < 0.05' sig_pairs.csv | wc -l
```

    ## 118197142
    ## 33624
    ## 31490
    ## 118197142
    ## 107168482
    ## 11028660

33,624 mRNA significantly correlate (p-val \< 0.05) with a total of
31,490 lncRNA, with a total of 118,197,142 unique pairs with
significantly correlated expression.

Of these, all also had correlation magnitudes of at least 0.5. The
majority of correlations (107,168,482 or 90.7%) were positive, though
some (11,028,660 or 9.3%) were negative.

How many mRNAs per lncRNA and vice versa for the sig pairs?

``` r
# ## sig pairs 
# lncRNAs_per_mRNA <- sig_pairs %>%
#   group_by(mRNA) %>%
#   summarize(n_lncRNAs = n_distinct(lncRNA)) %>%
#   arrange(desc(n_lncRNAs))
# 
# print("lncRNAs per mRNA, significant. mean, range:")
# mean(lncRNAs_per_mRNA$n_lncRNAs)
# range(lncRNAs_per_mRNA$n_lncRNAs)
# 
# cat("\n")
# 
# mRNAs_per_lncRNA <- sig_pairs %>%
#   group_by(lncRNA) %>%
#   summarize(n_mRNAs = n_distinct(mRNA)) %>%
#   arrange(desc(n_mRNAs))
# 
# print("mRNAs per lncRNA, significnat. mean, range:")
# mean(mRNAs_per_lncRNA$n_mRNAs)
# range(mRNAs_per_lncRNA$n_mRNAs)
```

Bash version:

``` bash
cd ../output/21.1-Apul-mRNA-lncRNA-correlation-PCC/

#lncRNAs per mRNA
awk -F, 'NR > 1 {pair[$1","$2]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > lncRNAs_per_mRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' lncRNAs_per_mRNA_counts.txt

# Range
sort -n lncRNAs_per_mRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'


# mRNAs per lncRNA
awk -F, 'NR > 1 {pair[$2","$1]=1} END {
  for (p in pair) {
    split(p, a, ",")
    count[a[1]]++
  }
  for (k in count) print count[k]
}' sig_pairs.csv > mRNAs_per_lncRNA_counts.txt

# Mean
awk '{sum+=$1} END {print "Mean:", sum/NR}' mRNAs_per_lncRNA_counts.txt

# Range
sort -n mRNAs_per_lncRNA_counts.txt | awk 'NR==1{min=$1} {max=$1} END {print "Range:", min, max}'
```

    ## Mean: 3515.26
    ## Range: 409 13731
    ## Mean: 3753.48
    ## Range: 646 7702

For the significant pairs (\|PCC\| \> 0.5, pval \< 0.05), the mRNAs are
correlated with 490-13,731 unique lncRNAs (3,515 on average), while the
lncRNAs are correlated with 646-7,702 unique mRNAs (3,753 on average).