This file is an initial attempt to run WGCNA on a single crab over time. It is based largely on Yaamini’s script, which is based largely on the official WGCNA tutorial

We will look solely at C. bairdi genes for Crab A

This corresponds to Library IDs 178, 359, and 463 (Days 0, 2, and 17) for that crab

We will first extract the TPM (transcripts per million) counts from the kallisto libraries created earlier in the pipeline. We will then change those to logTPM counts, and then begin the WGCNA analysis

library(tidyverse)
library(WGCNA)
library(DESeq2)

Step 1: Extract TPM Counts from Kallisto Libraries

This portion of the script is based largely off 21_obtaining_TPM_for_DEGs.Rmd

First, set all variables

# Path to kallisto libraries
kallisto_path <- "../output/kallisto_libraries/cbai_transcriptomev4.0/"

# Libraries we want to read in to our TPM matrix
libraries <- c("178", "359", "463")

crabTraits <- data.frame("infected" = c(rep(1, times = 3)),
                         "day" = c(0, 2, 17))

Then, we begin creating our TPM matrix for all transcripts

# Create character vector with all filenames for our libraries
kallisto_files <- paste0(kallisto_path, "id", libraries, "/abundance.tsv")

# Read first kallisto file in to start data frame
TPMcounts <- read.delim(file = kallisto_files[1],
                        header = TRUE,
                        sep = "\t")
# Eliminate all columns except transcript ID and TPM
TPMcounts <- TPMcounts %>%
  select(target_id, tpm)

# Rename columns for consistency and to ID TPM counts
colnames(TPMcounts)[1:2] <- c("Transcript_ID",
                              paste0(libraries[1], "_TPM"))

# Loop through remaining kallisto files, performing full joins to the kallisto file we read in
for (i in 2:length(kallisto_files)){
  idnum <- str_extract(kallisto_files[i], "id[0-9]+")
  kallisto_output <- read.delim(file = kallisto_files[i],
                                header = TRUE,
                                sep = "\t")
  # Select only transcript ID and TPM (transcripts per million) columns
  kallisto_output <- kallisto_output %>%
    select(target_id, tpm)
  # Rename kallisto column names to give ID to count column
  colnames(kallisto_output)[1:2] <- c("Transcript_ID", 
                                      paste0(idnum, "_TPM"))
  # Add TPM value to table of DEGs
  # Perform full join, keeping all transcript IDs
  TPMcounts <- full_join(TPMcounts, kallisto_output, by = "Transcript_ID")
}

WGCNA has several recommendations when it comes to RNAseq data, available in the FAQ here. First, they suggest removing all transcripts with counts below 10 in over 90% of samples. Since we only have 3 samples, we will remove all transcripts that have no counts above 10.

They also suggest a variance-stabilizing transformation or log-transforming the counts using log2(x+1). A variance-stabilizing transformation using DESeq2 would be ideal, but DESeq2 doesn’t work on datasets without replication.

We will change our data to fit both of these recommendations below. After, we will transpose the data frame so samples are rows and transcripts are columns

# Create logical matrix for whole dataframe, comparing values to 10

# Move transcript ID to rownames
TPMcounts <- TPMcounts %>%
  column_to_rownames(var = "Transcript_ID")

# Get initial dimensions of data frame
dim(TPMcounts)
## [1] 88302     3
# Filter out all variables with no counts greater than 10
TPMcounts <- TPMcounts %>%
  filter_all(any_vars(. > 10))

# See how many transcripts we have left
dim(TPMcounts)
## [1] 5946    3
# Log-transform TPM counts
TPMcounts <- log2(TPMcounts + 1)

# Transpose dataframe to format for WGCNA
CrabExpr0 <- as.data.frame(t(TPMcounts))

# Check dataframe was transposed correctly
dim(CrabExpr0)
## [1]    3 5946

We will now begin analysis with WGCNA. Our script is based largely on Yaamini’s WGCNA script, which is based largely on the WGCNA tutorial

Check for Missing Values and Outliers

# Check for genes and samples with too many missing values
# This won't work here, because we only have 3 samples, and WGCNA requires more than 3, but hey - it's just a practice run. We'll therefore continue regardless, with this section commented out

#gsg <- goodSamplesGenes(CrabExpr0, verbose = 3)
#gsg$allOK      # should return TRUE if all genes pass test

sampleTree <- hclust(dist(CrabExpr0), method = "average")
plot(sampleTree)

Create clinical trait data matrix

# Print the crabTraits matrix we made earlier
head(crabTraits)
##   infected day
## 1        1   0
## 2        1   2
## 3        1  17
# Use same rownames as expression data to create analogous  matrix
rownames(crabTraits) <- rownames(CrabExpr0)
# Make sure it looks good
head(crabTraits)
##           infected day
## 178_TPM          1   0
## id359_TPM        1   2
## id463_TPM        1  17
# Create a dendrogram to look at sample and trait clustering
sampleTree2 <- hclust(dist(CrabExpr0), method = "average")
# Convert traits to color values
traitColors <- numbers2colors(crabTraits, signed = FALSE)
## Warning in numbers2colors(crabTraits, signed = FALSE): (some columns in) 'x' are
## constant. Their color will be the color of NA.
# Plot dendrogram
plotDendroAndColors(sampleTree2, traitColors, 
                    groupLabels = names(crabTraits))

Network Connection and Module Detection

Relevant tutorial

Determine soft-thresholding power

# Create set of soft-thresholding powers
powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
# Use network topology analysis function to eval soft-thresholding power vals
sft <- pickSoftThreshold(CrabExpr0, powerVector = powers, verbose = 5)
## pickSoftThreshold: will use block size 5946.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 5946 of 5946
## Warning: executing %dopar% sequentially: no parallel backend registered
##    Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.5720 2.720          0.829    4900      5270   5350
## 2      2   0.4200 1.440          0.777    4380      4890   5020
## 3      3   0.3060 0.995          0.780    4040      4620   4790
## 4      4   0.2610 0.797          0.777    3790      4400   4610
## 5      5   0.2190 0.669          0.785    3590      4230   4470
## 6      6   0.1930 0.589          0.789    3430      4080   4350
## 7      7   0.1690 0.541          0.817    3300      3940   4240
## 8      8   0.1500 0.486          0.816    3180      3820   4150
## 9      9   0.1360 0.444          0.811    3070      3710   4070
## 10    10   0.1230 0.412          0.812    2980      3620   3990
## 11    12   0.1010 0.353          0.807    2820      3440   3860
## 12    14   0.0800 0.302          0.796    2690      3290   3740
## 13    16   0.0655 0.264          0.788    2580      3160   3640
## 14    18   0.0554 0.235          0.789    2480      3040   3550
## 15    20   0.0466 0.210          0.782    2390      2930   3470
# Plot scale-free topology fit as function of soft-thresholding power
# We will select the lowest power that meets our R2 cutoff

# Again, since we have just 3 samples, none of our values come remotely close to our R2 cutoff, Therefore, we will select 20, as it is the closest.

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab = "Soft Threshold (power)",
     ylab = "Scale Free Topology Model Fit, signed R^2",
     type = "n",
     main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
                               labels = powers,
                               cex = 1,
                               col = "red")
abline(h = 0.90, col = "red")

# Plot mean connectivity as function of soft-thresholding power
plot(sft$fitIndices[,1],sft$fitIndices[,5],
     xlab = "Soft Threshold (power)",
     ylab = "Mean Connectivity",
     type = "n",
     main = paste("Mean connectivity"))
# Add sft values
text(sft$fitIndices[,1], sft$fitIndices[,5],
     labels = powers,
     cex = 1,
     col = "red")

Co-expression similarity and adjacency, and Topological Overlap Matrix (TOM)

softPower <- 20
adjacency <- adjacency(CrabExpr0, power = softPower)

# Minimize noise and spurious associations by transforming adjacency into TOM
TOM <- TOMsimilarity(adjacency)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
#Calculate dissimilarity matrix
dissTOM <- 1 - TOM

# Clustering using TOM

# Create hierarchical clustering object
geneTree <- hclust(as.dist(dissTOM), method = "average")
# Plot initial dendrogram. Dissimilarity is based on topological overlap
plot(geneTree, xlab = "", sub = "", 
     main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE,
     hang = 0.04)

# Set minimum module size, AKA num of genes that need to be in a module. Here, using WGCNA default
minModuleSize <- 30
# Cut branches of dendrogram to ID WGCNA modules
dynamicMods <- cutreeDynamic(dendro =  geneTree,
                             distM = dissTOM,
                             deepSplit = 2,
                             pamRespectsDendro = FALSE,
                             minClusterSize = minModuleSize)
##  ..cutHeight not given, setting it to 0.959  ===>  99% of the (truncated) height range in dendro.
##  ..done.
# Look at table of modules
table(dynamicMods)
## dynamicMods
##    1    2    3    4    5    6    7    8 
## 2217 2056  581  518  298  150   85   41
# Convert module numbers into colors
dynamicColors <- labels2colors(dynamicMods)

# Plot dendrogram with module colors
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE,
                    hang = 0.03,
                    addGuide = TRUE,
                    guideHang = 0.05,
                    main = "Gene dendrogram and module colors")

Merge modules with similar expression profiles

Merging lets us combine modules with genes that are highly co-expressed. To do this, we can create and cluster eigengenes

# Calculate eigengenes
MElist <- moduleEigengenes(CrabExpr0, colors = dynamicColors)
# Save eigengenes as new object
MEs <- MElist$eigengenes
# Calculate dissimilarity of eigengenes
MEDiss <- 1-cor(MEs)
# Create cluster object
METree <- hclust(as.dist(MEDiss), method = "average")

# Plot dendrogram of clustered eigengenes
plot(METree, main = "Clustering of module eigengenes",
     xlab = "",
     sub = "")
# ID cut height based on sample number (3)
dynamicMergeCut(3)
## Warning in function dynamicMergeCut: too few observations for the dynamic assignment of the merge threshold.
##     Will set the threshold to .35
## [1] 0.35
MEDissThres <- dynamicMergeCut(3)
## Warning in function dynamicMergeCut: too few observations for the dynamic assignment of the merge threshold.
##     Will set the threshold to .35
abline(h = MEDissThres, col = "red")

merge <- mergeCloseModules(CrabExpr0, dynamicColors,
                           cutHeight = MEDissThres,
                           verbose = 3)
##  mergeCloseModules: Merging modules whose distance is less than 0.35
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 8 module eigengenes in given set.
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 3 module eigengenes in given set.
##    Calculating new MEs...
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 3 module eigengenes in given set.
# Extract merged colors and eigengenes
mergedColors <- merge$colors
mergedMEs <- merge$newMEs

# Plot dendrogram with original and merged eigengenes
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels = FALSE,
                    hang = 0.03,
                    addGuide = TRUE,
                    guideHang = 0.05)

# Rename and save variables for subsequent analysis

moduleColors <- mergedColors
colorOrder <- c("grey", standardColors(50)) # Determine color order
moduleLabels <- match(moduleColors, colorOrder)-1 # Construct numerical labels based on colors
MEs <- mergedMEs # Replace unmerged MEs

Relate modules to external traits

# Count the number of genes and samples
nGenes <- ncol(CrabExpr0)
nSamples <- nrow(CrabExpr0)

# Recalculate MEs with color labels, order MEs based on MEs0
MEs0 <- moduleEigengenes(CrabExpr0, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)

# Calculate trait correlations and obtain p-values
moduleTraitCor <- cor(MEs, crabTraits, use = "p")
## Warning in cor(MEs, crabTraits, use = "p"): the standard deviation is zero
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nSamples)

# Create text matrix for correlations and their p-values
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix)
## NULL