--- title: "26.2-rank35-optimization-ElasticNet-component24-scaling" author: "Kathleen Durkin" date: "2025-11-12" output: github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- Steven has run some optimization tests to settle on a rank of 35 for the Barnacle tensor decomposition tool (`26-rank35-optimization`). I'm taking the gene set from component 24 of this run and applying our Elastic Net model to predict expression of this gene set from multiomic epigenetic predictors (miRNA + lncRNA + CpG site methylation), for all 3 species. UPDATES: I'm making 1 major change from the previous iteration (26.1): Scale standardization. Regression techniques require inputs to be on similar scales. That is not the case for our inputs, for two reasons. First, RNA counts are unbounded (could have thousands of transcripts for single RNA), but CpG site methylation is bounded on a 0-100 % scale. I'm transforming methylation from beta values (0% - 100%) to M values (log2 transformation of ratio of unmethyl/methyl). M values are unbounded (like expression counts), and also more appropriate for regression analyses because they remove heterodescasticity without need to run methylation values through a DESeq vst transformation, which may not be appropriate for use on data other than gene counts. Second, to get all predictor blocks (miRNA, lncRNA, and WGBS Mvalues on the same scale, ill be scaling each before combining into the full predictor set.) Inputs: - RNA counts matrix (for gene set, raw): - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_apul_count_matrix.csv` - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_peve_count_matrix.csv` - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_ptua_count_matrix.csv` - miRNA counts matrix (raw): - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt` - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Peve_miRNA_counts_formatted_updated.txt` - `https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Ptuh_miRNA_counts_formatted_updated.txt` - lncRNA counts matrix (raw): - `https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt` - `https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/E-Peve/output/12-Peve-lncRNA-discovery/lncRNA_counts.clean.filtered.txt` - `https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/F-Ptua/output/06-Ptua-lncRNA-discovery/lncRNA_counts.clean.filtered.txt` - WGBS data (processed): Shelly's generated a WGBS count tables for a bunch of different parameter combinations (alignment score and sample coverage) (see details in [the `timeseries_molecular` wiki](https://github.com/urol-e5/timeseries_molecular/wiki/02%E2%80%90Expression-Count-Matrices)). For now, let's go with the most conservative parameter combination available for all 3 species, which is an alignment score of -0.8 and coverage in all samples (n=39 in Apul, n=37 in Peve, and n=32 in Ptuh): - `https://gannet.fish.washington.edu/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/22.5-Apul-multiomic-SR/Apul-filtered-WGBS-CpG-counts.csv` - `https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250821_meth_Peve/merged-WGBS-CpG-counts_filtered.csv` - `https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250821_meth_Ptua/merged-WGBS-CpG-counts_filtered.csv` - sample metadata: `../../M-multi-species/data/rna_metadata.csv` # Set up ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache=TRUE) ``` # Load libraries ```{r, warning=FALSE, message=FALSE} library(tidyverse) library(ggplot2) library(DESeq2) library(igraph) library(psych) library(tidygraph) library(ggraph) library(WGCNA) library(edgeR) library(reshape2) library(ggcorrplot) library(corrplot) library(rvest) library(purrr) library(pheatmap) library(glmnet) library(caret) library(factoextra) library(vegan) library(ggfortify) library(genefilter) library(scales) library(purrr) ``` The model includes random processes, so set a seed for reproducability ```{r} set.seed(703) ``` # Define model functions ```{r} train_models <- function(response_features, predictor_features) { models <- list() for (feature in colnames(response_features)) { y <- response_features[[feature]] # Gene expression X <- as.matrix(predictor_features) # miRNA/lncRNA/methylation as predictors # Train elastic net model (alpha = 0.5 for mix of LASSO & Ridge) # Reducing nfolds from 10 (default) to 5, since I'm using smaller sample sizes of just 10 samples per timepoint model <- cv.glmnet(X, y, alpha = 0.5, nfolds = 3) models[[feature]] <- model } return(models) } get_feature_importance <- function(models) { importance_list <- lapply(models, function(model) { coefs <- as.matrix(coef(model, s = "lambda.min"))[-1, , drop = FALSE] # Convert to regular matrix & remove intercept # Convert to data frame coefs_df <- data.frame(Feature = rownames(coefs), Importance = as.numeric(coefs)) return(coefs_df) }) # Combine feature importance across all predicted genes importance_df <- bind_rows(importance_list) %>% group_by(Feature) %>% summarize(MeanImportance = mean(abs(Importance)), .groups = "drop") %>% arrange(desc(MeanImportance)) return(importance_df) } evaluate_model_performance <- function(models, response_features, predictor_features) { results <- data.frame(Feature = colnames(response_features), R2 = NA) for (feature in colnames(response_features)) { y <- response_features[[feature]] X <- as.matrix(predictor_features) model <- models[[feature]] preds <- predict(model, X, s = "lambda.min") R2 <- cor(y, preds)^2 # R-squared metric results[results$Feature == feature, "R2"] <- R2 } return(results) } ``` # Define training and testing function ```{r} train_models_split <- function(response_features, predictor_features, train_frac = 0.8, alpha = 0.5, nfolds = 10) { # Set up storage models <- list() performance <- data.frame(Feature = colnames(response_features), R2 = NA) # Select fraction of data sample_idx <- sample(seq_len(nrow(predictor_features)), size = floor(train_frac * nrow(predictor_features))) # Split predictor data X_train <- predictor_features[sample_idx, ] X_test <- predictor_features[-sample_idx, ] for (feature in colnames(response_features)) { # Split response data y_train <- response_features[sample_idx, feature] y_test <- response_features[-sample_idx, feature] # Check if y_train is constant if (length(unique(y_train)) == 1) { warning(sprintf("Skipping feature '%s': y_train is constant", feature)) next } # Run model on training data model <- cv.glmnet(as.matrix(X_train), y_train, alpha = alpha, nfolds = nfolds) models[[feature]] <- model # Now apply the trained model to test data preds <- predict(model, newx = as.matrix(X_test), s = "lambda.min") R2 <- cor(y_test, preds)^2 performance[performance$Feature == feature, "R2"] <- R2 } return(list(models = models, performance = performance)) } ``` # A. pulchra ## Load and format data ```{r} ### mRNA ### # raw gene counts data (will filter and variance stabilize) Apul_genes <- as.data.frame(read_csv("https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_apul_count_matrix.csv")) # format gene IDs as rownames (instead of a column) rownames(Apul_genes) <- Apul_genes$gene_id Apul_genes <- Apul_genes%>%select(!gene_id) ### miRNA ### # raw miRNA counts (will filter and variance stabilize) Apul_miRNA <- read.table(file = "https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt", header = TRUE, sep = "\t", check.names = FALSE) # format miRNA IDs as rownames (instead of a column) rownames(Apul_miRNA) <- Apul_miRNA$Name Apul_miRNA <- Apul_miRNA%>%select(!Name) ### lncRNA ### # raw lncRNA counts (will filter and variance stabilize) Apul_lncRNA_full <- read.table("https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt", header = TRUE, sep = "\t") # Remove info on genomic location, set lncRNA IDs as rownames rownames(Apul_lncRNA_full) <- Apul_lncRNA_full$Geneid Apul_lncRNA <- Apul_lncRNA_full %>% select(-Geneid, -Chr, -Start, -End, -Strand, -Length) # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Apul_lncRNA) <- gsub("\\.", "-", colnames(Apul_lncRNA)) ### WGBS data ### Apul_WGBS <- read.csv("https://gannet.fish.washington.edu/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/22.5-Apul-multiomic-SR/Apul-filtered-WGBS-CpG-counts.csv") # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Apul_WGBS) <- gsub("\\.", "-", colnames(Apul_WGBS)) ### load and format metadata ### metadata <- read_csv("../../M-multi-species/data/rna_metadata.csv")%>%select(AzentaSampleName, ColonyID, Timepoint) %>% filter(grepl("ACR", ColonyID)) metadata$Sample <- paste0(metadata$ColonyID, "-", metadata$Timepoint) rownames(metadata) <- metadata$Sample colonies <- unique(metadata$ColonyID) ``` ### Filter data sets #### WGBS Only keep CpGs that have a non-zero value in all samples. ```{r} Apul_WGBS_filt <- Apul_WGBS %>% filter(if_all(-X, ~ .x > 0)) # Ensure it's formatted as a data frame Apul_WGBS_filt <- as.data.frame(Apul_WGBS_filt) # Set CpG IDs to rownames rownames(Apul_WGBS_filt) <- Apul_WGBS_filt$X Apul_WGBS_filt <- Apul_WGBS_filt %>% select(-X) nrow(Apul_WGBS) nrow(Apul_WGBS_filt) ``` We had 96,785 CpGs before filtering and have 14,872 after filtering. #### RNA Remove the sample with poor WGBS data quality (`ACR-225-TP1`), which was already excluded from the WGBS counts ```{r} Apul_genes <- Apul_genes %>% select(-"ACR-225-TP1") Apul_miRNA <- Apul_miRNA %>% select(-"ACR-225-TP1") Apul_lncRNA <- Apul_lncRNA %>% select(-"ACR-225-TP1") ``` Ensure we only have genes, miRNA, and lncRNA that are present in at least one sample ```{r} # genes Apul_genes_red <- Apul_genes[rowSums(Apul_genes) != 0, ] # miRNA Apul_miRNA_red <- Apul_miRNA[rowSums(Apul_miRNA) != 0, ] # lncRNA Apul_lncRNA_red <- Apul_lncRNA[rowSums(Apul_lncRNA) != 0, ] cat("Retained ", nrow(Apul_genes_red), " of ", nrow(Apul_genes), "genes; ", nrow(Apul_miRNA_red), " of ", nrow(Apul_miRNA), " miRNA; and ", nrow(Apul_lncRNA_red), " of ", nrow(Apul_lncRNA), " lncRNA") ``` *pOverA*: Specifying the minimum count for a proportion of samples for each gene. Setting 3/40 = 0.08. This would retain genes that are only expressed in a single season in a couple of the colonies. Additionally, setting the minimum count so that the minimum number of samples must have a gene count above a certain threshold. genes: ```{r} filt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data gfilt <- genefilter(Apul_genes_red, filt) #identify genes to keep by count filter gkeep <- Apul_genes_red[gfilt,] #identify gene lists gn.keep <- rownames(gkeep) #gene count data filtered in PoverA, P percent of the samples have counts over A Apul_genes_filt <- as.data.frame(Apul_genes_red[which(rownames(Apul_genes_red) %in% gn.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Apul_genes_red), "; Post-filtering:", nrow(Apul_genes_filt)) ``` miRNA: ```{r} mifilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data mifilt <- genefilter(Apul_miRNA_red, mifilt) #identify miRNA to keep by count filter mikeep <- Apul_miRNA_red[mifilt,] #identify miRNA to keep by count filter mikeep <- Apul_miRNA_red[mifilt,] #identify miRNA lists mi.keep <- rownames(mikeep) #miRNA count data filtered in PoverA, P percent of the samples have counts over A Apul_miRNA_filt <- as.data.frame(Apul_miRNA_red[which(rownames(Apul_miRNA_red) %in% mi.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Apul_miRNA_red), "; Post-filtering:", nrow(Apul_miRNA_filt)) ``` Of the 51 miRNA, 47 were retained. Which were removed? ```{r} setdiff(rownames(Apul_miRNA_red), rownames(Apul_miRNA_filt)) ``` lncRNA: ```{r} lncfilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data lncfilt <- genefilter(Apul_lncRNA_red, lncfilt) #identify lncRNA to keep by count filter lnckeep <- Apul_lncRNA_red[lncfilt,] #identify lncRNA to keep by count filter lnckeep <- Apul_lncRNA_red[lncfilt,] #identify lncRNA lists lnc.keep <- rownames(lnckeep) #lncRNA count data filtered in PoverA, P percent of the samples have counts over A Apul_lncRNA_filt <- as.data.frame(Apul_lncRNA_red[which(rownames(Apul_lncRNA_red) %in% lnc.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Apul_lncRNA_red), "; Post-filtering:", nrow(Apul_lncRNA_filt)) ``` ### Transform data Remove Sample `ACR-225-TP1` from metadata table ```{r} metadata <- metadata[!(rownames(metadata) == "ACR-225-TP1"), ] rownames(metadata) <- metadata$Sample ``` Set the order of genes, miRNA, lncRNA, wgbs, and metadata to all be the same. ```{r} # Ensure rownames of metadata are used as the desired column order desired_order <- rownames(metadata) # Reorder data frame columns Apul_genes_filt <- Apul_genes_filt[, desired_order] Apul_miRNA_filt <- Apul_miRNA_filt[, desired_order] Apul_lncRNA_filt <- Apul_lncRNA_filt[, desired_order] Apul_WGBS_filt <- Apul_WGBS_filt[, desired_order] # Check they all match identical(rownames(metadata), colnames(Apul_genes_filt)) identical(rownames(metadata), colnames(Apul_miRNA_filt)) identical(rownames(metadata), colnames(Apul_lncRNA_filt)) identical(rownames(metadata), colnames(Apul_WGBS_filt)) ``` Use a variance stabilized transformation for all the RNA data sets. Variance stabilization essentially tries to make variance independent of the mean, and we'll implement via the differential expression analysis package `DESeq2` (Is this the most appropriate design to use?) ```{r} ### genes ### dds_Apul <- DESeqDataSetFromMatrix(countData = Apul_genes_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Apul_genes <- varianceStabilizingTransformation(dds_Apul, blind = TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Apul_genes <- assay(vsd_Apul_genes) ### miRNA ### dds_miRNA <- DESeqDataSetFromMatrix(countData = Apul_miRNA_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Apul_miRNA <- varianceStabilizingTransformation(dds_miRNA, blind=TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Apul_miRNA <- assay(vsd_Apul_miRNA) ### lncRNA ### # All values must be integers Apul_lncRNA_filt <- Apul_lncRNA_filt %>% mutate(across(where(is.numeric), round)) dds_lncRNA <- DESeqDataSetFromMatrix(countData = Apul_lncRNA_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Apul_lncRNA <- assay(vst(dds_lncRNA, blind = TRUE)) ``` For the methylation, we'll convert from beta values to M-values. This will both (a) convert to an unbounded scale (instead of being limited to 0-100) and (b) reduce heterodescasticity, like a vst transformation ```{r} ### WGBS ### # Convert beta values to M-values: M = log2(beta / (1 - beta)) # Convert percent to proportion (0-100 -> 0-1) Apul_WGBS_prop <- Apul_WGBS_filt/100 # Protect against zeros or ones by bounding beta values (as 0s and 1s have disproportionate impact) Apul_WGBS_prop[Apul_WGBS_prop == 0] <- 1e-6 Apul_WGBS_prop[Apul_WGBS_prop == 1] <- 1 - 1e-6 # Again, this is not a vst transformation, I'm just retaining the vsd name prefix for consistency vsd_Apul_WGBS <- log2(Apul_WGBS_prop / (1 - Apul_WGBS_prop)) ``` ### Scale standardization Elastic Net, like many regression models, requires inputs to be on the same scale ```{r} # Scale each predictor block (miRNA, lncRNA, WGBS) vsd_Apul_miRNA_scaled <- scale(vsd_Apul_miRNA) vsd_Apul_lncRNA_scaled <- scale(vsd_Apul_lncRNA) vsd_Apul_WGBS_scaled <- scale(vsd_Apul_WGBS) # Scale response (genes) vsd_Apul_genes_scaled <- scale(vsd_Apul_genes) ``` ### Merge predictor features Create a merged dataset that contains the variance stabilized counts for all miRNA, lncRNA, and CpGs. ```{r} # Triple check that all three data frames have sample names in the same order identical(colnames(vsd_Apul_lncRNA_scaled), colnames(vsd_Apul_miRNA_scaled)) identical(colnames(vsd_Apul_lncRNA_scaled), colnames(vsd_Apul_WGBS_scaled)) # Bind (stack dataframes vertically, so that they match by column/sample) Apul_pred_counts <- rbind(vsd_Apul_lncRNA_scaled, vsd_Apul_miRNA_scaled, vsd_Apul_WGBS_scaled) # Transform so that samples are on rows and features are in columns Apul_pred_counts <- t(Apul_pred_counts) dim(Apul_pred_counts) ``` ### Format ```{r} # Transform the counts matrices so that samples are on the rows and gene IDs on the columns vsd_Apul_genes_t <- t(vsd_Apul_genes_scaled) # Ensure both are formatted as data frames Apul_pred_counts <- as.data.frame(Apul_pred_counts) vsd_Apul_genes_t <- as.data.frame(vsd_Apul_genes_t) # Ensure sample matching between gene and epigenetic dfs common_samples <- intersect(rownames(vsd_Apul_genes_t), rownames(Apul_pred_counts)) Apul_pred_counts <- Apul_pred_counts[common_samples, ] vsd_Apul_genes_t <- vsd_Apul_genes_t[common_samples,] ``` ## Split training and test ## Bootstrapping ```{r, cache=TRUE} n_replicates <- 1 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Apul_TM_list <- list() Apul_MP_list <- list() Apul_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Apul_result <- train_models_split(vsd_Apul_genes_t, Apul_pred_counts) cat("Extracting results for replicate", i) cat("\n") Apul_trained_models <- Apul_result$models Apul_model_performance <- Apul_result$performance Apul_feature_importance <- get_feature_importance(Apul_trained_models) Apul_TM_list[[i]] <- Apul_trained_models Apul_TM_list[[i]]$Replicate <- i Apul_MP_list[[i]] <- Apul_model_performance Apul_MP_list[[i]]$Replicate <- i Apul_FI_list[[i]] <- Apul_feature_importance Apul_FI_list[[i]]$Replicate <- i } # Combine all results Apul_all_TM <- do.call(rbind, Apul_TM_list) Apul_all_MP <- do.call(rbind, Apul_MP_list) Apul_all_FI <- do.call(rbind, Apul_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Apul_MP_summary <- Apul_all_MP %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2/sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Apul_MP_summary) # How many genes are consistently well-predicted? sum(!is.na(Apul_MP_summary$Feature) & Apul_MP_summary$Mean_R2 > 0.5, na.rm = TRUE) ``` Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Apul_all_MP <- merge(Apul_all_MP, Apul_MP_summary[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Apul_all_MP$Feature <- factor(Apul_all_MP$Feature, levels = Apul_MP_summary$Feature[order(Apul_MP_summary$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Apul_all_MP, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.4, color = "gray40", linewidth = 0.25) + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Apul: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=5)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot with features sorted by mean R² ggplot(Apul_all_MP, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Apul_all_MP$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Apul: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` Average importance of all predictive features and plot those with highest average importance across all replicates ```{r} # Average importance across replicates Apul_mean_importance_df <- Apul_all_FI %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Apul_top_features <- Apul_mean_importance_df %>% top_n(50, MeanImportance) # Plot ggplot(Apul_top_features, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Apul: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` ## Boostrapping on predictable genes Now I only want to model genes that are well-predicted over many replicates Isolate well-predicted genes ```{r} Apul_well_predicted <- Apul_MP_summary[Apul_MP_summary$Mean_R2 > 0.5,] vsd_Apul_high_perf_t <- vsd_Apul_genes_t[,colnames(vsd_Apul_genes_t) %in% Apul_well_predicted$Feature] dim(vsd_Apul_high_perf_t) ``` Of the original 99 genes annotated for energetic state GO terms, 33 are consistently well-predicted by our model (mean R^2 > 0.5, over 25 replicates). Run bootstrapping on isolated genes ```{r, cache=TRUE} n_replicates <- 1 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Apul_TM_list <- list() Apul_MP_list <- list() Apul_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Apul_result <- train_models_split(vsd_Apul_high_perf_t, Apul_pred_counts) cat("Extracting results for replicate", i) cat("\n") Apul_trained_models <-Apul_result$models Apul_model_performance <- Apul_result$performance Apul_feature_importance <- get_feature_importance(Apul_trained_models) Apul_TM_list[[i]] <- Apul_trained_models Apul_TM_list[[i]]$Replicate <- i Apul_MP_list[[i]] <- Apul_model_performance Apul_MP_list[[i]]$Replicate <- i Apul_FI_list[[i]] <- Apul_feature_importance Apul_FI_list[[i]]$Replicate <- i } # Combine all results Apul_all_TM_highperf <- do.call(rbind, Apul_TM_list) Apul_all_MP_highperf <- do.call(rbind, Apul_MP_list) Apul_all_FI_highperf <- do.call(rbind, Apul_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Apul_MP_summary_highperf <- Apul_all_MP_highperf %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2 / sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Apul_MP_summary_highperf) # How many genes are consistently well-predicted? nrow(Apul_MP_summary_highperf[Apul_MP_summary_highperf$Mean_R2 > 0.5,]) ``` When I train the model on only these 33 well-predicted genes, 28 of them are again well-predicted. That's a much higher success rate! This suggests that epigenetic mechanisms may indeed be able to predict expression of these genes Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Apul_all_MP_highperf <- merge(Apul_all_MP_highperf, Apul_MP_summary_highperf[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Apul_all_MP_highperf$Feature <- factor(Apul_all_MP_highperf$Feature, levels = Apul_MP_summary_highperf$Feature[order(Apul_MP_summary_highperf$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Apul_all_MP_highperf, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.3, color = "gray40") + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Apul: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot all replicate R2s, with features sorted by mean R² ggplot(Apul_all_MP_highperf, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1.5) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Apul_all_MP_highperf$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Apul: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` ```{r} # Average importance across replicates Apul_mean_importance_df_highperf <- Apul_all_FI_highperf %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Apul_top_features_highperf <- Apul_mean_importance_df_highperf %>% top_n(50, MeanImportance) # Plot ggplot(Apul_top_features_highperf, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Apul: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` Look at most predictive epigenetic features for these well-predicted genes ```{r} # Define color assignments for predictor types color_palette <- c(miRNA = "#E69F00", lncRNA = "#0072B2", CpG = "#009E73") # Define well-predicted features Apul_all_features_highR2 <- Apul_all_MP_highperf %>% filter(Mean_R2 > 0.5) %>% pull(Feature) %>% unique() # Function to extract importance from a single model get_feature_importance_for_feature <- function(model) { coefs <- as.matrix(coef(model, s = "lambda.min"))[-1, , drop = FALSE] # remove intercept data.frame(Predictor = rownames(coefs), Importance = abs(as.numeric(coefs))) } # Function to assign predictor type get_predictor_type <- function(predictor_name) { if (startsWith(predictor_name, "Cluster")) return("miRNA") if (startsWith(predictor_name, "lncRNA")) return("lncRNA") if (startsWith(predictor_name, "CpG")) return("CpG") return("Other") } # Initialize list to save top predictors Apul_top_predictors <- list() # Loop over features, aggregate importance across replicates, and plot for (target_feature in Apul_all_features_highR2) { # Extract and combine importance scores across all replicates importance_all_reps <- lapply(Apul_TM_list, function(rep_entry) { model <- rep_entry[[target_feature]] if (!is.null(model)) { get_feature_importance_for_feature(model) } else { NULL } }) %>% bind_rows() # If no importance data was found, skip if (nrow(importance_all_reps) == 0) next # Compute average importance across replicates mean_importance <- importance_all_reps %>% group_by(Predictor) %>% summarise(MeanImportance = mean(Importance, na.rm = TRUE), .groups = "drop") %>% arrange(desc(MeanImportance)) %>% slice_head(n = 20) %>% mutate(Type = sapply(Predictor, get_predictor_type)) # Plot once per target feature plot <- ggplot(mean_importance, aes(x = reorder(Predictor, MeanImportance), y = MeanImportance, fill=Type)) + geom_bar(stat = "identity") + scale_fill_manual(values = color_palette) + coord_flip() + theme_minimal() + labs(title = paste("Top Predictors for", target_feature), x = "Predictor", y = "Mean Importance (across replicates)") # Save top 20 most important predictors for later use top20 <- mean_importance %>% mutate(Feature = target_feature) Apul_top_predictors[[target_feature]] <- top20 print(plot) } ``` ## Plot expression of genes and their predictors ```{r, error=FALSE, warning=FALSE} # Put predictor expression in format with samples in columns Apul_pred_counts_t <- t(Apul_pred_counts) # Loop over each feature in the list for (feature in names(Apul_top_predictors)) { # Get top 5 predictors for this feature top_df <- Apul_top_predictors[[feature]] %>% slice_head(n = 5) predictors <- top_df$Predictor # Combine feature and predictors all_genes <- c(feature, predictors) # Check which genes are actually present present_genes <- all_genes[all_genes %in% rownames(Apul_pred_counts_t) | all_genes %in% rownames(vsd_Apul_genes_scaled)] if (!(feature %in% rownames(vsd_Apul_genes_scaled))) { warning(paste("Feature", feature, "not found in vsd_Apul_genes_scaled Skipping...")) next } # Extract predictor expression (only those present) predictor_expr <- Apul_pred_counts_t[intersect(predictors, rownames(Apul_pred_counts_t)), , drop = FALSE] %>% as.data.frame() # Add the feature expression feature_expr <- vsd_Apul_genes_scaled[feature, , drop = FALSE] %>% as.data.frame() # Combine into one data frame combined_expr <- rbind(predictor_expr, feature_expr) combined_expr$Gene <- rownames(combined_expr) # Convert to long format expr_long <- combined_expr %>% pivot_longer(-Gene, names_to = "Sample", values_to = "Expression") # Join with sample metadata expr_long <- expr_long %>% left_join(metadata, by = "Sample") # Plot, with colonies aggregated by timepoint # p <- ggplot(expr_long, aes(x = interaction(Timepoint), y = Expression, color = Gene, group = Gene)) + # geom_point(alpha = 0.7) + # geom_smooth(se = FALSE, method = "loess") + # labs(title = paste("Expression of", feature, "and Top 5 Predictors"), # x = "Colony-Timepoint", # y = "Expression") + # theme_minimal() + # theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Plot, faceted by colony p <- ggplot(expr_long, aes(x = Timepoint, y = Expression, color = Gene, group = Gene)) + geom_point(alpha = 0.7) + geom_smooth(se = FALSE, method = "loess", linewidth = 0.5) + facet_wrap(~ColonyID) + labs(title = paste("Expression of", feature, "and Top 5 Predictors by Colony"), x = "Timepoint", y = "Expression") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) print(p) } ``` Save top predictors and metrics of each well-predicted gene ```{r} # Combine all sub-tibbles into a single data frame with the parent list name as an ID Apul_top_predictors_df <- imap_dfr(Apul_top_predictors, ~ { .x }) # View the resulting data frame print(Apul_top_predictors_df) # Optional: save to CSV write.csv(Apul_top_predictors_df, "../output/26.2-rank35-optimization-ElasticNet-component24-scaling/Apul_top_predictors.csv", row.names = FALSE) ``` # P. evermanni ## Load and format data ```{r} ### mRNA ### # raw gene counts data (will filter and variance stabilize) Peve_genes <- as.data.frame(read_csv("https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_peve_count_matrix.csv")) # format gene IDs as rownames (instead of a column) rownames(Peve_genes) <- Peve_genes$gene_id Peve_genes <- Peve_genes%>%select(!gene_id) ### miRNA ### # raw miRNA counts (will filter and variance stabilize) Peve_miRNA <- read.table(file = "https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Peve_miRNA_counts_formatted_updated.txt", header = TRUE, sep = "\t", check.names = FALSE) # format miRNA IDs as rownames (instead of a column) rownames(Peve_miRNA) <- Peve_miRNA$Name Peve_miRNA <- Peve_miRNA%>%select(!Name) ### lncRNA ### # raw lncRNA counts (will filter and variance stabilize) Peve_lncRNA_full <- read.table("https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/E-Peve/output/12-Peve-lncRNA-discovery/lncRNA_counts.clean.filtered.txt", header = TRUE, sep = "\t") # Remove info on genomic location, set lncRNA IDs as rownames rownames(Peve_lncRNA_full) <- Peve_lncRNA_full$Geneid Peve_lncRNA <- Peve_lncRNA_full %>% select(-Geneid, -Chr, -Start, -End, -Strand, -Length) # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Peve_lncRNA) <- gsub("\\.", "-", colnames(Peve_lncRNA)) ### WGBS data ### Peve_WGBS <- read.csv("https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250821_meth_Peve/merged-WGBS-CpG-counts_filtered.csv") # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Peve_WGBS) <- gsub("\\.", "-", colnames(Peve_WGBS)) ### load and format metadata ### metadata <- read_csv("../../M-multi-species/data/rna_metadata.csv")%>%select(AzentaSampleName, ColonyID, Timepoint) %>% filter(grepl("POR", ColonyID)) metadata$Sample <- paste0(metadata$ColonyID, "-", metadata$Timepoint) rownames(metadata) <- metadata$Sample colonies <- unique(metadata$ColonyID) ``` ### Filter data sets #### WGBS Only keep CpGs that have a non-zero value in all samples. ```{r} Peve_WGBS_filt <- Peve_WGBS %>% filter(if_all(-CpG, ~ .x > 0)) # Ensure it's formatted as a data frame Peve_WGBS_filt <- as.data.frame(Peve_WGBS_filt) # Set CpG IDs to rownames rownames(Peve_WGBS_filt) <- Peve_WGBS_filt$CpG Peve_WGBS_filt <- Peve_WGBS_filt %>% select(-CpG) nrow(Peve_WGBS) nrow(Peve_WGBS_filt) ``` We had 242,100 CpGs before filtering and have 29,465 after filtering. #### RNA Remove samples which were excluded from sRNA (`POR-69-TP3`) and WGBS counts (`POR-73-TP2`). Also exclude `POR-236-TP2`, which has 0 counts for all miRNA. This makes it both biologically uninformative and creates NA values during scaling. I'm also going to exclude the following samples, which have almost entirely 0 counts for miRNA, suggesting a sample processing/sequencing issue: `POR-74-TP4`, `POR-262-TP4`, `POR-262-TP3`, and `POR-69-TP2 ` ```{r} Peve_genes <- Peve_genes %>% select(-"POR-69-TP3", -"POR-73-TP2", -"POR-236-TP2", -"POR-74-TP4", -"POR-262-TP4", -"POR-262-TP3", -"POR-69-TP2") Peve_miRNA <- Peve_miRNA %>% select(-"POR-73-TP2", -"POR-236-TP2", -"POR-74-TP4", -"POR-262-TP4", -"POR-262-TP3", -"POR-69-TP2") Peve_lncRNA <- Peve_lncRNA %>% select(-"POR-69-TP3", -"POR-73-TP2", -"POR-236-TP2", -"POR-74-TP4", -"POR-262-TP4", -"POR-262-TP3", -"POR-69-TP2") Peve_WGBS <- Peve_WGBS %>% select(-"POR-69-TP3", -"POR-236-TP2", -"POR-74-TP4", -"POR-262-TP4", -"POR-262-TP3", -"POR-69-TP2") ``` Ensure we only have genes, miRNA, and lncRNA that are present in at least one sample ```{r} # genes Peve_genes_red <- Peve_genes[rowSums(Peve_genes) != 0, ] # miRNA Peve_miRNA_red <- Peve_miRNA[rowSums(Peve_miRNA) != 0, ] # lncRNA Peve_lncRNA_red <- Peve_lncRNA[rowSums(Peve_lncRNA) != 0, ] cat("Retained ", nrow(Peve_genes_red), " of ", nrow(Peve_genes), "genes; ", nrow(Peve_miRNA_red), " of ", nrow(Peve_miRNA), " miRNA; and ", nrow(Peve_lncRNA_red), " of ", nrow(Peve_lncRNA), " lncRNA") ``` *pOverA*: Specifying the minimum count for a proportion of samples for each gene. Setting 3/40 = 0.08. This would retain genes that are only expressed in a single season in a couple of the colonies. Additionally, setting the minimum count so that the minimum number of samples must have a gene count above a certain threshold. genes: ```{r} filt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data gfilt <- genefilter(Peve_genes_red, filt) #identify genes to keep by count filter gkeep <- Peve_genes_red[gfilt,] #identify gene lists gn.keep <- rownames(gkeep) #gene count data filtered in PoverA, P percent of the samples have counts over A Peve_genes_filt <- as.data.frame(Peve_genes_red[which(rownames(Peve_genes_red) %in% gn.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Peve_genes_red), "; Post-filtering:", nrow(Peve_genes_filt)) ``` miRNA: ```{r} mifilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data mifilt <- genefilter(Peve_miRNA_red, mifilt) #identify miRNA to keep by count filter mikeep <- Peve_miRNA_red[mifilt,] #identify miRNA to keep by count filter mikeep <- Peve_miRNA_red[mifilt,] #identify miRNA lists mi.keep <- rownames(mikeep) #miRNA count data filtered in PoverA, P percent of the samples have counts over A Peve_miRNA_filt <- as.data.frame(Peve_miRNA_red[which(rownames(Peve_miRNA_red) %in% mi.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Peve_miRNA_red), "; Post-filtering:", nrow(Peve_miRNA_filt)) ``` lncRNA: ```{r} lncfilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data lncfilt <- genefilter(Peve_lncRNA_red, lncfilt) #identify lncRNA to keep by count filter lnckeep <- Peve_lncRNA_red[lncfilt,] #identify lncRNA to keep by count filter lnckeep <- Peve_lncRNA_red[lncfilt,] #identify lncRNA lists lnc.keep <- rownames(lnckeep) #lncRNA count data filtered in PoverA, P percent of the samples have counts over A Peve_lncRNA_filt <- as.data.frame(Peve_lncRNA_red[which(rownames(Peve_lncRNA_red) %in% lnc.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Peve_lncRNA_red), "; Post-filtering:", nrow(Peve_lncRNA_filt)) ``` ### Transform data Remove excluded samples from metadata table ```{r} metadata <- metadata[!(rownames(metadata) %in% c("POR-69-TP3", "POR-73-TP2", "POR-236-TP2", "POR-74-TP4", "POR-262-TP4", "POR-262-TP3", "POR-69-TP2")), ] rownames(metadata) <- metadata$Sample ``` Set the order of genes, miRNA, lncRNA, wgbs, and metadata to all be the same. ```{r} # Ensure rownames of metadata are used as the desired column order desired_order <- rownames(metadata) # Reorder data frame columns Peve_genes_filt <- Peve_genes_filt[, desired_order] Peve_miRNA_filt <- Peve_miRNA_filt[, desired_order] Peve_lncRNA_filt <- Peve_lncRNA_filt[, desired_order] Peve_WGBS_filt <- Peve_WGBS_filt[, desired_order] # Check they all match identical(rownames(metadata), colnames(Peve_genes_filt)) identical(rownames(metadata), colnames(Peve_miRNA_filt)) identical(rownames(metadata), colnames(Peve_lncRNA_filt)) identical(rownames(metadata), colnames(Peve_WGBS_filt)) ``` Use a variance stabilized transformation for all the RNA data sets. Variance stabilization essentially tries to make variance independent of the mean, and we'll implement via the differential expression analysis package `DESeq2` (Is this the most appropriate design to use?) ```{r} ### genes ### dds_Peve <- DESeqDataSetFromMatrix(countData = Peve_genes_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Peve_genes <- varianceStabilizingTransformation(dds_Peve, blind = TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Peve_genes <- assay(vsd_Peve_genes) ### miRNA ### # This is throwing an error: "Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :every gene contains at least one zero, cannot compute log geometric means" #To avoid having excessive 0 values that interfere with geometric mean calculations, I'm going to add a pseudo-count of 1 to all counts. This is a simple workaround, but I should revisit later to determine whether its appropriate or I need to do a more in-depth fix. dds_miRNA <- DESeqDataSetFromMatrix(countData = Peve_miRNA_filt + 1, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Peve_miRNA <- varianceStabilizingTransformation(dds_miRNA, blind=TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Peve_miRNA <- assay(vsd_Peve_miRNA) ### lncRNA ### # All values must be integers Peve_lncRNA_filt <- Peve_lncRNA_filt %>% mutate(across(where(is.numeric), round)) dds_lncRNA <- DESeqDataSetFromMatrix(countData = Peve_lncRNA_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Peve_lncRNA <- assay(vst(dds_lncRNA, blind = TRUE)) ``` For the methylation, we'll convert from beta values to M-values. This will both (a) convert to an unbounded scale (instead of being limited to 0-100) and (b) reduce heterodescasticity, like a vst transformation ```{r} ### WGBS ### # Convert beta values to M-values: M = log2(beta / (1 - beta)) # Convert percent to proportion (0-100 -> 0-1) Peve_WGBS_prop <- Peve_WGBS_filt/100 # Protect against zeros or ones by bounding beta values (as 0s and 1s have disproportionate impact) Peve_WGBS_prop[Peve_WGBS_prop == 0] <- 1e-6 Peve_WGBS_prop[Peve_WGBS_prop == 1] <- 1 - 1e-6 # Again, this is not a vst transformation, I'm just retaining the vsd name prefix for consistency vsd_Peve_WGBS <- log2(Peve_WGBS_prop / (1 - Peve_WGBS_prop)) ``` ### Scale standardization Elastic Net, like many regression models, requires inputs to be on the same scale ```{r} # Scale each predictor block (miRNA, lncRNA, WGBS) vsd_Peve_miRNA_scaled <- scale(vsd_Peve_miRNA) vsd_Peve_lncRNA_scaled <- scale(vsd_Peve_lncRNA) vsd_Peve_WGBS_scaled <- scale(vsd_Peve_WGBS) # Scale response (genes) vsd_Peve_genes_scaled <- scale(vsd_Peve_genes) ``` ### Merge predictor features Create a merged dataset that contains the variance stabilized counts for all miRNA, lncRNA, and CpGs. ```{r} # Triple check that all three data frames have sample names in the same order identical(colnames(vsd_Peve_lncRNA_scaled), colnames(vsd_Peve_miRNA_scaled)) identical(colnames(vsd_Peve_lncRNA_scaled), colnames(vsd_Peve_WGBS_scaled)) # Bind (stack dataframes vertically, so that they match by column/sample) Peve_pred_counts <- rbind(vsd_Peve_lncRNA_scaled, vsd_Peve_miRNA_scaled, vsd_Peve_WGBS_scaled) # Transform so that samples are on rows and features are in columns Peve_pred_counts <- t(Peve_pred_counts) dim(Peve_pred_counts) ``` ### Format ```{r} # Transform the counts matrices so that samples are on the rows and gene IDs on the columns vsd_Peve_genes_t <- t(vsd_Peve_genes_scaled) # Ensure both are formatted as data frames Peve_pred_counts <- as.data.frame(Peve_pred_counts) vsd_Peve_genes_t <- as.data.frame(vsd_Peve_genes_t) # Ensure sample matching between gene and epigenetic dfs common_samples <- intersect(rownames(vsd_Peve_genes_t), rownames(Peve_pred_counts)) Peve_pred_counts <- Peve_pred_counts[common_samples, ] vsd_Peve_genes_t <- vsd_Peve_genes_t[common_samples,] ``` ## Split training and test ## Bootstrapping Try to remove low-variance predictors that might be messing with model: ```{r} var_threshold <- 0.01 # adjust as appropriate pred_var_scores <- apply(Peve_pred_counts, 2, var) Peve_pred_counts <- as.matrix(Peve_pred_counts)[, pred_var_scores > var_threshold] genes_var_scores <- apply(vsd_Peve_genes_t, 2, var) vsd_Peve_genes_t <- as.matrix(vsd_Peve_genes_t)[, genes_var_scores > var_threshold] ``` ```{r, cache=TRUE} n_replicates <- 10 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Peve_TM_list <- list() Peve_MP_list <- list() Peve_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Peve_result <- train_models_split(vsd_Peve_genes_t, Peve_pred_counts) cat("Extracting results for replicate", i) cat("\n") Peve_trained_models <- Peve_result$models Peve_model_performance <- Peve_result$performance Peve_feature_importance <- get_feature_importance(Peve_trained_models) Peve_TM_list[[i]] <- Peve_trained_models Peve_TM_list[[i]]$Replicate <- i Peve_MP_list[[i]] <- Peve_model_performance Peve_MP_list[[i]]$Replicate <- i Peve_FI_list[[i]] <- Peve_feature_importance Peve_FI_list[[i]]$Replicate <- i } # Combine all results Peve_all_TM <- do.call(rbind, Peve_TM_list) Peve_all_MP <- do.call(rbind, Peve_MP_list) Peve_all_FI <- do.call(rbind, Peve_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Peve_MP_summary <- Peve_all_MP %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2/sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Peve_MP_summary) # How many genes are consistently well-predicted? sum(!is.na(Peve_MP_summary$Feature) & Peve_MP_summary$Mean_R2 > 0.5, na.rm = TRUE) ``` Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Peve_all_MP <- merge(Peve_all_MP, Peve_MP_summary[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Peve_all_MP$Feature <- factor(Peve_all_MP$Feature, levels = Peve_MP_summary$Feature[order(Peve_MP_summary$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Peve_all_MP, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.4, color = "gray40", linewidth = 0.25) + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Peve: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=5)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot with features sorted by mean R² ggplot(Peve_all_MP, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Peve_all_MP$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Peve: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` Average importance of all predictive features and plot those with highest average importance accross all replicates ```{r} # Average importance across replicates Peve_mean_importance_df <- Peve_all_FI %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Peve_top_features <- Peve_mean_importance_df %>% top_n(50, MeanImportance) # Plot ggplot(Peve_top_features, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Peve: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` ## Boostrapping on predictable genes Now I only want to model genes that are well-predicted over many replicates Isolate well-predicted genes ```{r} Peve_well_predicted <- Peve_MP_summary[Peve_MP_summary$Mean_R2 > 0.5,] vsd_Peve_high_perf_t <- vsd_Peve_genes_t[,colnames(vsd_Peve_genes_t) %in% Peve_well_predicted$Feature] dim(vsd_Peve_high_perf_t) ``` Of the original 99 genes annotated for energetic state GO terms, 33 are consistently well-predicted by our model (mean R^2 > 0.5, over 25 replicates). Run bootstrapping on isolated genes ```{r, cache=TRUE} n_replicates <- 25 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Peve_TM_list <- list() Peve_MP_list <- list() Peve_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Peve_result <- train_models_split(vsd_Peve_high_perf_t, Peve_pred_counts) cat("Extracting results for replicate", i) cat("\n") Peve_trained_models <-Peve_result$models Peve_model_performance <- Peve_result$performance Peve_feature_importance <- get_feature_importance(Peve_trained_models) Peve_TM_list[[i]] <- Peve_trained_models Peve_TM_list[[i]]$Replicate <- i Peve_MP_list[[i]] <- Peve_model_performance Peve_MP_list[[i]]$Replicate <- i Peve_FI_list[[i]] <- Peve_feature_importance Peve_FI_list[[i]]$Replicate <- i } # Combine all results Peve_all_TM_highperf <- do.call(rbind, Peve_TM_list) Peve_all_MP_highperf <- do.call(rbind, Peve_MP_list) Peve_all_FI_highperf <- do.call(rbind, Peve_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Peve_MP_summary_highperf <- Peve_all_MP_highperf %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2 / sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Peve_MP_summary_highperf) # How many genes are consistently well-predicted? nrow(Peve_MP_summary_highperf[Peve_MP_summary_highperf$Mean_R2 > 0.5,]) ``` When I train the model on only these 33 well-predicted genes, 28 of them are again well-predicted. That's a much higher success rate! This suggests that epigenetic mechanisms may indeed be able to predict expression of these genes Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Peve_all_MP_highperf <- merge(Peve_all_MP_highperf, Peve_MP_summary_highperf[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Peve_all_MP_highperf$Feature <- factor(Peve_all_MP_highperf$Feature, levels = Peve_MP_summary_highperf$Feature[order(Peve_MP_summary_highperf$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Peve_all_MP_highperf, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.3, color = "gray40") + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Peve: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot all replicate R2s, with features sorted by mean R² ggplot(Peve_all_MP_highperf, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1.5) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Peve_all_MP_highperf$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Peve: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` ```{r} # Average importance across replicates Peve_mean_importance_df_highperf <- Peve_all_FI_highperf %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Peve_top_features_highperf <- Peve_mean_importance_df_highperf %>% top_n(50, MeanImportance) # Plot ggplot(Peve_top_features_highperf, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Peve: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` Look at most predictive epigenetic features for these well-predicted genes ```{r} # Define color assignments for predictor types color_palette <- c(miRNA = "#E69F00", lncRNA = "#0072B2", CpG = "#009E73") # Define well-predicted features Peve_all_features_highR2 <- Peve_all_MP_highperf %>% filter(Mean_R2 > 0.5) %>% pull(Feature) %>% unique() # Function to extract importance from a single model get_feature_importance_for_feature <- function(model) { coefs <- as.matrix(coef(model, s = "lambda.min"))[-1, , drop = FALSE] # remove intercept data.frame(Predictor = rownames(coefs), Importance = abs(as.numeric(coefs))) } # Function to assign predictor type get_predictor_type <- function(predictor_name) { if (startsWith(predictor_name, "Cluster")) return("miRNA") if (startsWith(predictor_name, "lncRNA")) return("lncRNA") if (startsWith(predictor_name, "CpG")) return("CpG") return("Other") } # Initialize list to save top predictors Peve_top_predictors <- list() # Loop over features, aggregate importance across replicates, and plot for (target_feature in Peve_all_features_highR2) { # Extract and combine importance scores across all replicates importance_all_reps <- lapply(Peve_TM_list, function(rep_entry) { model <- rep_entry[[target_feature]] if (!is.null(model)) { get_feature_importance_for_feature(model) } else { NULL } }) %>% bind_rows() # If no importance data was found, skip if (nrow(importance_all_reps) == 0) next # Compute average importance across replicates mean_importance <- importance_all_reps %>% group_by(Predictor) %>% summarise(MeanImportance = mean(Importance, na.rm = TRUE), .groups = "drop") %>% arrange(desc(MeanImportance)) %>% slice_head(n = 20) %>% mutate(Type = sapply(Predictor, get_predictor_type)) # Plot once per target feature plot <- ggplot(mean_importance, aes(x = reorder(Predictor, MeanImportance), y = MeanImportance, fill=Type)) + geom_bar(stat = "identity") + scale_fill_manual(values = color_palette) + coord_flip() + theme_minimal() + labs(title = paste("Top Predictors for", target_feature), x = "Predictor", y = "Mean Importance (across replicates)") # Save top 20 most important predictors for later use top20 <- mean_importance %>% mutate(Feature = target_feature) Peve_top_predictors[[target_feature]] <- top20 print(plot) } ``` ## Plot expression of genes and their predictors ```{r, error=FALSE, warning=FALSE} # Put predictor expression in format with samples in columns Peve_pred_counts_t <- t(Peve_pred_counts) # Loop over each feature in the list for (feature in names(Peve_top_predictors)) { # Get top 5 predictors for this feature top_df <- Peve_top_predictors[[feature]] %>% slice_head(n = 5) predictors <- top_df$Predictor # Combine feature and predictors all_genes <- c(feature, predictors) # Check which genes are actually present present_genes <- all_genes[all_genes %in% rownames(Peve_pred_counts_t) | all_genes %in% rownames(vsd_Peve_genes_scaled)] if (!(feature %in% rownames(vsd_Peve_genes_scaled))) { warning(paste("Feature", feature, "not found in vsd_Peve_genes_scaled Skipping...")) next } # Extract predictor expression (only those present) predictor_expr <- Peve_pred_counts_t[intersect(predictors, rownames(Peve_pred_counts_t)), , drop = FALSE] %>% as.data.frame() # Add the feature expression feature_expr <- vsd_Peve_genes_scaled[feature, , drop = FALSE] %>% as.data.frame() # Combine into one data frame combined_expr <- rbind(predictor_expr, feature_expr) combined_expr$Gene <- rownames(combined_expr) # Convert to long format expr_long <- combined_expr %>% pivot_longer(-Gene, names_to = "Sample", values_to = "Expression") # Join with sample metadata expr_long <- expr_long %>% left_join(metadata, by = "Sample") # Plot, with colonies aggregated by timepoint # p <- ggplot(expr_long, aes(x = interaction(Timepoint), y = Expression, color = Gene, group = Gene)) + # geom_point(alpha = 0.7) + # geom_smooth(se = FALSE, method = "loess") + # labs(title = paste("Expression of", feature, "and Top 5 Predictors"), # x = "Colony-Timepoint", # y = "Expression") + # theme_minimal() + # theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Plot, faceted by colony p <- ggplot(expr_long, aes(x = Timepoint, y = Expression, color = Gene, group = Gene)) + geom_point(alpha = 0.7) + geom_smooth(se = FALSE, method = "loess", linewidth = 0.5) + facet_wrap(~ColonyID) + labs(title = paste("Expression of", feature, "and Top 5 Predictors by Colony"), x = "Timepoint", y = "Expression") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) print(p) } ``` Save top predictors and metrics of each well-predicted gene ```{r} # Combine all sub-tibbles into a single data frame with the parent list name as an ID Peve_top_predictors_df <- imap_dfr(Peve_top_predictors, ~ { .x }) # View the resulting data frame print(Peve_top_predictors_df) # Optional: save to CSV write.csv(Peve_top_predictors_df, "../output/26.2-rank35-optimization-ElasticNet-component24-scaling/Peve_top_predictors.csv", row.names = FALSE) ``` # P. tuahiniensis ## Load and format data ```{r} ### mRNA ### # raw gene counts data (will filter and variance stabilize) Ptuh_genes <- as.data.frame(read_csv("https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/Component_24_ptua_count_matrix.csv")) # format gene IDs as rownames (instead of a column) rownames(Ptuh_genes) <- Ptuh_genes$gene_id Ptuh_genes <- Ptuh_genes%>%select(!gene_id) ### miRNA ### # raw miRNA counts (will filter and variance stabilize) Ptuh_miRNA <- read.table(file = "https://github.com/urol-e5/timeseries_molecular/raw/refs/heads/main/M-multi-species/output/10.1-format-miRNA-counts-updated/Ptuh_miRNA_counts_formatted_updated.txt", header = TRUE, sep = "\t", check.names = FALSE) # format miRNA IDs as rownames (instead of a column) rownames(Ptuh_miRNA) <- Ptuh_miRNA$Name Ptuh_miRNA <- Ptuh_miRNA%>%select(!Name) ### lncRNA ### # raw lncRNA counts (will filter and variance stabilize) Ptuh_lncRNA_full <- read.table("https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/F-Ptua/output/06-Ptua-lncRNA-discovery/lncRNA_counts.clean.filtered.txt", header = TRUE, sep = "\t") # Remove info on genomic location, set lncRNA IDs as rownames rownames(Ptuh_lncRNA_full) <- Ptuh_lncRNA_full$Geneid Ptuh_lncRNA <- Ptuh_lncRNA_full %>% select(-Geneid, -Chr, -Start, -End, -Strand, -Length) # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Ptuh_lncRNA) <- gsub("\\.", "-", colnames(Ptuh_lncRNA)) ### WGBS data ### Ptuh_WGBS <- read.csv(url("https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250821_meth_Ptua/merged-WGBS-CpG-counts_filtered.csv")) # Format sample names to match standard (SPEC-Sample#-TP#) colnames(Ptuh_WGBS) <- gsub("\\.", "-", colnames(Ptuh_WGBS)) ### load and format metadata ### metadata <- read_csv("../../M-multi-species/data/rna_metadata.csv")%>%select(AzentaSampleName, ColonyID, Timepoint) %>% filter(grepl("POC", ColonyID)) metadata$Sample <- paste0(metadata$ColonyID, "-", metadata$Timepoint) rownames(metadata) <- metadata$Sample colonies <- unique(metadata$ColonyID) ``` ### Filter data sets #### WGBS Only keep CpGs that have a non-zero value in all samples. ```{r} Ptuh_WGBS_filt <- Ptuh_WGBS %>% filter(if_all(-CpG, ~ .x > 0)) # Ensure it's formatted as a data frame Ptuh_WGBS_filt <- as.data.frame(Ptuh_WGBS_filt) # Set CpG IDs to rownames rownames(Ptuh_WGBS_filt) <- Ptuh_WGBS_filt$CpG Ptuh_WGBS_filt <- Ptuh_WGBS_filt %>% select(-CpG) nrow(Ptuh_WGBS) nrow(Ptuh_WGBS_filt) ``` We had 1,991,606 CpGs before filtering and have 58,872 after filtering. #### RNA Remove the samples with poor WGBS data quality (`POC-201-TP3` `POC-222-TP2` `POC-255-TP4` `POC-259-TP3` `POC-259-TP4` `POC-42-TP1` `POC-42-TP3`), which was already excluded from the WGBS counts ```{r} Ptuh_genes <- Ptuh_genes %>% select(-"POC-201-TP3", -"POC-222-TP2", -"POC-255-TP4", -"POC-259-TP3", -"POC-259-TP4", -"POC-42-TP1", -"POC-42-TP3") Ptuh_miRNA <- Ptuh_miRNA %>% select(-"POC-201-TP3", -"POC-222-TP2", -"POC-255-TP4", -"POC-259-TP3", -"POC-259-TP4", -"POC-42-TP1", -"POC-42-TP3") Ptuh_lncRNA <- Ptuh_lncRNA %>% select(-"POC-201-TP3", -"POC-222-TP2", -"POC-255-TP4", -"POC-259-TP3", -"POC-259-TP4", -"POC-42-TP1", -"POC-42-TP3") ``` Ensure we only have genes, miRNA, and lncRNA that are present in at least one sample ```{r} # genes Ptuh_genes_red <- Ptuh_genes[rowSums(Ptuh_genes) != 0, ] # miRNA Ptuh_miRNA_red <- Ptuh_miRNA[rowSums(Ptuh_miRNA) != 0, ] # lncRNA Ptuh_lncRNA_red <- Ptuh_lncRNA[rowSums(Ptuh_lncRNA) != 0, ] cat("Retained ", nrow(Ptuh_genes_red), " of ", nrow(Ptuh_genes), "genes; ", nrow(Ptuh_miRNA_red), " of ", nrow(Ptuh_miRNA), " miRNA; and ", nrow(Ptuh_lncRNA_red), " of ", nrow(Ptuh_lncRNA), " lncRNA") ``` *pOverA*: Specifying the minimum count for a proportion of samples for each gene. Setting 3/40 = 0.08. This would retain genes that are only expressed in a single season in a couple of the colonies. Additionally, setting the minimum count so that the minimum number of samples must have a gene count above a certain threshold. genes: ```{r} filt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data gfilt <- genefilter(Ptuh_genes_red, filt) #identify genes to keep by count filter gkeep <- Ptuh_genes_red[gfilt,] #identify gene lists gn.keep <- rownames(gkeep) #gene count data filtered in PoverA, P percent of the samples have counts over A Ptuh_genes_filt <- as.data.frame(Ptuh_genes_red[which(rownames(Ptuh_genes_red) %in% gn.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Ptuh_genes_red), "; Post-filtering:", nrow(Ptuh_genes_filt)) ``` miRNA: ```{r} mifilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data mifilt <- genefilter(Ptuh_miRNA_red, mifilt) #identify miRNA to keep by count filter mikeep <- Ptuh_miRNA_red[mifilt,] #identify miRNA to keep by count filter mikeep <- Ptuh_miRNA_red[mifilt,] #identify miRNA lists mi.keep <- rownames(mikeep) #miRNA count data filtered in PoverA, P percent of the samples have counts over A Ptuh_miRNA_filt <- as.data.frame(Ptuh_miRNA_red[which(rownames(Ptuh_miRNA_red) %in% mi.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Ptuh_miRNA_red), "; Post-filtering:", nrow(Ptuh_miRNA_filt)) ``` lncRNA: ```{r} lncfilt <- filterfun(pOverA(0.08, 5)) #create filter for the counts data lncfilt <- genefilter(Ptuh_lncRNA_red, lncfilt) #identify lncRNA to keep by count filter lnckeep <- Ptuh_lncRNA_red[lncfilt,] #identify lncRNA to keep by count filter lnckeep <- Ptuh_lncRNA_red[lncfilt,] #identify lncRNA lists lnc.keep <- rownames(lnckeep) #lncRNA count data filtered in PoverA, P percent of the samples have counts over A Ptuh_lncRNA_filt <- as.data.frame(Ptuh_lncRNA_red[which(rownames(Ptuh_lncRNA_red) %in% lnc.keep),]) #How many rows do we have before and after filtering? cat("Pre-filtering:", nrow(Ptuh_lncRNA_red), "; Post-filtering:", nrow(Ptuh_lncRNA_filt)) ``` ### Transform data Remove samples from metadata table ```{r} metadata <- metadata[!(rownames(metadata) %in% c("POC-201-TP3", "POC-222-TP2", "POC-255-TP4", "POC-259-TP3", "POC-259-TP4", "POC-42-TP1", "POC-42-TP3")), ] rownames(metadata) <- metadata$Sample ``` Set the order of genes, miRNA, lncRNA, wgbs, and metadata to all be the same. ```{r} # Ensure rownames of metadata are used as the desired column order desired_order <- rownames(metadata) # Reorder data frame columns Ptuh_genes_filt <- Ptuh_genes_filt[, desired_order] Ptuh_miRNA_filt <- Ptuh_miRNA_filt[, desired_order] Ptuh_lncRNA_filt <- Ptuh_lncRNA_filt[, desired_order] Ptuh_WGBS_filt <- Ptuh_WGBS_filt[, desired_order] # Check they all match identical(rownames(metadata), colnames(Ptuh_genes_filt)) identical(rownames(metadata), colnames(Ptuh_miRNA_filt)) identical(rownames(metadata), colnames(Ptuh_lncRNA_filt)) identical(rownames(metadata), colnames(Ptuh_WGBS_filt)) ``` Use a variance stabilized transformation for all the RNA data sets. Variance stabilization essentially tries to make variance independent of the mean, and we'll implement via the differential expression analysis package `DESeq2` (Is this the most appropriate design to use?) ```{r} ### genes ### dds_Ptuh <- DESeqDataSetFromMatrix(countData = Ptuh_genes_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Ptuh_genes <- varianceStabilizingTransformation(dds_Ptuh, blind = TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Ptuh_genes <- assay(vsd_Ptuh_genes) ### miRNA ### dds_miRNA <- DESeqDataSetFromMatrix(countData = Ptuh_miRNA_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Ptuh_miRNA <- varianceStabilizingTransformation(dds_miRNA, blind=TRUE) # Must use varianceStabilizingTransformation() instead of vst() due to few input genes vsd_Ptuh_miRNA <- assay(vsd_Ptuh_miRNA) ### lncRNA ### # All values must be integers Ptuh_lncRNA_filt <- Ptuh_lncRNA_filt %>% mutate(across(where(is.numeric), round)) dds_lncRNA <- DESeqDataSetFromMatrix(countData = Ptuh_lncRNA_filt, colData = metadata, design = ~Timepoint+ColonyID) # Variance Stabilizing Transformation vsd_Ptuh_lncRNA <- assay(vst(dds_lncRNA, blind = TRUE)) ``` For the methylation, we'll convert from beta values to M-values. This will both (a) convert to an unbounded scale (instead of being limited to 0-100) and (b) reduce heterodescasticity, like a vst transformation ```{r} ### WGBS ### # Convert beta values to M-values: M = log2(beta / (1 - beta)) # Convert percent to proportion (0-100 -> 0-1) Ptuh_WGBS_prop <- Ptuh_WGBS_filt/100 # Protect against zeros or ones by bounding beta values (as 0s and 1s have disproportionate impact) Ptuh_WGBS_prop[Ptuh_WGBS_prop == 0] <- 1e-6 Ptuh_WGBS_prop[Ptuh_WGBS_prop == 1] <- 1 - 1e-6 # Again, this is not a vst transformation, I'm just retaining the vsd name prefix for consistency vsd_Ptuh_WGBS <- log2(Ptuh_WGBS_prop / (1 - Ptuh_WGBS_prop)) ``` ### Scale standardization Elastic Net, like many regression models, requires inputs to be on the same scale ```{r} # Scale each predictor block (miRNA, lncRNA, WGBS) vsd_Ptuh_miRNA_scaled <- scale(vsd_Ptuh_miRNA) vsd_Ptuh_lncRNA_scaled <- scale(vsd_Ptuh_lncRNA) vsd_Ptuh_WGBS_scaled <- scale(vsd_Ptuh_WGBS) # Scale response (genes) vsd_Ptuh_genes_scaled <- scale(vsd_Ptuh_genes) ``` ### Merge predictor features Create a merged dataset that contains the variance stabilized counts for all miRNA, lncRNA, and CpGs. ```{r} # Triple check that all three data frames have sample names in the same order identical(colnames(vsd_Ptuh_lncRNA_scaled), colnames(vsd_Ptuh_miRNA_scaled)) identical(colnames(vsd_Ptuh_lncRNA_scaled), colnames(vsd_Ptuh_WGBS_scaled)) # Bind (stack dataframes vertically, so that they match by column/sample) Ptuh_pred_counts <- rbind(vsd_Ptuh_lncRNA_scaled, vsd_Ptuh_miRNA_scaled, vsd_Ptuh_WGBS_scaled) # Transform so that samples are on rows and features are in columns Ptuh_pred_counts <- t(Ptuh_pred_counts) dim(Ptuh_pred_counts) ``` ### Format ```{r} # Transform the counts matrices so that samples are on the rows and gene IDs on the columns vsd_Ptuh_genes_t <- t(vsd_Ptuh_genes_scaled) # Ensure both are formatted as data frames Ptuh_pred_counts <- as.data.frame(Ptuh_pred_counts) vsd_Ptuh_genes_t <- as.data.frame(vsd_Ptuh_genes_t) # Ensure sample matching between gene and epigenetic dfs common_samples <- intersect(rownames(vsd_Ptuh_genes_t), rownames(Ptuh_pred_counts)) Ptuh_pred_counts <- Ptuh_pred_counts[common_samples, ] vsd_Ptuh_genes_t <- vsd_Ptuh_genes_t[common_samples,] ``` ## Split training and test ## Bootstrapping ```{r, cache=TRUE} n_replicates <- 10 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Ptuh_TM_list <- list() Ptuh_MP_list <- list() Ptuh_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Ptuh_result <- train_models_split(vsd_Ptuh_genes_t, Ptuh_pred_counts) cat("Extracting results for replicate", i) cat("\n") Ptuh_trained_models <- Ptuh_result$models Ptuh_model_performance <- Ptuh_result$performance Ptuh_feature_importance <- get_feature_importance(Ptuh_trained_models) Ptuh_TM_list[[i]] <- Ptuh_trained_models Ptuh_TM_list[[i]]$Replicate <- i Ptuh_MP_list[[i]] <- Ptuh_model_performance Ptuh_MP_list[[i]]$Replicate <- i Ptuh_FI_list[[i]] <- Ptuh_feature_importance Ptuh_FI_list[[i]]$Replicate <- i } # Combine all results Ptuh_all_TM <- do.call(rbind, Ptuh_TM_list) Ptuh_all_MP <- do.call(rbind, Ptuh_MP_list) Ptuh_all_FI <- do.call(rbind, Ptuh_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Ptuh_MP_summary <- Ptuh_all_MP %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2/sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Ptuh_MP_summary) # How many genes are consistently well-predicted? sum(!is.na(Ptuh_MP_summary$Feature) & Ptuh_MP_summary$Mean_R2 > 0.5, na.rm = TRUE) ``` Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Ptuh_all_MP <- merge(Ptuh_all_MP, Ptuh_MP_summary[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Ptuh_all_MP$Feature <- factor(Ptuh_all_MP$Feature, levels = Ptuh_MP_summary$Feature[order(Ptuh_MP_summary$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Ptuh_all_MP, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.4, color = "gray40", linewidth = 0.25) + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Ptuh: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=5)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot with features sorted by mean R² ggplot(Ptuh_all_MP, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Ptuh_all_MP$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Ptuh: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` Average importance of all predictive features and plot those with highest average importance accross all replicates ```{r} # Average importance across replicates Ptuh_mean_importance_df <- Ptuh_all_FI %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Ptuh_top_features <- Ptuh_mean_importance_df %>% top_n(50, MeanImportance) # Plot ggplot(Ptuh_top_features, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Ptuh: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` ## Boostrapping on predictable genes Now I only want to model genes that are well-predicted over many replicates Isolate well-predicted genes ```{r} Ptuh_well_predicted <- Ptuh_MP_summary[Ptuh_MP_summary$Mean_R2 > 0.5,] vsd_Ptuh_high_perf_t <- vsd_Ptuh_genes_t[,colnames(vsd_Ptuh_genes_t) %in% Ptuh_well_predicted$Feature] dim(vsd_Ptuh_high_perf_t) ``` Of the original 99 genes annotated for energetic state GO terms, 33 are consistently well-predicted by our model (mean R^2 > 0.5, over 25 replicates). Run bootstrapping on isolated genes ```{r, cache=TRUE} n_replicates <- 25 # number of replicates r2_threshold <- 0.5 # threshold for "good" prediction Ptuh_TM_list <- list() Ptuh_MP_list <- list() Ptuh_FI_list <- list() # Perform bootstrapping for (i in 1:n_replicates) { cat("Beginning replicate", i) cat("\n") Ptuh_result <- train_models_split(vsd_Ptuh_high_perf_t, Ptuh_pred_counts) cat("Extracting results for replicate", i) cat("\n") Ptuh_trained_models <-Ptuh_result$models Ptuh_model_performance <- Ptuh_result$performance Ptuh_feature_importance <- get_feature_importance(Ptuh_trained_models) Ptuh_TM_list[[i]] <- Ptuh_trained_models Ptuh_TM_list[[i]]$Replicate <- i Ptuh_MP_list[[i]] <- Ptuh_model_performance Ptuh_MP_list[[i]]$Replicate <- i Ptuh_FI_list[[i]] <- Ptuh_feature_importance Ptuh_FI_list[[i]]$Replicate <- i } # Combine all results Ptuh_all_TM_highperf <- do.call(rbind, Ptuh_TM_list) Ptuh_all_MP_highperf <- do.call(rbind, Ptuh_MP_list) Ptuh_all_FI_highperf <- do.call(rbind, Ptuh_FI_list) ``` ```{r} # Count how often each feature exceeds the R² threshold Ptuh_MP_summary_highperf <- Ptuh_all_MP_highperf %>% group_by(Feature) %>% summarize( Mean_R2 = mean(R2, na.rm = TRUE), SD_R2 = sd(R2, na.rm = TRUE), SE_R2 = SD_R2 / sqrt(n()), High_Perf_Count = sum(R2 >= r2_threshold), .groups = 'drop' ) %>% arrange(desc(High_Perf_Count)) # View top consistently high-performing features print(Ptuh_MP_summary_highperf) # How many genes are consistently well-predicted? nrow(Ptuh_MP_summary_highperf[Ptuh_MP_summary_highperf$Mean_R2 > 0.5,]) ``` When I train the model on only these 33 well-predicted genes, 28 of them are again well-predicted. That's a much higher success rate! This suggests that epigenetic mechanisms may indeed be able to predict expression of these genes Sort genes by average R^2 across all replicates, then plot R^2 for all replicates ```{r} # Merge mean R2 info into full replicate-level data Ptuh_all_MP_highperf <- merge(Ptuh_all_MP_highperf, Ptuh_MP_summary_highperf[, c("Feature", "Mean_R2", "SD_R2")], by = "Feature") # Reorder Feature factor by Mean_R2 Ptuh_all_MP_highperf$Feature <- factor(Ptuh_all_MP_highperf$Feature, levels = Ptuh_MP_summary_highperf$Feature[order(Ptuh_MP_summary_highperf$Mean_R2)]) ``` ```{r} # Plot mean R² with error bars # Choose SD to describe variability in model performance ggplot(Ptuh_all_MP_highperf, aes(x = Feature, y = Mean_R2)) + geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2), width = 0.3, color = "gray40") + geom_point(size = 2, color = "black") + geom_hline(yintercept = 0.5, linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Ptuh: Mean R² with Error Bars Across Gene Features", x = "Gene Expression Feature (Ordered by Mean R²)", y = "Mean R² ± SD") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(-0.2, 1.2) + scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1)) ``` ```{r} # Plot all replicate R2s, with features sorted by mean R² ggplot(Ptuh_all_MP_highperf, aes(x = Feature, y = R2)) + geom_point(aes(color = as.factor(Replicate)), size = 1.5) + geom_point(aes(y = Mean_R2), color = "black", shape = 18, size = 3) + geom_hline(yintercept = mean(Ptuh_all_MP_highperf$R2, na.rm = TRUE), linetype = "dashed", color = "blue") + theme_minimal() + labs(title = "Ptuh: Model Performance Across Gene Expression", x = "Gene Expression Feature (Ordered by Mean R²)", y = "R² (Variance Explained)", color = "Replicate") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylim(0, 1) ``` ```{r} # Average importance across replicates Ptuh_mean_importance_df_highperf <- Ptuh_all_FI_highperf %>% group_by(Feature) %>% summarise(MeanImportance = mean(MeanImportance, na.rm = TRUE)) %>% arrange(desc(MeanImportance)) # Select top 50 features Ptuh_top_features_highperf <- Ptuh_mean_importance_df_highperf %>% top_n(50, MeanImportance) # Plot ggplot(Ptuh_top_features_highperf, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + # Flip for readability theme_minimal() + labs(title = "Ptuh: Top Predictive Features by Mean Importance", x = "Feature", y = "Mean Importance") ``` Look at most predictive epigenetic features for these well-predicted genes ```{r} # Define color assignments for predictor types color_palette <- c(miRNA = "#E69F00", lncRNA = "#0072B2", CpG = "#009E73") # Define well-predicted features Ptuh_all_features_highR2 <- Ptuh_all_MP_highperf %>% filter(Mean_R2 > 0.5) %>% pull(Feature) %>% unique() # Function to extract importance from a single model get_feature_importance_for_feature <- function(model) { coefs <- as.matrix(coef(model, s = "lambda.min"))[-1, , drop = FALSE] # remove intercept data.frame(Predictor = rownames(coefs), Importance = abs(as.numeric(coefs))) } # Function to assign predictor type get_predictor_type <- function(predictor_name) { if (startsWith(predictor_name, "Cluster")) return("miRNA") if (startsWith(predictor_name, "lncRNA")) return("lncRNA") if (startsWith(predictor_name, "CpG")) return("CpG") return("Other") } # Initialize list to save top predictors Ptuh_top_predictors <- list() # Loop over features, aggregate importance across replicates, and plot for (target_feature in Ptuh_all_features_highR2) { # Extract and combine importance scores across all replicates importance_all_reps <- lapply(Ptuh_TM_list, function(rep_entry) { model <- rep_entry[[target_feature]] if (!is.null(model)) { get_feature_importance_for_feature(model) } else { NULL } }) %>% bind_rows() # If no importance data was found, skip if (nrow(importance_all_reps) == 0) next # Compute average importance across replicates mean_importance <- importance_all_reps %>% group_by(Predictor) %>% summarise(MeanImportance = mean(Importance, na.rm = TRUE), .groups = "drop") %>% arrange(desc(MeanImportance)) %>% slice_head(n = 20) %>% mutate(Type = sapply(Predictor, get_predictor_type)) # Plot once per target feature plot <- ggplot(mean_importance, aes(x = reorder(Predictor, MeanImportance), y = MeanImportance, fill=Type)) + geom_bar(stat = "identity") + scale_fill_manual(values = color_palette) + coord_flip() + theme_minimal() + labs(title = paste("Top Predictors for", target_feature), x = "Predictor", y = "Mean Importance (across replicates)") # Save top 20 most important predictors for later use top20 <- mean_importance %>% mutate(Feature = target_feature) Ptuh_top_predictors[[target_feature]] <- top20 print(plot) } ``` ## Plot expression of genes and their predictors ```{r, error=FALSE, warning=FALSE} # Put predictor expression in format with samples in columns Ptuh_pred_counts_t <- t(Ptuh_pred_counts) # Loop over each feature in the list for (feature in names(Ptuh_top_predictors)) { # Get top 5 predictors for this feature top_df <- Ptuh_top_predictors[[feature]] %>% slice_head(n = 5) predictors <- top_df$Predictor # Combine feature and predictors all_genes <- c(feature, predictors) # Check which genes are actually present present_genes <- all_genes[all_genes %in% rownames(Ptuh_pred_counts_t) | all_genes %in% rownames(vsd_Ptuh_genes_scaled)] if (!(feature %in% rownames(vsd_Ptuh_genes_scaled))) { warning(paste("Feature", feature, "not found in vsd_Ptuh_genes_scaled Skipping...")) next } # Extract predictor expression (only those present) predictor_expr <- Ptuh_pred_counts_t[intersect(predictors, rownames(Ptuh_pred_counts_t)), , drop = FALSE] %>% as.data.frame() # Add the feature expression feature_expr <- vsd_Ptuh_genes_scaled[feature, , drop = FALSE] %>% as.data.frame() # Combine into one data frame combined_expr <- rbind(predictor_expr, feature_expr) combined_expr$Gene <- rownames(combined_expr) # Convert to long format expr_long <- combined_expr %>% pivot_longer(-Gene, names_to = "Sample", values_to = "Expression") # Join with sample metadata expr_long <- expr_long %>% left_join(metadata, by = "Sample") # Plot, with colonies aggregated by timepoint # p <- ggplot(expr_long, aes(x = interaction(Timepoint), y = Expression, color = Gene, group = Gene)) + # geom_point(alpha = 0.7) + # geom_smooth(se = FALSE, method = "loess") + # labs(title = paste("Expression of", feature, "and Top 5 Predictors"), # x = "Colony-Timepoint", # y = "Expression") + # theme_minimal() + # theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Plot, faceted by colony p <- ggplot(expr_long, aes(x = Timepoint, y = Expression, color = Gene, group = Gene)) + geom_point(alpha = 0.7) + geom_smooth(se = FALSE, method = "loess", linewidth = 0.5) + facet_wrap(~ColonyID) + labs(title = paste("Expression of", feature, "and Top 5 Predictors by Colony"), x = "Timepoint", y = "Expression") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) print(p) } ``` Save top predictors and metrics of each well-predicted gene ```{r} # Combine all sub-tibbles into a single data frame with the parent list name as an ID Ptuh_top_predictors_df <- imap_dfr(Ptuh_top_predictors, ~ { .x }) # View the resulting data frame print(Ptuh_top_predictors_df) # Optional: save to CSV write.csv(Ptuh_top_predictors_df, "../output/26.2-rank35-optimization-ElasticNet-component24-scaling/Ptuh_top_predictors.csv", row.names = FALSE) ```