--- title: "13.00-multiomics-barnacle" author: "Sam White" date: "2025-10-03" output: github_document: toc: true number_sections: true bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib citeproc: true --- # BACKGROUND This analysis prepares three-species ortholog expression matrices, normalizes the data, constructs a multi‑omics tensor (genes × combined species-samples × timepoints), and runs a sparse tensor decomposition using the [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] workflow to discover shared gene/time/species patterns. ## Workflow overview 1. **Data Preparation** - Extract and normalize ortholog expression data 2. **Tensor Construction** - Build 3D tensor combining species, samples, and timepoints 3. **Rank Comparison** - Test multiple ranks systematically to find optimal components 4. **Interpretation** - Choose best rank and examine factors, visualizations, and biological meaning ## Rank selection approach **This analysis uses a systematic rank comparison approach.** The workflow: 1. Configure `ranks_to_test` in the comparison chunk (e.g., `[5, 8, 10, 12, 15, 20]`) 2. Run decompositions for all specified ranks (10-30 minutes) 3. Review comparison metrics and elbow plot 4. Choose optimal rank based on: - Variance explained (aim for >60%) - Diminishing returns in elbow plot - Sparsity patterns - Component interpretability 5. Use chosen rank for final biological interpretation **Key metrics for comparison:** - **Variance explained**: How much of data variation is captured - **Relative error**: Reconstruction accuracy - **Sparsity**: How many genes/samples load strongly per component - **Convergence**: Whether optimization completed successfully - **Component weights**: Distribution of importance across components ## Key steps performed by this script: - Load annotated ortholog groups and per-species count matrices. - Filter ortholog groups to retain complete three‑way matches with available expression data. - Extract per‑species expression matrices mapped to ortholog group IDs. - Normalize counts (preferred: `sctransform`; fallback: log2(CPM + 1)). - Build a 3D tensor combining species/sample and timepoint dimensions. - Run [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] sparse CP decomposition and save factor matrices, metadata and figures. - Assess rank appropriateness through multiple metrics and visualizations. ## Input files - `../output/12-ortho-annot/ortholog_groups_annotated.csv` : annotated ortholog groups with columns for `group_id`, `apul`, `peve`, `ptua`, `type`, etc. - `../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv` : Apul gene count matrix (contains `gene_id` plus sample columns). - `../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-gene_count_matrix.csv` : Peve gene count matrix. - `../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-gene_count_matrix.csv` : Ptua gene count matrix. - (Also referenced) transcript-level matrices: `apul-transcript_count_matrix.csv`, `peve-transcript_count_matrix.csv`, `ptua-transcript_count_matrix.csv` when needed. ## Output files - `apul_ortholog_expression.csv`, `peve_ortholog_expression.csv`, `ptua_ortholog_expression.csv` : per‑species expression matrices aligned to ortholog `group_id`. - `apul_normalized_expression.csv`, `peve_normalized_expression.csv`, `ptua_normalized_expression.csv` : normalized expression matrices (sctransform or log2(CPM+1)). - `multiomics_tensor.npy` : saved NumPy array of the 3D tensor used for decomposition (genes × combined_samples × timepoints). - `barnacle_factors/` directory containing: - `gene_factors.csv` : gene loadings per component (genes × components). - `sample_factors.csv` : combined sample (species_sample) loadings per component with `Species` and `Sample_ID` metadata. - `time_factors.csv` : timepoint loadings per component. - `component_weights.csv` : component weights / importance. - `sample_mapping.csv` : mapping of combined sample indices to species and sample IDs. - `metadata.json` : analysis parameters and tensor metadata (shape, rank, lambdas, convergence, etc.). - `figures/` : generated visualizations (component weights, time loadings, sample heatmap, PCA, top ortholog plots). ## Notes / assumptions: - Sample column names are parsed expecting a dot-separated format with a `TP#` timepoint token (e.g., `ACR.139.TP1`). - Apul ortholog IDs in the ortholog table include transcript suffixes (e.g., `-T1`) which are removed for matching to the gene count matrix. - Missing values in the tensor are handled by substitution (current workflow fills NaNs with zeros before decomposition). - Samples must have all four timepoints (TP1, TP2, TP3, TP4) to be included in the analysis. # SETUP ## Libraries ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(reticulate) library(matrixStats) # Required for sctransform internal functions library(sctransform) library(glmGamPoi) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks results = "hold", # Hold outputs and show them after the full code chunk warning = FALSE, # Hide warnings collapse = FALSE, # Keep code and output in separate blocks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "##" # Prefix output lines with '##' so output is visually distinct ) ``` ## Set R variables ```{r R-variables, eval=TRUE} # OUTPUT DIRECTORY output_dir <- "../output/13.00-multiomics-barnacle" #INPUT FILE(S) ortholog_groups_file <- "../output/12-ortho-annot/ortholog_groups_annotated.csv" # Transcript count matrices for each species apul_transcript_matrix_file <- "../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-transcript_count_matrix.csv" peve_transcript_matrix_file <- "../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-transcript_count_matrix.csv" ptua_transcript_matrix_file <- "../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-transcript_count_matrix.csv" # CONDA conda_env_name <- c("/home/sam/programs/mambaforge/envs/barnacle_py311_env") conda_path <- c("/opt/anaconda/anaconda3/bin/conda") ``` ## Load [barnacle](https://github.com/blasks/barnacle) conda environment If this is successful, the first line of output should show that the Python being used is the one in your [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] conda environment path. E.g. `python: /home/sam/programs/mambaforge/envs/barnacle_py311_env/bin/python` ```{r load-barnacle-conda-env, eval=TRUE} use_condaenv(condaenv = conda_env_name, conda = conda_path) py_config() ``` # DATA PREP ## Load ortholog groups data ```{r load-ortholog-data, eval=TRUE} # Read in the ortholog groups data ortholog_groups <- read.csv(ortholog_groups_file) # Display basic info about the data cat("Dimensions of ortholog groups data:", dim(ortholog_groups), "\n\n") cat("Column names:", colnames(ortholog_groups), "\n\n") head(ortholog_groups) str(ortholog_groups) ``` ## Extract ortholog expression data Now let's extract expression data for genes that are present in the ortholog groups. We'll use the gene count matrices with gene IDs to properly map the data. ### Load gene count matrices ```{r load-gene-matrices, eval=TRUE} # Define file paths for gene count matrices apul_gene_matrix_file <- "../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv" peve_gene_matrix_file <- "../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-gene_count_matrix.csv" ptua_gene_matrix_file <- "../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-gene_count_matrix.csv" # Load gene count matrices for each species cat("Loading gene count matrices...\n\n") apul_gene_matrix <- read.csv(apul_gene_matrix_file) cat("Apul gene matrix dimensions:", dim(apul_gene_matrix), "\n") peve_gene_matrix <- read.csv(peve_gene_matrix_file) cat("Peve gene matrix dimensions:", dim(peve_gene_matrix), "\n") ptua_gene_matrix <- read.csv(ptua_gene_matrix_file) cat("Ptua gene matrix dimensions:", dim(ptua_gene_matrix), "\n\n") ``` ### Filter ortholog groups for complete three-way matches ```{r filter-ortholog-groups, eval=TRUE} cat("Filtering for complete three-way ortholog groups...\n") # Keep only rows where all three species have entries (no NA values or empty strings) complete_ortholog_groups <- ortholog_groups[nzchar(ortholog_groups$apul) & nzchar(ortholog_groups$peve) & nzchar(ortholog_groups$ptua), ] cat("Total ortholog groups:", nrow(ortholog_groups), "\n") cat("Complete three-way ortholog groups:", nrow(complete_ortholog_groups), "\n") ``` ### Filter for expression data availability ```{r filter-expression-availability, eval=TRUE} cat("Filtering ortholog groups to ensure all genes have expression data...\n") # Clean gene IDs to check against expression data # For Apul: remove -T[n] suffix from ortholog groups to match gene matrix format apul_ortholog_genes_check <- gsub("-T[0-9]+$", "", complete_ortholog_groups$apul) # For Peve and Ptua: use as-is (will clean gene matrix IDs later) peve_ortholog_genes_check <- complete_ortholog_groups$peve ptua_ortholog_genes_check <- complete_ortholog_groups$ptua # Check which genes are present in expression data # (Note: We need to clean gene matrix IDs to match) apul_gene_matrix_ids <- gsub("^gene-", "", apul_gene_matrix$gene_id) # Remove gene- prefix if present peve_gene_matrix_ids <- gsub("^gene-", "", peve_gene_matrix$gene_id) # Remove gene- prefix ptua_gene_matrix_ids <- gsub("^gene-", "", ptua_gene_matrix$gene_id) # Remove gene- prefix # Find which ortholog genes are present in each species' expression data apul_present <- apul_ortholog_genes_check %in% apul_gene_matrix_ids peve_present <- peve_ortholog_genes_check %in% peve_gene_matrix_ids ptua_present <- ptua_ortholog_genes_check %in% ptua_gene_matrix_ids # Keep only ortholog groups where all three species have expression data expression_complete_mask <- apul_present & peve_present & ptua_present complete_ortholog_groups <- complete_ortholog_groups[expression_complete_mask, ] cat("Ortholog groups after filtering for expression data availability:", nrow(complete_ortholog_groups), "\n") ``` ### Gene ID cleaning examples ```{r gene-id-examples-testing, eval=TRUE} cat("\n=== GENE ID CLEANING EXAMPLES ===\n") cat("Apul (clean ortholog groups to match gene matrix):\n") cat("Ortholog groups original:", head(complete_ortholog_groups$apul, 3), "\n") cat("Ortholog groups cleaned:", head(gsub("-T[0-9]+$", "", complete_ortholog_groups$apul), 3), "\n") cat("Gene matrix (target format):", head(apul_gene_matrix$gene_id, 3), "\n\n") cat("Peve (clean gene matrix to match ortholog groups):\n") cat("Ortholog groups (target format):", head(complete_ortholog_groups$peve, 3), "\n") cat("Gene matrix original:", head(peve_gene_matrix$gene_id, 3), "\n") cat("Gene matrix cleaned:", head(peve_gene_matrix$gene_id_clean, 3), "\n\n") cat("Ptua (clean gene matrix to match ortholog groups):\n") cat("Ortholog groups (target format):", head(complete_ortholog_groups$ptua, 3), "\n") cat("Gene matrix original:", head(ptua_gene_matrix$gene_id, 3), "\n") cat("Gene matrix cleaned:", head(ptua_gene_matrix$gene_id_clean, 3), "\n\n") ``` ### Clean gene IDs for matching ```{r clean-gene-ids, eval=TRUE} cat("Cleaning gene matrix IDs to match ortholog group format...\n") # For Apul: ortholog groups have "FUN_000185-T1", gene matrix has "FUN_002326" # We need to remove "-T1" from ortholog groups to match gene matrix apul_ortholog_genes <- unique(gsub("-T[0-9]+$", "", complete_ortholog_groups$apul)) # For Peve and Ptua: keep ortholog groups as-is and clean gene matrix peve_ortholog_genes <- unique(complete_ortholog_groups$peve) ptua_ortholog_genes <- unique(complete_ortholog_groups$ptua) # Clean gene matrix IDs accordingly # Apul: gene matrix already in correct format (no cleaning needed) apul_gene_matrix$gene_id_clean <- apul_gene_matrix$gene_id # Peve: gene matrix has "gene-Peve_00000032", ortholog groups have "Peve_00037402" # So we need to remove "gene-" prefix from gene matrix peve_gene_matrix$gene_id_clean <- gsub("^gene-", "", peve_gene_matrix$gene_id) # Ptua: gene matrix has "gene-Pocillopora_meandrina_HIv1___RNAseq.g20905.t1", # ortholog groups have "Pocillopora_meandrina_HIv1___RNAseq.g28886.t1" # So we just need to remove "gene-" prefix from gene matrix ptua_gene_matrix$gene_id_clean <- gsub("^gene-", "", ptua_gene_matrix$gene_id) cat("Apul ortholog genes (complete groups only):", length(apul_ortholog_genes), "\n") cat("Peve ortholog genes (complete groups only):", length(peve_ortholog_genes), "\n") cat("Ptua ortholog genes (complete groups only):", length(ptua_ortholog_genes), "\n\n") ``` ### Define sample filtering function ```{r define-sample-filter-function, eval=TRUE} # Function to filter samples that have all four timepoints (TP1, TP2, TP3, TP4) filter_complete_samples <- function(gene_matrix, species_name) { cat(" Filtering samples with complete timepoints for", species_name, "...\n") # Get expression columns (exclude gene_id column, gene_id_clean may not exist yet) all_cols <- colnames(gene_matrix) id_cols <- c("gene_id", "gene_id_clean") expr_cols <- setdiff(all_cols, id_cols) if(length(expr_cols) == 0) { cat(" ERROR: No expression columns found! Returning original matrix.\n") return(gene_matrix) } # Parse sample column names to identify sample groups and timepoints sample_timepoint_map <- list() for(col in expr_cols) { # Expected formats: ACR.139.TP1, POR.216.TP1, POC.201.TP1, etc. # Note: Using dots (.) as separators, not hyphens (-) parts <- strsplit(col, "\\.")[[1]] # Split on dots, not hyphens if(length(parts) >= 3) { # Extract timepoint from last part (e.g., "TP1", "TP2", etc.) tp_part <- parts[length(parts)] if(grepl("^TP[0-9]+$", tp_part)) { timepoint <- as.numeric(gsub("TP", "", tp_part)) # Sample ID is everything except the timepoint part, joined with dots sample_id <- paste(parts[1:(length(parts)-1)], collapse = ".") if(!sample_id %in% names(sample_timepoint_map)) { sample_timepoint_map[[sample_id]] <- list() } sample_timepoint_map[[sample_id]][[as.character(timepoint)]] <- col } } } # Identify samples with all four timepoints (1, 2, 3, 4) complete_samples <- c() incomplete_samples <- c() required_timepoints <- c("1", "2", "3", "4") for(sample_id in names(sample_timepoint_map)) { available_timepoints <- names(sample_timepoint_map[[sample_id]]) if(all(required_timepoints %in% available_timepoints)) { complete_samples <- c(complete_samples, sample_id) } else { incomplete_samples <- c(incomplete_samples, sample_id) missing_tps <- setdiff(required_timepoints, available_timepoints) cat(" Sample", sample_id, "missing timepoints:", paste(paste0("TP", missing_tps), collapse = ", "), "\n") } } cat(" Complete samples (all 4 timepoints):", length(complete_samples), "\n") cat(" Incomplete samples:", length(incomplete_samples), "\n") if(length(incomplete_samples) > 0) { cat(" Removing incomplete samples:", paste(incomplete_samples, collapse = ", "), "\n") } # Build list of columns to keep (include gene_id columns plus ONLY complete sample columns) if(length(complete_samples) == 0) { cat(" WARNING: No complete samples found! Keeping all expression columns.\n") # Fallback: keep all expression columns keep_cols <- c("gene_id") if("gene_id_clean" %in% colnames(gene_matrix)) { keep_cols <- c(keep_cols, "gene_id_clean") } keep_cols <- c(keep_cols, expr_cols) } else { keep_cols <- c("gene_id") if("gene_id_clean" %in% colnames(gene_matrix)) { keep_cols <- c(keep_cols, "gene_id_clean") } # ONLY add columns from complete samples (this excludes incomplete sample columns) for(sample_id in complete_samples) { for(tp in required_timepoints) { if(tp %in% names(sample_timepoint_map[[sample_id]])) { keep_cols <- c(keep_cols, sample_timepoint_map[[sample_id]][[tp]]) } } } # Double-check: ensure we're not accidentally including incomplete sample columns incomplete_cols <- c() for(sample_id in incomplete_samples) { for(tp in names(sample_timepoint_map[[sample_id]])) { incomplete_cols <- c(incomplete_cols, sample_timepoint_map[[sample_id]][[tp]]) } } if(length(incomplete_cols) > 0) { # Remove any incomplete sample columns that might have been included keep_cols <- setdiff(keep_cols, incomplete_cols) } } # Filter the gene matrix to keep only complete samples keep_cols <- intersect(keep_cols, colnames(gene_matrix)) # Ensure columns exist filtered_matrix <- gene_matrix[, keep_cols, drop = FALSE] # Count ID vs expression columns in final result final_expr_cols <- setdiff(colnames(filtered_matrix), c("gene_id", "gene_id_clean")) cat(" Final expression columns:", length(final_expr_cols), "\n") return(filtered_matrix) } ``` ### Define expression extraction function ```{r define-expression-extraction-function, eval=TRUE} # Function to extract expression data for a species using ortholog group mapping extract_species_expression <- function(ortholog_groups_df, gene_matrix, species_col, species_name) { cat("Processing", species_name, "...\n") # First, filter the gene matrix to keep only samples with complete timepoints filtered_gene_matrix <- filter_complete_samples(gene_matrix, species_name) # Remove duplicate group_ids, keeping first occurrence of each unique_groups <- ortholog_groups_df[!duplicated(ortholog_groups_df$group_id), ] cat(" Unique ortholog groups:", nrow(unique_groups), "\n") # Create results data frame starting with group_id result_df <- data.frame(group_id = unique_groups$group_id, stringsAsFactors = FALSE) # Get expression columns (exclude gene_id and gene_id_clean columns) expr_cols <- setdiff(colnames(filtered_gene_matrix), c("gene_id", "gene_id_clean")) # Initialize expression columns with NA for(col in expr_cols) { result_df[[col]] <- NA } # For each ortholog group, find the corresponding gene and extract expression for(i in seq_len(nrow(unique_groups))) { target_gene <- unique_groups[[species_col]][i] # Clean target gene for matching if(species_name == "Apul") { # Remove transcript suffix for Apul target_gene_clean <- gsub("-T[0-9]+$", "", target_gene) matching_rows <- which(filtered_gene_matrix$gene_id_clean == target_gene_clean) } else { # For Peve and Ptua, match cleaned gene_id matching_rows <- which(filtered_gene_matrix$gene_id_clean == target_gene) } if(length(matching_rows) == 1) { # Single match - copy expression data for(col in expr_cols) { result_df[i, col] <- filtered_gene_matrix[matching_rows, col] } } else if(length(matching_rows) > 1) { # Multiple matches - take first gene (no averaging) first_match <- matching_rows[1] for(col in expr_cols) { result_df[i, col] <- filtered_gene_matrix[first_match, col] } } else { # No match - leave as NA (will be removed later) } } # Remove rows with all NA expression values expr_na_mask <- apply(result_df[, expr_cols, drop = FALSE], 1, function(x) all(is.na(x))) result_df <- result_df[!expr_na_mask, ] cat(" Final dimensions:", nrow(result_df), "ortholog groups x", ncol(result_df)-1, "samples\n") cat(" Removed", sum(expr_na_mask), "groups with no expression data\n\n") return(result_df) } ``` ### Extract ortholog expression data ```{r extract-ortholog-expression-data, eval=TRUE} cat("Creating ortholog expression data with proper group_id mapping...\n") # Extract expression data for each species using the ortholog group mapping apul_ortholog_expression <- extract_species_expression(complete_ortholog_groups, apul_gene_matrix, "apul", "Apul") peve_ortholog_expression <- extract_species_expression(complete_ortholog_groups, peve_gene_matrix, "peve", "Peve") ptua_ortholog_expression <- extract_species_expression(complete_ortholog_groups, ptua_gene_matrix, "ptua", "Ptua") cat("=== FINAL ORTHOLOG EXPRESSION DATA DIMENSIONS ===\n") cat("Apul:", nrow(apul_ortholog_expression), "ortholog groups x", ncol(apul_ortholog_expression)-1, "samples\n") cat("Peve:", nrow(peve_ortholog_expression), "ortholog groups x", ncol(peve_ortholog_expression)-1, "samples\n") cat("Ptua:", nrow(ptua_ortholog_expression), "ortholog groups x", ncol(ptua_ortholog_expression)-1, "samples\n\n") ``` ### Write ortholog expression data ```{r write-expression-data, eval=TRUE} cat("Exporting ortholog expression data to CSV files...\n") # Define output file paths apul_output_file <- file.path(output_dir, "apul_ortholog_expression.csv") peve_output_file <- file.path(output_dir, "peve_ortholog_expression.csv") ptua_output_file <- file.path(output_dir, "ptua_ortholog_expression.csv") # Write CSV files without quotes write.csv(apul_ortholog_expression, file = apul_output_file, quote = FALSE, row.names = FALSE) cat("Exported Apul ortholog expression data to:", apul_output_file, "\n") write.csv(peve_ortholog_expression, file = peve_output_file, quote = FALSE, row.names = FALSE) cat("Exported Peve ortholog expression data to:", peve_output_file, "\n") write.csv(ptua_ortholog_expression, file = ptua_output_file, quote = FALSE, row.names = FALSE) cat("Exported Ptua ortholog expression data to:", ptua_output_file, "\n") cat("\nAll ortholog expression data exported successfully!\n\n") ``` ### Column structure analysis ```{r column-structure-analysis, eval=TRUE} cat("=== COLUMN STRUCTURE ANALYSIS ===\n") cat("Apul columns:", ncol(apul_ortholog_expression), "\n") cat("Apul column names (first 10):", paste(head(colnames(apul_ortholog_expression), 10), collapse = ", "), "\n") cat("Apul column names (last 10):", paste(tail(colnames(apul_ortholog_expression), 10), collapse = ", "), "\n\n") cat("Peve columns:", ncol(peve_ortholog_expression), "\n") cat("Peve column names (first 10):", paste(head(colnames(peve_ortholog_expression), 10), collapse = ", "), "\n") cat("Peve column names (last 10):", paste(tail(colnames(peve_ortholog_expression), 10), collapse = ", "), "\n\n") cat("Ptua columns:", ncol(ptua_ortholog_expression), "\n") cat("Ptua column names (first 10):", paste(head(colnames(ptua_ortholog_expression), 10), collapse = ", "), "\n") cat("Ptua column names (last 10):", paste(tail(colnames(ptua_ortholog_expression), 10), collapse = ", "), "\n\n") ``` ### Summary statistics ```{r summary-statistics, eval=TRUE} cat("=== LINE COUNTS FOR ORTHOLOG EXPRESSION DATA ===\n") cat("Apul ortholog expression with info: ", nrow(apul_ortholog_expression), " rows\n") cat("Peve ortholog expression with info: ", nrow(peve_ortholog_expression), " rows\n") cat("Ptua ortholog expression with info: ", nrow(ptua_ortholog_expression), " rows\n\n") ``` ### Sample filtering verification ```{r sample-filtering-verification, eval=TRUE} cat("=== SAMPLE FILTERING VERIFICATION ===\n") cat("This analysis filters samples to include ONLY those with all four timepoints (TP1, TP2, TP3, TP4)\n\n") # Function to analyze sample completeness in the filtered data analyze_sample_completeness <- function(data_df, species_name) { cat("Analyzing", species_name, "sample completeness:\n") # Get expression columns (exclude group_id) expr_cols <- setdiff(colnames(data_df), "group_id") # Parse sample column names to identify sample groups and timepoints sample_timepoint_map <- list() for(col in expr_cols) { parts <- strsplit(col, "\\.")[[1]] # Split on dots to match actual format if(length(parts) >= 3) { tp_part <- parts[length(parts)] if(grepl("^TP[0-9]+$", tp_part)) { timepoint <- as.numeric(gsub("TP", "", tp_part)) sample_id <- paste(parts[1:(length(parts)-1)], collapse = ".") if(!sample_id %in% names(sample_timepoint_map)) { sample_timepoint_map[[sample_id]] <- c() } sample_timepoint_map[[sample_id]] <- c(sample_timepoint_map[[sample_id]], timepoint) } } } # Check completeness complete_samples <- 0 incomplete_samples <- 0 required_timepoints <- c(1, 2, 3, 4) for(sample_id in names(sample_timepoint_map)) { available_timepoints <- sort(sample_timepoint_map[[sample_id]]) if(all(required_timepoints %in% available_timepoints)) { complete_samples <- complete_samples + 1 } else { incomplete_samples <- incomplete_samples + 1 missing_tps <- setdiff(required_timepoints, available_timepoints) cat(" WARNING: Sample", sample_id, "missing timepoints:", paste(paste0("TP", missing_tps), collapse = ", "), "\n") } } cat(" Total samples:", length(sample_timepoint_map), "\n") cat(" Complete samples (all 4 timepoints):", complete_samples, "\n") cat(" Incomplete samples:", incomplete_samples, "\n") if(incomplete_samples > 0) { cat(" ERROR: Filtering did not work correctly!\n") } else { cat(" SUCCESS: All samples have complete timepoints\n") } cat("\n") return(list( total = length(sample_timepoint_map), complete = complete_samples, incomplete = incomplete_samples )) } # Verify each species apul_stats <- analyze_sample_completeness(apul_ortholog_expression, "Apul") peve_stats <- analyze_sample_completeness(peve_ortholog_expression, "Peve") ptua_stats <- analyze_sample_completeness(ptua_ortholog_expression, "Ptua") cat("=== FILTERING SUMMARY ===\n") cat("All samples in the filtered dataset should have exactly 4 timepoints (TP1, TP2, TP3, TP4)\n") cat("Apul: ", apul_stats$complete, "/", apul_stats$total, " complete samples\n") cat("Peve: ", peve_stats$complete, "/", peve_stats$total, " complete samples\n") cat("Ptua: ", ptua_stats$complete, "/", ptua_stats$total, " complete samples\n") total_incomplete <- apul_stats$incomplete + peve_stats$incomplete + ptua_stats$incomplete if(total_incomplete == 0) { cat("SUCCESS: All samples across all species have complete timepoints!\n") } else { cat("ERROR: Found", total_incomplete, "incomplete samples - filtering needs to be fixed!\n") } cat("\n") ``` ### Diagnostic analysis ```{r diagnostic-analysis, eval=TRUE} cat("\n=== VERIFICATION: THREE-WAY ORTHOLOG COUNTS ===\n") cat("After filtering, all species should have identical three_way ortholog counts.\n\n") # Check how many genes we have for each species cat("Genes found in expression data by species:\n") cat("Apul ortholog expression (before adding info):", nrow(apul_ortholog_expression), "\n") cat("Peve ortholog expression (before adding info):", nrow(peve_ortholog_expression), "\n") cat("Ptua ortholog expression (before adding info):", nrow(ptua_ortholog_expression), "\n\n") # Verify all ortholog groups are three-way cat("Complete three-way ortholog groups available:", nrow(complete_ortholog_groups), "\n") cat("All should be type 'three_way'? Check:", table(complete_ortholog_groups$type), "\n\n") # Verify perfect gene coverage (should be 100% for all species now) apul_coverage <- length(intersect(apul_ortholog_genes, apul_gene_matrix$gene_id_clean)) peve_coverage <- length(intersect(peve_ortholog_genes, peve_gene_matrix$gene_id_clean)) ptua_coverage <- length(intersect(ptua_ortholog_genes, ptua_gene_matrix$gene_id_clean)) cat("Gene coverage in expression data (should be 100% for all):\n") cat("Apul: ", apul_coverage, "/", length(apul_ortholog_genes), " (", round(apul_coverage/length(apul_ortholog_genes)*100, 1), "%)\n") cat("Peve: ", peve_coverage, "/", length(peve_ortholog_genes), " (", round(peve_coverage/length(peve_ortholog_genes)*100, 1), "%)\n") cat("Ptua: ", ptua_coverage, "/", length(ptua_ortholog_genes), " (", round(ptua_coverage/length(ptua_ortholog_genes)*100, 1), "%)\n\n") ``` ### Preview expression data ```{r preview-expression-data, eval=TRUE} cat("\n=== PREVIEW OF ORTHOLOG EXPRESSION DATA ===\n") cat("Apul ortholog expression:\n\n") str(apul_ortholog_expression) cat("\n\n") cat("\nPeve ortholog expression:\n") str(peve_ortholog_expression) cat("\n\n") cat("\nPtua ortholog expression:\n") str(ptua_ortholog_expression) cat("\n\n") ``` # BARNACLE ANALYSIS This workflow focuses on comparing multiple ranks to find the optimal number of components. Based on the barnacle workflow, we: 1. Normalize count data with `sctransform` 2. Create tensors for multiomics analysis 3. Compare multiple ranks systematically 4. Choose optimal rank based on metrics ## Load expression data ```{r load-normalized-data, eval=TRUE} # Read the exported ortholog expression data apul_expr <- read.csv(file.path(output_dir, "apul_ortholog_expression.csv")) peve_expr <- read.csv(file.path(output_dir, "peve_ortholog_expression.csv")) ptua_expr <- read.csv(file.path(output_dir, "ptua_ortholog_expression.csv")) cat("Loaded expression data:\n") cat("Apul:", nrow(apul_expr), "genes x", ncol(apul_expr)-1, "samples\n") cat("Peve:", nrow(peve_expr), "genes x", ncol(peve_expr)-1, "samples\n") cat("Ptua:", nrow(ptua_expr), "genes x", ncol(ptua_expr)-1, "samples\n") ``` ## Normalize data with sctransform Following the barnacle manuscript approach, we'll use `sctransform` to normalize each species' data. We'll use a bulk RNA-seq appropriate approach. ```{r normalize-sctransform, eval=TRUE} # Function to normalize count data with sctransform for bulk RNA-seq normalize_with_sctransform <- function(count_data, species_name) { cat("Normalizing", species_name, "data with sctransform...\n") # Check if we have group_id or gene_id column id_col <- if("group_id" %in% colnames(count_data)) "group_id" else "gene_id" cat("Using", id_col, "as identifier column\n") # Check for and handle duplicate group_ids/gene_ids duplicate_ids <- count_data[[id_col]][duplicated(count_data[[id_col]])] if(length(duplicate_ids) > 0) { cat("Found", length(unique(duplicate_ids)), "duplicate", id_col, "values:\n") cat(" Examples:", head(unique(duplicate_ids), 10), "\n") cat(" Note: sctransform may fail due to duplicate row names, will fall back to log2(CPM + 1)\n") } else { cat("No duplicate", id_col, "values found\n") } # Use original data without aggregation - let sctransform handle duplicates or fail agg_data <- count_data # Extract count matrix (genes as rows, samples as columns) count_matrix <- as.matrix(agg_data[, -1]) # Remove id column rownames(count_matrix) <- agg_data[[id_col]] # Check for and handle problematic values cat("Checking data quality...\n") cat(" - Zero values:", sum(count_matrix == 0), "/", length(count_matrix), "\n") cat(" - NA values:", sum(is.na(count_matrix)), "\n") cat(" - Infinite values:", sum(is.infinite(count_matrix)), "\n") cat(" - Min value:", min(count_matrix, na.rm = TRUE), "\n") cat(" - Max value:", max(count_matrix, na.rm = TRUE), "\n") # Remove genes with all zeros or very low expression gene_sums <- rowSums(count_matrix) keep_genes <- gene_sums > 10 # Keep genes with total counts > 10 count_matrix_filtered <- count_matrix[keep_genes, , drop = FALSE] cat(" - Filtered to", nrow(count_matrix_filtered), "ortholog groups (from", nrow(count_matrix), ")\n") # Ensure data is numeric and handle any potential issues count_matrix_filtered <- apply(count_matrix_filtered, c(1,2), function(x) { if(is.na(x) || !is.finite(x)) return(0) return(as.numeric(x)) }) # Ensure row and column names are character strings (not factors) rownames(count_matrix_filtered) <- as.character(rownames(count_matrix_filtered)) colnames(count_matrix_filtered) <- as.character(colnames(count_matrix_filtered)) # sctransform expects genes as rows and cells (samples) as columns — # our matrix is already genes x samples (rows=genes, cols=samples), so do NOT transpose. count_matrix_for_vst <- count_matrix_filtered # Apply sctransform normalization with bulk RNA-seq appropriate parameters normalized_df <- tryCatch({ # First try with minimal parameters to isolate the issue cat(" Attempting sctransform with minimal parameters...\n") # Pass genes x samples matrix directly normalized <- sctransform::vst( count_matrix_for_vst, verbosity = 1 ) # Extract normalized data (should be genes x samples) normalized_data <- normalized$y # Create output data frame with original gene set (fill missing with zeros) full_normalized_data <- matrix(0, nrow = nrow(count_matrix), ncol = ncol(count_matrix)) rownames(full_normalized_data) <- rownames(count_matrix) colnames(full_normalized_data) <- colnames(count_matrix) # Fill in normalized values for kept genes full_normalized_data[rownames(normalized_data), ] <- normalized_data result_df <- data.frame( group_id = rownames(full_normalized_data), full_normalized_data, stringsAsFactors = FALSE ) cat("sctransform normalization successful for", species_name, "\n") return(result_df) }, error = function(e) { cat("sctransform with minimal parameters failed for", species_name, ":", e$message, "\n") cat("Trying sctransform with glmGamPoi method...\n") # Try again with glmGamPoi method tryCatch({ normalized <- sctransform::vst( count_matrix_for_vst, method = "glmGamPoi", n_genes = min(2000, nrow(count_matrix_for_vst)), return_cell_attr = TRUE, verbosity = 1 ) # Extract normalized data (genes x samples) normalized_data <- normalized$y # Create output data frame with original gene set (fill missing with zeros) full_normalized_data <- matrix(0, nrow = nrow(count_matrix), ncol = ncol(count_matrix)) rownames(full_normalized_data) <- rownames(count_matrix) colnames(full_normalized_data) <- colnames(count_matrix) # Fill in normalized values for kept genes full_normalized_data[rownames(normalized_data), ] <- normalized_data result_df <- data.frame( group_id = rownames(full_normalized_data), full_normalized_data, stringsAsFactors = FALSE ) cat("sctransform normalization with glmGamPoi successful for", species_name, "\n") return(result_df) }, error = function(e2) { cat("sctransform failed for", species_name, ":", e2$message, "\n") cat("Falling back to log2(CPM + 1) normalization...\n") # Fallback: log2(CPM + 1) normalization # Calculate CPM (Counts Per Million) using the filtered count matrix lib_sizes <- colSums(count_matrix_filtered) cpm_matrix <- sweep(count_matrix_filtered, 2, lib_sizes/1e6, FUN = "/") # Log2 transform with pseudocount normalized_data <- log2(cpm_matrix + 1) result_df <- data.frame( group_id = rownames(normalized_data), normalized_data, stringsAsFactors = FALSE ) cat("Log2(CPM + 1) normalization complete for", species_name, "\n") return(result_df) }) }) cat("Input dimensions:", nrow(count_data), "rows x", ncol(count_data)-1, "samples\n") cat("Output dimensions:", nrow(normalized_df), "ortholog groups x", ncol(normalized_df)-1, "samples\n\n") return(normalized_df) } # Normalize each species cat("=== STARTING NORMALIZATION ===\n\n") apul_normalized <- normalize_with_sctransform(apul_expr, "Apul") peve_normalized <- normalize_with_sctransform(peve_expr, "Peve") ptua_normalized <- normalize_with_sctransform(ptua_expr, "Ptua") cat("=== NORMALIZATION COMPLETE ===\n\n") ``` ## Export normalized data for Python analysis ```{r export-normalized-data, eval=TRUE} # Export normalized data for Python processing apul_norm_file <- file.path(output_dir, "apul_normalized_expression.csv") peve_norm_file <- file.path(output_dir, "peve_normalized_expression.csv") ptua_norm_file <- file.path(output_dir, "ptua_normalized_expression.csv") write.csv(apul_normalized, apul_norm_file, row.names = FALSE, quote = FALSE) write.csv(peve_normalized, peve_norm_file, row.names = FALSE, quote = FALSE) write.csv(ptua_normalized, ptua_norm_file, row.names = FALSE, quote = FALSE) cat("Exported normalized data:\n") cat("Apul:", apul_norm_file, "\n") cat("Peve:", peve_norm_file, "\n") cat("Ptua:", ptua_norm_file, "\n\n") ``` ## Create tensor dataset in Python Now we'll switch to Python to create the multiomics tensor and run barnacle analysis. ```{python setup-logging, eval=TRUE} import sys import os from datetime import datetime # Set up logging to both console and file class Logger: def __init__(self, log_file): self.terminal = sys.stdout self.log = open(log_file, 'w') def write(self, message): self.terminal.write(message) self.log.write(message) self.log.flush() def flush(self): self.terminal.flush() self.log.flush() # Create log file path output_dir = r.output_dir log_file = os.path.join(output_dir, f'barnacle_analysis_log_{datetime.now().strftime("%Y%m%d_%H%M%S")}.txt') # Redirect stdout to both console and log file sys.stdout = Logger(log_file) sys.stderr = sys.stdout # Also capture error messages print("="*60) print(f"BARNACLE ANALYSIS LOG") print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") print(f"Log file: {log_file}") print("="*60) print() ``` ```{python create-multiomics-tensor, eval=TRUE} import pandas as pd import numpy as np import os from pathlib import Path # Set up paths output_dir = r.output_dir print(f"Working in output directory: {output_dir}") # Load normalized data apul_norm = pd.read_csv(os.path.join(output_dir, "apul_normalized_expression.csv")) peve_norm = pd.read_csv(os.path.join(output_dir, "peve_normalized_expression.csv")) ptua_norm = pd.read_csv(os.path.join(output_dir, "ptua_normalized_expression.csv")) print("Loaded normalized data:") print(f"Apul: {apul_norm.shape}") print(f"Peve: {peve_norm.shape}") print(f"Ptua: {ptua_norm.shape}") # Check which genes are common across all species apul_genes = set(apul_norm['group_id']) peve_genes = set(peve_norm['group_id']) ptua_genes = set(ptua_norm['group_id']) common_genes = apul_genes & peve_genes & ptua_genes print(f"\nCommon genes across all species: {len(common_genes)}") # Filter to common genes and align gene order common_genes_list = sorted(list(common_genes)) apul_common = apul_norm[apul_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list) peve_common = peve_norm[peve_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list) ptua_common = ptua_norm[ptua_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list) print(f"\nFiltered to common genes:") print(f"Apul: {apul_common.shape}") print(f"Peve: {peve_common.shape}") print(f"Ptua: {ptua_common.shape}") ``` ## Parse sample information ```{python parse-sample-info, eval=TRUE} # Parse sample names to extract sample information for each species independently def parse_species_samples(columns, species_name): """Parse sample column names for a specific species""" sample_map = {} sample_ids = [] timepoints = set() for col in columns: # Expected format: PREFIX-NUMBER-TP# (e.g., ACR-139-TP1, POR-216-TP1, POC-201-TP1) # Also support legacy format: PREFIX.NUMBER.TP# for backwards compatibility parts = col.split('-') if '-' in col else col.split('.') if len(parts) >= 3 and parts[-1].startswith('TP'): # Extract timepoint from last part (e.g., "TP1", "TP2", etc.) tp_part = parts[-1] timepoint = int(tp_part[2:]) # e.g., 1 from TP1 # Sample ID is everything except the timepoint part sample_id_parts = parts[:-1] sample_id = '-'.join(sample_id_parts) if '-' in col else '.'.join(sample_id_parts) sample_map[(sample_id, timepoint)] = col if sample_id not in sample_ids: sample_ids.append(sample_id) timepoints.add(timepoint) else: print(f"Warning: Could not parse column name: {col}") return sample_map, sample_ids, sorted(timepoints) # Parse sample information for each species independently print("Parsing sample information for each species...") species_data = { 'apul': apul_common, 'peve': peve_common, 'ptua': ptua_common } species_info = {} all_timepoints = set() for species, data in species_data.items(): sample_map, sample_ids, timepoints = parse_species_samples(data.columns, species) species_info[species] = { 'sample_map': sample_map, 'sample_ids': sample_ids, 'timepoints': timepoints, 'n_samples': len(sample_ids) } all_timepoints.update(timepoints) print(f"{species}:") print(f" Samples: {len(sample_ids)} ({sample_ids[:3]}...)") print(f" Timepoints: {timepoints}") common_timepoints = sorted(list(all_timepoints)) print(f"\nTimepoints found across all species: {common_timepoints}") # Find the maximum number of samples to determine tensor dimensions max_samples = max(info['n_samples'] for info in species_info.values()) print(f"Maximum samples in any species: {max_samples}") # Print detailed sample structure print(f"\nDetailed sample structure:") for species, info in species_info.items(): print(f"{species}: {info['n_samples']} samples × {len(info['timepoints'])} timepoints") ``` ## Create 3D tensor (genes × species_samples × timepoints) ```{python create-3d-tensor, eval=TRUE} # Create a 3D tensor: genes × (species_samples) × timepoints # This flattens species and samples into a single dimension that Barnacle can handle print("Creating 3D tensor by combining species and samples...") print("Note: R-level filtering has already removed samples without all 4 timepoints") # First, collect all actual sample-timepoint combinations that have data all_sample_columns = [] sample_labels = [] # Track which sample belongs to which species species_sample_map = {} # Map from combined index to (species, sample_idx, sample_id) sample_idx = 0 for species in ['apul', 'peve', 'ptua']: data = species_data[species] info = species_info[species] print(f"\nProcessing {species}:") for sample_id in info['sample_ids']: # Check if this sample has data for ALL required timepoints available_timepoints = [] sample_timepoint_cols = [] for timepoint in common_timepoints: if (sample_id, timepoint) in info['sample_map']: col_name = info['sample_map'][(sample_id, timepoint)] sample_timepoint_cols.append(col_name) available_timepoints.append(timepoint) # Only include samples that have ALL timepoints (should be all samples due to R filtering) if len(available_timepoints) == len(common_timepoints): all_sample_columns.extend(sample_timepoint_cols) sample_labels.append(f"{species}_{sample_id}") species_sample_map[sample_idx] = { 'species': species, 'sample_id': sample_id, 'sample_idx_in_species': info['sample_ids'].index(sample_id) } sample_idx += 1 print(f" Added {sample_id} with {len(sample_timepoint_cols)} timepoints: {sorted(available_timepoints)}") else: print(f" Skipped {sample_id} - incomplete timepoints: {sorted(available_timepoints)} (expected: {sorted(common_timepoints)})") n_genes = len(common_genes_list) n_combined_samples = len(sample_labels) n_timepoints = len(common_timepoints) print(f"\nCreating 3D tensor with shape: ({n_genes}, {n_combined_samples}, {n_timepoints})") print(f"Combined samples from all species: {n_combined_samples}") # Initialize tensor tensor_3d = np.full((n_genes, n_combined_samples, n_timepoints), np.nan) # Fill tensor filled_count = 0 missing_count = 0 for combined_idx, sample_label in enumerate(sample_labels): species_info_map = species_sample_map[combined_idx] species = species_info_map['species'] sample_id = species_info_map['sample_id'] data = species_data[species] info = species_info[species] for time_idx, timepoint in enumerate(common_timepoints): if (sample_id, timepoint) in info['sample_map']: col_name = info['sample_map'][(sample_id, timepoint)] tensor_3d[:, combined_idx, time_idx] = data[col_name].values filled_count += 1 else: missing_count += 1 # Check tensor statistics n_missing = np.sum(np.isnan(tensor_3d)) n_total = tensor_3d.size n_finite = np.sum(np.isfinite(tensor_3d)) print(f"\n=== TENSOR STATISTICS ===") print(f"Tensor shape: {tensor_3d.shape}") print(f"Total elements: {n_total}") print(f"Finite values: {n_finite}") print(f"Missing/NaN values: {n_missing}") print(f"Missing percentage: {n_missing / n_total * 100:.2f}%") print(f"Filled {filled_count} sample-timepoint combinations") print(f"Missing {missing_count} sample-timepoint combinations") # Check non-zero values among finite values finite_values = tensor_3d[np.isfinite(tensor_3d)] n_nonzero = np.sum(finite_values != 0) print(f"Non-zero finite values: {n_nonzero}") print(f"Zero finite values: {len(finite_values) - n_nonzero}") print(f"Sparsity among finite values: {(len(finite_values) - n_nonzero) / len(finite_values) * 100:.2f}%") # Save sample mapping for later interpretation sample_mapping = pd.DataFrame([ { 'combined_index': i, 'sample_label': label, 'species': species_sample_map[i]['species'], 'sample_id': species_sample_map[i]['sample_id'] } for i, label in enumerate(sample_labels) ]) print(f"\nSample mapping:") print(sample_mapping.head(10)) ``` ## Rank Comparison **This workflow systematically compares multiple ranks to find the optimal number of components.** The comparison runs decompositions for each specified rank and generates: 1. Metrics comparison table 2. Elbow plot (variance explained vs rank) 3. Multi-panel summary plots 4. Automated recommendation **Results:** Each rank is saved to `rank_XX/` directory with complete factor matrices. ### Define helper functions for rank comparison ```{python rank-comparison-functions, eval=TRUE} def run_single_rank_decomposition(tensor, rank, lambdas, n_iter_max, random_state=42): """ Run sparse CP decomposition for a given rank. Returns dict with results and metrics. """ from barnacle.decomposition import SparseCP print(f"\n{'='*60}") print(f"Running decomposition with rank = {rank}") print(f"{'='*60}") # Handle missing values tensor_filled = np.nan_to_num(tensor, nan=0.0) # Create model model = SparseCP( rank=rank, lambdas=lambdas, nonneg_modes=[0], norm_constraint=True, init='random', tol=1e-5, n_iter_max=n_iter_max, random_state=random_state, n_initializations=1 ) # Fit model try: decomposition = model.fit_transform(tensor_filled, verbose=0) # Calculate metrics metrics = calculate_decomposition_metrics(tensor_filled, decomposition, model) results = { 'rank': rank, 'decomposition': decomposition, 'model': model, 'metrics': metrics, 'success': True } print(f"✓ Rank {rank} complete") print(f" Final loss: {metrics['final_loss']:.6f}") print(f" Variance explained: {metrics['variance_explained']*100:.2f}%") print(f" Converged: {metrics['converged']}") return results except Exception as e: print(f"✗ Rank {rank} failed: {e}") return { 'rank': rank, 'success': False, 'error': str(e) } def calculate_decomposition_metrics(tensor, decomposition, model): """ Calculate reconstruction quality and other metrics. """ # Reconstruct tensor gene_factors = decomposition.factors[0] sample_factors = decomposition.factors[1] time_factors = decomposition.factors[2] weights = decomposition.weights rank = len(weights) reconstructed = np.zeros_like(tensor) for r in range(rank): outer_prod = np.outer(gene_factors[:, r], sample_factors[:, r]) outer_prod = np.outer(outer_prod.flatten(), time_factors[:, r]) outer_prod = outer_prod.reshape(tensor.shape) reconstructed += weights[r] * outer_prod # Calculate errors mse = np.mean((tensor - reconstructed) ** 2) rmse = np.sqrt(mse) # Relative error tensor_norm = np.linalg.norm(tensor) error_norm = np.linalg.norm(tensor - reconstructed) relative_error = error_norm / tensor_norm if tensor_norm > 0 else np.inf # Variance explained total_var = np.var(tensor) residual_var = np.var(tensor - reconstructed) variance_explained = 1 - (residual_var / total_var) if total_var > 0 else 0 # Sparsity metrics gene_sparsity = np.mean(np.abs(gene_factors) < 0.1) sample_sparsity = np.mean(np.abs(sample_factors) < 0.1) time_sparsity = np.mean(np.abs(time_factors) < 0.1) # Component weight statistics weight_max = np.max(weights) weight_min = np.min(weights) weight_ratio = weight_max / weight_min if weight_min > 0 else np.inf weight_cv = np.std(weights) / np.mean(weights) if np.mean(weights) > 0 else np.inf # Convergence info # Note: Barnacle's SparseCP class does not have a converged_ attribute # We determine convergence by checking if iterations < max_iter n_iterations = len(model.loss_) if hasattr(model, 'loss_') and model.loss_ is not None else 0 final_loss = model.loss_[-1] if hasattr(model, 'loss_') and model.loss_ is not None else np.nan # Converged if stopped before reaching max iterations (met tolerance criterion) converged = (n_iterations < model.n_iter_max) if n_iterations > 0 else False return { 'mse': mse, 'rmse': rmse, 'relative_error': relative_error, 'variance_explained': variance_explained, 'gene_sparsity': gene_sparsity, 'sample_sparsity': sample_sparsity, 'time_sparsity': time_sparsity, 'weight_max': weight_max, 'weight_min': weight_min, 'weight_ratio': weight_ratio, 'weight_cv': weight_cv, 'converged': converged, 'n_iterations': n_iterations, 'final_loss': final_loss } def save_rank_results(results, output_dir, gene_ids, sample_labels, timepoint_labels, species_sample_map): """ Save decomposition results for a specific rank. """ rank = results['rank'] rank_dir = os.path.join(output_dir, f'rank_{rank:02d}') os.makedirs(rank_dir, exist_ok=True) decomposition = results['decomposition'] metrics = results['metrics'] # Save factor matrices gene_factors = pd.DataFrame( decomposition.factors[0], index=gene_ids, columns=[f'Component_{i+1}' for i in range(rank)] ) gene_factors.to_csv(os.path.join(rank_dir, 'gene_factors.csv')) sample_factors = pd.DataFrame( decomposition.factors[1], index=sample_labels, columns=[f'Component_{i+1}' for i in range(rank)] ) sample_factors['Species'] = [species_sample_map[i]['species'] for i in range(len(sample_labels))] sample_factors['Sample_ID'] = [species_sample_map[i]['sample_id'] for i in range(len(sample_labels))] sample_factors.to_csv(os.path.join(rank_dir, 'sample_factors.csv')) time_factors = pd.DataFrame( decomposition.factors[2], index=timepoint_labels, columns=[f'Component_{i+1}' for i in range(rank)] ) time_factors.to_csv(os.path.join(rank_dir, 'time_factors.csv')) # Save weights weights_df = pd.DataFrame({ 'Component': [f'Component_{i+1}' for i in range(rank)], 'Weight': decomposition.weights }) weights_df.to_csv(os.path.join(rank_dir, 'component_weights.csv'), index=False) # Save loss history if hasattr(results['model'], 'loss_') and results['model'].loss_ is not None: loss_df = pd.DataFrame({ 'Iteration': list(range(1, len(results['model'].loss_) + 1)), 'Loss': results['model'].loss_ }) loss_df.to_csv(os.path.join(rank_dir, 'loss_history.csv'), index=False) # Save metadata metadata = { 'rank': rank, 'metrics': {k: float(v) if not isinstance(v, bool) else v for k, v in metrics.items()}, 'tensor_shape': list(decomposition.factors[0].shape[0:1]) + [decomposition.factors[1].shape[0]] + [decomposition.factors[2].shape[0]] } import json with open(os.path.join(rank_dir, 'metadata.json'), 'w') as f: json.dump(metadata, f, indent=2) print(f" Saved results to: {rank_dir}") print("Rank comparison helper functions loaded") ``` ### Run barnacle ```{python barnacle-multiple-ranks, eval=TRUE} # =========================================== # CONFIGURE RANKS TO TEST # =========================================== # NOTE: This chunk creates a machine-readable log file # (rank_comparison_log.json) that subsequent chunks will use. # This allows you to skip re-running this chunk if you've # already generated results. # =========================================== ranks_to_test = [5, 8, 10, 12, 15, 20, 25, 35, 45, 55, 65, 75] # Modify this list as needed # =========================================== import os import json import datetime print("="*60) print("COMPARING MULTIPLE RANKS") print("="*60) print(f"Ranks to test: {ranks_to_test}") print(f"This will run {len(ranks_to_test)} decompositions") print(f"Estimated time: {len(ranks_to_test) * 2}-{len(ranks_to_test) * 5} minutes") print("="*60) # Run decompositions for each rank all_rank_results = [] for test_rank in sorted(ranks_to_test): result = run_single_rank_decomposition( tensor=tensor_3d, rank=test_rank, lambdas=[0.1, 0.0, 0.1], n_iter_max=10000, random_state=41 ) all_rank_results.append(result) # Save individual results if result['success']: save_rank_results( result, output_dir, common_genes_list, sample_labels, [f'TP{tp}' for tp in common_timepoints], species_sample_map ) print(f"\n{'='*60}") print(f"ALL RANKS COMPLETE") print(f"{'='*60}") print(f"Successful: {sum(1 for r in all_rank_results if r['success'])}/{len(all_rank_results)}") # =========================================== # SAVE COMPREHENSIVE LOG FILE # =========================================== import json import datetime log_file = os.path.join(output_dir, 'rank_comparison_log.json') log_data = { 'timestamp': datetime.datetime.now().isoformat(), 'ranks_tested': ranks_to_test, 'total_runs': len(all_rank_results), 'successful_runs': sum(1 for r in all_rank_results if r['success']), 'failed_runs': sum(1 for r in all_rank_results if not r['success']), 'results': [] } for result in all_rank_results: rank_info = { 'rank': result['rank'], 'success': result['success'] } if result['success']: rank_info['metrics'] = { k: float(v) if isinstance(v, (int, float, np.number)) else bool(v) if isinstance(v, (bool, np.bool_)) else v for k, v in result['metrics'].items() } rank_info['output_dir'] = os.path.join(output_dir, f'rank_{result["rank"]:02d}') rank_info['has_loss_history'] = hasattr(result['model'], 'loss_') and result['model'].loss_ is not None else: rank_info['error'] = result.get('error', 'Unknown error') log_data['results'].append(rank_info) with open(log_file, 'w') as f: json.dump(log_data, f, indent=2) print(f"\nLog file saved to: {log_file}") print("This log can be used by subsequent chunks to avoid re-running decompositions") ``` # VISUALIZATION ## Plot rank comparisons ### 1. Comparison Plots Located in `barnacle_factors/rank_comparison/`: - **`elbow_plot.png`** - Shows variance explained vs rank (MOST IMPORTANT) - Look for the "elbow" where gains diminish - Marginal gains are annotated on the plot - **`rank_comparison_summary.png`** - 6-panel overview - **`loss_trajectories.png`** - Convergence speed comparison ### 2. Metrics Table File: `barnacle_factors/rank_comparison/rank_comparison_metrics.csv` Contains all quantitative metrics for comparison. ### 3. Individual Rank Results Each rank gets its own directory (`rank_05/`, `rank_10/`, etc.) with full factor matrices. ```{python plot-rank-comparison, eval=TRUE} # =========================================== # This chunk reads from rank_comparison_log.json # and does NOT require variables from previous chunks # =========================================== import matplotlib.pyplot as plt import seaborn as sns import json import glob import os import pandas as pd import numpy as np print(f"\n{'='*60}") print("Creating rank comparison visualizations") print(f"{'='*60}") # Check if output_dir exists, if not get it from R if 'output_dir' not in globals(): print("Getting output_dir from R environment...") output_dir = r.output_dir print(f" output_dir = {output_dir}") comparison_dir = os.path.join(output_dir, 'rank_comparison') os.makedirs(comparison_dir, exist_ok=True) # =========================================== # LOAD RESULTS FROM LOG FILE # =========================================== log_file = os.path.join(output_dir, 'rank_comparison_log.json') if os.path.exists(log_file): print(f"Loading results from log file: {log_file}") with open(log_file, 'r') as f: log_data = json.load(f) print(f" Log timestamp: {log_data['timestamp']}") print(f" Total runs: {log_data['total_runs']}") print(f" Successful: {log_data['successful_runs']}") print(f" Failed: {log_data['failed_runs']}") # Reconstruct all_rank_results from log all_rank_results = [] for result_info in log_data['results']: if result_info['success']: all_rank_results.append({ 'rank': result_info['rank'], 'metrics': result_info['metrics'], 'success': True }) else: all_rank_results.append({ 'rank': result_info['rank'], 'success': False, 'error': result_info.get('error', 'Unknown error') }) print(f" Loaded {len(all_rank_results)} results from log file") elif 'all_rank_results' not in globals(): # Fallback: Load from individual metadata files print("Log file not found. Loading from individual rank directories...") all_rank_results = [] # Find all rank_* directories rank_dirs = sorted(glob.glob(os.path.join(output_dir, 'rank_*'))) print(f" Found {len(rank_dirs)} rank directories") for rank_dir in rank_dirs: metadata_file = os.path.join(rank_dir, 'metadata.json') if os.path.exists(metadata_file): with open(metadata_file, 'r') as f: metadata = json.load(f) all_rank_results.append({ 'rank': metadata['rank'], 'metrics': metadata['metrics'], 'success': True }) print(f" Loaded {len(all_rank_results)} results from disk") else: print("Using all_rank_results from memory") # Extract successful results successful = [r for r in all_rank_results if r['success']] if len(successful) == 0: print("No successful decompositions to compare!") else: ranks = [r['rank'] for r in successful] # Create comparison dataframe comparison_df = pd.DataFrame([ { 'Rank': r['rank'], **r['metrics'] } for r in successful ]) # Save comparison table comparison_df.to_csv(os.path.join(comparison_dir, 'rank_comparison_metrics.csv'), index=False) print(f" Saved metrics table") # Create multi-panel comparison plot fig, axes = plt.subplots(2, 3, figsize=(18, 10)) fig.suptitle('Rank Comparison: Key Metrics', fontsize=16, fontweight='bold') # 1. Variance Explained ax = axes[0, 0] ax.plot(ranks, comparison_df['variance_explained'], 'o-', linewidth=2, markersize=8, color='steelblue') ax.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, label='80% threshold') ax.axhline(y=0.6, color='orange', linestyle='--', alpha=0.5, label='60% threshold') ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Variance Explained', fontsize=11) ax.set_title('Variance Explained vs Rank', fontweight='bold') ax.grid(True, alpha=0.3) ax.legend() ax.set_ylim([0, 1.05]) # 2. Relative Error ax = axes[0, 1] ax.plot(ranks, comparison_df['relative_error'], 'o-', linewidth=2, markersize=8, color='coral') ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Relative Reconstruction Error', fontsize=11) ax.set_title('Reconstruction Error vs Rank', fontweight='bold') ax.grid(True, alpha=0.3) # 3. Final Loss ax = axes[0, 2] ax.plot(ranks, comparison_df['final_loss'], 'o-', linewidth=2, markersize=8, color='purple') ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Final Loss', fontsize=11) ax.set_title('Final Loss vs Rank', fontweight='bold') ax.grid(True, alpha=0.3) # 4. Gene Sparsity ax = axes[1, 0] ax.plot(ranks, comparison_df['gene_sparsity'], 'o-', linewidth=2, markersize=8, color='green') ax.axhline(y=0.9, color='red', linestyle='--', alpha=0.5, label='90% sparse') ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Gene Factor Sparsity', fontsize=11) ax.set_title('Gene Sparsity vs Rank', fontweight='bold') ax.grid(True, alpha=0.3) ax.legend() ax.set_ylim([0, 1.05]) # 5. Weight Distribution (CV) ax = axes[1, 1] ax.plot(ranks, comparison_df['weight_cv'], 'o-', linewidth=2, markersize=8, color='darkorange') ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Weight Coefficient of Variation', fontsize=11) ax.set_title('Component Weight Variability vs Rank', fontweight='bold') ax.grid(True, alpha=0.3) # 6. Iterations to Convergence ax = axes[1, 2] colors = ['green' if c else 'red' for c in comparison_df['converged']] ax.bar(ranks, comparison_df['n_iterations'], color=colors, alpha=0.7) ax.set_xlabel('Rank (Number of Components)', fontsize=11) ax.set_ylabel('Iterations to Convergence', fontsize=11) ax.set_title('Convergence Speed vs Rank\n(Green=Converged, Red=Max Iterations)', fontweight='bold') ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() output_path = os.path.join(comparison_dir, 'rank_comparison_summary.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') print(f" Saved: rank_comparison_summary.png") plt.close() # Create elbow plot fig, ax = plt.subplots(figsize=(10, 6)) var_exp = comparison_df['variance_explained'].values ax.plot(ranks, var_exp, 'o-', linewidth=3, markersize=10, color='steelblue', label='Variance Explained') # Calculate marginal gains if len(ranks) > 1: marginal_gains = np.diff(var_exp) for i in range(len(marginal_gains)): ax.annotate(f'+{marginal_gains[i]*100:.1f}%', xy=(ranks[i+1], var_exp[i+1]), xytext=(10, -10), textcoords='offset points', fontsize=9, color='darkgreen', bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3)) ax.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, linewidth=2, label='80% threshold (excellent)') ax.axhline(y=0.6, color='orange', linestyle='--', alpha=0.5, linewidth=2, label='60% threshold (good)') ax.set_xlabel('Rank (Number of Components)', fontsize=13, fontweight='bold') ax.set_ylabel('Variance Explained', fontsize=13, fontweight='bold') ax.set_title('Elbow Plot: Finding Optimal Rank\n(Look for diminishing returns)', fontsize=14, fontweight='bold') ax.grid(True, alpha=0.3) ax.legend(fontsize=11) ax.set_ylim([0, 1.05]) ax.set_xticks(ranks) plt.tight_layout() output_path = os.path.join(comparison_dir, 'elbow_plot.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') print(f" Saved: elbow_plot.png") plt.close() # Loss trajectories fig, ax = plt.subplots(figsize=(10, 6)) loss_plotted = False for r in successful: # Try to get loss from model object first if 'model' in r and hasattr(r['model'], 'loss_') and r['model'].loss_ is not None: loss = r['model'].loss_ ax.plot(range(1, len(loss)+1), loss, linewidth=2, label=f'Rank {r["rank"]}', alpha=0.8) loss_plotted = True else: # Load from disk rank_dir = os.path.join(output_dir, f'rank_{r["rank"]:02d}') loss_file = os.path.join(rank_dir, 'loss_history.csv') if os.path.exists(loss_file): loss_df = pd.read_csv(loss_file) ax.plot(loss_df['Iteration'], loss_df['Loss'], linewidth=2, label=f'Rank {r["rank"]}', alpha=0.8) loss_plotted = True if loss_plotted: ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('Loss', fontsize=12) ax.set_title('Loss Convergence Trajectories by Rank', fontsize=14, fontweight='bold') ax.grid(True, alpha=0.3) ax.legend(fontsize=10) ax.set_yscale('log') plt.tight_layout() output_path = os.path.join(comparison_dir, 'loss_trajectories.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') print(f" Saved: loss_trajectories.png") else: print(f" Skipped loss_trajectories.png (no loss history available)") plt.close() print(f"\nAll comparison plots saved to: {comparison_dir}") ``` # RANK STABILITY ANALYSIS ## Gene stability across ranks This analysis determines at what rank the top genes stabilize. ```{python rank-stability-analysis, eval=TRUE} # =========================================== # This chunk analyzes gene stability across ranks # =========================================== import os import json import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns print(f"\n{'='*60}") print("RANK STABILITY ANALYSIS: TOP GENE CONSISTENCY") print(f"{'='*60}") # =========================================== # LOAD RESULTS FROM LOG FILE IF NEEDED # =========================================== if 'output_dir' not in globals(): print("Getting output_dir from R environment...") output_dir = r.output_dir print(f" output_dir = {output_dir}") # Always reload from log file to ensure consistency log_file = os.path.join(output_dir, 'rank_comparison_log.json') if not os.path.exists(log_file): print(f"ERROR: Log file not found: {log_file}") print("Please run the 'barnacle-multiple-ranks' chunk first!") all_rank_results = [] successful = [] else: print(f"Loading results from log file: {log_file}") with open(log_file, 'r') as f: log_data = json.load(f) print(f" Log timestamp: {log_data['timestamp']}") print(f" Total runs: {log_data['total_runs']}") print(f" Successful: {log_data['successful_runs']}") print(f" Failed: {log_data['failed_runs']}") # Reconstruct all_rank_results from log all_rank_results = [] for result_info in log_data['results']: if result_info['success']: all_rank_results.append({ 'rank': result_info['rank'], 'metrics': result_info['metrics'], 'success': True }) else: all_rank_results.append({ 'rank': result_info['rank'], 'success': False, 'error': result_info.get('error', 'Unknown error') }) print(f" Loaded {len(all_rank_results)} results from log file") successful = [r for r in all_rank_results if r['success']] successful = sorted(successful, key=lambda x: x['rank']) if len(successful) == 0: print("No successful decompositions to analyze!") else: # =========================================== # EXTRACT TOP GENES FROM EACH RANK # =========================================== print("\nExtracting top genes from each rank...") TOP_N = 100 # Number of top genes to compare top_genes_by_rank = {} for r in successful: rank = r['rank'] rank_dir = os.path.join(output_dir, f'rank_{rank:02d}') gene_factors_file = os.path.join(rank_dir, 'gene_factors.csv') if os.path.exists(gene_factors_file): gene_factors = pd.read_csv(gene_factors_file, index_col=0) # Get genes with highest loadings across ALL components # Sum absolute values across all components for each gene gene_importance = gene_factors.abs().sum(axis=1) top_genes = set(gene_importance.nlargest(TOP_N).index) top_genes_by_rank[rank] = { 'genes': top_genes, 'importance': gene_importance } print(f" Rank {rank:2d}: Top {TOP_N} genes extracted") else: print(f" Rank {rank:2d}: Gene factors file not found!") ranks = sorted(top_genes_by_rank.keys()) # =========================================== # CALCULATE JACCARD SIMILARITY MATRIX # =========================================== print(f"\nCalculating pairwise Jaccard similarities...") n_ranks = len(ranks) similarity_matrix = np.zeros((n_ranks, n_ranks)) for i, rank1 in enumerate(ranks): for j, rank2 in enumerate(ranks): genes1 = top_genes_by_rank[rank1]['genes'] genes2 = top_genes_by_rank[rank2]['genes'] intersection = len(genes1 & genes2) union = len(genes1 | genes2) jaccard = intersection / union if union > 0 else 0 similarity_matrix[i, j] = jaccard # =========================================== # CALCULATE CONSECUTIVE RANK OVERLAP # =========================================== print(f"\nCalculating consecutive rank overlaps...") consecutive_overlaps = [] for i in range(len(ranks) - 1): rank1, rank2 = ranks[i], ranks[i+1] genes1 = top_genes_by_rank[rank1]['genes'] genes2 = top_genes_by_rank[rank2]['genes'] overlap = len(genes1 & genes2) jaccard = overlap / len(genes1 | genes2) percent_overlap = (overlap / TOP_N) * 100 consecutive_overlaps.append({ 'rank1': rank1, 'rank2': rank2, 'overlap_count': overlap, 'jaccard_similarity': jaccard, 'percent_overlap': percent_overlap }) overlap_df = pd.DataFrame(consecutive_overlaps) # Save to file comparison_dir = os.path.join(output_dir, 'rank_comparison') overlap_df.to_csv(os.path.join(comparison_dir, 'consecutive_rank_overlap.csv'), index=False) print("\nConsecutive Rank Overlap:") print(overlap_df.to_string(index=False, float_format='%.3f')) # =========================================== # IDENTIFY STABILITY THRESHOLD # =========================================== print(f"\n{'='*60}") print("STABILITY ANALYSIS") print(f"{'='*60}") # Find where overlap stabilizes (high consecutive overlap) high_overlap_threshold = 0.80 # 80% overlap stable_ranks = overlap_df[overlap_df['jaccard_similarity'] >= high_overlap_threshold] if len(stable_ranks) > 0: first_stable = stable_ranks.iloc[0] print(f"\nFirst stable transition: Rank {int(first_stable['rank1'])} → {int(first_stable['rank2'])}") print(f" Overlap: {first_stable['overlap_count']:.0f}/{TOP_N} genes ({first_stable['percent_overlap']:.1f}%)") print(f" Jaccard: {first_stable['jaccard_similarity']:.3f}") print(f"\n→ Model appears to stabilize around rank {int(first_stable['rank2'])}") else: print(f"\nNo consecutive ranks have ≥{high_overlap_threshold*100:.0f}% overlap") print(" Model may not have fully stabilized in tested range") print(" Consider testing higher ranks") # =========================================== # PLOT 1: HEATMAP OF JACCARD SIMILARITIES # =========================================== print(f"\nCreating similarity heatmap...") fig, ax = plt.subplots(figsize=(12, 10)) sns.heatmap( similarity_matrix, xticklabels=ranks, yticklabels=ranks, annot=True, fmt='.2f', cmap='RdYlGn', vmin=0, vmax=1, square=True, cbar_kws={'label': 'Jaccard Similarity'}, ax=ax ) ax.set_xlabel('Rank', fontsize=12, fontweight='bold') ax.set_ylabel('Rank', fontsize=12, fontweight='bold') ax.set_title(f'Top {TOP_N} Gene Overlap Between Ranks\n(Jaccard Similarity)', fontsize=14, fontweight='bold', pad=20) plt.tight_layout() output_path = os.path.join(comparison_dir, 'gene_stability_heatmap.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') print(f" Saved: gene_stability_heatmap.png") plt.close() # =========================================== # PLOT 2: CONSECUTIVE OVERLAP LINE PLOT # =========================================== print(f"Creating consecutive overlap plot...") fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10)) # Plot 1: Absolute overlap count ax1.plot(range(len(overlap_df)), overlap_df['overlap_count'], 'o-', linewidth=2, markersize=8, color='steelblue') ax1.axhline(y=TOP_N*0.8, color='green', linestyle='--', alpha=0.5, label='80% threshold') ax1.axhline(y=TOP_N*0.9, color='darkgreen', linestyle='--', alpha=0.5, label='90% threshold') ax1.set_ylabel(f'Number of Overlapping Genes (out of {TOP_N})', fontsize=11) ax1.set_title(f'Consecutive Rank Gene Overlap\n(Top {TOP_N} Most Important Genes)', fontsize=13, fontweight='bold') ax1.grid(True, alpha=0.3) ax1.legend() ax1.set_xticks(range(len(overlap_df))) ax1.set_xticklabels([f'{int(row["rank1"])}→{int(row["rank2"])}' for _, row in overlap_df.iterrows()], rotation=45) # Plot 2: Jaccard similarity ax2.plot(range(len(overlap_df)), overlap_df['jaccard_similarity'], 'o-', linewidth=2, markersize=8, color='coral') ax2.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, label='80% similarity') ax2.axhline(y=0.9, color='darkgreen', linestyle='--', alpha=0.5, label='90% similarity') ax2.set_xlabel('Rank Transition', fontsize=11) ax2.set_ylabel('Jaccard Similarity', fontsize=11) ax2.set_title('Jaccard Similarity Between Consecutive Ranks', fontsize=13, fontweight='bold') ax2.grid(True, alpha=0.3) ax2.legend() ax2.set_ylim([0, 1.05]) ax2.set_xticks(range(len(overlap_df))) ax2.set_xticklabels([f'{int(row["rank1"])}→{int(row["rank2"])}' for _, row in overlap_df.iterrows()], rotation=45) plt.tight_layout() output_path = os.path.join(comparison_dir, 'consecutive_overlap_plot.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') print(f" Saved: consecutive_overlap_plot.png") plt.close() # =========================================== # EXPORT TOP GENES FOR EACH RANK # =========================================== print(f"\nExporting top {TOP_N} genes for each rank...") for rank in ranks: genes = top_genes_by_rank[rank]['genes'] importance = top_genes_by_rank[rank]['importance'] top_genes_df = pd.DataFrame({ 'gene_id': list(genes), 'total_importance': [importance[g] for g in genes] }).sort_values('total_importance', ascending=False) rank_dir = os.path.join(output_dir, f'rank_{rank:02d}') output_file = os.path.join(rank_dir, f'top_{TOP_N}_genes.csv') top_genes_df.to_csv(output_file, index=False) print(f" Saved top_{TOP_N}_genes.csv for each rank") # =========================================== # EXPORT SHARED GENES FOR CONSECUTIVE RANK COMPARISONS # =========================================== print(f"\nExporting shared genes for consecutive rank comparisons...") for i in range(len(ranks) - 1): rank1, rank2 = ranks[i], ranks[i+1] genes1 = top_genes_by_rank[rank1]['genes'] genes2 = top_genes_by_rank[rank2]['genes'] # Get shared genes shared_genes = genes1 & genes2 # Get importance scores from both ranks importance1 = top_genes_by_rank[rank1]['importance'] importance2 = top_genes_by_rank[rank2]['importance'] # Create dataframe with shared genes and their importance in both ranks shared_df = pd.DataFrame({ 'gene_id': list(shared_genes), f'rank_{rank1}_importance': [importance1[g] for g in shared_genes], f'rank_{rank2}_importance': [importance2[g] for g in shared_genes] }) # Sort by average importance shared_df['avg_importance'] = (shared_df[f'rank_{rank1}_importance'] + shared_df[f'rank_{rank2}_importance']) / 2 shared_df = shared_df.sort_values('avg_importance', ascending=False) # Save to comparison directory output_file = os.path.join(comparison_dir, f'shared_genes_rank_{rank1}_vs_{rank2}.csv') shared_df.to_csv(output_file, index=False) print(f" Rank {rank1} vs {rank2}: {len(shared_genes)} shared genes") print(f" Saved shared_genes_rank_X_vs_Y.csv files to {comparison_dir}") # =========================================== # EXPORT ALL PAIRWISE SHARED GENES SUMMARY # =========================================== print(f"\nCreating pairwise shared genes summary...") pairwise_summary = [] for i, rank1 in enumerate(ranks): for j, rank2 in enumerate(ranks): if i < j: # Only upper triangle (avoid duplicates) genes1 = top_genes_by_rank[rank1]['genes'] genes2 = top_genes_by_rank[rank2]['genes'] shared = genes1 & genes2 pairwise_summary.append({ 'rank1': rank1, 'rank2': rank2, 'shared_count': len(shared), 'jaccard_similarity': len(shared) / len(genes1 | genes2) if len(genes1 | genes2) > 0 else 0, 'percent_overlap': (len(shared) / TOP_N) * 100 }) pairwise_df = pd.DataFrame(pairwise_summary) pairwise_file = os.path.join(comparison_dir, 'pairwise_shared_genes_summary.csv') pairwise_df.to_csv(pairwise_file, index=False) print(f" Saved pairwise_shared_genes_summary.csv") print(f"\n{'='*60}") print("STABILITY ANALYSIS COMPLETE") print(f"{'='*60}") print(f"\nAll stability plots saved to: {comparison_dir}") print(f"- gene_stability_heatmap.png") print(f"- consecutive_overlap_plot.png") print(f"- consecutive_rank_overlap.csv") print(f"- shared_genes_rank_X_vs_Y.csv (for consecutive rank pairs)") print(f"- pairwise_shared_genes_summary.csv (all pairwise comparisons)") print(f"\nTop {TOP_N} genes for each rank saved to: rank_XX/top_{TOP_N}_genes.csv") ``` # INDIVIDUAL RANK VISUALIZATIONS ## Detailed plots for each rank This section creates comprehensive visualizations for each rank to aid in assessment and comparison. ```{python individual-rank-visualizations, eval=TRUE} # =========================================== # CREATE DETAILED VISUALIZATIONS FOR EACH RANK # =========================================== import os import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns print(f"\n{'='*60}") print("CREATING INDIVIDUAL RANK VISUALIZATIONS") print(f"{'='*60}") # Get output directory if 'output_dir' not in globals(): output_dir = r.output_dir # Load log file to get ranks log_file = os.path.join(output_dir, 'rank_comparison_log.json') if not os.path.exists(log_file): print(f"ERROR: Log file not found: {log_file}") print("Please run the 'barnacle-multiple-ranks' chunk first!") else: import json with open(log_file, 'r') as f: log_data = json.load(f) # Get successful ranks successful_ranks = [r['rank'] for r in log_data['results'] if r['success']] print(f"Creating visualizations for {len(successful_ranks)} ranks...") # Parameters for visualization TOP_GENES_HEATMAP = 50 # Number of top genes to show in heatmap for rank in successful_ranks: print(f"\n--- Rank {rank} ---") rank_dir = os.path.join(output_dir, f'rank_{rank:02d}') # Load factor matrices gene_factors_file = os.path.join(rank_dir, 'gene_factors.csv') time_factors_file = os.path.join(rank_dir, 'time_factors.csv') weights_file = os.path.join(rank_dir, 'component_weights.csv') if not all([os.path.exists(f) for f in [gene_factors_file, time_factors_file, weights_file]]): print(f" Skipping rank {rank} - missing factor files") continue gene_factors = pd.read_csv(gene_factors_file, index_col=0) time_factors = pd.read_csv(time_factors_file, index_col=0) weights = pd.read_csv(weights_file) # =========================================== # PLOT 1: GENE HEATMAP (Top genes x Components) # =========================================== print(f" Creating gene-component heatmap...") # Get top genes by total importance gene_importance = gene_factors.abs().sum(axis=1) top_genes = gene_importance.nlargest(TOP_GENES_HEATMAP).index top_gene_factors = gene_factors.loc[top_genes] # Create heatmap fig, ax = plt.subplots(figsize=(max(12, rank*0.5), 10)) sns.heatmap( top_gene_factors, cmap='RdBu_r', center=0, cbar_kws={'label': 'Component Loading'}, xticklabels=True, yticklabels=True, ax=ax ) ax.set_xlabel('Components', fontsize=12, fontweight='bold') ax.set_ylabel('Gene ID', fontsize=12, fontweight='bold') ax.set_title(f'Rank {rank}: Top {TOP_GENES_HEATMAP} Genes Across Components\n' + f'(Ordered by Total Importance)', fontsize=13, fontweight='bold', pad=15) plt.tight_layout() output_path = os.path.join(rank_dir, f'gene_component_heatmap_top{TOP_GENES_HEATMAP}.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') plt.close() print(f" Saved: gene_component_heatmap_top{TOP_GENES_HEATMAP}.png") # =========================================== # PLOT 2: TIME COMPONENT LINE PLOT # =========================================== print(f" Creating time-component line plot...") fig, ax = plt.subplots(figsize=(12, 8)) # Plot each component for i, col in enumerate(time_factors.columns): component_weights = weights[weights['Component'] == col]['Weight'].values[0] ax.plot(time_factors.index, time_factors[col], marker='o', linewidth=2, markersize=8, label=f'{col} (w={component_weights:.3f})', alpha=0.8) ax.set_xlabel('Timepoint', fontsize=12, fontweight='bold') ax.set_ylabel('Component Loading', fontsize=12, fontweight='bold') ax.set_title(f'Rank {rank}: Component Loadings Across Timepoints\n' + '(Each line = one component, weighted by importance)', fontsize=13, fontweight='bold', pad=15) ax.grid(True, alpha=0.3) ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9) ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.3) plt.tight_layout() output_path = os.path.join(rank_dir, 'time_component_lineplot.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') plt.close() print(f" Saved: time_component_lineplot.png") # =========================================== # PLOT 3: COMPONENT WEIGHT BAR PLOT # =========================================== print(f" Creating component weight bar plot...") fig, ax = plt.subplots(figsize=(max(10, rank*0.4), 6)) bars = ax.bar(range(len(weights)), weights['Weight'], color='steelblue', alpha=0.7, edgecolor='black') # Color top 5 components differently for i in range(min(5, len(bars))): bars[i].set_color('coral') bars[i].set_alpha(0.9) ax.set_xlabel('Component', fontsize=12, fontweight='bold') ax.set_ylabel('Weight', fontsize=12, fontweight='bold') ax.set_title(f'Rank {rank}: Component Weights\n(Top 5 highlighted in coral)', fontsize=13, fontweight='bold', pad=15) ax.set_xticks(range(len(weights))) ax.set_xticklabels([f'C{i+1}' for i in range(len(weights))], rotation=45 if rank > 20 else 0) ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() output_path = os.path.join(rank_dir, 'component_weights_barplot.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') plt.close() print(f" Saved: component_weights_barplot.png") # =========================================== # PLOT 4: GENE IMPORTANCE DISTRIBUTION # =========================================== print(f" Creating gene importance distribution...") fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) # Histogram ax1.hist(gene_importance, bins=50, color='steelblue', alpha=0.7, edgecolor='black') ax1.axvline(gene_importance.nlargest(TOP_GENES_HEATMAP).iloc[-1], color='red', linestyle='--', linewidth=2, label=f'Top {TOP_GENES_HEATMAP} threshold') ax1.set_xlabel('Total Importance Score', fontsize=11) ax1.set_ylabel('Number of Genes', fontsize=11) ax1.set_title('Distribution of Gene Importance Scores', fontweight='bold') ax1.grid(True, alpha=0.3, axis='y') ax1.legend() # Cumulative curve sorted_importance = gene_importance.sort_values(ascending=False).values cumsum = np.cumsum(sorted_importance) cumsum_pct = (cumsum / cumsum[-1]) * 100 ax2.plot(range(len(cumsum_pct)), cumsum_pct, linewidth=2, color='darkgreen') ax2.axhline(y=50, color='orange', linestyle='--', alpha=0.5, label='50% variance') ax2.axhline(y=80, color='red', linestyle='--', alpha=0.5, label='80% variance') ax2.set_xlabel('Number of Top Genes', fontsize=11) ax2.set_ylabel('Cumulative Variance Explained (%)', fontsize=11) ax2.set_title('Cumulative Importance by Gene Rank', fontweight='bold') ax2.grid(True, alpha=0.3) ax2.legend() fig.suptitle(f'Rank {rank}: Gene Importance Analysis', fontsize=14, fontweight='bold', y=1.02) plt.tight_layout() output_path = os.path.join(rank_dir, 'gene_importance_distribution.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') plt.close() print(f" Saved: gene_importance_distribution.png") # =========================================== # PLOT 5: COMPONENT CORRELATION HEATMAP # =========================================== print(f" Creating component correlation heatmap...") # Calculate correlation between components based on gene loadings component_corr = gene_factors.corr() fig, ax = plt.subplots(figsize=(max(10, rank*0.4), max(8, rank*0.35))) sns.heatmap( component_corr, annot=True if rank <= 20 else False, fmt='.2f' if rank <= 20 else '.1f', cmap='coolwarm', center=0, square=True, cbar_kws={'label': 'Correlation'}, ax=ax ) ax.set_title(f'Rank {rank}: Component Correlation Matrix\n' + '(Based on gene loadings - low correlation = good separation)', fontsize=13, fontweight='bold', pad=15) plt.tight_layout() output_path = os.path.join(rank_dir, 'component_correlation_heatmap.png') plt.savefig(output_path, dpi=150, bbox_inches='tight') plt.close() print(f" Saved: component_correlation_heatmap.png") print(f"\n{'='*60}") print("INDIVIDUAL RANK VISUALIZATIONS COMPLETE") print(f"{'='*60}") print(f"\nFor each rank, the following plots were created in rank_XX/ directories:") print(f" 1. gene_component_heatmap_top{TOP_GENES_HEATMAP}.png - Top genes across components") print(f" 2. time_component_lineplot.png - Component patterns over time") print(f" 3. component_weights_barplot.png - Relative component importance") print(f" 4. gene_importance_distribution.png - Gene importance statistics") print(f" 5. component_correlation_heatmap.png - Component independence check") ``` # RANK RECOMMENDATION ```{python recommend-rank, eval=TRUE} # =========================================== # This chunk reads from rank_comparison_log.json # and does NOT require variables from previous chunks # =========================================== import os import json import pandas as pd import numpy as np print(f"\n{'='*60}") print("RANK RECOMMENDATION") print(f"{'='*60}") # =========================================== # LOAD RESULTS FROM LOG FILE IF NEEDED # =========================================== if 'output_dir' not in globals(): print("Getting output_dir from R environment...") output_dir = r.output_dir print(f" output_dir = {output_dir}") if 'all_rank_results' not in globals(): log_file = os.path.join(output_dir, 'rank_comparison_log.json') if os.path.exists(log_file): print(f"Loading results from log file: {log_file}") with open(log_file, 'r') as f: log_data = json.load(f) # Reconstruct all_rank_results from log all_rank_results = [] for result_info in log_data['results']: if result_info['success']: all_rank_results.append({ 'rank': result_info['rank'], 'metrics': result_info['metrics'], 'success': True }) else: all_rank_results.append({ 'rank': result_info['rank'], 'success': False, 'error': result_info.get('error', 'Unknown error') }) print(f" Loaded {len(all_rank_results)} results from log file") else: print("ERROR: No log file found and all_rank_results not in memory!") print("Please run the barnacle-multiple-ranks chunk first.") all_rank_results = [] successful = [r for r in all_rank_results if r['success']] if len(successful) == 0: print("No successful decompositions to analyze!") else: # Sort by rank successful = sorted(successful, key=lambda x: x['rank']) # Calculate scores for each rank scores = [] for r in successful: m = r['metrics'] # Variance explained score (0-1, higher is better) var_score = m['variance_explained'] # Sparsity score (0-1, higher is better) sparsity_score = (m['gene_sparsity'] + m['sample_sparsity'] + m['time_sparsity']) / 3 # Convergence score conv_score = 1.0 if m['converged'] else 0.5 # Weight distribution score (prefer moderate variation) cv = m['weight_cv'] if 0.5 <= cv <= 2.0: weight_score = 1.0 elif cv < 0.5: weight_score = cv / 0.5 else: weight_score = 2.0 / cv # Combined score (weighted average) combined_score = ( 0.40 * var_score + 0.25 * sparsity_score + 0.20 * conv_score + 0.15 * weight_score ) scores.append({ 'rank': r['rank'], 'variance_score': var_score, 'sparsity_score': sparsity_score, 'convergence_score': conv_score, 'weight_score': weight_score, 'combined_score': combined_score }) scores_df = pd.DataFrame(scores) # Find elbow in variance explained var_exp = [r['metrics']['variance_explained'] for r in successful] ranks = [r['rank'] for r in successful] elbow_rank = None if len(var_exp) > 2: marginal_gains = np.diff(var_exp) elbow_indices = np.where(marginal_gains < 0.05)[0] if len(elbow_indices) > 0: elbow_rank = ranks[elbow_indices[0]] # Best by combined score best_idx = scores_df['combined_score'].idxmax() recommended_rank = scores_df.iloc[best_idx]['rank'] print("\nScoring Summary:") print(scores_df.to_string(index=False, float_format='%.3f')) print(f"\n{'='*60}") print(f"RECOMMENDED RANK: {int(recommended_rank)}") print(f"{'='*60}") print(f" Combined score: {scores_df.iloc[best_idx]['combined_score']:.3f}") print(f" Variance explained: {successful[best_idx]['metrics']['variance_explained']*100:.1f}%") print(f" Gene sparsity: {successful[best_idx]['metrics']['gene_sparsity']*100:.1f}%") print(f" Converged: {successful[best_idx]['metrics']['converged']}") if elbow_rank is not None: print(f"\nElbow point detected at rank: {int(elbow_rank)}") print(" (Diminishing returns beyond this point)") print("\nReasoning:") if recommended_rank <= min(ranks): print(" → Lower rank preferred: Model may be overfitting or data is simple") elif recommended_rank >= max(ranks): print(" → Higher rank preferred: Complex data may need more components") print(" → Consider testing even higher ranks (e.g., +5, +10)") else: print(" → Middle rank optimal: Good balance of fit and interpretability") print("\nNext steps:") print(f" 1. Review plots in: {comparison_dir}") print(f" 2. Examine individual rank results in: {output_dir}/rank_XX/") print(f" 3. Verify biological interpretability of top components") print(f" 4. Re-run main analysis with chosen rank for final results") ``` ## Close log file ```{python close-log, eval=TRUE} from datetime import datetime print() print("="*60) print(f"ANALYSIS COMPLETE") print(f"Finished: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") print(f"All output saved to: {output_dir}") print(f"Log file: {log_file}") print("="*60) # Restore stdout and close log file if hasattr(sys.stdout, 'log'): sys.stdout.log.close() sys.stdout = sys.stdout.terminal ``` # SYSTEM INFO ```{r system-info, eval=TRUE} sessionInfo() ``` # REFERENCES