---
title: "13.00-multiomics-barnacle"
author: "Sam White"
date: "2025-10-03"
output: 
  github_document:
    toc: true
    number_sections: true
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
citeproc: true
---

# BACKGROUND

This analysis prepares three-species ortholog expression matrices, normalizes the data, constructs a multi‑omics tensor (genes × combined species-samples × timepoints), and runs a sparse tensor decomposition using the [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] workflow to discover shared gene/time/species patterns.

## Workflow overview

1. **Data Preparation** - Extract and normalize ortholog expression data
2. **Tensor Construction** - Build 3D tensor combining species, samples, and timepoints
3. **Rank Comparison** - Test multiple ranks systematically to find optimal components
4. **Interpretation** - Choose best rank and examine factors, visualizations, and biological meaning

## Rank selection approach

**This analysis uses a systematic rank comparison approach.**

The workflow:
1. Configure `ranks_to_test` in the comparison chunk (e.g., `[5, 8, 10, 12, 15, 20]`)
2. Run decompositions for all specified ranks (10-30 minutes)
3. Review comparison metrics and elbow plot
4. Choose optimal rank based on:
   - Variance explained (aim for >60%)
   - Diminishing returns in elbow plot
   - Sparsity patterns
   - Component interpretability
5. Use chosen rank for final biological interpretation

**Key metrics for comparison:**
- **Variance explained**: How much of data variation is captured
- **Relative error**: Reconstruction accuracy
- **Sparsity**: How many genes/samples load strongly per component
- **Convergence**: Whether optimization completed successfully
- **Component weights**: Distribution of importance across components

## Key steps performed by this script:

- Load annotated ortholog groups and per-species count matrices.
- Filter ortholog groups to retain complete three‑way matches with available expression data.
- Extract per‑species expression matrices mapped to ortholog group IDs.
- Normalize counts (preferred: `sctransform`; fallback: log2(CPM + 1)).
- Build a 3D tensor combining species/sample and timepoint dimensions.
- Run [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] sparse CP decomposition and save factor matrices, metadata and figures.
- Assess rank appropriateness through multiple metrics and visualizations.

## Input files

- `../output/12-ortho-annot/ortholog_groups_annotated.csv` : annotated ortholog groups with columns for `group_id`, `apul`, `peve`, `ptua`, `type`, etc.
- `../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv` : Apul gene count matrix (contains `gene_id` plus sample columns).
- `../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-gene_count_matrix.csv` : Peve gene count matrix.
- `../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-gene_count_matrix.csv` : Ptua gene count matrix.
- (Also referenced) transcript-level matrices: `apul-transcript_count_matrix.csv`, `peve-transcript_count_matrix.csv`, `ptua-transcript_count_matrix.csv` when needed.

## Output files

- `apul_ortholog_expression.csv`, `peve_ortholog_expression.csv`, `ptua_ortholog_expression.csv` : per‑species expression matrices aligned to ortholog `group_id`.
- `apul_normalized_expression.csv`, `peve_normalized_expression.csv`, `ptua_normalized_expression.csv` : normalized expression matrices (sctransform or log2(CPM+1)).
- `multiomics_tensor.npy` : saved NumPy array of the 3D tensor used for decomposition (genes × combined_samples × timepoints).
- `barnacle_factors/` directory containing:
    - `gene_factors.csv` : gene loadings per component (genes × components).
    - `sample_factors.csv` : combined sample (species_sample) loadings per component with `Species` and `Sample_ID` metadata.
    - `time_factors.csv` : timepoint loadings per component.
    - `component_weights.csv` : component weights / importance.
    - `sample_mapping.csv` : mapping of combined sample indices to species and sample IDs.
    - `metadata.json` : analysis parameters and tensor metadata (shape, rank, lambdas, convergence, etc.).
    - `figures/` : generated visualizations (component weights, time loadings, sample heatmap, PCA, top ortholog plots).

## Notes / assumptions:

- Sample column names are parsed expecting a dot-separated format with a `TP#` timepoint token (e.g., `ACR.139.TP1`).
- Apul ortholog IDs in the ortholog table include transcript suffixes (e.g., `-T1`) which are removed for matching to the gene count matrix.
- Missing values in the tensor are handled by substitution (current workflow fills NaNs with zeros before decomposition).
- Samples must have all four timepoints (TP1, TP2, TP3, TP4) to be included in the analysis.


# SETUP

## Libraries

```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(reticulate)
library(matrixStats)  # Required for sctransform internal functions
library(sctransform)
library(glmGamPoi)

knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
    eval = FALSE,        # Evaluate code chunks
    results = "hold",   # Hold outputs and show them after the full code chunk
  warning = FALSE,     # Hide warnings
        collapse = FALSE,    # Keep code and output in separate blocks
        warning = FALSE,     # Hide warnings
        message = FALSE,     # Hide messages
        comment = "##"      # Prefix output lines with '##' so output is visually distinct
)
```

## Set R variables

```{r R-variables, eval=TRUE}
# OUTPUT DIRECTORY
output_dir <- "../output/13.00-multiomics-barnacle"

#INPUT FILE(S)
ortholog_groups_file <- "../output/12-ortho-annot/ortholog_groups_annotated.csv"

# Transcript count matrices for each species
apul_transcript_matrix_file <- "../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-transcript_count_matrix.csv"
peve_transcript_matrix_file <- "../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-transcript_count_matrix.csv"
ptua_transcript_matrix_file <- "../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-transcript_count_matrix.csv"

# CONDA
conda_env_name <- c("/home/sam/programs/mambaforge/envs/barnacle_py311_env")
conda_path <- c("/opt/anaconda/anaconda3/bin/conda")
```

## Load [barnacle](https://github.com/blasks/barnacle) conda environment

If this is successful, the first line of output should show that the Python being used is the one in your [barnacle](https://github.com/blasks/barnacle) [@blaskowski2024] conda environment path.

E.g.

`python:         /home/sam/programs/mambaforge/envs/barnacle_py311_env/bin/python`

```{r load-barnacle-conda-env, eval=TRUE}
use_condaenv(condaenv = conda_env_name, conda = conda_path)
py_config()
```

# DATA PREP

## Load ortholog groups data

```{r load-ortholog-data, eval=TRUE}
# Read in the ortholog groups data
ortholog_groups <- read.csv(ortholog_groups_file)


# Display basic info about the data
cat("Dimensions of ortholog groups data:", dim(ortholog_groups), "\n\n")
cat("Column names:", colnames(ortholog_groups), "\n\n")
head(ortholog_groups)
str(ortholog_groups)
```

## Extract ortholog expression data

Now let's extract expression data for genes that are present in the ortholog groups. We'll use the gene count matrices with gene IDs to properly map the data.

### Load gene count matrices

```{r load-gene-matrices, eval=TRUE}
# Define file paths for gene count matrices
apul_gene_matrix_file <- "../../D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv"
peve_gene_matrix_file <- "../../E-Peve/output/02.20-E-Peve-RNAseq-alignment-HiSat2/peve-gene_count_matrix.csv"
ptua_gene_matrix_file <- "../../F-Ptua/output/02.20-F-Ptua-RNAseq-alignment-HiSat2/ptua-gene_count_matrix.csv"

# Load gene count matrices for each species
cat("Loading gene count matrices...\n\n")

apul_gene_matrix <- read.csv(apul_gene_matrix_file)
cat("Apul gene matrix dimensions:", dim(apul_gene_matrix), "\n")

peve_gene_matrix <- read.csv(peve_gene_matrix_file)
cat("Peve gene matrix dimensions:", dim(peve_gene_matrix), "\n")

ptua_gene_matrix <- read.csv(ptua_gene_matrix_file)
cat("Ptua gene matrix dimensions:", dim(ptua_gene_matrix), "\n\n")
```

### Filter ortholog groups for complete three-way matches

```{r filter-ortholog-groups, eval=TRUE}
cat("Filtering for complete three-way ortholog groups...\n")

# Keep only rows where all three species have entries (no NA values or empty strings)
complete_ortholog_groups <- ortholog_groups[nzchar(ortholog_groups$apul) & 
                                          nzchar(ortholog_groups$peve) & 
                                          nzchar(ortholog_groups$ptua), ]

cat("Total ortholog groups:", nrow(ortholog_groups), "\n")
cat("Complete three-way ortholog groups:", nrow(complete_ortholog_groups), "\n")

```

### Filter for expression data availability

```{r filter-expression-availability, eval=TRUE}
cat("Filtering ortholog groups to ensure all genes have expression data...\n")

# Clean gene IDs to check against expression data
# For Apul: remove -T[n] suffix from ortholog groups to match gene matrix format
apul_ortholog_genes_check <- gsub("-T[0-9]+$", "", complete_ortholog_groups$apul)

# For Peve and Ptua: use as-is (will clean gene matrix IDs later)
peve_ortholog_genes_check <- complete_ortholog_groups$peve
ptua_ortholog_genes_check <- complete_ortholog_groups$ptua

# Check which genes are present in expression data
# (Note: We need to clean gene matrix IDs to match)
apul_gene_matrix_ids <- gsub("^gene-", "", apul_gene_matrix$gene_id)  # Remove gene- prefix if present
peve_gene_matrix_ids <- gsub("^gene-", "", peve_gene_matrix$gene_id)  # Remove gene- prefix
ptua_gene_matrix_ids <- gsub("^gene-", "", ptua_gene_matrix$gene_id)  # Remove gene- prefix

# Find which ortholog genes are present in each species' expression data
apul_present <- apul_ortholog_genes_check %in% apul_gene_matrix_ids
peve_present <- peve_ortholog_genes_check %in% peve_gene_matrix_ids
ptua_present <- ptua_ortholog_genes_check %in% ptua_gene_matrix_ids

# Keep only ortholog groups where all three species have expression data
expression_complete_mask <- apul_present & peve_present & ptua_present
complete_ortholog_groups <- complete_ortholog_groups[expression_complete_mask, ]

cat("Ortholog groups after filtering for expression data availability:", nrow(complete_ortholog_groups), "\n")

```

### Gene ID cleaning examples

```{r gene-id-examples-testing, eval=TRUE}
cat("\n=== GENE ID CLEANING EXAMPLES ===\n")
cat("Apul (clean ortholog groups to match gene matrix):\n")
cat("Ortholog groups original:", head(complete_ortholog_groups$apul, 3), "\n")
cat("Ortholog groups cleaned:", head(gsub("-T[0-9]+$", "", complete_ortholog_groups$apul), 3), "\n")
cat("Gene matrix (target format):", head(apul_gene_matrix$gene_id, 3), "\n\n")

cat("Peve (clean gene matrix to match ortholog groups):\n")
cat("Ortholog groups (target format):", head(complete_ortholog_groups$peve, 3), "\n") 
cat("Gene matrix original:", head(peve_gene_matrix$gene_id, 3), "\n")
cat("Gene matrix cleaned:", head(peve_gene_matrix$gene_id_clean, 3), "\n\n")

cat("Ptua (clean gene matrix to match ortholog groups):\n")
cat("Ortholog groups (target format):", head(complete_ortholog_groups$ptua, 3), "\n")
cat("Gene matrix original:", head(ptua_gene_matrix$gene_id, 3), "\n")
cat("Gene matrix cleaned:", head(ptua_gene_matrix$gene_id_clean, 3), "\n\n")

```

### Clean gene IDs for matching

```{r clean-gene-ids, eval=TRUE}
cat("Cleaning gene matrix IDs to match ortholog group format...\n")

# For Apul: ortholog groups have "FUN_000185-T1", gene matrix has "FUN_002326"
# We need to remove "-T1" from ortholog groups to match gene matrix
apul_ortholog_genes <- unique(gsub("-T[0-9]+$", "", complete_ortholog_groups$apul))

# For Peve and Ptua: keep ortholog groups as-is and clean gene matrix
peve_ortholog_genes <- unique(complete_ortholog_groups$peve)
ptua_ortholog_genes <- unique(complete_ortholog_groups$ptua)

# Clean gene matrix IDs accordingly
# Apul: gene matrix already in correct format (no cleaning needed)
apul_gene_matrix$gene_id_clean <- apul_gene_matrix$gene_id

# Peve: gene matrix has "gene-Peve_00000032", ortholog groups have "Peve_00037402"  
# So we need to remove "gene-" prefix from gene matrix
peve_gene_matrix$gene_id_clean <- gsub("^gene-", "", peve_gene_matrix$gene_id)

# Ptua: gene matrix has "gene-Pocillopora_meandrina_HIv1___RNAseq.g20905.t1", 
# ortholog groups have "Pocillopora_meandrina_HIv1___RNAseq.g28886.t1"
# So we just need to remove "gene-" prefix from gene matrix
ptua_gene_matrix$gene_id_clean <- gsub("^gene-", "", ptua_gene_matrix$gene_id)

cat("Apul ortholog genes (complete groups only):", length(apul_ortholog_genes), "\n")
cat("Peve ortholog genes (complete groups only):", length(peve_ortholog_genes), "\n")
cat("Ptua ortholog genes (complete groups only):", length(ptua_ortholog_genes), "\n\n")
```

### Define sample filtering function

```{r define-sample-filter-function, eval=TRUE}
# Function to filter samples that have all four timepoints (TP1, TP2, TP3, TP4)
filter_complete_samples <- function(gene_matrix, species_name) {
  cat("  Filtering samples with complete timepoints for", species_name, "...\n")
  
  # Get expression columns (exclude gene_id column, gene_id_clean may not exist yet)
  all_cols <- colnames(gene_matrix)
  id_cols <- c("gene_id", "gene_id_clean")
  expr_cols <- setdiff(all_cols, id_cols)
  
  if(length(expr_cols) == 0) {
    cat("  ERROR: No expression columns found! Returning original matrix.\n")
    return(gene_matrix)
  }
  
  # Parse sample column names to identify sample groups and timepoints
  sample_timepoint_map <- list()
  
  for(col in expr_cols) {
    # Expected formats: ACR.139.TP1, POR.216.TP1, POC.201.TP1, etc.
    # Note: Using dots (.) as separators, not hyphens (-)
    parts <- strsplit(col, "\\.")[[1]]  # Split on dots, not hyphens
    
    if(length(parts) >= 3) {
      # Extract timepoint from last part (e.g., "TP1", "TP2", etc.)
      tp_part <- parts[length(parts)]
      
      if(grepl("^TP[0-9]+$", tp_part)) {
        timepoint <- as.numeric(gsub("TP", "", tp_part))
        
        # Sample ID is everything except the timepoint part, joined with dots
        sample_id <- paste(parts[1:(length(parts)-1)], collapse = ".")
        
        if(!sample_id %in% names(sample_timepoint_map)) {
          sample_timepoint_map[[sample_id]] <- list()
        }
        sample_timepoint_map[[sample_id]][[as.character(timepoint)]] <- col
      }
    }
  }
  
  # Identify samples with all four timepoints (1, 2, 3, 4)
  complete_samples <- c()
  incomplete_samples <- c()
  required_timepoints <- c("1", "2", "3", "4")
  
  for(sample_id in names(sample_timepoint_map)) {
    available_timepoints <- names(sample_timepoint_map[[sample_id]])
    
    if(all(required_timepoints %in% available_timepoints)) {
      complete_samples <- c(complete_samples, sample_id)
    } else {
      incomplete_samples <- c(incomplete_samples, sample_id)
      missing_tps <- setdiff(required_timepoints, available_timepoints)
      cat("    Sample", sample_id, "missing timepoints:", paste(paste0("TP", missing_tps), collapse = ", "), "\n")
    }
  }
  
  cat("  Complete samples (all 4 timepoints):", length(complete_samples), "\n")
  cat("  Incomplete samples:", length(incomplete_samples), "\n")
  
  if(length(incomplete_samples) > 0) {
    cat("  Removing incomplete samples:", paste(incomplete_samples, collapse = ", "), "\n")
  }
  
  # Build list of columns to keep (include gene_id columns plus ONLY complete sample columns)
  if(length(complete_samples) == 0) {
    cat("  WARNING: No complete samples found! Keeping all expression columns.\n")
    # Fallback: keep all expression columns
    keep_cols <- c("gene_id")
    if("gene_id_clean" %in% colnames(gene_matrix)) {
      keep_cols <- c(keep_cols, "gene_id_clean")
    }
    keep_cols <- c(keep_cols, expr_cols)
  } else {
    keep_cols <- c("gene_id")
    if("gene_id_clean" %in% colnames(gene_matrix)) {
      keep_cols <- c(keep_cols, "gene_id_clean")
    }
    
    # ONLY add columns from complete samples (this excludes incomplete sample columns)
    for(sample_id in complete_samples) {
      for(tp in required_timepoints) {
        if(tp %in% names(sample_timepoint_map[[sample_id]])) {
          keep_cols <- c(keep_cols, sample_timepoint_map[[sample_id]][[tp]])
        }
      }
    }
    
    # Double-check: ensure we're not accidentally including incomplete sample columns
    incomplete_cols <- c()
    for(sample_id in incomplete_samples) {
      for(tp in names(sample_timepoint_map[[sample_id]])) {
        incomplete_cols <- c(incomplete_cols, sample_timepoint_map[[sample_id]][[tp]])
      }
    }
    
    if(length(incomplete_cols) > 0) {
      # Remove any incomplete sample columns that might have been included
      keep_cols <- setdiff(keep_cols, incomplete_cols)
    }
  }
  
  # Filter the gene matrix to keep only complete samples
  keep_cols <- intersect(keep_cols, colnames(gene_matrix))  # Ensure columns exist
  filtered_matrix <- gene_matrix[, keep_cols, drop = FALSE]
  
  # Count ID vs expression columns in final result
  final_expr_cols <- setdiff(colnames(filtered_matrix), c("gene_id", "gene_id_clean"))
  cat("  Final expression columns:", length(final_expr_cols), "\n")
  
  return(filtered_matrix)
}
```

### Define expression extraction function

```{r define-expression-extraction-function, eval=TRUE}
# Function to extract expression data for a species using ortholog group mapping
extract_species_expression <- function(ortholog_groups_df, gene_matrix, species_col, species_name) {
  cat("Processing", species_name, "...\n")
  
  # First, filter the gene matrix to keep only samples with complete timepoints
  filtered_gene_matrix <- filter_complete_samples(gene_matrix, species_name)
  
  # Remove duplicate group_ids, keeping first occurrence of each
  unique_groups <- ortholog_groups_df[!duplicated(ortholog_groups_df$group_id), ]
  cat("  Unique ortholog groups:", nrow(unique_groups), "\n")
  
  # Create results data frame starting with group_id
  result_df <- data.frame(group_id = unique_groups$group_id, stringsAsFactors = FALSE)
  
  # Get expression columns (exclude gene_id and gene_id_clean columns)
  expr_cols <- setdiff(colnames(filtered_gene_matrix), c("gene_id", "gene_id_clean"))
  
  # Initialize expression columns with NA
  for(col in expr_cols) {
    result_df[[col]] <- NA
  }
  
  # For each ortholog group, find the corresponding gene and extract expression
  for(i in seq_len(nrow(unique_groups))) {
    target_gene <- unique_groups[[species_col]][i]
    
    # Clean target gene for matching
    if(species_name == "Apul") {
      # Remove transcript suffix for Apul
      target_gene_clean <- gsub("-T[0-9]+$", "", target_gene)
      matching_rows <- which(filtered_gene_matrix$gene_id_clean == target_gene_clean)
    } else {
      # For Peve and Ptua, match cleaned gene_id
      matching_rows <- which(filtered_gene_matrix$gene_id_clean == target_gene)
    }
    
    if(length(matching_rows) == 1) {
      # Single match - copy expression data
      for(col in expr_cols) {
        result_df[i, col] <- filtered_gene_matrix[matching_rows, col]
      }
    } else if(length(matching_rows) > 1) {
      # Multiple matches - take first gene (no averaging)
      first_match <- matching_rows[1]
      for(col in expr_cols) {
        result_df[i, col] <- filtered_gene_matrix[first_match, col]
      }
    } else {
      # No match - leave as NA (will be removed later)
    }
  }
  
  # Remove rows with all NA expression values
  expr_na_mask <- apply(result_df[, expr_cols, drop = FALSE], 1, function(x) all(is.na(x)))
  result_df <- result_df[!expr_na_mask, ]
  
  cat("  Final dimensions:", nrow(result_df), "ortholog groups x", ncol(result_df)-1, "samples\n")
  cat("  Removed", sum(expr_na_mask), "groups with no expression data\n\n")
  
  return(result_df)
}
```

### Extract ortholog expression data

```{r extract-ortholog-expression-data, eval=TRUE}
cat("Creating ortholog expression data with proper group_id mapping...\n")

# Extract expression data for each species using the ortholog group mapping
apul_ortholog_expression <- extract_species_expression(complete_ortholog_groups, apul_gene_matrix, "apul", "Apul")
peve_ortholog_expression <- extract_species_expression(complete_ortholog_groups, peve_gene_matrix, "peve", "Peve")
ptua_ortholog_expression <- extract_species_expression(complete_ortholog_groups, ptua_gene_matrix, "ptua", "Ptua")

cat("=== FINAL ORTHOLOG EXPRESSION DATA DIMENSIONS ===\n")
cat("Apul:", nrow(apul_ortholog_expression), "ortholog groups x", ncol(apul_ortholog_expression)-1, "samples\n")
cat("Peve:", nrow(peve_ortholog_expression), "ortholog groups x", ncol(peve_ortholog_expression)-1, "samples\n")
cat("Ptua:", nrow(ptua_ortholog_expression), "ortholog groups x", ncol(ptua_ortholog_expression)-1, "samples\n\n")
```

### Write ortholog expression data

```{r write-expression-data, eval=TRUE}
cat("Exporting ortholog expression data to CSV files...\n")

# Define output file paths
apul_output_file <- file.path(output_dir, "apul_ortholog_expression.csv")
peve_output_file <- file.path(output_dir, "peve_ortholog_expression.csv")
ptua_output_file <- file.path(output_dir, "ptua_ortholog_expression.csv")

# Write CSV files without quotes
write.csv(apul_ortholog_expression, file = apul_output_file, quote = FALSE, row.names = FALSE)
cat("Exported Apul ortholog expression data to:", apul_output_file, "\n")

write.csv(peve_ortholog_expression, file = peve_output_file, quote = FALSE, row.names = FALSE)
cat("Exported Peve ortholog expression data to:", peve_output_file, "\n")

write.csv(ptua_ortholog_expression, file = ptua_output_file, quote = FALSE, row.names = FALSE)
cat("Exported Ptua ortholog expression data to:", ptua_output_file, "\n")

cat("\nAll ortholog expression data exported successfully!\n\n")
```

### Column structure analysis

```{r column-structure-analysis, eval=TRUE}
cat("=== COLUMN STRUCTURE ANALYSIS ===\n")
cat("Apul columns:", ncol(apul_ortholog_expression), "\n")
cat("Apul column names (first 10):", paste(head(colnames(apul_ortholog_expression), 10), collapse = ", "), "\n")
cat("Apul column names (last 10):", paste(tail(colnames(apul_ortholog_expression), 10), collapse = ", "), "\n\n")

cat("Peve columns:", ncol(peve_ortholog_expression), "\n")
cat("Peve column names (first 10):", paste(head(colnames(peve_ortholog_expression), 10), collapse = ", "), "\n")
cat("Peve column names (last 10):", paste(tail(colnames(peve_ortholog_expression), 10), collapse = ", "), "\n\n")

cat("Ptua columns:", ncol(ptua_ortholog_expression), "\n")
cat("Ptua column names (first 10):", paste(head(colnames(ptua_ortholog_expression), 10), collapse = ", "), "\n")
cat("Ptua column names (last 10):", paste(tail(colnames(ptua_ortholog_expression), 10), collapse = ", "), "\n\n")

```

### Summary statistics

```{r summary-statistics, eval=TRUE}
cat("=== LINE COUNTS FOR ORTHOLOG EXPRESSION DATA ===\n")
cat("Apul ortholog expression with info: ", nrow(apul_ortholog_expression), " rows\n")
cat("Peve ortholog expression with info: ", nrow(peve_ortholog_expression), " rows\n")
cat("Ptua ortholog expression with info: ", nrow(ptua_ortholog_expression), " rows\n\n")
```

### Sample filtering verification

```{r sample-filtering-verification, eval=TRUE}
cat("=== SAMPLE FILTERING VERIFICATION ===\n")
cat("This analysis filters samples to include ONLY those with all four timepoints (TP1, TP2, TP3, TP4)\n\n")

# Function to analyze sample completeness in the filtered data
analyze_sample_completeness <- function(data_df, species_name) {
  cat("Analyzing", species_name, "sample completeness:\n")
  
  # Get expression columns (exclude group_id)
  expr_cols <- setdiff(colnames(data_df), "group_id")
  
  # Parse sample column names to identify sample groups and timepoints
  sample_timepoint_map <- list()
  
  for(col in expr_cols) {
    parts <- strsplit(col, "\\.")[[1]]  # Split on dots to match actual format
    
    if(length(parts) >= 3) {
      tp_part <- parts[length(parts)]
      
      if(grepl("^TP[0-9]+$", tp_part)) {
        timepoint <- as.numeric(gsub("TP", "", tp_part))
        sample_id <- paste(parts[1:(length(parts)-1)], collapse = ".")
        
        if(!sample_id %in% names(sample_timepoint_map)) {
          sample_timepoint_map[[sample_id]] <- c()
        }
        sample_timepoint_map[[sample_id]] <- c(sample_timepoint_map[[sample_id]], timepoint)
      }
    }
  }
  
  # Check completeness
  complete_samples <- 0
  incomplete_samples <- 0
  required_timepoints <- c(1, 2, 3, 4)
  
  for(sample_id in names(sample_timepoint_map)) {
    available_timepoints <- sort(sample_timepoint_map[[sample_id]])
    
    if(all(required_timepoints %in% available_timepoints)) {
      complete_samples <- complete_samples + 1
    } else {
      incomplete_samples <- incomplete_samples + 1
      missing_tps <- setdiff(required_timepoints, available_timepoints)
      cat("  WARNING: Sample", sample_id, "missing timepoints:", paste(paste0("TP", missing_tps), collapse = ", "), "\n")
    }
  }
  
  cat("  Total samples:", length(sample_timepoint_map), "\n")
  cat("  Complete samples (all 4 timepoints):", complete_samples, "\n")
  cat("  Incomplete samples:", incomplete_samples, "\n")
  
  if(incomplete_samples > 0) {
    cat("  ERROR: Filtering did not work correctly!\n")
  } else {
    cat("  SUCCESS: All samples have complete timepoints\n")
  }
  
  cat("\n")
  
  return(list(
    total = length(sample_timepoint_map),
    complete = complete_samples,
    incomplete = incomplete_samples
  ))
}

# Verify each species
apul_stats <- analyze_sample_completeness(apul_ortholog_expression, "Apul")
peve_stats <- analyze_sample_completeness(peve_ortholog_expression, "Peve")
ptua_stats <- analyze_sample_completeness(ptua_ortholog_expression, "Ptua")

cat("=== FILTERING SUMMARY ===\n")
cat("All samples in the filtered dataset should have exactly 4 timepoints (TP1, TP2, TP3, TP4)\n")
cat("Apul: ", apul_stats$complete, "/", apul_stats$total, " complete samples\n")
cat("Peve: ", peve_stats$complete, "/", peve_stats$total, " complete samples\n")
cat("Ptua: ", ptua_stats$complete, "/", ptua_stats$total, " complete samples\n")

total_incomplete <- apul_stats$incomplete + peve_stats$incomplete + ptua_stats$incomplete
if(total_incomplete == 0) {
  cat("SUCCESS: All samples across all species have complete timepoints!\n")
} else {
  cat("ERROR: Found", total_incomplete, "incomplete samples - filtering needs to be fixed!\n")
}
cat("\n")
```

### Diagnostic analysis

```{r diagnostic-analysis, eval=TRUE}
cat("\n=== VERIFICATION: THREE-WAY ORTHOLOG COUNTS ===\n")
cat("After filtering, all species should have identical three_way ortholog counts.\n\n")

# Check how many genes we have for each species
cat("Genes found in expression data by species:\n")
cat("Apul ortholog expression (before adding info):", nrow(apul_ortholog_expression), "\n")
cat("Peve ortholog expression (before adding info):", nrow(peve_ortholog_expression), "\n")
cat("Ptua ortholog expression (before adding info):", nrow(ptua_ortholog_expression), "\n\n")

# Verify all ortholog groups are three-way
cat("Complete three-way ortholog groups available:", nrow(complete_ortholog_groups), "\n")
cat("All should be type 'three_way'? Check:", table(complete_ortholog_groups$type), "\n\n")

# Verify perfect gene coverage (should be 100% for all species now)
apul_coverage <- length(intersect(apul_ortholog_genes, apul_gene_matrix$gene_id_clean))
peve_coverage <- length(intersect(peve_ortholog_genes, peve_gene_matrix$gene_id_clean))
ptua_coverage <- length(intersect(ptua_ortholog_genes, ptua_gene_matrix$gene_id_clean))

cat("Gene coverage in expression data (should be 100% for all):\n")
cat("Apul: ", apul_coverage, "/", length(apul_ortholog_genes), " (", round(apul_coverage/length(apul_ortholog_genes)*100, 1), "%)\n")
cat("Peve: ", peve_coverage, "/", length(peve_ortholog_genes), " (", round(peve_coverage/length(peve_ortholog_genes)*100, 1), "%)\n")
cat("Ptua: ", ptua_coverage, "/", length(ptua_ortholog_genes), " (", round(ptua_coverage/length(ptua_ortholog_genes)*100, 1), "%)\n\n")
```

### Preview expression data

```{r preview-expression-data, eval=TRUE}
cat("\n=== PREVIEW OF ORTHOLOG EXPRESSION DATA ===\n")

cat("Apul ortholog expression:\n\n")
str(apul_ortholog_expression)
cat("\n\n")

cat("\nPeve ortholog expression:\n")
str(peve_ortholog_expression)
cat("\n\n")

cat("\nPtua ortholog expression:\n")
str(ptua_ortholog_expression)
cat("\n\n")
```

# BARNACLE ANALYSIS

This workflow focuses on comparing multiple ranks to find the optimal number of components.

Based on the barnacle workflow, we: 1. Normalize count data with `sctransform` 2. Create tensors for multiomics analysis 3. Compare multiple ranks systematically 4. Choose optimal rank based on metrics

## Load expression data

```{r load-normalized-data, eval=TRUE}
# Read the exported ortholog expression data
apul_expr <- read.csv(file.path(output_dir, "apul_ortholog_expression.csv"))
peve_expr <- read.csv(file.path(output_dir, "peve_ortholog_expression.csv"))  
ptua_expr <- read.csv(file.path(output_dir, "ptua_ortholog_expression.csv"))

cat("Loaded expression data:\n")
cat("Apul:", nrow(apul_expr), "genes x", ncol(apul_expr)-1, "samples\n")
cat("Peve:", nrow(peve_expr), "genes x", ncol(peve_expr)-1, "samples\n") 
cat("Ptua:", nrow(ptua_expr), "genes x", ncol(ptua_expr)-1, "samples\n")
```

## Normalize data with sctransform

Following the barnacle manuscript approach, we'll use `sctransform` to normalize each species' data. We'll use a bulk RNA-seq appropriate approach.

```{r normalize-sctransform, eval=TRUE}
# Function to normalize count data with sctransform for bulk RNA-seq
normalize_with_sctransform <- function(count_data, species_name) {
  cat("Normalizing", species_name, "data with sctransform...\n")
  
  # Check if we have group_id or gene_id column
  id_col <- if("group_id" %in% colnames(count_data)) "group_id" else "gene_id"
  cat("Using", id_col, "as identifier column\n")
  
  # Check for and handle duplicate group_ids/gene_ids
  duplicate_ids <- count_data[[id_col]][duplicated(count_data[[id_col]])]
  if(length(duplicate_ids) > 0) {
    cat("Found", length(unique(duplicate_ids)), "duplicate", id_col, "values:\n")
    cat("  Examples:", head(unique(duplicate_ids), 10), "\n")
    cat("  Note: sctransform may fail due to duplicate row names, will fall back to log2(CPM + 1)\n")
  } else {
    cat("No duplicate", id_col, "values found\n")
  }
  
  # Use original data without aggregation - let sctransform handle duplicates or fail
  agg_data <- count_data
  
  # Extract count matrix (genes as rows, samples as columns)
  count_matrix <- as.matrix(agg_data[, -1])  # Remove id column
  rownames(count_matrix) <- agg_data[[id_col]]
  
  # Check for and handle problematic values
  cat("Checking data quality...\n")
  cat("  - Zero values:", sum(count_matrix == 0), "/", length(count_matrix), "\n")
  cat("  - NA values:", sum(is.na(count_matrix)), "\n")
  cat("  - Infinite values:", sum(is.infinite(count_matrix)), "\n")
  cat("  - Min value:", min(count_matrix, na.rm = TRUE), "\n")
  cat("  - Max value:", max(count_matrix, na.rm = TRUE), "\n")
  
  # Remove genes with all zeros or very low expression
  gene_sums <- rowSums(count_matrix)
  keep_genes <- gene_sums > 10  # Keep genes with total counts > 10
  count_matrix_filtered <- count_matrix[keep_genes, , drop = FALSE]
  
  cat("  - Filtered to", nrow(count_matrix_filtered), "ortholog groups (from", nrow(count_matrix), ")\n")
  
  # Ensure data is numeric and handle any potential issues
  count_matrix_filtered <- apply(count_matrix_filtered, c(1,2), function(x) {
    if(is.na(x) || !is.finite(x)) return(0)
    return(as.numeric(x))
  })
  
  # Ensure row and column names are character strings (not factors)
  rownames(count_matrix_filtered) <- as.character(rownames(count_matrix_filtered))
  colnames(count_matrix_filtered) <- as.character(colnames(count_matrix_filtered))
  
  # sctransform expects genes as rows and cells (samples) as columns —
  # our matrix is already genes x samples (rows=genes, cols=samples), so do NOT transpose.
  count_matrix_for_vst <- count_matrix_filtered
  
  # Apply sctransform normalization with bulk RNA-seq appropriate parameters
  normalized_df <- tryCatch({
    # First try with minimal parameters to isolate the issue
    cat("  Attempting sctransform with minimal parameters...\n")
    # Pass genes x samples matrix directly
    normalized <- sctransform::vst(
      count_matrix_for_vst,
      verbosity = 1
    )

    # Extract normalized data (should be genes x samples)
    normalized_data <- normalized$y
    
    # Create output data frame with original gene set (fill missing with zeros)
    full_normalized_data <- matrix(0, nrow = nrow(count_matrix), ncol = ncol(count_matrix))
    rownames(full_normalized_data) <- rownames(count_matrix)
    colnames(full_normalized_data) <- colnames(count_matrix)
    
    # Fill in normalized values for kept genes
    full_normalized_data[rownames(normalized_data), ] <- normalized_data
    
    result_df <- data.frame(
      group_id = rownames(full_normalized_data),
      full_normalized_data,
      stringsAsFactors = FALSE
    )
    
    cat("sctransform normalization successful for", species_name, "\n")
    return(result_df)
    
  }, error = function(e) {
    cat("sctransform with minimal parameters failed for", species_name, ":", e$message, "\n")
    cat("Trying sctransform with glmGamPoi method...\n")
    
    # Try again with glmGamPoi method
    tryCatch({
      normalized <- sctransform::vst(
        count_matrix_for_vst, 
        method = "glmGamPoi",
        n_genes = min(2000, nrow(count_matrix_for_vst)),
        return_cell_attr = TRUE,
        verbosity = 1
      )

      # Extract normalized data (genes x samples)
      normalized_data <- normalized$y
      
      # Create output data frame with original gene set (fill missing with zeros)
      full_normalized_data <- matrix(0, nrow = nrow(count_matrix), ncol = ncol(count_matrix))
      rownames(full_normalized_data) <- rownames(count_matrix)
      colnames(full_normalized_data) <- colnames(count_matrix)
      
      # Fill in normalized values for kept genes
      full_normalized_data[rownames(normalized_data), ] <- normalized_data
      
      result_df <- data.frame(
        group_id = rownames(full_normalized_data),
        full_normalized_data,
        stringsAsFactors = FALSE
      )
      
      cat("sctransform normalization with glmGamPoi successful for", species_name, "\n")
      return(result_df)
      
    }, error = function(e2) {
      cat("sctransform failed for", species_name, ":", e2$message, "\n")
      cat("Falling back to log2(CPM + 1) normalization...\n")
      
  # Fallback: log2(CPM + 1) normalization
  # Calculate CPM (Counts Per Million) using the filtered count matrix
  lib_sizes <- colSums(count_matrix_filtered)
  cpm_matrix <- sweep(count_matrix_filtered, 2, lib_sizes/1e6, FUN = "/")
      
      # Log2 transform with pseudocount
      normalized_data <- log2(cpm_matrix + 1)
      
      result_df <- data.frame(
        group_id = rownames(normalized_data),
        normalized_data,
        stringsAsFactors = FALSE
      )
      
      cat("Log2(CPM + 1) normalization complete for", species_name, "\n")
      return(result_df)
    })
  })
  
  cat("Input dimensions:", nrow(count_data), "rows x", ncol(count_data)-1, "samples\n")
  cat("Output dimensions:", nrow(normalized_df), "ortholog groups x", ncol(normalized_df)-1, "samples\n\n")
  
  return(normalized_df)
}

# Normalize each species
cat("=== STARTING NORMALIZATION ===\n\n")
apul_normalized <- normalize_with_sctransform(apul_expr, "Apul")
peve_normalized <- normalize_with_sctransform(peve_expr, "Peve") 
ptua_normalized <- normalize_with_sctransform(ptua_expr, "Ptua")
cat("=== NORMALIZATION COMPLETE ===\n\n")
```

## Export normalized data for Python analysis

```{r export-normalized-data, eval=TRUE}
# Export normalized data for Python processing
apul_norm_file <- file.path(output_dir, "apul_normalized_expression.csv")
peve_norm_file <- file.path(output_dir, "peve_normalized_expression.csv")
ptua_norm_file <- file.path(output_dir, "ptua_normalized_expression.csv")

write.csv(apul_normalized, apul_norm_file, row.names = FALSE, quote = FALSE)
write.csv(peve_normalized, peve_norm_file, row.names = FALSE, quote = FALSE)
write.csv(ptua_normalized, ptua_norm_file, row.names = FALSE, quote = FALSE)

cat("Exported normalized data:\n")
cat("Apul:", apul_norm_file, "\n")
cat("Peve:", peve_norm_file, "\n")
cat("Ptua:", ptua_norm_file, "\n\n")
```

## Create tensor dataset in Python

Now we'll switch to Python to create the multiomics tensor and run barnacle analysis.

```{python setup-logging, eval=TRUE}
import sys
import os
from datetime import datetime

# Set up logging to both console and file
class Logger:
    def __init__(self, log_file):
        self.terminal = sys.stdout
        self.log = open(log_file, 'w')
        
    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)
        self.log.flush()
        
    def flush(self):
        self.terminal.flush()
        self.log.flush()

# Create log file path
output_dir = r.output_dir
log_file = os.path.join(output_dir, f'barnacle_analysis_log_{datetime.now().strftime("%Y%m%d_%H%M%S")}.txt')

# Redirect stdout to both console and log file
sys.stdout = Logger(log_file)
sys.stderr = sys.stdout  # Also capture error messages

print("="*60)
print(f"BARNACLE ANALYSIS LOG")
print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Log file: {log_file}")
print("="*60)
print()
```

```{python create-multiomics-tensor, eval=TRUE}
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Set up paths
output_dir = r.output_dir
print(f"Working in output directory: {output_dir}")

# Load normalized data
apul_norm = pd.read_csv(os.path.join(output_dir, "apul_normalized_expression.csv"))
peve_norm = pd.read_csv(os.path.join(output_dir, "peve_normalized_expression.csv"))
ptua_norm = pd.read_csv(os.path.join(output_dir, "ptua_normalized_expression.csv"))

print("Loaded normalized data:")
print(f"Apul: {apul_norm.shape}")
print(f"Peve: {peve_norm.shape}")  
print(f"Ptua: {ptua_norm.shape}")

# Check which genes are common across all species
apul_genes = set(apul_norm['group_id'])
peve_genes = set(peve_norm['group_id'])
ptua_genes = set(ptua_norm['group_id'])

common_genes = apul_genes & peve_genes & ptua_genes
print(f"\nCommon genes across all species: {len(common_genes)}")

# Filter to common genes and align gene order
common_genes_list = sorted(list(common_genes))

apul_common = apul_norm[apul_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list)
peve_common = peve_norm[peve_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list)
ptua_common = ptua_norm[ptua_norm['group_id'].isin(common_genes_list)].set_index('group_id').reindex(common_genes_list)

print(f"\nFiltered to common genes:")
print(f"Apul: {apul_common.shape}")
print(f"Peve: {peve_common.shape}")
print(f"Ptua: {ptua_common.shape}")
```

## Parse sample information

```{python parse-sample-info, eval=TRUE}
# Parse sample names to extract sample information for each species independently
def parse_species_samples(columns, species_name):
    """Parse sample column names for a specific species"""
    sample_map = {}
    sample_ids = []
    timepoints = set()
    
    for col in columns:
        # Expected format: PREFIX-NUMBER-TP# (e.g., ACR-139-TP1, POR-216-TP1, POC-201-TP1)
        # Also support legacy format: PREFIX.NUMBER.TP# for backwards compatibility
        parts = col.split('-') if '-' in col else col.split('.')
        
        if len(parts) >= 3 and parts[-1].startswith('TP'):
            # Extract timepoint from last part (e.g., "TP1", "TP2", etc.)
            tp_part = parts[-1]
            timepoint = int(tp_part[2:])  # e.g., 1 from TP1
            
            # Sample ID is everything except the timepoint part
            sample_id_parts = parts[:-1]
            sample_id = '-'.join(sample_id_parts) if '-' in col else '.'.join(sample_id_parts)
            
            sample_map[(sample_id, timepoint)] = col
            if sample_id not in sample_ids:
                sample_ids.append(sample_id)
            timepoints.add(timepoint)
        else:
            print(f"Warning: Could not parse column name: {col}")
    
    return sample_map, sample_ids, sorted(timepoints)

# Parse sample information for each species independently
print("Parsing sample information for each species...")

species_data = {
    'apul': apul_common,
    'peve': peve_common, 
    'ptua': ptua_common
}

species_info = {}
all_timepoints = set()

for species, data in species_data.items():
    sample_map, sample_ids, timepoints = parse_species_samples(data.columns, species)
    species_info[species] = {
        'sample_map': sample_map,
        'sample_ids': sample_ids,
        'timepoints': timepoints,
        'n_samples': len(sample_ids)
    }
    all_timepoints.update(timepoints)
    
    print(f"{species}:")
    print(f"  Samples: {len(sample_ids)} ({sample_ids[:3]}...)")
    print(f"  Timepoints: {timepoints}")

common_timepoints = sorted(list(all_timepoints))
print(f"\nTimepoints found across all species: {common_timepoints}")

# Find the maximum number of samples to determine tensor dimensions
max_samples = max(info['n_samples'] for info in species_info.values())
print(f"Maximum samples in any species: {max_samples}")

# Print detailed sample structure
print(f"\nDetailed sample structure:")
for species, info in species_info.items():
    print(f"{species}: {info['n_samples']} samples × {len(info['timepoints'])} timepoints")
```

## Create 3D tensor (genes × species_samples × timepoints)

```{python create-3d-tensor, eval=TRUE}
# Create a 3D tensor: genes × (species_samples) × timepoints
# This flattens species and samples into a single dimension that Barnacle can handle

print("Creating 3D tensor by combining species and samples...")
print("Note: R-level filtering has already removed samples without all 4 timepoints")

# First, collect all actual sample-timepoint combinations that have data
all_sample_columns = []
sample_labels = []  # Track which sample belongs to which species
species_sample_map = {}  # Map from combined index to (species, sample_idx, sample_id)

sample_idx = 0
for species in ['apul', 'peve', 'ptua']:
    data = species_data[species]
    info = species_info[species]
    
    print(f"\nProcessing {species}:")
    for sample_id in info['sample_ids']:
        # Check if this sample has data for ALL required timepoints
        available_timepoints = []
        sample_timepoint_cols = []
        
        for timepoint in common_timepoints:
            if (sample_id, timepoint) in info['sample_map']:
                col_name = info['sample_map'][(sample_id, timepoint)]
                sample_timepoint_cols.append(col_name)
                available_timepoints.append(timepoint)
        
        # Only include samples that have ALL timepoints (should be all samples due to R filtering)
        if len(available_timepoints) == len(common_timepoints):
            all_sample_columns.extend(sample_timepoint_cols)
            sample_labels.append(f"{species}_{sample_id}")
            species_sample_map[sample_idx] = {
                'species': species,
                'sample_id': sample_id,
                'sample_idx_in_species': info['sample_ids'].index(sample_id)
            }
            sample_idx += 1
            print(f"  Added {sample_id} with {len(sample_timepoint_cols)} timepoints: {sorted(available_timepoints)}")
        else:
            print(f"  Skipped {sample_id} - incomplete timepoints: {sorted(available_timepoints)} (expected: {sorted(common_timepoints)})")

n_genes = len(common_genes_list)
n_combined_samples = len(sample_labels)
n_timepoints = len(common_timepoints)

print(f"\nCreating 3D tensor with shape: ({n_genes}, {n_combined_samples}, {n_timepoints})")
print(f"Combined samples from all species: {n_combined_samples}")

# Initialize tensor
tensor_3d = np.full((n_genes, n_combined_samples, n_timepoints), np.nan)

# Fill tensor
filled_count = 0
missing_count = 0

for combined_idx, sample_label in enumerate(sample_labels):
    species_info_map = species_sample_map[combined_idx]
    species = species_info_map['species']
    sample_id = species_info_map['sample_id']
    
    data = species_data[species]
    info = species_info[species]
    
    for time_idx, timepoint in enumerate(common_timepoints):
        if (sample_id, timepoint) in info['sample_map']:
            col_name = info['sample_map'][(sample_id, timepoint)]
            tensor_3d[:, combined_idx, time_idx] = data[col_name].values
            filled_count += 1
        else:
            missing_count += 1

# Check tensor statistics
n_missing = np.sum(np.isnan(tensor_3d))
n_total = tensor_3d.size
n_finite = np.sum(np.isfinite(tensor_3d))

print(f"\n=== TENSOR STATISTICS ===")
print(f"Tensor shape: {tensor_3d.shape}")
print(f"Total elements: {n_total}")
print(f"Finite values: {n_finite}")
print(f"Missing/NaN values: {n_missing}")
print(f"Missing percentage: {n_missing / n_total * 100:.2f}%")
print(f"Filled {filled_count} sample-timepoint combinations")
print(f"Missing {missing_count} sample-timepoint combinations")

# Check non-zero values among finite values
finite_values = tensor_3d[np.isfinite(tensor_3d)]
n_nonzero = np.sum(finite_values != 0)
print(f"Non-zero finite values: {n_nonzero}")
print(f"Zero finite values: {len(finite_values) - n_nonzero}")
print(f"Sparsity among finite values: {(len(finite_values) - n_nonzero) / len(finite_values) * 100:.2f}%")

# Save sample mapping for later interpretation
sample_mapping = pd.DataFrame([
    {
        'combined_index': i,
        'sample_label': label,
        'species': species_sample_map[i]['species'],
        'sample_id': species_sample_map[i]['sample_id']
    }
    for i, label in enumerate(sample_labels)
])
print(f"\nSample mapping:")
print(sample_mapping.head(10))
```


## Rank Comparison

**This workflow systematically compares multiple ranks to find the optimal number of components.**

The comparison runs decompositions for each specified rank and generates:
1. Metrics comparison table
2. Elbow plot (variance explained vs rank)
3. Multi-panel summary plots
4. Automated recommendation

**Results:** Each rank is saved to `rank_XX/` directory with complete factor matrices.

### Define helper functions for rank comparison

```{python rank-comparison-functions, eval=TRUE}
def run_single_rank_decomposition(tensor, rank, lambdas, n_iter_max, random_state=42):
    """
    Run sparse CP decomposition for a given rank.
    Returns dict with results and metrics.
    """
    from barnacle.decomposition import SparseCP
    
    print(f"\n{'='*60}")
    print(f"Running decomposition with rank = {rank}")
    print(f"{'='*60}")
    
    # Handle missing values
    tensor_filled = np.nan_to_num(tensor, nan=0.0)
    
    # Create model
    model = SparseCP(
        rank=rank,
        lambdas=lambdas,
        nonneg_modes=[0],
        norm_constraint=True,
        init='random',
        tol=1e-5,
        n_iter_max=n_iter_max,
        random_state=random_state,
        n_initializations=1
    )
    
    # Fit model
    try:
        decomposition = model.fit_transform(tensor_filled, verbose=0)
        
        # Calculate metrics
        metrics = calculate_decomposition_metrics(tensor_filled, decomposition, model)
        
        results = {
            'rank': rank,
            'decomposition': decomposition,
            'model': model,
            'metrics': metrics,
            'success': True
        }
        
        print(f"✓ Rank {rank} complete")
        print(f"  Final loss: {metrics['final_loss']:.6f}")
        print(f"  Variance explained: {metrics['variance_explained']*100:.2f}%")
        print(f"  Converged: {metrics['converged']}")
        
        return results
        
    except Exception as e:
        print(f"✗ Rank {rank} failed: {e}")
        return {
            'rank': rank,
            'success': False,
            'error': str(e)
        }


def calculate_decomposition_metrics(tensor, decomposition, model):
    """
    Calculate reconstruction quality and other metrics.
    """
    # Reconstruct tensor
    gene_factors = decomposition.factors[0]
    sample_factors = decomposition.factors[1]
    time_factors = decomposition.factors[2]
    weights = decomposition.weights
    
    rank = len(weights)
    reconstructed = np.zeros_like(tensor)
    
    for r in range(rank):
        outer_prod = np.outer(gene_factors[:, r], sample_factors[:, r])
        outer_prod = np.outer(outer_prod.flatten(), time_factors[:, r])
        outer_prod = outer_prod.reshape(tensor.shape)
        reconstructed += weights[r] * outer_prod
    
    # Calculate errors
    mse = np.mean((tensor - reconstructed) ** 2)
    rmse = np.sqrt(mse)
    
    # Relative error
    tensor_norm = np.linalg.norm(tensor)
    error_norm = np.linalg.norm(tensor - reconstructed)
    relative_error = error_norm / tensor_norm if tensor_norm > 0 else np.inf
    
    # Variance explained
    total_var = np.var(tensor)
    residual_var = np.var(tensor - reconstructed)
    variance_explained = 1 - (residual_var / total_var) if total_var > 0 else 0
    
    # Sparsity metrics
    gene_sparsity = np.mean(np.abs(gene_factors) < 0.1)
    sample_sparsity = np.mean(np.abs(sample_factors) < 0.1)
    time_sparsity = np.mean(np.abs(time_factors) < 0.1)
    
    # Component weight statistics
    weight_max = np.max(weights)
    weight_min = np.min(weights)
    weight_ratio = weight_max / weight_min if weight_min > 0 else np.inf
    weight_cv = np.std(weights) / np.mean(weights) if np.mean(weights) > 0 else np.inf
    
    # Convergence info
    # Note: Barnacle's SparseCP class does not have a converged_ attribute
    # We determine convergence by checking if iterations < max_iter
    n_iterations = len(model.loss_) if hasattr(model, 'loss_') and model.loss_ is not None else 0
    final_loss = model.loss_[-1] if hasattr(model, 'loss_') and model.loss_ is not None else np.nan
    # Converged if stopped before reaching max iterations (met tolerance criterion)
    converged = (n_iterations < model.n_iter_max) if n_iterations > 0 else False
    
    return {
        'mse': mse,
        'rmse': rmse,
        'relative_error': relative_error,
        'variance_explained': variance_explained,
        'gene_sparsity': gene_sparsity,
        'sample_sparsity': sample_sparsity,
        'time_sparsity': time_sparsity,
        'weight_max': weight_max,
        'weight_min': weight_min,
        'weight_ratio': weight_ratio,
        'weight_cv': weight_cv,
        'converged': converged,
        'n_iterations': n_iterations,
        'final_loss': final_loss
    }


def save_rank_results(results, output_dir, gene_ids, sample_labels, timepoint_labels, species_sample_map):
    """
    Save decomposition results for a specific rank.
    """
    rank = results['rank']
    rank_dir = os.path.join(output_dir, f'rank_{rank:02d}')
    os.makedirs(rank_dir, exist_ok=True)
    
    decomposition = results['decomposition']
    metrics = results['metrics']
    
    # Save factor matrices
    gene_factors = pd.DataFrame(
        decomposition.factors[0],
        index=gene_ids,
        columns=[f'Component_{i+1}' for i in range(rank)]
    )
    gene_factors.to_csv(os.path.join(rank_dir, 'gene_factors.csv'))
    
    sample_factors = pd.DataFrame(
        decomposition.factors[1],
        index=sample_labels,
        columns=[f'Component_{i+1}' for i in range(rank)]
    )
    sample_factors['Species'] = [species_sample_map[i]['species'] for i in range(len(sample_labels))]
    sample_factors['Sample_ID'] = [species_sample_map[i]['sample_id'] for i in range(len(sample_labels))]
    sample_factors.to_csv(os.path.join(rank_dir, 'sample_factors.csv'))
    
    time_factors = pd.DataFrame(
        decomposition.factors[2],
        index=timepoint_labels,
        columns=[f'Component_{i+1}' for i in range(rank)]
    )
    time_factors.to_csv(os.path.join(rank_dir, 'time_factors.csv'))
    
    # Save weights
    weights_df = pd.DataFrame({
        'Component': [f'Component_{i+1}' for i in range(rank)],
        'Weight': decomposition.weights
    })
    weights_df.to_csv(os.path.join(rank_dir, 'component_weights.csv'), index=False)
    
    # Save loss history
    if hasattr(results['model'], 'loss_') and results['model'].loss_ is not None:
        loss_df = pd.DataFrame({
            'Iteration': list(range(1, len(results['model'].loss_) + 1)),
            'Loss': results['model'].loss_
        })
        loss_df.to_csv(os.path.join(rank_dir, 'loss_history.csv'), index=False)
    
    # Save metadata
    metadata = {
        'rank': rank,
        'metrics': {k: float(v) if not isinstance(v, bool) else v for k, v in metrics.items()},
        'tensor_shape': list(decomposition.factors[0].shape[0:1]) + 
                       [decomposition.factors[1].shape[0]] + 
                       [decomposition.factors[2].shape[0]]
    }
    
    import json
    with open(os.path.join(rank_dir, 'metadata.json'), 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"  Saved results to: {rank_dir}")


print("Rank comparison helper functions loaded")
```

### Run barnacle

```{python barnacle-multiple-ranks, eval=TRUE}
# ===========================================
# CONFIGURE RANKS TO TEST
# ===========================================
# NOTE: This chunk creates a machine-readable log file
# (rank_comparison_log.json) that subsequent chunks will use.
# This allows you to skip re-running this chunk if you've
# already generated results.
# ===========================================
ranks_to_test = [5, 8, 10, 12, 15, 20, 25, 35, 45, 55, 65, 75]  # Modify this list as needed
# ===========================================

import os
import json
import datetime

print("="*60)
print("COMPARING MULTIPLE RANKS")
print("="*60)
print(f"Ranks to test: {ranks_to_test}")
print(f"This will run {len(ranks_to_test)} decompositions")
print(f"Estimated time: {len(ranks_to_test) * 2}-{len(ranks_to_test) * 5} minutes")
print("="*60)

# Run decompositions for each rank
all_rank_results = []

for test_rank in sorted(ranks_to_test):
    result = run_single_rank_decomposition(
        tensor=tensor_3d,
        rank=test_rank,
        lambdas=[0.1, 0.0, 0.1],
        n_iter_max=10000,
        random_state=41
    )
    all_rank_results.append(result)
    
    # Save individual results
    if result['success']:
        save_rank_results(
            result,
            output_dir,
            common_genes_list,
            sample_labels,
            [f'TP{tp}' for tp in common_timepoints],
            species_sample_map
        )

print(f"\n{'='*60}")
print(f"ALL RANKS COMPLETE")
print(f"{'='*60}")
print(f"Successful: {sum(1 for r in all_rank_results if r['success'])}/{len(all_rank_results)}")

# ===========================================
# SAVE COMPREHENSIVE LOG FILE
# ===========================================
import json
import datetime

log_file = os.path.join(output_dir, 'rank_comparison_log.json')

log_data = {
    'timestamp': datetime.datetime.now().isoformat(),
    'ranks_tested': ranks_to_test,
    'total_runs': len(all_rank_results),
    'successful_runs': sum(1 for r in all_rank_results if r['success']),
    'failed_runs': sum(1 for r in all_rank_results if not r['success']),
    'results': []
}

for result in all_rank_results:
    rank_info = {
        'rank': result['rank'],
        'success': result['success']
    }
    
    if result['success']:
        rank_info['metrics'] = {
            k: float(v) if isinstance(v, (int, float, np.number)) else bool(v) if isinstance(v, (bool, np.bool_)) else v
            for k, v in result['metrics'].items()
        }
        rank_info['output_dir'] = os.path.join(output_dir, f'rank_{result["rank"]:02d}')
        rank_info['has_loss_history'] = hasattr(result['model'], 'loss_') and result['model'].loss_ is not None
    else:
        rank_info['error'] = result.get('error', 'Unknown error')
    
    log_data['results'].append(rank_info)

with open(log_file, 'w') as f:
    json.dump(log_data, f, indent=2)

print(f"\nLog file saved to: {log_file}")
print("This log can be used by subsequent chunks to avoid re-running decompositions")
```

### Run barnacle with zero regularization (lambda = 0.0)

This chunk runs barnacle decomposition with all lambdas set to 0.0 (no regularization) to test how the model performs without sparsity constraints. Results are saved to a separate subdirectory for comparison.

```{python barnacle-zero-lambda, eval=TRUE}
# ===========================================
# CONFIGURE RANKS TO TEST - ZERO LAMBDA
# ===========================================
# This chunk tests decomposition with NO regularization
# All lambda values are set to 0.0
# ===========================================
ranks_to_test_zero_lambda = [10, 20, 35, 40]  # Specific ranks to test
# ===========================================

import os
import json
import datetime

# Create separate output directory for zero-lambda results
zero_lambda_output_dir = os.path.join(output_dir, 'zero_lambda_results')
os.makedirs(zero_lambda_output_dir, exist_ok=True)

print("="*60)
print("BARNACLE DECOMPOSITION - ZERO REGULARIZATION")
print("="*60)
print(f"Ranks to test: {ranks_to_test_zero_lambda}")
print(f"Lambda values: [0.0, 0.0, 0.0] (no regularization)")
print(f"Output directory: {zero_lambda_output_dir}")
print(f"This will run {len(ranks_to_test_zero_lambda)} decompositions")
print(f"Estimated time: {len(ranks_to_test_zero_lambda) * 2}-{len(ranks_to_test_zero_lambda) * 5} minutes")
print("="*60)

# Run decompositions for each rank with zero lambda
all_rank_results_zero_lambda = []

for test_rank in sorted(ranks_to_test_zero_lambda):
    result = run_single_rank_decomposition(
        tensor=tensor_3d,
        rank=test_rank,
        lambdas=[0.0, 0.0, 0.0],  # All lambdas set to 0.0
        n_iter_max=10000,
        random_state=41
    )
    all_rank_results_zero_lambda.append(result)
    
    # Save individual results
    if result['success']:
        save_rank_results(
            result,
            zero_lambda_output_dir,
            common_genes_list,
            sample_labels,
            [f'TP{tp}' for tp in common_timepoints],
            species_sample_map
        )

print(f"\n{'='*60}")
print(f"ZERO LAMBDA RANKS COMPLETE")
print(f"{'='*60}")
print(f"Successful: {sum(1 for r in all_rank_results_zero_lambda if r['success'])}/{len(all_rank_results_zero_lambda)}")

# ===========================================
# SAVE COMPREHENSIVE LOG FILE FOR ZERO LAMBDA
# ===========================================
import json
import datetime

log_file_zero_lambda = os.path.join(zero_lambda_output_dir, 'rank_comparison_log_zero_lambda.json')

log_data_zero_lambda = {
    'timestamp': datetime.datetime.now().isoformat(),
    'lambda_values': [0.0, 0.0, 0.0],
    'ranks_tested': ranks_to_test_zero_lambda,
    'total_runs': len(all_rank_results_zero_lambda),
    'successful_runs': sum(1 for r in all_rank_results_zero_lambda if r['success']),
    'failed_runs': sum(1 for r in all_rank_results_zero_lambda if not r['success']),
    'results': []
}

for result in all_rank_results_zero_lambda:
    rank_info = {
        'rank': result['rank'],
        'success': result['success']
    }
    
    if result['success']:
        rank_info['metrics'] = {
            k: float(v) if isinstance(v, (int, float, np.number)) else bool(v) if isinstance(v, (bool, np.bool_)) else v
            for k, v in result['metrics'].items()
        }
        rank_info['output_dir'] = os.path.join(zero_lambda_output_dir, f'rank_{result["rank"]:02d}')
        rank_info['has_loss_history'] = hasattr(result['model'], 'loss_') and result['model'].loss_ is not None
    else:
        rank_info['error'] = result.get('error', 'Unknown error')
    
    log_data_zero_lambda['results'].append(rank_info)

with open(log_file_zero_lambda, 'w') as f:
    json.dump(log_data_zero_lambda, f, indent=2)

print(f"\nLog file saved to: {log_file_zero_lambda}")
print(f"Results directory: {zero_lambda_output_dir}")
print("This log can be used to compare zero-lambda results with regularized decompositions")
```

### Plot Ranks vs Variance Explained (Zero Lambda Results)

This chunk creates a visualization showing the relationship between rank and variance explained for the zero-lambda decompositions, helping to identify the optimal rank based on the variance explained metric.

```{python plot-ranks-vs-variance-zero-lambda, eval=TRUE}
# ===========================================
# PLOT RANKS VS VARIANCE EXPLAINED - ZERO LAMBDA
# ===========================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

print(f"\n{'='*60}")
print("PLOTTING RANKS VS VARIANCE EXPLAINED (ZERO LAMBDA)")
print(f"{'='*60}")

# Extract data from zero-lambda results
ranks_zero_lambda = []
variance_explained_zero_lambda = []
successful_results = []

for result in all_rank_results_zero_lambda:
    if result['success'] and 'metrics' in result:
        ranks_zero_lambda.append(result['rank'])
        variance_explained_zero_lambda.append(result['metrics']['variance_explained'])
        successful_results.append(result)

if len(successful_results) == 0:
    print("WARNING: No successful decompositions found for zero-lambda results!")
    print("Cannot create variance plot.")
else:
    # Sort by rank for plotting
    sorted_data = sorted(zip(ranks_zero_lambda, variance_explained_zero_lambda))
    ranks_zero_lambda, variance_explained_zero_lambda = zip(*sorted_data)
    
    # Create the plot
    plt.figure(figsize=(10, 6))
    plt.plot(ranks_zero_lambda, [v*100 for v in variance_explained_zero_lambda], 
             'o-', linewidth=2.5, markersize=8, color='steelblue', markerfacecolor='lightblue', markeredgecolor='steelblue', markeredgewidth=2)
    
    # Customize the plot
    plt.xlabel('Rank', fontsize=12, fontweight='bold')
    plt.ylabel('Variance Explained (%)', fontsize=12, fontweight='bold')
    plt.title('Variance Explained vs Rank', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.xticks(ranks_zero_lambda)
    
    # Add value labels on points
    for rank, var_exp in zip(ranks_zero_lambda, variance_explained_zero_lambda):
        plt.annotate(f'{var_exp*100:.1f}%', 
                    (rank, var_exp*100), 
                    textcoords="offset points", 
                    xytext=(0,10), 
                    ha='center',
                    fontsize=10,
                    fontweight='bold')
    
    # Formatting
    plt.ylim(bottom=0)
    plt.tight_layout()
    
    # Save the plot
    plot_file = os.path.join(zero_lambda_output_dir, 'ranks_vs_variance_explained_zero_lambda.png')
    plt.savefig(plot_file, dpi=300, bbox_inches='tight')
    print(f"Plot saved to: {plot_file}")
    
    plt.show()
    
    # Print summary statistics
    print(f"\nSUMMARY - Zero Lambda Results:")
    print(f"{'='*50}")
    print(f"Number of successful decompositions: {len(successful_results)}")
    print(f"Ranks tested: {sorted(ranks_zero_lambda)}")
    print(f"Variance explained range: {min(variance_explained_zero_lambda)*100:.1f}% - {max(variance_explained_zero_lambda)*100:.1f}%")
    
    # Find best rank
    best_idx = np.argmax(variance_explained_zero_lambda)
    best_rank = ranks_zero_lambda[best_idx]
    best_variance = variance_explained_zero_lambda[best_idx]
    
    print(f"Best performing rank: {best_rank} (variance explained: {best_variance*100:.1f}%)")
    
    # Create a DataFrame for easy inspection
    results_df = pd.DataFrame({
        'Rank': ranks_zero_lambda,
        'Variance_Explained_Percent': [v*100 for v in variance_explained_zero_lambda]
    })
    
    print(f"\nDetailed Results:")
    print(results_df.to_string(index=False))

print(f"{'='*60}")
```

### Factor Match Score evaluation

This section evaluates each rank's decomposition quality using multiple Factor Match Score (FMS) metrics:

1. **TLViz FMS** (self-comparison): Sanity check only, not useful for rank selection
2. **Reconstruction FMS** (fast): Quick approximation based on tensor reconstruction correlation
3. **Synthetic FMS** (RECOMMENDED): Most rigorous - tests factor recovery from noisy data

The Synthetic FMS is the most reliable metric for determining optimal rank as it validates robustness.

```{python factor-match-score-evaluation, eval=TRUE}
# ===========================================
# FACTOR MATCH SCORE EVALUATION
# ===========================================
import numpy as np
import pandas as pd
import os

print(f"\n{'='*60}")
print("FACTOR MATCH SCORE (FMS) EVALUATION")
print(f"{'='*60}")
print("This analysis evaluates decomposition quality using multiple FMS metrics:")
print("  1. TLViz FMS: Self-comparison (sanity check only)")
print("  2. Reconstruction FMS: Fast approximation (correlation-based)")
print("  3. Synthetic FMS: RECOMMENDED (tests factor recovery robustness)")
print(f"{'='*60}")

def calculate_factor_match_score(tensor, decomposition, rank):
    """
    Calculate Factor Match Score using tlviz library.
    
    WARNING: This implementation compares the decomposition against itself,
    which will always yield FMS ≈ 1.0. This is only useful as a sanity check
    to verify the decomposition is internally consistent.
    
    For actual rank selection, use calculate_reconstruction_fms() instead,
    which compares the reconstruction against the original tensor.
    
    NOTE: May fail with scipy >= 1.14 due to _lazywhere removal.
    """
    try:
        # Import required libraries
        import tensorly as tl
        import tlviz.factor_tools
        
        # IMPORTANT: This creates a "reference" from the decomposition itself
        # This means we're comparing the decomposition to itself, which will
        # always give a perfect or near-perfect score (~1.0).
        # This is NOT a meaningful metric for rank selection!
        # It only validates that the decomposition object is internally consistent.
        reference_cp = tl.cp_tensor.CPTensor((decomposition.weights, decomposition.factors))
        
        # Calculate factor match score between decomposition and itself
        # Expected result: FMS ≈ 1.0 for all ranks (self-comparison)
        fms, perm = tlviz.factor_tools.factor_match_score(
            reference_cp,
            decomposition,
            return_permutation=True,
            allow_smaller_rank=True
        )
        
        return {
            'fms': float(fms),
            'permutation': perm,
            'success': True,
            'method': 'tlviz_self_consistency',
            'warning': 'Self-comparison: FMS ≈ 1.0 expected (not useful for rank selection)'
        }
        
    except ImportError as e:
        # Skip silently - this is expected with newer scipy versions
        error_msg = "SciPy/TLViz compatibility issue (expected with scipy >= 1.14)"
        if '_lazywhere' in str(e):
            error_msg = "SciPy version incompatibility (_lazywhere removed in scipy >= 1.14)"
        return {
            'fms': np.nan,
            'permutation': None,
            'success': False,
            'error': error_msg,
            'method': 'scipy_incompatibility'
        }
    except Exception as e:
        return {
            'fms': np.nan,
            'permutation': None,
            'success': False,
            'error': str(e),
            'method': 'failed'
        }

def calculate_proper_factor_match_score(tensor, decomposition, rank, noise_level=0.1, random_state=42):
    """
    RECOMMENDED METRIC: Synthetic validation approach for FMS.
    
    This creates a synthetic tensor from the decomposition, adds noise,
    re-decomposes it, then measures how well we recover the original factors.
    
    This is the most rigorous validation of decomposition quality because it tests
    the method's ability to recover known factors in the presence of noise.
    
    Higher scores indicate:
    - Better factor recovery from noisy data
    - More robust decomposition
    - Less sensitivity to noise
    - Better rank selection
    
    NOTE: Computationally expensive (requires additional decomposition per rank).
          
    """
    try:
        import tensorly as tl
        import tlviz.factor_tools
        from barnacle.decomposition import SparseCP
        
        # Step 1: Create synthetic "ground truth" tensor from current decomposition
        gene_factors = decomposition.factors[0]
        sample_factors = decomposition.factors[1]
        time_factors = decomposition.factors[2]
        weights = decomposition.weights
        
        # Reconstruct clean synthetic tensor
        synthetic_clean = np.zeros(tensor.shape)
        for r in range(len(weights)):
            outer_prod = np.outer(gene_factors[:, r], sample_factors[:, r])
            outer_prod = np.outer(outer_prod.flatten(), time_factors[:, r])
            outer_prod = outer_prod.reshape(tensor.shape)
            synthetic_clean += weights[r] * outer_prod
        
        # Step 2: Add realistic noise
        np.random.seed(random_state)
        noise_std = noise_level * np.std(synthetic_clean)
        noise = np.random.normal(0, noise_std, synthetic_clean.shape)
        synthetic_noisy = synthetic_clean + noise
        
        # Step 3: Decompose the noisy synthetic tensor
        # This tests: "Can we recover the factors from noisy data?"
        model_synthetic = SparseCP(
            rank=rank,
            lambdas=[0.1, 0.0, 0.1],
            nonneg_modes=[0],
            norm_constraint=True,
            init='random',
            tol=1e-5,
            n_iter_max=5000,  # Fewer iterations for speed
            random_state=random_state + 1,
            n_initializations=1
        )
        
        recovered_decomposition = model_synthetic.fit_transform(synthetic_noisy, verbose=0)
        
        # Step 4: Compare recovered factors to ground truth (original decomposition)
        # Create proper CPTensor objects
        ground_truth_cp = tl.cp_tensor.CPTensor((weights, [gene_factors, sample_factors, time_factors]))
        
        # MockDecomposition for recovered (same structure as before)
        class MockDecomposition:
            def __init__(self, weights, factors):
                self.weights = weights
                self.factors = factors
                self._tuple = (weights, factors)
            def __getitem__(self, index):
                return self._tuple[index]
            def __len__(self):
                return 2
        
        recovered_mock = MockDecomposition(recovered_decomposition.weights, recovered_decomposition.factors)
        
        # Calculate FMS between ground truth and recovered
        fms, perm = tlviz.factor_tools.factor_match_score(
            ground_truth_cp,
            recovered_mock,
            return_permutation=True,
            allow_smaller_rank=True
        )
        
        return {
            'fms': float(fms),
            'permutation': perm,
            'success': True,
            'method': 'synthetic_validation',
            'noise_level': noise_level,
            'recovery_quality': 'excellent' if fms > 0.9 else 'good' if fms > 0.7 else 'fair' if fms > 0.5 else 'poor'
        }
        
    except ImportError as e:
        error_msg = "TLViz not available (scipy incompatibility)"
        if '_lazywhere' in str(e):
            error_msg = "SciPy version incompatibility (_lazywhere removed in scipy >= 1.14)"
        return {
            'fms': np.nan,
            'success': False,
            'error': error_msg,
            'method': 'import_failed'
        }
    except Exception as e:
        return {
            'fms': np.nan,
            'success': False,
            'error': str(e),
            'method': 'failed'
        }

def calculate_reconstruction_fms(tensor, decomposition, rank):
    """
    Calculate FMS based on reconstruction quality.
    
    This compares the original tensor to its reconstruction from the decomposition,
    providing a measure of how well the decomposition captures the data.
    
    Higher scores indicate better reconstruction:
    - 1.0 = perfect reconstruction
    - 0.5 = no correlation
    - 0.0 = perfect negative correlation
    
    Note: This is a fast approximation. For more rigorous validation, use
    synthetic validation FMS (calculate_proper_factor_match_score).
    """
    try:
        # Create reconstructed tensor
        gene_factors = decomposition.factors[0]
        sample_factors = decomposition.factors[1] 
        time_factors = decomposition.factors[2]
        weights = decomposition.weights
        
        # Reconstruct tensor
        reconstructed = np.zeros_like(tensor)
        for r in range(len(weights)):
            outer_prod = np.outer(gene_factors[:, r], sample_factors[:, r])
            outer_prod = np.outer(outer_prod.flatten(), time_factors[:, r])
            outer_prod = outer_prod.reshape(tensor.shape)
            reconstructed += weights[r] * outer_prod
        
        # Calculate correlation-based FMS
        # Flatten tensors and calculate correlation
        orig_flat = tensor.flatten()
        recon_flat = reconstructed.flatten()
        
        # Remove NaN values for correlation calculation
        valid_mask = np.isfinite(orig_flat) & np.isfinite(recon_flat)
        orig_valid = orig_flat[valid_mask]
        recon_valid = recon_flat[valid_mask]
        
        if len(orig_valid) > 0:
            correlation = np.corrcoef(orig_valid, recon_valid)[0, 1]
            # Convert correlation to FMS-like score (0-1 scale)
            fms_score = (correlation + 1) / 2  # Map [-1,1] to [0,1]
        else:
            fms_score = 0.0
        
        return {
            'fms': float(fms_score),
            'correlation': float(correlation) if len(orig_valid) > 0 else np.nan,
            'success': True,
            'method': 'reconstruction_correlation',
            'valid_elements': len(orig_valid)
        }
        
    except Exception as e:
        print(f"  Error calculating reconstruction FMS for rank {rank}: {e}")
        return {
            'fms': np.nan,
            'success': False,
            'error': str(e),
            'method': 'failed'
        }

def evaluate_fms_for_rank(rank, output_dir, tensor, use_synthetic_validation=False):
    """
    Evaluate Factor Match Score for a specific rank.
    
    Parameters:
    -----------
    rank : int
        Rank to evaluate
    output_dir : str
        Directory containing rank results
    tensor : np.ndarray
        Original tensor
    use_synthetic_validation : bool, default=False
        If True, calculate synthetic validation FMS (RECOMMENDED but computationally expensive).
        If False, only calculate reconstruction FMS (fast approximation).
    """
    print(f"\n--- Evaluating FMS for Rank {rank} ---")
    
    rank_dir = os.path.join(output_dir, f'rank_{rank:02d}')
    
    # Check if rank directory exists
    if not os.path.exists(rank_dir):
        print(f"  Rank directory not found: {rank_dir}")
        return None
    
    # Load factor matrices
    try:
        gene_factors = pd.read_csv(os.path.join(rank_dir, 'gene_factors.csv'), index_col=0)
        sample_factors = pd.read_csv(os.path.join(rank_dir, 'sample_factors.csv'), index_col=0)
        time_factors = pd.read_csv(os.path.join(rank_dir, 'time_factors.csv'), index_col=0)
        weights_df = pd.read_csv(os.path.join(rank_dir, 'component_weights.csv'))
        
        # Remove metadata columns from sample_factors if present
        factor_cols = [col for col in sample_factors.columns if col.startswith('Component_')]
        sample_factor_values = sample_factors[factor_cols]
        
        print(f"  Loaded factor matrices - Genes: {gene_factors.shape}, Samples: {sample_factor_values.shape}, Time: {time_factors.shape}")
        
    except Exception as e:
        print(f"  Error loading factor matrices: {e}")
        return None
    
    # Create mock decomposition object that mimics tensorly CPTensor structure
    # TensorLy expects a tuple-like object: (weights, factors)
    # We'll create a simple class that supports both attribute and subscript access
    class MockDecomposition:
        def __init__(self, weights, factors):
            self.weights = weights
            self.factors = factors
            # Make it subscriptable for tensorly compatibility
            self._tuple = (weights, factors)
        
        def __getitem__(self, index):
            """Support subscript access like a tuple"""
            return self._tuple[index]
        
        def __len__(self):
            """Support len() for tuple-like behavior"""
            return 2
    
    weights = weights_df['Weight'].values
    factors = [gene_factors.values, sample_factor_values.values, time_factors.values]
    mock_decomposition = MockDecomposition(weights, factors)
    
    # Calculate FMS metrics
    # 1. TLViz FMS (self-comparison - sanity check only, not useful for rank selection)
    fms_result1 = calculate_factor_match_score(tensor, mock_decomposition, rank)
    
    # 2. Reconstruction FMS (fast approximation - useful for quick comparison)
    fms_result2 = calculate_reconstruction_fms(tensor, mock_decomposition, rank)
    
    # 3. Synthetic validation FMS (RECOMMENDED - most rigorous but computationally expensive)
    if use_synthetic_validation:
        print(f"  Running synthetic validation (this will take a long time)...")
        fms_result3 = calculate_proper_factor_match_score(tensor, mock_decomposition, rank)
    else:
        fms_result3 = {'success': False, 'method': 'skipped', 'fms': np.nan}
    
    # Combine results
    result = {
        'rank': rank,
        'fms_tlviz': fms_result1['fms'] if fms_result1['success'] else np.nan,
        'fms_reconstruction': fms_result2['fms'] if fms_result2['success'] else np.nan,
        'fms_synthetic': fms_result3['fms'] if fms_result3['success'] else np.nan,
        'reconstruction_correlation': fms_result2.get('correlation', np.nan),
        'tlviz_success': fms_result1['success'],
        'reconstruction_success': fms_result2['success'],
        'synthetic_success': fms_result3['success'],
        'tlviz_method': fms_result1.get('method', 'unknown'),
        'reconstruction_method': fms_result2.get('method', 'unknown'),
        'synthetic_method': fms_result3.get('method', 'unknown'),
        'valid_elements': fms_result2.get('valid_elements', 0)
    }
    
    # Print results
    # TLViz FMS note: This compares decomposition to itself (always ~1.0)
    if fms_result1['success']:
        warning_note = fms_result1.get('warning', '')
        if warning_note:
            print(f"  TLViz FMS: {fms_result1['fms']:.4f} ⚠️  (self-consistency check only)")
        else:
            print(f"  TLViz FMS: {fms_result1['fms']:.4f} ✅")
    elif 'scipy_incompatibility' not in fms_result1.get('method', ''):
        # Only print if it's a real error (not the expected scipy issue)
        print(f"  TLViz FMS: Failed - {fms_result1.get('error', 'unknown')} (method: {fms_result1.get('method', 'unknown')})")
    
    # Reconstruction FMS (fast approximation)
    if fms_result2['success']:
        print(f"  📊 RECONSTRUCTION FMS: {fms_result2['fms']:.4f} (correlation: {fms_result2.get('correlation', 'N/A'):.4f})")
        print(f"     ↳ Fast approximation for quick comparison")
        print(f"     ↳ Valid elements used: {fms_result2.get('valid_elements', 0):,}")
    else:
        print(f"  Reconstruction FMS: Failed ({fms_result2.get('error', 'unknown error')})")
    
    # Synthetic validation FMS (RECOMMENDED for rank selection)
    if use_synthetic_validation and fms_result3['success']:
        recovery = fms_result3.get('recovery_quality', 'unknown')
        noise = fms_result3.get('noise_level', 0.1)
        print(f"  ⭐ SYNTHETIC FMS: {fms_result3['fms']:.4f} ({recovery} recovery with {noise*100:.0f}% noise) ⭐")
        print(f"     ↳ RECOMMENDED metric for rank selection")
        print(f"     ↳ Tests factor recovery robustness with noise")
    elif use_synthetic_validation and not fms_result3['success']:
        print(f"  Synthetic FMS: Failed ({fms_result3.get('error', 'unknown error')})")
    
    # Save individual rank FMS results
    fms_file = os.path.join(rank_dir, 'factor_match_score.json')
    import json
    with open(fms_file, 'w') as f:
        json.dump({
            'rank': rank,
            'tlviz_fms': fms_result1,
            'reconstruction_fms': fms_result2,
            'synthetic_fms': fms_result3,
            'summary': result
        }, f, indent=2, default=str)
    
    print(f"  Saved FMS results to: factor_match_score.json")
    
    return result

# ===========================================
# EVALUATE FMS FOR ALL RANKS
# ===========================================

# Get output directory if not available
if 'output_dir' not in globals():
    output_dir = r.output_dir

# Get tensor if not available
if 'tensor_3d' not in globals():
    print("ERROR: tensor_3d not found in memory!")
    print("Please run the tensor creation chunks first.")
    tensor_3d = None

if tensor_3d is not None:
    # Load log file to get successful ranks
    log_file = os.path.join(output_dir, 'rank_comparison_log.json')
    
    if os.path.exists(log_file):
        import json
        with open(log_file, 'r') as f:
            log_data = json.load(f)
        
        successful_ranks = [r['rank'] for r in log_data['results'] if r['success']]
        print(f"Found {len(successful_ranks)} successful ranks to evaluate: {successful_ranks}")
        
    else:
        # Fallback: find rank directories
        import glob
        rank_dirs = sorted(glob.glob(os.path.join(output_dir, 'rank_*')))
        successful_ranks = []
        for rank_dir in rank_dirs:
            try:
                rank_num = int(os.path.basename(rank_dir).split('_')[1])
                successful_ranks.append(rank_num)
            except:
                pass
        print(f"Found {len(successful_ranks)} rank directories: {successful_ranks}")
    
    # ===========================================
    # CONFIGURATION: Synthetic Validation FMS
    # ===========================================
    # Set to True to enable the RECOMMENDED synthetic validation FMS metric.
    # This is the most rigorous method for rank selection as it tests factor
    # recovery from noisy data, providing the best indication of optimal rank.
    # 
    # COMPUTATION TIME: Adds ~1-2 minutes per rank
    # For 12 ranks, expect ~15-30 minutes additional runtime
    # 
    # If False, only reconstruction FMS will be calculated (faster but less rigorous)
    # ===========================================
    USE_SYNTHETIC_VALIDATION = True  # Change to False for faster (but less rigorous) analysis
    # ===========================================
    
    if USE_SYNTHETIC_VALIDATION:
        print("\n" + "="*60)
        print("⭐ SYNTHETIC VALIDATION ENABLED (RECOMMENDED)")
        print("="*60)
        print("This will test factor recovery robustness by:")
        print("  1. Creating synthetic tensor from each decomposition")
        print("  2. Adding noise to test robustness")
        print("  3. Re-decomposing to see if factors are recovered")
        print("  4. Comparing recovered factors to originals")
        print(f"\nExpected additional time: {len(successful_ranks) * 1.5:.0f}-{len(successful_ranks) * 2:.0f} minutes")
        print("This provides the most reliable metric for rank selection.")
        print("="*60 + "\n")
    
    # Evaluate FMS for each rank
    all_fms_results = []
    
    for rank in sorted(successful_ranks):
        fms_result = evaluate_fms_for_rank(rank, output_dir, tensor_3d, 
                                          use_synthetic_validation=USE_SYNTHETIC_VALIDATION)
        if fms_result is not None:
            all_fms_results.append(fms_result)
    
    # ===========================================
    # CREATE FMS COMPARISON TABLE AND PLOTS
    # ===========================================
    if len(all_fms_results) > 0:
        print(f"\n{'='*60}")
        print("FMS RESULTS SUMMARY")
        print(f"{'='*60}")
        
        # Create comparison dataframe
        fms_df = pd.DataFrame(all_fms_results)
        
        # Save comprehensive FMS results
        comparison_dir = os.path.join(output_dir, 'rank_comparison')
        os.makedirs(comparison_dir, exist_ok=True)
        
        fms_output_file = os.path.join(comparison_dir, 'factor_match_scores.csv')
        fms_df.to_csv(fms_output_file, index=False)
        print(f"Saved FMS comparison table to: {fms_output_file}")
        
        # Display results
        # Determine which columns to display based on what was calculated
        display_cols = ['rank', 'fms_tlviz', 'fms_reconstruction', 'reconstruction_correlation']
        if 'fms_synthetic' in fms_df.columns and fms_df['synthetic_success'].any():
            display_cols.append('fms_synthetic')
        
        print("\n" + "="*60)
        print("FACTOR MATCH SCORE RESULTS")
        print("="*60)
        print("⚠️  TLViz FMS: Self-comparison only (always ~1.0, not useful)")
        print("📊 Reconstruction FMS: Fast approximation (useful for quick comparison)")
        if 'fms_synthetic' in display_cols:
            print("⭐ SYNTHETIC FMS: RECOMMENDED for rank selection (tests robustness)")
        print("="*60)
        print(fms_df[display_cols].to_string(index=False, float_format='%.4f'))
        
        # Check if TLViz worked for any ranks
        tlviz_working = fms_df['tlviz_success'].any()
        if not tlviz_working:
            print("\n" + "="*60)
            print("NOTE: TLViz FMS unavailable (scipy version incompatibility)")
            print("="*60)
            print("This is expected with scipy >= 1.14.")
            print("Use Reconstruction FMS (fast) or Synthetic FMS (recommended).")
            print("="*60)
        elif tlviz_working and fms_df.loc[fms_df['tlviz_success'], 'fms_tlviz'].min() > 0.95:
            print("\n" + "="*60)
            print("⚠️  WARNING: TLViz FMS shows all values near 1.0")
            print("="*60)
            print("This indicates self-comparison (not useful for rank selection).")
            print("→ Use Synthetic FMS (recommended) or Reconstruction FMS (fast)")
            print("="*60)
        
        # Create visualization
        import matplotlib.pyplot as plt
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
        
        # Plot 1: Both FMS measures (but emphasize reconstruction if TLViz failed)
        valid_tlviz = fms_df['tlviz_success']
        valid_recon = fms_df['reconstruction_success']
        
        if valid_tlviz.any():
            ax1.plot(fms_df.loc[valid_tlviz, 'rank'], 
                    fms_df.loc[valid_tlviz, 'fms_tlviz'], 
                    'o-', linewidth=2, markersize=8, label='TLViz FMS', color='steelblue')
        
        if valid_recon.any():
            ax1.plot(fms_df.loc[valid_recon, 'rank'], 
                    fms_df.loc[valid_recon, 'fms_reconstruction'], 
                    's-', linewidth=2, markersize=8, label='Reconstruction FMS', color='coral')
        
        # Adjust title based on whether TLViz worked
        if not tlviz_working:
            title = 'Factor Match Score vs Rank\n(Reconstruction FMS: Fast Approximation)'
        else:
            title = 'Factor Match Score vs Rank\n(Higher = Better Factor Recovery)'
        
        ax1.set_xlabel('Rank (Number of Components)', fontsize=11)
        ax1.set_ylabel('Factor Match Score', fontsize=11)
        ax1.set_title(title, fontweight='bold')
        ax1.grid(True, alpha=0.3)
        ax1.legend()
        ax1.set_ylim([0, 1.05])
        
        # Plot 2: Reconstruction correlation
        if valid_recon.any():
            ax2.plot(fms_df.loc[valid_recon, 'rank'], 
                    fms_df.loc[valid_recon, 'reconstruction_correlation'], 
                    'o-', linewidth=2, markersize=8, color='green')
            ax2.axhline(y=0.8, color='orange', linestyle='--', alpha=0.5, label='80% correlation')
            ax2.axhline(y=0.9, color='red', linestyle='--', alpha=0.5, label='90% correlation')
        
        ax2.set_xlabel('Rank (Number of Components)', fontsize=11)
        ax2.set_ylabel('Reconstruction Correlation', fontsize=11)
        ax2.set_title('Tensor Reconstruction Quality vs Rank\n(Correlation between original and reconstructed)', fontweight='bold')
        ax2.grid(True, alpha=0.3)
        ax2.legend()
        ax2.set_ylim([-1.05, 1.05])
        
        plt.tight_layout()
        fms_plot_file = os.path.join(comparison_dir, 'factor_match_scores_plot.png')
        plt.savefig(fms_plot_file, dpi=150, bbox_inches='tight')
        print(f"Saved FMS plot to: {fms_plot_file}")
        plt.close()
        
        # ===========================================
        # FMS-BASED RANK RECOMMENDATION
        # ===========================================
        print(f"\n{'='*60}")
        print("FMS-BASED RANK RECOMMENDATION")
        print(f"{'='*60}")
        
        # Find best ranks based on different criteria
        valid_df = fms_df[fms_df['reconstruction_success']]
        
        if len(valid_df) > 0:
            # Check for synthetic FMS results (RECOMMENDED metric)
            synthetic_valid = fms_df[fms_df['synthetic_success']]
            
            if len(synthetic_valid) > 0:
                # Synthetic FMS is available - use it as PRIMARY recommendation
                best_synthetic_idx = synthetic_valid['fms_synthetic'].idxmax()
                best_synthetic_rank = synthetic_valid.iloc[best_synthetic_idx]['rank']
                best_synthetic_score = synthetic_valid.iloc[best_synthetic_idx]['fms_synthetic']
                
                print(f"⭐ RECOMMENDED: Rank {int(best_synthetic_rank)} (Synthetic FMS: {best_synthetic_score:.4f})")
                print(f"   ↳ Based on factor recovery robustness with noise")
                
                # Show top 3 ranks by synthetic FMS
                top_synthetic = synthetic_valid.nlargest(3, 'fms_synthetic')[['rank', 'fms_synthetic']]
                print(f"\nTop 3 ranks by Synthetic FMS:")
                for idx, row in top_synthetic.iterrows():
                    print(f"  Rank {int(row['rank'])}: {row['fms_synthetic']:.4f}")
            else:
                print(f"⚠️  Synthetic FMS not calculated (set USE_SYNTHETIC_VALIDATION=True)")
                print(f"   ↳ Using Reconstruction FMS for recommendation (less rigorous)")
                
                # Fall back to reconstruction FMS
                best_recon_idx = valid_df['fms_reconstruction'].idxmax()
                best_recon_rank = valid_df.iloc[best_recon_idx]['rank']
                best_recon_score = valid_df.iloc[best_recon_idx]['fms_reconstruction']
                
                print(f"\n📊 Rank {int(best_recon_rank)} (Reconstruction FMS: {best_recon_score:.4f})")
                print(f"   ↳ Based on reconstruction correlation (fast approximation)")
            
            # Additional metrics for context
            print(f"\n--- Additional Metrics ---")
            
            # Best reconstruction correlation
            best_corr_idx = valid_df['reconstruction_correlation'].idxmax()
            best_corr_rank = valid_df.iloc[best_corr_idx]['rank']
            best_corr_score = valid_df.iloc[best_corr_idx]['reconstruction_correlation']
            print(f"Best Reconstruction Correlation: Rank {int(best_corr_rank)} ({best_corr_score:.4f})")
            
            # Check for TLViz results (informational only)
            tlviz_valid = fms_df[fms_df['tlviz_success']]
            if len(tlviz_valid) > 0:
                best_tlviz_idx = tlviz_valid['fms_tlviz'].idxmax()
                best_tlviz_rank = tlviz_valid.iloc[best_tlviz_idx]['rank']
                best_tlviz_score = tlviz_valid.iloc[best_tlviz_idx]['fms_tlviz']
                print(f"TLViz FMS (informational): Rank {int(best_tlviz_rank)} ({best_tlviz_score:.4f})")
            
            # Provide interpretation
            print(f"\n--- FMS Metrics Explained ---")
            if len(synthetic_valid) > 0:
                print(f"⭐ Synthetic FMS (RECOMMENDED):")
                print(f"   • Tests factor recovery from noisy data")
                print(f"   • Most rigorous validation of decomposition quality")
                print(f"   • Best metric for determining optimal rank")
                print(f"   • Values > 0.9 = excellent, > 0.7 = good, > 0.5 = fair")
            print(f"📊 Reconstruction FMS (fast approximation):")
            print(f"   • Based on correlation with original tensor")
            print(f"   • Useful for quick comparison")
            print(f"   • Less rigorous than synthetic validation")
            print(f"⚠️  TLViz FMS (self-comparison only):")
            print(f"   • Not useful for rank selection (always ~1.0)")
            print(f"   • Only validates internal consistency")
            
        else:
            print("No valid FMS results available for recommendation.")
    
    else:
        print("No FMS results to analyze.")

else:
    print("Skipping FMS evaluation - tensor not available.")

print(f"\n{'='*60}")
print("FACTOR MATCH SCORE EVALUATION COMPLETE")
print(f"{'='*60}")
if 'all_fms_results' in locals() and len(all_fms_results) > 0:
    working_recon = sum(1 for r in all_fms_results if r['reconstruction_success'])
    working_tlviz = sum(1 for r in all_fms_results if r['tlviz_success'])
    print(f"Summary: {working_recon}/{len(all_fms_results)} ranks have valid Reconstruction FMS")
    print(f"         {working_tlviz}/{len(all_fms_results)} ranks have valid TLViz FMS")
    if working_recon > 0:
        print(f"→ Use Reconstruction FMS and correlation for rank selection")
    else:
        print(f"→ No valid FMS results - check tensor and factor matrices")
```


# VISUALIZATION

## Plot rank comparisons

### Plot Synthetic FMS Elbow (Standalone)

This chunk can be run independently to generate the synthetic FMS elbow plot without re-running the expensive Factor Match Score evaluation. It loads the saved FMS results from the CSV file.

```{python plot-synthetic-fms-elbow-standalone, eval=TRUE}
# ===========================================
# STANDALONE SYNTHETIC FMS ELBOW PLOT
# ===========================================
# This chunk loads saved FMS results and creates the elbow plot
# Can be run independently without re-running FMS evaluation
# ===========================================

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Get output directory
output_dir = r.output_dir if 'r' in dir() else '../output/13.00-multiomics-barnacle'
comparison_dir = os.path.join(output_dir, 'rank_comparison')
fms_csv_file = os.path.join(comparison_dir, 'factor_match_scores.csv')

print(f"\n{'='*60}")
print("SYNTHETIC FMS ELBOW PLOT (STANDALONE)")
print(f"{'='*60}")
print(f"Loading FMS results from: {fms_csv_file}")

# Check if FMS results file exists
if not os.path.exists(fms_csv_file):
    print(f"\nERROR: FMS results file not found!")
    print(f"Expected location: {fms_csv_file}")
    print(f"\nPlease run the Factor Match Score evaluation chunk first.")
    print(f"It should create: {comparison_dir}/factor_match_scores.csv")
else:
    # Load FMS results
    fms_df = pd.read_csv(fms_csv_file)
    
    print(f"Loaded FMS data for {len(fms_df)} ranks")
    print(f"Columns available: {list(fms_df.columns)}")
    
    # Check if synthetic FMS data is available
    if 'fms_synthetic' not in fms_df.columns:
        print(f"\nERROR: No 'fms_synthetic' column found in FMS results!")
        print(f"Synthetic validation was not performed.")
        print(f"\nTo generate synthetic FMS data:")
        print(f"1. Set USE_SYNTHETIC_VALIDATION = True in the FMS evaluation chunk")
        print(f"2. Re-run the Factor Match Score evaluation chunk")
    else:
        # Check if any ranks have valid synthetic FMS
        valid_synthetic = fms_df['synthetic_success'].fillna(False)
        
        if not valid_synthetic.any():
            print(f"\nERROR: No valid synthetic FMS values found!")
            print(f"All synthetic validation attempts failed or were skipped.")
            print(f"\nTo generate synthetic FMS data:")
            print(f"1. Set USE_SYNTHETIC_VALIDATION = True in the FMS evaluation chunk")
            print(f"2. Re-run the Factor Match Score evaluation chunk")
        else:
            print(f"Found {valid_synthetic.sum()} ranks with valid synthetic FMS")
            
            # ===========================================
            # CREATE SYNTHETIC FMS ELBOW PLOT
            # ===========================================
            
            fig, ax = plt.subplots(1, 1, figsize=(10, 7))
            
            # Get data for plotting
            valid_tlviz = fms_df['tlviz_success'].fillna(False)
            valid_recon = fms_df['reconstruction_success'].fillna(False)
            
            # Synthetic FMS (RECOMMENDED - should show elbow)
            synthetic_data = fms_df.loc[valid_synthetic].sort_values('rank')
            ax.plot(synthetic_data['rank'], synthetic_data['fms_synthetic'], 
                    'o-', linewidth=3, markersize=10, label='⭐ Synthetic Validation FMS (RECOMMENDED)', 
                    color='#2ecc71', zorder=3)
            
            # Mark the peak (optimal rank)
            max_synthetic_idx = synthetic_data['fms_synthetic'].idxmax()
            optimal_rank = synthetic_data.loc[max_synthetic_idx, 'rank']
            optimal_score = synthetic_data.loc[max_synthetic_idx, 'fms_synthetic']
            ax.plot(optimal_rank, optimal_score, 'r*', markersize=20, 
                    label=f'★ Optimal: Rank {int(optimal_rank)} (FMS={optimal_score:.4f})',
                    zorder=4)
            
            # Reconstruction FMS (fast approximation - typically increases monotonically)
            if valid_recon.any():
                recon_data = fms_df.loc[valid_recon].sort_values('rank')
                ax.plot(recon_data['rank'], recon_data['fms_reconstruction'], 
                        's--', linewidth=2, markersize=8, label='📊 Reconstruction FMS (fast)', 
                        color='#3498db', alpha=0.7, zorder=2)
            
            # TLViz FMS (if available - self-comparison, typically near 1.0)
            if valid_tlviz.any():
                tlviz_data = fms_df.loc[valid_tlviz].sort_values('rank')
                ax.plot(tlviz_data['rank'], tlviz_data['fms_tlviz'], 
                        '^:', linewidth=1.5, markersize=6, label='⚠️  TLViz FMS (sanity check only)', 
                        color='#95a5a6', alpha=0.5, zorder=1)
            
            ax.set_xlabel('Rank (Number of Components)', fontsize=12, fontweight='bold')
            ax.set_ylabel('Factor Match Score', fontsize=12, fontweight='bold')
            ax.set_title('Factor Match Score Comparison: Finding Optimal Rank\n' +
                         'Synthetic FMS (RECOMMENDED): Peaks at optimal rank, then declines with overfitting',
                         fontsize=13, fontweight='bold')
            ax.grid(True, alpha=0.3, linestyle='--')
            ax.legend(loc='best', fontsize=10)
            ax.set_ylim([0, 1.05])
            
            # Add quality threshold lines
            ax.axhline(y=0.9, color='green', linestyle='--', alpha=0.3, linewidth=1)
            ax.text(ax.get_xlim()[1]*0.95, 0.91, 'Excellent (>0.9)', 
                    ha='right', va='bottom', fontsize=9, color='green', alpha=0.7)
            ax.axhline(y=0.7, color='orange', linestyle='--', alpha=0.3, linewidth=1)
            ax.text(ax.get_xlim()[1]*0.95, 0.71, 'Good (>0.7)', 
                    ha='right', va='bottom', fontsize=9, color='orange', alpha=0.7)
            ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.3, linewidth=1)
            ax.text(ax.get_xlim()[1]*0.95, 0.51, 'Fair (>0.5)', 
                    ha='right', va='bottom', fontsize=9, color='red', alpha=0.7)
            
            plt.tight_layout()
            synthetic_plot_file = os.path.join(comparison_dir, 'synthetic_fms_elbow_plot.png')
            plt.savefig(synthetic_plot_file, dpi=150, bbox_inches='tight')
            print(f"\n✓ Saved Synthetic FMS elbow plot to: {synthetic_plot_file}")
            plt.close()
            
            # Print summary
            print(f"\n{'='*60}")
            print("SYNTHETIC FMS ELBOW ANALYSIS")
            print(f"{'='*60}")
            print(f"Optimal Rank: {int(optimal_rank)} (FMS = {optimal_score:.4f})")
            
            # Show all ranks with their scores
            print(f"\nComplete Synthetic FMS Results:")
            print(f"{'Rank':<8} {'Synthetic FMS':<15} {'Quality':<12} {'Status'}")
            print(f"{'-'*8} {'-'*15} {'-'*12} {'-'*20}")
            for idx, row in synthetic_data.iterrows():
                rank_val = int(row['rank'])
                fms_val = row['fms_synthetic']
                
                if fms_val > 0.9:
                    quality = "Excellent"
                elif fms_val > 0.7:
                    quality = "Good"
                elif fms_val > 0.5:
                    quality = "Fair"
                else:
                    quality = "Poor"
                
                status = "★ OPTIMAL" if rank_val == optimal_rank else ""
                print(f"{rank_val:<8} {fms_val:<15.4f} {quality:<12} {status}")
            
            print(f"\nInterpretation:")
            print(f"  • Synthetic FMS tests factor recovery from noisy data")
            print(f"  • Peak at rank {int(optimal_rank)} indicates best balance:")
            print(f"    - Sufficient complexity to capture patterns")
            print(f"    - Robust enough to avoid overfitting noise")
            print(f"  • Higher ranks show declining FMS (overfitting)")
            print(f"  • Lower ranks show lower FMS (underfitting)")
            
            # Additional statistics
            if len(synthetic_data) > 1:
                decline_after_optimal = after_optimal['fms_synthetic'].mean() if len(after_optimal) > 0 else np.nan
                improvement_to_optimal = optimal_score - before_optimal['fms_synthetic'].max() if len(before_optimal) > 0 else np.nan
                
                print(f"\nStatistics:")
                if not np.isnan(improvement_to_optimal):
                    print(f"  • Improvement to optimal: +{improvement_to_optimal:.4f}")
                if not np.isnan(decline_after_optimal):
                    print(f"  • Average FMS after optimal: {decline_after_optimal:.4f}")
                    print(f"  • Decline from peak: -{(optimal_score - decline_after_optimal):.4f}")
            
            print(f"\n{'='*60}")

print(f"\nSYNTHETIC FMS ELBOW PLOT GENERATION COMPLETE")
print(f"{'='*60}")
```

### 1. Comparison Plots

Located in `barnacle_factors/rank_comparison/`:

- **`elbow_plot.png`** - Shows variance explained vs rank (MOST IMPORTANT)
  - Look for the "elbow" where gains diminish
  - Marginal gains are annotated on the plot
  
- **`rank_comparison_summary.png`** - 6-panel overview
- **`loss_trajectories.png`** - Convergence speed comparison

### 2. Metrics Table

File: `barnacle_factors/rank_comparison/rank_comparison_metrics.csv`

Contains all quantitative metrics for comparison.

### 3. Individual Rank Results

Each rank gets its own directory (`rank_05/`, `rank_10/`, etc.) with full factor matrices.

```{python plot-rank-comparison, eval=TRUE}
# ===========================================
# This chunk reads from rank_comparison_log.json
# and does NOT require variables from previous chunks
# ===========================================
import matplotlib.pyplot as plt
import seaborn as sns
import json
import glob
import os
import pandas as pd
import numpy as np

print(f"\n{'='*60}")
print("Creating rank comparison visualizations")
print(f"{'='*60}")

# Check if output_dir exists, if not get it from R
if 'output_dir' not in globals():
    print("Getting output_dir from R environment...")
    output_dir = r.output_dir
    print(f"  output_dir = {output_dir}")

comparison_dir = os.path.join(output_dir, 'rank_comparison')
os.makedirs(comparison_dir, exist_ok=True)

# ===========================================
# LOAD RESULTS FROM LOG FILE
# ===========================================
log_file = os.path.join(output_dir, 'rank_comparison_log.json')

if os.path.exists(log_file):
    print(f"Loading results from log file: {log_file}")
    with open(log_file, 'r') as f:
        log_data = json.load(f)
    
    print(f"  Log timestamp: {log_data['timestamp']}")
    print(f"  Total runs: {log_data['total_runs']}")
    print(f"  Successful: {log_data['successful_runs']}")
    print(f"  Failed: {log_data['failed_runs']}")
    
    # Reconstruct all_rank_results from log
    all_rank_results = []
    for result_info in log_data['results']:
        if result_info['success']:
            all_rank_results.append({
                'rank': result_info['rank'],
                'metrics': result_info['metrics'],
                'success': True
            })
        else:
            all_rank_results.append({
                'rank': result_info['rank'],
                'success': False,
                'error': result_info.get('error', 'Unknown error')
            })
    
    print(f"  Loaded {len(all_rank_results)} results from log file")
    
elif 'all_rank_results' not in globals():
    # Fallback: Load from individual metadata files
    print("Log file not found. Loading from individual rank directories...")
    all_rank_results = []
    
    # Find all rank_* directories
    rank_dirs = sorted(glob.glob(os.path.join(output_dir, 'rank_*')))
    print(f"  Found {len(rank_dirs)} rank directories")
    
    for rank_dir in rank_dirs:
        metadata_file = os.path.join(rank_dir, 'metadata.json')
        if os.path.exists(metadata_file):
            with open(metadata_file, 'r') as f:
                metadata = json.load(f)
                all_rank_results.append({
                    'rank': metadata['rank'],
                    'metrics': metadata['metrics'],
                    'success': True
                })
    
    print(f"  Loaded {len(all_rank_results)} results from disk")
else:
    print("Using all_rank_results from memory")

# Extract successful results
successful = [r for r in all_rank_results if r['success']]
if len(successful) == 0:
    print("No successful decompositions to compare!")
else:
    ranks = [r['rank'] for r in successful]
    
    # Create comparison dataframe
    comparison_df = pd.DataFrame([
        {
            'Rank': r['rank'],
            **r['metrics']
        }
        for r in successful
    ])
    
    # Save comparison table
    comparison_df.to_csv(os.path.join(comparison_dir, 'rank_comparison_metrics.csv'), index=False)
    print(f"  Saved metrics table")
    
    # Create multi-panel comparison plot
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('Rank Comparison: Key Metrics', fontsize=16, fontweight='bold')
    
    # 1. Variance Explained
    ax = axes[0, 0]
    ax.plot(ranks, comparison_df['variance_explained'], 'o-', linewidth=2, markersize=8, color='steelblue')
    ax.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, label='80% threshold')
    ax.axhline(y=0.6, color='orange', linestyle='--', alpha=0.5, label='60% threshold')
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Variance Explained', fontsize=11)
    ax.set_title('Variance Explained vs Rank', fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_ylim([0, 1.05])
    
    # 2. Relative Error
    ax = axes[0, 1]
    ax.plot(ranks, comparison_df['relative_error'], 'o-', linewidth=2, markersize=8, color='coral')
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Relative Reconstruction Error', fontsize=11)
    ax.set_title('Reconstruction Error vs Rank', fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # 3. Final Loss
    ax = axes[0, 2]
    ax.plot(ranks, comparison_df['final_loss'], 'o-', linewidth=2, markersize=8, color='purple')
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Final Loss', fontsize=11)
    ax.set_title('Final Loss vs Rank', fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # 4. Gene Sparsity
    ax = axes[1, 0]
    ax.plot(ranks, comparison_df['gene_sparsity'], 'o-', linewidth=2, markersize=8, color='green')
    ax.axhline(y=0.9, color='red', linestyle='--', alpha=0.5, label='90% sparse')
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Gene Factor Sparsity', fontsize=11)
    ax.set_title('Gene Sparsity vs Rank', fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_ylim([0, 1.05])
    
    # 5. Weight Distribution (CV)
    ax = axes[1, 1]
    ax.plot(ranks, comparison_df['weight_cv'], 'o-', linewidth=2, markersize=8, color='darkorange')
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Weight Coefficient of Variation', fontsize=11)
    ax.set_title('Component Weight Variability vs Rank', fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # 6. Iterations to Convergence
    ax = axes[1, 2]
    colors = ['green' if c else 'red' for c in comparison_df['converged']]
    ax.bar(ranks, comparison_df['n_iterations'], color=colors, alpha=0.7)
    ax.set_xlabel('Rank (Number of Components)', fontsize=11)
    ax.set_ylabel('Iterations to Convergence', fontsize=11)
    ax.set_title('Convergence Speed vs Rank\n(Green=Converged, Red=Max Iterations)', fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    output_path = os.path.join(comparison_dir, 'rank_comparison_summary.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    print(f"  Saved: rank_comparison_summary.png")
    plt.close()
    
    # Create elbow plot
    fig, ax = plt.subplots(figsize=(10, 6))
    
    var_exp = comparison_df['variance_explained'].values
    ax.plot(ranks, var_exp, 'o-', linewidth=3, markersize=10, color='steelblue', label='Variance Explained')
    
    # Calculate marginal gains
    if len(ranks) > 1:
        marginal_gains = np.diff(var_exp)
        for i in range(len(marginal_gains)):
            ax.annotate(f'+{marginal_gains[i]*100:.1f}%',
                       xy=(ranks[i+1], var_exp[i+1]),
                       xytext=(10, -10),
                       textcoords='offset points',
                       fontsize=9,
                       color='darkgreen',
                       bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3))
    
    ax.axhline(y=0.8, color='green', linestyle='--', alpha=0.5, linewidth=2, label='80% threshold (excellent)')
    ax.axhline(y=0.6, color='orange', linestyle='--', alpha=0.5, linewidth=2, label='60% threshold (good)')
    ax.set_xlabel('Rank (Number of Components)', fontsize=13, fontweight='bold')
    ax.set_ylabel('Variance Explained', fontsize=13, fontweight='bold')
    ax.set_title('Elbow Plot: Finding Optimal Rank\n(Look for diminishing returns)', 
                fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.legend(fontsize=11)
    ax.set_ylim([0, 1.05])
    ax.set_xticks(ranks)
    
    plt.tight_layout()
    output_path = os.path.join(comparison_dir, 'elbow_plot.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    print(f"  Saved: elbow_plot.png")
    plt.close()
    
    # Loss trajectories
    fig, ax = plt.subplots(figsize=(10, 6))
    
    loss_plotted = False
    for r in successful:
        # Try to get loss from model object first
        if 'model' in r and hasattr(r['model'], 'loss_') and r['model'].loss_ is not None:
            loss = r['model'].loss_
            ax.plot(range(1, len(loss)+1), loss, linewidth=2, label=f'Rank {r["rank"]}', alpha=0.8)
            loss_plotted = True
        else:
            # Load from disk
            rank_dir = os.path.join(output_dir, f'rank_{r["rank"]:02d}')
            loss_file = os.path.join(rank_dir, 'loss_history.csv')
            if os.path.exists(loss_file):
                loss_df = pd.read_csv(loss_file)
                ax.plot(loss_df['Iteration'], loss_df['Loss'], linewidth=2, label=f'Rank {r["rank"]}', alpha=0.8)
                loss_plotted = True
    
    if loss_plotted:
        ax.set_xlabel('Iteration', fontsize=12)
        ax.set_ylabel('Loss', fontsize=12)
        ax.set_title('Loss Convergence Trajectories by Rank', fontsize=14, fontweight='bold')
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=10)
        ax.set_yscale('log')
        
        plt.tight_layout()
        output_path = os.path.join(comparison_dir, 'loss_trajectories.png')
        plt.savefig(output_path, dpi=150, bbox_inches='tight')
        print(f"  Saved: loss_trajectories.png")
    else:
        print(f"  Skipped loss_trajectories.png (no loss history available)")
    
    plt.close()
    
    print(f"\nAll comparison plots saved to: {comparison_dir}")
```


# INDIVIDUAL RANK VISUALIZATIONS

## Detailed plots for Rank 10

This section creates comprehensive visualizations for rank 10 (the optimal rank determined from previous analysis).

```{python individual-rank-visualizations, eval=TRUE}
# ===========================================
# CREATE DETAILED VISUALIZATIONS FOR RANK 10
# ===========================================
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print(f"\n{'='*60}")
print("CREATING RANK 10 VISUALIZATIONS")
print(f"{'='*60}")

# Get output directory
if 'output_dir' not in globals():
    output_dir = r.output_dir

# Set rank to analyze
SELECTED_RANK = 10
TOP_GENES_HEATMAP = 50  # Number of top genes to show in heatmap

# Process only rank 10
rank = SELECTED_RANK
print(f"\n--- Rank {rank} ---")
rank_dir = os.path.join(output_dir, f'rank_{rank:02d}')

# Load factor matrices
gene_factors_file = os.path.join(rank_dir, 'gene_factors.csv')
time_factors_file = os.path.join(rank_dir, 'time_factors.csv')
weights_file = os.path.join(rank_dir, 'component_weights.csv')

if not all([os.path.exists(f) for f in [gene_factors_file, time_factors_file, weights_file]]):
    print(f"ERROR: Missing factor files for rank {rank}")
    print(f"Please ensure barnacle decomposition has been run for rank {rank}")
else:
    gene_factors = pd.read_csv(gene_factors_file, index_col=0)
    time_factors = pd.read_csv(time_factors_file, index_col=0)
    weights = pd.read_csv(weights_file)
    
    # ===========================================
    # PLOT 1: GENE HEATMAPS (Multiple selection criteria)
    # ===========================================
    print(f"  Creating gene-component heatmaps with different selection criteria...")
    
    # ===========================================
    # 1A: Sum of absolute loadings (broadly active genes)
    # ===========================================
    print(f"    1A: Sum of absolute loadings (broadly active genes)...")
    gene_importance_sum = gene_factors.abs().sum(axis=1)
    top_genes_sum = gene_importance_sum.nlargest(TOP_GENES_HEATMAP).index
    top_gene_factors_sum = gene_factors.loc[top_genes_sum]
    
    # Save data to CSV
    csv_path = os.path.join(rank_dir, f'gene_heatmap_1A_sum_top{TOP_GENES_HEATMAP}_data.csv')
    top_gene_factors_sum.to_csv(csv_path)
    print(f"      Saved data: gene_heatmap_1A_sum_top{TOP_GENES_HEATMAP}_data.csv")
    
    fig, ax = plt.subplots(figsize=(max(12, rank*0.5), 10))
    sns.heatmap(
        top_gene_factors_sum,
        cmap='RdBu_r',
        center=0,
        cbar_kws={'label': 'Component Loading'},
        xticklabels=True,
        yticklabels=True,
        ax=ax
    )
    ax.set_xlabel('Components', fontsize=12, fontweight='bold')
    ax.set_ylabel('Gene ID', fontsize=12, fontweight='bold')
    ax.set_title(f'Rank {rank}: Top {TOP_GENES_HEATMAP} Genes by Sum of Absolute Loadings\n' + 
                f'(Genes that participate across multiple components)',
                fontsize=13, fontweight='bold', pad=15)
    plt.tight_layout()
    output_path = os.path.join(rank_dir, f'gene_heatmap_1A_sum_top{TOP_GENES_HEATMAP}.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"      Saved plot: gene_heatmap_1A_sum_top{TOP_GENES_HEATMAP}.png")
    
    # ===========================================
    # 1B: Maximum absolute loading (component-specific genes)
    # ===========================================
    print(f"    1B: Maximum absolute loading (component-specific genes)...")
    gene_importance_max = gene_factors.abs().max(axis=1)
    top_genes_max = gene_importance_max.nlargest(TOP_GENES_HEATMAP).index
    top_gene_factors_max = gene_factors.loc[top_genes_max]
    
    # Save data to CSV
    csv_path = os.path.join(rank_dir, f'gene_heatmap_1B_max_top{TOP_GENES_HEATMAP}_data.csv')
    top_gene_factors_max.to_csv(csv_path)
    print(f"      Saved data: gene_heatmap_1B_max_top{TOP_GENES_HEATMAP}_data.csv")
    
    fig, ax = plt.subplots(figsize=(max(12, rank*0.5), 10))
    sns.heatmap(
        top_gene_factors_max,
        cmap='RdBu_r',
        center=0,
        cbar_kws={'label': 'Component Loading'},
        xticklabels=True,
        yticklabels=True,
        ax=ax
    )
    ax.set_xlabel('Components', fontsize=12, fontweight='bold')
    ax.set_ylabel('Gene ID', fontsize=12, fontweight='bold')
    ax.set_title(f'Rank {rank}: Top {TOP_GENES_HEATMAP} Genes by Maximum Loading\n' + 
                f'(Genes highly specific to individual components)',
                fontsize=13, fontweight='bold', pad=15)
    plt.tight_layout()
    output_path = os.path.join(rank_dir, f'gene_heatmap_1B_max_top{TOP_GENES_HEATMAP}.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"      Saved plot: gene_heatmap_1B_max_top{TOP_GENES_HEATMAP}.png")
    
    # ===========================================
    # 1C: Mean absolute loading (consistently active genes)
    # ===========================================
    print(f"    1C: Mean absolute loading (consistently active genes)...")
    gene_importance_mean = gene_factors.abs().mean(axis=1)
    top_genes_mean = gene_importance_mean.nlargest(TOP_GENES_HEATMAP).index
    top_gene_factors_mean = gene_factors.loc[top_genes_mean]
    
    # Save data to CSV
    csv_path = os.path.join(rank_dir, f'gene_heatmap_1C_mean_top{TOP_GENES_HEATMAP}_data.csv')
    top_gene_factors_mean.to_csv(csv_path)
    print(f"      Saved data: gene_heatmap_1C_mean_top{TOP_GENES_HEATMAP}_data.csv")
    
    fig, ax = plt.subplots(figsize=(max(12, rank*0.5), 10))
    sns.heatmap(
        top_gene_factors_mean,
        cmap='RdBu_r',
        center=0,
        cbar_kws={'label': 'Component Loading'},
        xticklabels=True,
        yticklabels=True,
        ax=ax
    )
    ax.set_xlabel('Components', fontsize=12, fontweight='bold')
    ax.set_ylabel('Gene ID', fontsize=12, fontweight='bold')
    ax.set_title(f'Rank {rank}: Top {TOP_GENES_HEATMAP} Genes by Mean Absolute Loading\n' + 
                f'(Genes with consistent moderate activity across components)',
                fontsize=13, fontweight='bold', pad=15)
    plt.tight_layout()
    output_path = os.path.join(rank_dir, f'gene_heatmap_1C_mean_top{TOP_GENES_HEATMAP}.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"      Saved plot: gene_heatmap_1C_mean_top{TOP_GENES_HEATMAP}.png")
    
    # ===========================================
    # 1D: Variance of loadings (genes with high specificity)
    # ===========================================
    print(f"    1D: Variance of loadings (component-specific markers)...")
    gene_variance = gene_factors.var(axis=1)
    top_genes_var = gene_variance.nlargest(TOP_GENES_HEATMAP).index
    top_gene_factors_var = gene_factors.loc[top_genes_var]
    
    # Save data to CSV
    csv_path = os.path.join(rank_dir, f'gene_heatmap_1D_variance_top{TOP_GENES_HEATMAP}_data.csv')
    top_gene_factors_var.to_csv(csv_path)
    print(f"      Saved data: gene_heatmap_1D_variance_top{TOP_GENES_HEATMAP}_data.csv")
    
    fig, ax = plt.subplots(figsize=(max(12, rank*0.5), 10))
    sns.heatmap(
        top_gene_factors_var,
        cmap='RdBu_r',
        center=0,
        cbar_kws={'label': 'Component Loading'},
        xticklabels=True,
        yticklabels=True,
        ax=ax
    )
    ax.set_xlabel('Components', fontsize=12, fontweight='bold')
    ax.set_ylabel('Gene ID', fontsize=12, fontweight='bold')
    ax.set_title(f'Rank {rank}: Top {TOP_GENES_HEATMAP} Genes by Loading Variance\n' + 
                f'(Genes with high specificity - strong in some components, weak in others)',
                fontsize=13, fontweight='bold', pad=15)
    plt.tight_layout()
    output_path = os.path.join(rank_dir, f'gene_heatmap_1D_variance_top{TOP_GENES_HEATMAP}.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"      Saved plot: gene_heatmap_1D_variance_top{TOP_GENES_HEATMAP}.png")
    
    # ===========================================
    # Summary comparison
    # ===========================================
    print(f"\n    Gene selection summary:")
    print(f"      Sum method:      Favors genes active across many components")
    print(f"      Max method:      Favors genes strongly specific to one component")
    print(f"      Mean method:     Favors genes with consistent moderate activity")
    print(f"      Variance method: Favors genes with high component specificity")
    
    # Check overlap between methods
    overlap_sum_max = len(set(top_genes_sum) & set(top_genes_max))
    overlap_sum_mean = len(set(top_genes_sum) & set(top_genes_mean))
    overlap_max_var = len(set(top_genes_max) & set(top_genes_var))
    
    print(f"\n    Gene overlap between methods:")
    print(f"      Sum & Max:      {overlap_sum_max}/{TOP_GENES_HEATMAP} genes in common")
    print(f"      Sum & Mean:     {overlap_sum_mean}/{TOP_GENES_HEATMAP} genes in common")
    print(f"      Max & Variance: {overlap_max_var}/{TOP_GENES_HEATMAP} genes in common")
    
    # ===========================================
    # EXPORT TOP 500 GENES PER COMPONENT
    # ===========================================
    print(f"\n    Exporting top 500 genes per component...")
    TOP_GENES_PER_COMPONENT = 500
    
    for component in gene_factors.columns:
        # Get top genes by absolute loading for this component
        component_loadings = gene_factors[component].abs().sort_values(ascending=False)
        top_genes_for_component = component_loadings.head(TOP_GENES_PER_COMPONENT)
        
        # Create dataframe with gene ID and loading value (both absolute and signed)
        top_genes_df = pd.DataFrame({
            'gene_id': top_genes_for_component.index,
            'absolute_loading': top_genes_for_component.values,
            'signed_loading': gene_factors.loc[top_genes_for_component.index, component].values
        })
        
        # Save to CSV
        csv_filename = f'top_{TOP_GENES_PER_COMPONENT}_genes_{component}.csv'
        csv_path = os.path.join(rank_dir, csv_filename)
        top_genes_df.to_csv(csv_path, index=False)
        
    print(f"      Saved top {TOP_GENES_PER_COMPONENT} genes for each of {len(gene_factors.columns)} components")
    print(f"      Files: top_{TOP_GENES_PER_COMPONENT}_genes_Component_X.csv")
    
    # ===========================================
    # PLOT 2: TIME COMPONENT LINE PLOT
    # ===========================================
    print(f"  Creating time-component line plot...")
    
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Define different marker shapes for each component
    markers = ['o', 's', '^', 'D', 'v', '<', '>', 'p', '*', 'h', '+', 'x', '8', 'H', 'd', '|', '_']
    
    # Plot each component
    for i, col in enumerate(time_factors.columns):
        marker_style = markers[i % len(markers)]  # Cycle through markers if more components than markers
        ax.plot(time_factors.index, time_factors[col], 
               marker=marker_style, linewidth=2, markersize=8,
               label=f'{col}', alpha=0.8)
    
    ax.set_xlabel('Timepoint', fontsize=12, fontweight='bold')
    ax.set_ylabel('Component Loading', fontsize=12, fontweight='bold')
    ax.set_title(f'Rank {rank}: Component Loadings Across Timepoints',
                fontsize=13, fontweight='bold', pad=15)
    ax.grid(True, alpha=0.3)
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
    ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.3)
    
    plt.tight_layout()
    output_path = os.path.join(rank_dir, 'time_component_lineplot.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"    Saved: time_component_lineplot.png")
    
    # ===========================================
    # PLOT 3: SAMPLE HEATMAP (Samples x Components)
    # ===========================================
    print(f"  Creating sample-component heatmap...")
    
    # Load sample factors with metadata
    sample_factors_file = os.path.join(rank_dir, 'sample_factors.csv')
    sample_factors = pd.read_csv(sample_factors_file, index_col=0)
    
    # Extract only the component columns
    component_cols = [col for col in sample_factors.columns if col.startswith('Component_')]
    sample_component_data = sample_factors[component_cols]
    
    # Create labels that include species and sample info
    sample_labels = [f"{row['Species']}_{row['Sample_ID']}" for idx, row in sample_factors.iterrows()]
    
    # Add sample labels as index for clustering
    sample_component_data.index = sample_labels
    
    # Create clustered heatmap using seaborn's clustermap
    # This performs hierarchical clustering on samples (rows) based on their component loadings
    print(f"    Performing hierarchical clustering on samples...")
    
    g = sns.clustermap(
        sample_component_data,
        cmap='RdBu_r',
        center=0,
        figsize=(max(12, rank*0.5), max(10, len(sample_labels)*0.3)),
        cbar_kws={'label': 'Component Loading'},
        row_cluster=True,   # Cluster samples (rows)
        col_cluster=False,  # Don't cluster components (columns) - keep original order
        xticklabels=True,
        yticklabels=True,
        dendrogram_ratio=0.15,
        cbar_pos=(0.02, 0.8, 0.03, 0.15),
        linewidths=0,
        method='average',  # Linkage method for hierarchical clustering
        metric='euclidean'  # Distance metric
    )
    
    # Adjust labels and title
    g.ax_heatmap.set_xlabel('Components', fontsize=12, fontweight='bold')
    g.ax_heatmap.set_ylabel('Sample (Species_SampleID)', fontsize=12, fontweight='bold')
    g.fig.suptitle(f'Rank {rank}: Sample Loadings Across Components\n' + 
                   f'(Hierarchically clustered by component loading similarity)',
                   fontsize=13, fontweight='bold', y=0.98)
    
    plt.tight_layout()
    output_path = os.path.join(rank_dir, 'sample_component_heatmap.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"    Saved: sample_component_heatmap.png")
    
    # ===========================================
    # PLOT 4: COMPONENT CORRELATION HEATMAP
    # ===========================================
    print(f"  Creating component correlation heatmap...")
    
    # Calculate correlation between components based on gene loadings
    component_corr = gene_factors.corr()
    
    fig, ax = plt.subplots(figsize=(max(10, rank*0.4), max(8, rank*0.35)))
    
    sns.heatmap(
        component_corr,
        annot=True if rank <= 20 else False,
        fmt='.2f' if rank <= 20 else '.1f',
        cmap='coolwarm',
        center=0,
        square=True,
        cbar_kws={'label': 'Correlation'},
        ax=ax
    )
    
    ax.set_title(f'Rank {rank}: Component Correlation Matrix\n' +
                '(Based on gene loadings - low correlation = good separation)',
                fontsize=13, fontweight='bold', pad=15)
    
    plt.tight_layout()
    output_path = os.path.join(rank_dir, 'component_correlation_heatmap.png')
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"    Saved: component_correlation_heatmap.png")
    
    print(f"\n{'='*60}")
    print("RANK 10 VISUALIZATIONS COMPLETE")
    print(f"{'='*60}")
    print(f"\nThe following plots were created in rank_{rank:02d}/ directory:")
    print(f"  1A. gene_heatmap_1A_sum_top{TOP_GENES_HEATMAP}.png - Broadly active genes (sum)")
    print(f"  1B. gene_heatmap_1B_max_top{TOP_GENES_HEATMAP}.png - Component-specific genes (max)")
    print(f"  1C. gene_heatmap_1C_mean_top{TOP_GENES_HEATMAP}.png - Consistently active genes (mean)")
    print(f"  1D. gene_heatmap_1D_variance_top{TOP_GENES_HEATMAP}.png - High specificity genes (variance)")
    print(f"  2. time_component_lineplot.png - Component patterns over time")
    print(f"  3. sample_component_heatmap.png - Individual samples across components")
    print(f"  4. component_correlation_heatmap.png - Component independence check")
```


## Close log file

```{python close-log, eval=TRUE}
from datetime import datetime

print()
print("="*60)
print(f"ANALYSIS COMPLETE")
print(f"Finished: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"All output saved to: {output_dir}")
print(f"Log file: {log_file}")
print("="*60)

# Restore stdout and close log file
if hasattr(sys.stdout, 'log'):
    sys.stdout.log.close()
    sys.stdout = sys.stdout.terminal
```

# GOSLIM ANNOTATION ANALYSIS

This section annotates the top 500 genes for components 5, 6, 7, and 9 with GOslim Biological Process terms from the rank_10 directory.

**Note:** Only Biological Process (BP) GOslim terms are included in this analysis. Cellular Component and Molecular Function GOslim terms are excluded.

## Setup and load annotations

```{r goslim-setup, eval=TRUE}
library(tidyverse)
library(ggplot2)

# Define paths
rank_10_dir <- file.path(output_dir, "rank_10")
goslim_output_dir <- file.path(output_dir, "rank_10", "goslim_analysis")
dir.create(goslim_output_dir, showWarnings = FALSE, recursive = TRUE)

# Path to goslim_generic.obo file
goslim_obo_file <- "../output/12-ortho-annot/run_20250831_172744/goslim_generic.obo"

# Load the ortholog groups annotation file
ortho_annot <- read.csv(ortholog_groups_file, stringsAsFactors = FALSE)

cat(strrep("=", 60), "\n")
cat("GOSLIM ANNOTATION SETUP\n")
cat(strrep("=", 60), "\n\n")
cat("Rank 10 directory:", rank_10_dir, "\n")
cat("GOslim output directory:", goslim_output_dir, "\n")
cat("Loaded ortholog annotations:", nrow(ortho_annot), "ortholog groups\n\n")
```

## Extract Biological Process GOslim IDs

```{r extract-bp-goslims, eval=TRUE}
# Function to parse OBO file and extract Biological Process GOslim IDs
extract_bp_goslim_ids <- function(obo_file) {
  if (!file.exists(obo_file)) {
    cat("WARNING: GOslim OBO file not found:", obo_file, "\n")
    cat("Proceeding without BP filtering (all GOslims will be included)\n")
    return(NULL)
  }
  
  # Read OBO file
  obo_lines <- readLines(obo_file)
  
  # Parse OBO file
  bp_goslim_ids <- c()
  current_term <- list(id = NULL, namespace = NULL, is_goslim = FALSE)
  
  for (line in obo_lines) {
    line <- trimws(line)
    
    if (line == "[Term]") {
      # Save previous term if it's a BP GOslim
      if (!is.null(current_term$id) && 
          current_term$namespace == "biological_process" && 
          current_term$is_goslim) {
        bp_goslim_ids <- c(bp_goslim_ids, current_term$id)
      }
      # Reset for new term
      current_term <- list(id = NULL, namespace = NULL, is_goslim = FALSE)
      
    } else if (grepl("^id:", line)) {
      current_term$id <- sub("^id:\\s*", "", line)
      
    } else if (grepl("^namespace:", line)) {
      current_term$namespace <- sub("^namespace:\\s*", "", line)
      
    } else if (grepl("^subset:\\s*goslim_generic", line)) {
      current_term$is_goslim <- TRUE
    }
  }
  
  # Don't forget the last term
  if (!is.null(current_term$id) && 
      current_term$namespace == "biological_process" && 
      current_term$is_goslim) {
    bp_goslim_ids <- c(bp_goslim_ids, current_term$id)
  }
  
  return(unique(bp_goslim_ids))
}

# Extract BP GOslim IDs
bp_goslim_ids <- extract_bp_goslim_ids(goslim_obo_file)

if (!is.null(bp_goslim_ids)) {
  cat("Extracted Biological Process GOslim terms:", length(bp_goslim_ids), "\n")
  cat("First 10 BP GOslim IDs:", paste(head(bp_goslim_ids, 10), collapse = ", "), "\n\n")
} else {
  cat("NOTE: BP filtering disabled - all GOslim terms will be used\n\n")
}
```

## Load top 500 genes for components 5, 6, 7, and 9

```{r load-top-genes, eval=TRUE}
# Components to analyze
components_to_analyze <- c(5, 6, 7, 9)

# Load top 500 genes for each component
top_genes_list <- list()

cat(strrep("=", 60), "\n")
cat("LOADING TOP 500 GENES\n")
cat(strrep("=", 60), "\n\n")

for (comp_num in components_to_analyze) {
  file_path <- file.path(rank_10_dir, 
                         paste0("top_500_genes_Component_", comp_num, ".csv"))
  
  if (!file.exists(file_path)) {
    cat("WARNING: File not found:", file_path, "\n")
    next
  }
  
  # Load the top genes file
  top_genes_df <- read.csv(file_path, stringsAsFactors = FALSE)
  
  # Rename gene_id column to group_id for consistency with annotation file
  if ("gene_id" %in% colnames(top_genes_df)) {
    top_genes_df <- top_genes_df %>%
      rename(group_id = gene_id)
  }
  
  top_genes_list[[paste0("Component_", comp_num)]] <- top_genes_df
  
  cat("Component", comp_num, ":\n")
  cat("  Loaded", nrow(top_genes_df), "genes\n")
  cat("  Loading range:", 
      round(min(top_genes_df$absolute_loading), 4), "to", 
      round(max(top_genes_df$absolute_loading), 4), "\n")
  cat("  Top 3 genes:", paste(head(top_genes_df$group_id, 3), collapse = ", "), "\n\n")
}

cat("Total components loaded:", length(top_genes_list), "\n\n")
```

## Annotate with GOslim terms

```{r annotate-goslim, eval=TRUE}
# Function to parse GOslim IDs and names, filtering for BP terms only
parse_goslim <- function(goslim_ids_str, goslim_names_str, bp_filter_ids = NULL) {
  if (is.na(goslim_ids_str) || goslim_ids_str == "" || 
      is.na(goslim_names_str) || goslim_names_str == "") {
    return(NULL)
  }
  
  # Split GOslim IDs and names
  ids <- strsplit(goslim_ids_str, ";\\s*")[[1]]
  names <- strsplit(goslim_names_str, ";\\s*")[[1]]
  
  # Clean up whitespace
  ids <- trimws(ids)
  names <- trimws(names)
  
  # Return data frame if lengths match
  if (length(ids) == length(names) && length(ids) > 0) {
    result_df <- data.frame(
      goslim_id = ids,
      goslim_name = names,
      stringsAsFactors = FALSE
    )
    
    # Filter for Biological Process GOslims if bp_filter_ids is provided
    if (!is.null(bp_filter_ids)) {
      result_df <- result_df %>%
        filter(goslim_id %in% bp_filter_ids)
      
      # Return NULL if no BP terms remain after filtering
      if (nrow(result_df) == 0) {
        return(NULL)
      }
    }
    
    return(result_df)
  }
  
  return(NULL)
}

# Annotate each component's top genes
annotated_components <- list()

cat(strrep("=", 60), "\n")
cat("ANNOTATING WITH GOSLIM TERMS (Biological Process only)\n")
cat(strrep("=", 60), "\n\n")

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(top_genes_list)) {
    cat("Component", comp_num, ": Skipping (not loaded)\n\n")
    next
  }
  
  top_genes_df <- top_genes_list[[comp_col]]
  
  cat("Component", comp_num, ":\n")
  
  # Merge with annotation data
  annotated_df <- top_genes_df %>%
    left_join(ortho_annot %>% 
                select(group_id, goslim_ids, goslim_names),
              by = "group_id")
  
  # Expand GOslim terms (one row per GOslim term)
  # Filter for Biological Process terms only
  goslim_expanded <- list()
  
  for (i in seq_len(nrow(annotated_df))) {
    row <- annotated_df[i, ]
    # Pass bp_goslim_ids to filter for BP terms only
    goslim_data <- parse_goslim(row$goslim_ids, row$goslim_names, bp_goslim_ids)
    
    if (!is.null(goslim_data)) {
      goslim_data$group_id <- row$group_id
      goslim_data$loading <- row$absolute_loading
      goslim_expanded[[i]] <- goslim_data
    }
  }
  
  # Combine all GOslim annotations
  if (length(goslim_expanded) > 0) {
    goslim_all <- bind_rows(goslim_expanded)
    
    # Count occurrences of each GOslim term
    goslim_counts <- goslim_all %>%
      group_by(goslim_id, goslim_name) %>%
      summarize(
        count = n(),
        mean_loading = mean(loading),
        .groups = "drop"
      ) %>%
      arrange(desc(count))
    
    annotated_components[[comp_col]] <- list(
      top_genes = annotated_df,
      goslim_counts = goslim_counts
    )
    
    cat("  Genes with BP GOslim:", 
        length(unique(goslim_all$group_id)), "/", nrow(top_genes_df), "\n")
    cat("  Unique BP GOslim terms:", nrow(goslim_counts), "\n")
    cat("  Top 5 BP GOslim terms:\n")
    top5 <- head(goslim_counts, 5)
    for (j in seq_len(nrow(top5))) {
      cat("    ", j, ". ", top5$goslim_name[j], " (", top5$goslim_id[j], ", n=", top5$count[j], ")\n", sep = "")
    }
  } else {
    cat("  WARNING: No BP GOslim annotations found!\n")
    annotated_components[[comp_col]] <- list(
      top_genes = annotated_df,
      goslim_counts = data.frame()
    )
  }
  cat("\n")
}
```

## Save GOslim annotation tables

```{r save-goslim-tables, eval=TRUE}
cat(strrep("=", 60), "\n")
cat("SAVING GOSLIM TABLES (Biological Process only)\n")
cat(strrep("=", 60), "\n\n")

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(annotated_components)) {
    next
  }
  
  # Save top genes with annotations
  top_genes_file <- file.path(goslim_output_dir, 
                               paste0("component_", comp_num, "_top500_genes_goslim_BP_annotated.csv"))
  write.csv(annotated_components[[comp_col]]$top_genes, 
            top_genes_file, 
            row.names = FALSE)
  cat("Saved:", basename(top_genes_file), "\n")
  
  # Save GOslim counts
  if (nrow(annotated_components[[comp_col]]$goslim_counts) > 0) {
    goslim_counts_file <- file.path(goslim_output_dir, 
                                    paste0("component_", comp_num, "_goslim_BP_counts.csv"))
    write.csv(annotated_components[[comp_col]]$goslim_counts, 
              goslim_counts_file, 
              row.names = FALSE)
    cat("Saved:", basename(goslim_counts_file), "\n")
  }
}

cat("\nAll tables saved to:", goslim_output_dir, "\n\n")
```

## Create GOslim bar charts

```{r plot-goslim-barcharts, eval=TRUE, fig.width=10, fig.height=8}
cat(strrep("=", 60), "\n")
cat("CREATING GOSLIM BAR CHARTS (Biological Process only)\n")
cat(strrep("=", 60), "\n\n")

# Color palette for components
component_colors <- c(
  "5" = "#E64B35",
  "6" = "#4DBBD5", 
  "7" = "#00A087",
  "9" = "#F39B7F"
)

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(annotated_components)) {
    next
  }
  
  goslim_counts <- annotated_components[[comp_col]]$goslim_counts
  
  if (nrow(goslim_counts) == 0) {
    cat("Component", comp_num, ": No BP GOslim data to plot\n")
    next
  }
  
  # Take top 20 GOslim terms for visualization
  top_goslim <- goslim_counts %>%
    head(20) %>%
    mutate(goslim_name = reorder(goslim_name, count))
  
  # Get total number of genes for subtitle
  top_genes <- annotated_components[[comp_col]]$top_genes
  total_genes <- nrow(top_genes)
  
  # Create bar chart
  p <- ggplot(top_goslim, aes(x = goslim_name, y = count)) +
    geom_bar(stat = "identity", fill = component_colors[as.character(comp_num)], 
             alpha = 0.8) +
    geom_text(aes(label = count), hjust = -0.2, size = 3.5) +
    coord_flip() +
    labs(
      title = paste0("Component ", comp_num, ": Top 20 GOslim Biological Process Terms"),
      subtitle = paste0("Top 500 genes (genes can have multiple BP GOslim annotations)"),
      x = "GOslim Biological Process Term",
      y = "Number of Ortholog Groups"
    ) +
    theme_bw() +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray30"),
      axis.text.y = element_text(size = 10),
      axis.text.x = element_text(size = 10),
      axis.title = element_text(size = 11, face = "bold"),
      panel.grid.major.y = element_blank(),
      panel.grid.minor = element_blank()
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.15)))
  
  # Save plot
  plot_file <- file.path(goslim_output_dir, 
                         paste0("component_", comp_num, "_goslim_BP_barchart.png"))
  ggsave(plot_file, p, width = 10, height = 8, dpi = 300)
  cat("Saved:", basename(plot_file), "\n")
}

cat("\nAll bar charts saved to:", goslim_output_dir, "\n\n")
```

## Create GOslim percentage bar charts

```{r plot-goslim-percentage-barcharts, eval=TRUE, fig.width=10, fig.height=8}
cat(strrep("=", 60), "\n")
cat("CREATING GOSLIM PERCENTAGE BAR CHARTS (Biological Process only)\n")
cat(strrep("=", 60), "\n\n")

# Color palette for components
component_colors <- c(
  "5" = "#E64B35",
  "6" = "#4DBBD5", 
  "7" = "#00A087",
  "9" = "#F39B7F"
)

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(annotated_components)) {
    next
  }
  
  goslim_counts <- annotated_components[[comp_col]]$goslim_counts
  
  if (nrow(goslim_counts) == 0) {
    cat("Component", comp_num, ": No BP GOslim data to plot\n")
    next
  }
  
  # Calculate total genes with BP GOslim annotations
  # Note: We're using the full top 500 genes as the denominator
  top_genes <- annotated_components[[comp_col]]$top_genes
  total_genes <- nrow(top_genes)
  
  # Calculate percentage for each GOslim term
  # (genes can have multiple GOslim terms, so percentages may sum to >100%)
  goslim_percentages <- goslim_counts %>%
    mutate(percentage = (count / total_genes) * 100) %>%
    head(20) %>%
    mutate(goslim_name = reorder(goslim_name, percentage))
  
  # Create percentage bar chart
  p <- ggplot(goslim_percentages, aes(x = goslim_name, y = percentage)) +
    geom_bar(stat = "identity", fill = component_colors[as.character(comp_num)], 
             alpha = 0.8) +
    geom_text(aes(label = sprintf("%.1f%%", percentage)), hjust = -0.2, size = 3.5) +
    coord_flip() +
    labs(
      title = paste0("Component ", comp_num, ": Top 20 GOslim Biological Process Terms (%)"),
      subtitle = paste0("Percentage of 500 genes with each BP GOslim (genes can have multiple annotations)"),
      x = "GOslim Biological Process Term",
      y = "Percentage of Ortholog Groups (%)"
    ) +
    theme_bw() +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray30"),
      axis.text.y = element_text(size = 10),
      axis.text.x = element_text(size = 10),
      axis.title = element_text(size = 11, face = "bold"),
      panel.grid.major.y = element_blank(),
      panel.grid.minor = element_blank()
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.15)))
  
  # Save plot
  plot_file <- file.path(goslim_output_dir, 
                         paste0("component_", comp_num, "_goslim_BP_barchart_percentage.png"))
  ggsave(plot_file, p, width = 10, height = 8, dpi = 300)
  cat("Saved:", basename(plot_file), "\n")
  
  # Also save the percentage data
  percentage_file <- file.path(goslim_output_dir, 
                               paste0("component_", comp_num, "_goslim_BP_percentages.csv"))
  goslim_percentages_full <- goslim_counts %>%
    mutate(percentage = (count / total_genes) * 100) %>%
    arrange(desc(percentage))
  write.csv(goslim_percentages_full, percentage_file, row.names = FALSE)
  cat("Saved:", basename(percentage_file), "\n\n")
}

cat("All percentage bar charts saved to:", goslim_output_dir, "\n\n")
```

## Create comprehensive GOslim percentage bar charts (all 72 BP terms)

```{r plot-goslim-all-percentage-barcharts, eval=TRUE, fig.width=12, fig.height=22}
cat(strrep("=", 60), "\n")
cat("CREATING COMPREHENSIVE GOSLIM PERCENTAGE BAR CHARTS (All 72 BP GOslim terms)\n")
cat(strrep("=", 60), "\n\n")

# Color palette for components
component_colors <- c(
  "5" = "#E64B35",
  "6" = "#4DBBD5", 
  "7" = "#00A087",
  "9" = "#F39B7F"
)

# Create a reference dataframe with ALL 72 BP GOslim terms
# Use the bp_goslim_ids that were already extracted, and get names from OBO file
extract_bp_goslim_names <- function(obo_file, bp_ids) {
  if (!file.exists(obo_file)) {
    return(data.frame(goslim_id = bp_ids, goslim_name = bp_ids, stringsAsFactors = FALSE))
  }
  
  obo_lines <- readLines(obo_file)
  id_to_name <- list()
  current_id <- NULL
  current_name <- NULL
  
  for (line in obo_lines) {
    line <- trimws(line)
    
    if (line == "[Term]") {
      # Save previous term
      if (!is.null(current_id) && !is.null(current_name)) {
        id_to_name[[current_id]] <- current_name
      }
      current_id <- NULL
      current_name <- NULL
      
    } else if (grepl("^id:", line)) {
      current_id <- sub("^id:\\s*", "", line)
      
    } else if (grepl("^name:", line)) {
      current_name <- sub("^name:\\s*", "", line)
    }
  }
  
  # Don't forget the last term
  if (!is.null(current_id) && !is.null(current_name)) {
    id_to_name[[current_id]] <- current_name
  }
  
  # Create dataframe with all BP GOslim IDs and their names
  data.frame(
    goslim_id = bp_ids,
    goslim_name = sapply(bp_ids, function(id) {
      if (id %in% names(id_to_name)) {
        id_to_name[[id]]
      } else {
        id  # Use ID as name if name not found
      }
    }, USE.NAMES = FALSE),
    stringsAsFactors = FALSE
  )
}

# Get names for all 72 BP GOslim IDs
all_bp_goslim_ref <- extract_bp_goslim_names(goslim_obo_file, bp_goslim_ids)

cat("BP GOslim reference table created with", nrow(all_bp_goslim_ref), "terms\n\n")

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(annotated_components)) {
    next
  }
  
  goslim_counts <- annotated_components[[comp_col]]$goslim_counts
  
  # Get total number of genes
  top_genes <- annotated_components[[comp_col]]$top_genes
  total_genes <- nrow(top_genes)
  
  # Create complete dataframe with ALL 72 BP GOslim terms
  # Join with actual counts, filling missing terms with 0
  goslim_all_percentages <- all_bp_goslim_ref %>%
    left_join(goslim_counts, by = c("goslim_id", "goslim_name")) %>%
    mutate(
      count = replace_na(count, 0),
      percentage = (count / total_genes) * 100
    ) %>%
    arrange(desc(percentage)) %>%
    mutate(goslim_name = reorder(goslim_name, percentage))
  
  # Count how many terms have representation
  terms_with_data <- sum(goslim_all_percentages$count > 0)
  
  # Create comprehensive percentage bar chart with all 72 terms
  p <- ggplot(goslim_all_percentages, aes(x = goslim_name, y = percentage)) +
    geom_bar(stat = "identity", fill = component_colors[as.character(comp_num)], 
             alpha = 0.8) +
    geom_text(data = subset(goslim_all_percentages, percentage > 0),
              aes(label = sprintf("%.1f%%", percentage)), hjust = -0.2, size = 2.5) +
    coord_flip() +
    labs(
      title = paste0("Component ", comp_num, ": All 72 GOslim Biological Process Terms (%)"),
      subtitle = paste0(terms_with_data, " of 72 BP GOslim terms found in top 500 genes (genes can have multiple annotations)"),
      x = "GOslim Biological Process Term",
      y = "Percentage of Ortholog Groups (%)"
    ) +
    theme_bw() +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      plot.subtitle = element_text(size = 11, color = "gray30"),
      axis.text.y = element_text(size = 8),
      axis.text.x = element_text(size = 10),
      axis.title = element_text(size = 11, face = "bold"),
      panel.grid.major.y = element_blank(),
      panel.grid.minor = element_blank()
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.15)))
  
  # Save plot with fixed height for 72 terms
  plot_file <- file.path(goslim_output_dir, 
                         paste0("component_", comp_num, "_goslim_BP_barchart_all_percentage.png"))
  ggsave(plot_file, p, width = 12, height = 22, dpi = 300)
  cat("Saved:", basename(plot_file), "(", terms_with_data, "of 72 BP GOslim terms represented)\n")
}

cat("\nAll comprehensive percentage bar charts saved to:", goslim_output_dir, "\n\n")
```

## Summary statistics

```{r goslim-summary, eval=TRUE}
cat(strrep("=", 60), "\n")
cat("GOSLIM ANNOTATION SUMMARY (Biological Process only)\n")
cat(strrep("=", 60), "\n\n")

summary_data <- list()

for (comp_num in components_to_analyze) {
  comp_col <- paste0("Component_", comp_num)
  
  if (!comp_col %in% names(annotated_components)) {
    next
  }
  
  top_genes <- annotated_components[[comp_col]]$top_genes
  goslim_counts <- annotated_components[[comp_col]]$goslim_counts
  
  genes_with_goslim <- sum(!is.na(top_genes$goslim_ids) & top_genes$goslim_ids != "")
  
  if (nrow(goslim_counts) > 0) {
    top_term <- goslim_counts$goslim_name[1]
    top_term_id <- goslim_counts$goslim_id[1]
    top_count <- goslim_counts$count[1]
  } else {
    top_term <- NA
    top_term_id <- NA
    top_count <- NA
  }
  
  summary_data[[comp_col]] <- data.frame(
    Component = comp_num,
    Total_Genes = nrow(top_genes),
    Genes_With_BP_GOslim = length(unique(top_genes$group_id[!is.na(top_genes$goslim_ids) & top_genes$goslim_ids != ""])),
    Percent_Annotated = round(genes_with_goslim / nrow(top_genes) * 100, 1),
    Unique_BP_GOslim_Terms = nrow(goslim_counts),
    Top_BP_GOslim_Term = top_term,
    Top_BP_GOslim_ID = top_term_id,
    Top_BP_GOslim_Count = top_count,
    stringsAsFactors = FALSE
  )
}

summary_df <- bind_rows(summary_data)

print(summary_df)

# Save summary table
summary_file <- file.path(goslim_output_dir, "goslim_BP_annotation_summary.csv")
write.csv(summary_df, summary_file, row.names = FALSE)
cat("\nSaved summary table:", basename(summary_file), "\n")

cat("\n", strrep("=", 60), "\n")
cat("GOSLIM ANNOTATION ANALYSIS COMPLETE\n")
cat("Output directory:", goslim_output_dir, "\n")
cat(strrep("=", 60), "\n\n")

cat("Output files created:\n")
cat("  - component_[5,6,7,9]_top500_genes_goslim_BP_annotated.csv\n")
cat("  - component_[5,6,7,9]_goslim_BP_counts.csv\n")
cat("  - component_[5,6,7,9]_goslim_BP_percentages.csv\n")
cat("  - component_[5,6,7,9]_goslim_BP_barchart.png (counts)\n")
cat("  - component_[5,6,7,9]_goslim_BP_barchart_percentage.png (percentages)\n")
cat("  - goslim_BP_annotation_summary.csv\n")
cat("\nNote: Only Biological Process GOslim terms are included in this analysis.\n\n")
```

# SYSTEM INFO

```{r system-info, eval=TRUE}
sessionInfo()
```

# REFERENCES