# Barnacle Multi-Species Tensor Decomposition: Executive Summary

**Date:** December 3, 2025  
**Analysis:** 13.00-multiomics-barnacle  
**Method:** Sparse Canonical Polyadic (CP) Tensor Decomposition using Barnacle [@blaskowski2024]

---

## Overview

This analysis identifies shared gene expression patterns across three coral species (*Acropora pulchra*, *Porites evermanni*, *Pocillopora tuahiniensis*) during heat stress using sparse tensor decomposition. The workflow decomposes a 3D tensor (genes × samples × timepoints) into interpretable latent components representing coordinated biological responses.

**Key Innovation:** Integrates orthologous genes across species to discover conserved stress response patterns while accounting for species-specific and colony-specific variation.

---

## Data Sources

### Input Data
- **Ortholog groups:** 9,801 three-way ortholog groups with complete matches across all species
- **Expression data:** RNA-seq gene counts from heat stress timeseries experiment
  - *A. pulchra*: 10 colonies × 4 timepoints = 40 samples
  - *P. evermanni*: 9 colonies × 4 timepoints = 36 samples  
  - *P. tuahiniensis*: 8 colonies × 4 timepoints = 32 samples
- **Timepoints:** TP1 (baseline), TP2, TP3, TP4 (progressive heat stress)
- **Total samples:** 27 colonies with complete temporal profiles

### Data Quality
- **Gene coverage:** 100% of ortholog groups present in all species expression data
- **Completeness:** Only samples with all 4 timepoints included (filtered in R)
- **Replicates:** Multiple biological replicates per species (colonies as replicates)

---

## Workflow Steps

### 1. Data Preparation & Normalization

**Ortholog Alignment**
- Loaded annotated three-way ortholog groups from comparative genomics analysis
- Extracted expression matrices for each species using ortholog group IDs as common identifiers
- Verified 100% gene coverage across all three species

**Normalization Strategy**
- **Method:** `sctransform` (Hafemeister & Satija, 2019)
- **Rationale:** Variance-stabilizing transformation appropriate for count data
- **Fallback:** log₂(CPM + 1) if sctransform fails due to data structure issues
- **Filtering:** Genes with total counts ≤ 10 across samples removed before normalization
- **Output:** Normalized expression matrices preserving biological variation while removing technical noise

### 2. Tensor Construction

**Tensor Structure:** `genes × combined_samples × timepoints`

- **Dimension 1 (Genes):** 9,801 ortholog groups
- **Dimension 2 (Samples):** 27 colonies (10 apul + 9 peve + 8 ptua)
  - Samples combined across species in a single dimension
  - Each colony represented by 4 timepoint measurements
  - Species identity tracked in metadata mapping
- **Dimension 3 (Timepoints):** 4 measurement times (TP1-TP4)

**Final Tensor Shape:** `(9801, 27, 4)` = ~1.06M data points

**Handling Missing Values:** All samples pre-filtered to have complete temporal profiles; remaining NAs filled with zeros

### 3. Parameter Selection (Dissertation-Validated Method)

**Implementation:** Full Grid Search approach following Blaskowski (2024), Section 1.2.3

**Configuration**
```python
GRID_RANK_RANGE = [5, 10, 15, 20, 25, 30, 35]        # 7 ranks tested
GRID_LAMBDA_VALUES = [0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]  # 8 lambda values
Total combinations: 7 × 8 = 56
Cross-validation: Leave-one-species-out (3 folds)
Total models fit: 56 combinations × 3 folds = 168 models
```

**Two-Stage Selection Process:**

**Stage 1: Rank Selection (R)**
- **Criterion:** Minimum cross-validated sum of squared errors (CV-SSE) at λ=0.0
- **Rationale:** Identifies optimal model complexity without regularization bias
- **Validation:** Each rank tested with leave-one-species-out CV
  - Train on 2 species, test on held-out species
  - Mean CV-SSE calculated across 3 folds
  - Lower CV-SSE = better generalization to unseen species
- **Result:** Rank 35 selected (lowest CV-SSE among tested ranks)

**Stage 2: Lambda Selection (λ)**
- **Criterion:** Maximum lambda where CV-FMS ≥ (max_FMS - 1SE)
- **Rationale:** Balances sparsity with factor quality using one-standard-error rule
- **Metric:** Factor Match Score (FMS) measures consistency between models from different CV folds
  - FMS range: [0, 1] with 1 = perfect factor alignment
  - Calculated pairwise between all CV fold models
- **Result:** Optimal lambda determined at selected rank

**Cross-Validation Approach:**
- **Species-level CV** (RECOMMENDED - used in current analysis)
  - 3 folds: hold out one species at a time
  - Tests generalization to phylogenetically distinct species
  - Question: Can patterns learned from 2 species predict the 3rd?

- **Colony-level CV** (alternative, not used)
  - ~27 folds: hold out one colony at a time
  - Tests generalization to new biological replicates
  - Question: Can patterns learned from N-1 colonies predict the Nth?

### 4. Decomposition Execution

**Sparse CP Decomposition**
- **Algorithm:** Alternating Least Squares (ALS) with L1 regularization
- **Regularization:** Separate lambda values for each mode [λ_genes, λ_samples, λ_time]
- **Convergence:** Maximum 10,000 iterations or tolerance threshold
- **Random initialization:** Seed=42 for reproducibility
- **Multiple ranks tested:** Systematic comparison across rank range

**Optimization Details:**
- Each component extracted iteratively
- Sparsity induced through L1 penalty on factor loadings
- Non-negativity constraints optional (not enforced in current analysis)
- Convergence monitored through reconstruction error

### 5. Multiple Rank Comparison

**Systematic Rank Testing**
- **Ranks evaluated:** [5, 8, 10, 12, 15, 20, 25, 35, 45, 55, 65, 75]
- **Purpose:** Identify optimal number of latent components
- **Approach:** Run full decomposition at each rank with consistent parameters

**Evaluation Metrics (per rank):**

1. **Synthetic FMS** (MOST IMPORTANT)
   - Tests factor recovery from synthetic noisy data
   - Measures factor identifiability and stability
   - Higher values = more reliable factor structure
   - Used to identify "elbow" point where adding components yields diminishing returns

2. **Variance Explained**
   - Proportion of total tensor variation captured by decomposition
   - Target: >60% for meaningful biological interpretation
   - Calculated as: 1 - (reconstruction_error / total_variance)

3. **Relative Reconstruction Error**
   - Normalized difference between original and reconstructed tensor
   - Lower = better fit to data
   - Formula: ||X - X̂||_F / ||X||_F

4. **Sparsity Metrics**
   - Percentage of near-zero loadings in each factor mode
   - Indicates how many genes/samples strongly contribute to each component
   - Higher sparsity = more interpretable components

5. **Convergence Status**
   - Whether optimization reached convergence criteria
   - Failed convergence may indicate rank too high for data

6. **Component Weights**
   - Relative importance of each discovered component
   - Uniform weights = all components equally important
   - Skewed weights = some components dominate signal

**Selection Criteria:**
- Primary: Synthetic FMS elbow (curve flattens)
- Secondary: Variance explained (>60%)
- Tertiary: Convergence success
- Qualitative: Component interpretability

### 6. Results Storage & Organization

**Directory Structure:**
```
output/13.00-multiomics-barnacle/
├── apul_ortholog_expression.csv          # Per-species aligned expression
├── peve_ortholog_expression.csv
├── ptua_ortholog_expression.csv
├── apul_normalized_expression.csv        # Normalized data
├── peve_normalized_expression.csv
├── ptua_normalized_expression.csv
├── multiomics_tensor.npy                 # Final 3D tensor
├── dissertation_grid_search/             # Parameter selection results
│   ├── full_grid_results.csv            # All 56 combinations with metrics
│   ├── optimal_parameters.json          # Selected rank and lambda
│   ├── grid_search_sse_heatmap.png      # CV-SSE across rank×lambda grid
│   ├── grid_search_fms_heatmap.png      # CV-FMS across rank×lambda grid
│   ├── rank_selection_lambda_zero.png   # Stage 1: Rank selection plot
│   └── optimal_rank_lambda_selection.png # Stage 2: Lambda selection plot
├── rank_05/                              # Results for each tested rank
│   ├── barnacle_factors/
│   │   ├── gene_factors.csv             # Gene loadings (9801 × R)
│   │   ├── sample_factors.csv           # Sample loadings (27 × R)
│   │   ├── time_factors.csv             # Time loadings (4 × R)
│   │   ├── component_weights.csv        # Component importance (R × 1)
│   │   ├── sample_mapping.csv           # Sample metadata
│   │   └── metadata.json                # Analysis parameters
│   ├── figures/
│   │   ├── component_weights.png        # Bar plot of weights
│   │   ├── time_loadings.png            # Temporal patterns per component
│   │   ├── sample_heatmap.png           # Sample × component heatmap
│   │   └── top_genes_by_component.png   # Highest loading genes
│   └── loss_history.csv                 # Optimization convergence
├── rank_10/
├── rank_15/
│   ... (repeated for each tested rank)
└── rank_comparison_log.json             # Summary of all rank runs
```

---

## Key Results from Current Analysis

### Parameter Selection Outcome

**Selected Parameters:**
- **Optimal Rank:** 35 components
- **Selection basis:** Minimum CV-SSE = 488,328 at λ=0.0
- **Optimal Lambda:** [Determined by Stage 2 - see `optimal_parameters.json`]

**Rank Selection Details:**
- Tested ranks 5-35 (7 values)
- CV-SSE showed monotonic decrease across range
- Rank 35 (highest tested) had lowest error
- **IMPORTANT CAVEAT:** Monotonic decrease suggests true optimal rank may be higher than 35, or that a statistical significance criterion (1SE rule) should be applied to prefer lower ranks

**Cross-Validation Performance:**
| Rank | Mean CV-SSE | Std CV-SSE | Successful Folds |
|------|-------------|------------|------------------|
| 5    | 509,692     | 41,853     | 3/3              |
| 10   | 495,049     | 29,922     | 3/3              |
| 15   | 492,031     | 30,408     | 3/3              |
| 20   | 491,723     | 29,053     | 3/3              |
| 25   | 489,845     | 29,885     | 3/3              |
| 30   | 489,372     | 28,991     | 3/3              |
| **35** | **488,328** | **29,505** | **3/3**      |

**Observations:**
- All ranks converged successfully across all CV folds
- Error reduction diminishes between ranks 20-35 (~0.7% total improvement)
- Standard errors overlap substantially across ranks 20-35
- Consider testing higher ranks (40, 45, 50) to detect potential elbow point

### Decomposition Results

**35-Component Model:**
- 35 latent components discovered representing shared patterns
- Each component = combination of:
  - Gene loadings: Which genes co-vary in this pattern?
  - Sample loadings: Which colonies/species show this pattern?
  - Time loadings: How does this pattern change over timepoints?
  - Component weight: Overall importance of this pattern

**Interpretation Approach:**
1. Examine component weights → Identify most important patterns
2. Analyze time loadings → Understand temporal dynamics
3. Inspect gene loadings → Determine biological functions (GO enrichment)
4. Review sample loadings → Species/colony-specific responses
5. Cross-reference with metadata → Link to experimental conditions

---

## Biological Interpretation Framework

### Component Analysis Strategy

**1. Temporal Dynamics**
- Time factor plots show pattern evolution across TP1-TP4
- Identify: Baseline patterns, Early response, Progressive stress, Late response
- Compare temporal trajectories across components

**2. Species/Colony Patterns**
- Sample loadings reveal which colonies contribute to each component
- Species-level patterns: Conserved vs. species-specific responses
- Colony-level variation: Genetic diversity or condition-dependent responses

**3. Genetic Architecture**
- Gene loadings identify co-regulated gene sets
- High-loading genes = core members of pattern
- Sparse loadings = component specificity

**4. Functional Enrichment**
- Export top genes per component for GO enrichment analysis
- Link components to biological processes (stress response, metabolism, immunity)
- Cross-reference with coral stress literature

### Next Steps for Interpretation

1. **Component Prioritization**
   - Focus on components with highest weights
   - Examine components with clear temporal trends
   - Identify species-conserved vs. species-specific components

2. **Gene Set Analysis**
   - Extract top 100-500 genes per component
   - Run GO enrichment, KEGG pathway analysis
   - Compare to known stress response genes

3. **Validation**
   - Compare patterns to single-species analyses
   - Cross-reference with physiological measurements
   - Literature comparison with published coral stress studies

4. **Hypothesis Generation**
   - Which components represent adaptive vs. damage responses?
   - Are there phylogenetic patterns (Acropora vs. Porites vs. Pocillopora)?
   - Do colony-level patterns correlate with survival/bleaching severity?

---

## Methodological Considerations

### Strengths

1. **Multi-species integration**
   - Ortholog-based approach enables direct cross-species comparison
   - Discovers conserved evolutionary responses to heat stress
   - Accounts for species-specific variation

2. **Tensor structure preserves relationships**
   - Maintains temporal ordering within colonies
   - Separates colony-level and time-level variation
   - Enables three-way interactions (gene × sample × time)

3. **Sparse decomposition enhances interpretability**
   - L1 regularization produces sparse factor loadings
   - Fewer genes per component = clearer biological interpretation
   - Reduces false positives from noise

4. **Rigorous parameter selection**
   - Dissertation-validated methodology
   - Cross-validation prevents overfitting
   - Systematic comparison across parameter space

5. **Reproducibility**
   - Fixed random seeds
   - Complete provenance tracking
   - Saved intermediate results

### Limitations

1. **Rank selection challenge**
   - Monotonic CV-SSE decrease complicates rank choice
   - May need higher ranks or alternative selection criterion (1SE rule)
   - Simulation studies (with known true rank) would validate approach

2. **Ortholog filtering**
   - Limited to three-way complete ortholog groups (9,801 genes)
   - Excludes species-specific genes and lineage expansions
   - May miss important species-specific stress responses

3. **Sample size**
   - 27 colonies total across 3 species
   - Species-level CV has only 3 folds (limited statistical power)
   - Some colony-level effects may be underestimated

4. **Temporal resolution**
   - Only 4 timepoints captures coarse temporal dynamics
   - May miss rapid early responses or oscillatory patterns
   - Assumes monotonic or simple temporal trajectories

5. **Technical confounding**
   - Batch effects across species (different sequencing runs?)
   - Normalization method may not fully remove technical variation
   - Colony-level biological replicates have inherent genetic variation

### Validation Recommendations

1. **Test additional ranks** (40, 45, 50, 60, 75, 100)
   - Determine if CV-SSE plateaus or shows elbow
   - Apply 1SE rule to prefer simpler models

2. **Alternative CV strategies**
   - Run colony-level CV for comparison
   - Leave-one-timepoint-out CV (tests temporal generalization)
   - Nested CV (outer: species; inner: colony)

3. **Sensitivity analyses**
   - Vary normalization methods (DESeq2, TMM, quantile)
   - Test different lambda values (more granular grid)
   - Bootstrap resampling for factor stability

4. **External validation**
   - Compare to single-species differential expression results
   - Validate top genes with qPCR or independent datasets
   - Link components to physiological/bleaching phenotypes

---

## Software & Methods

### Key Dependencies
- **barnacle** (Blaskowski 2024): Sparse CP tensor decomposition
- **sctransform** (Hafemeister & Satija 2019): Variance-stabilizing normalization
- **tensorly**: Tensor operations and factor match score calculations
- **NumPy/Pandas**: Data manipulation and numerical computing
- **Matplotlib/Seaborn**: Visualization

### Computational Environment
- **Language:** Python 3.x via reticulate in R Markdown
- **Conda environment:** barnacle (with all dependencies)
- **Hardware:** Multi-core CPU (parallelization possible for CV folds)
- **Storage:** ~500 MB per rank tested

### Citation
If using this workflow, please cite:
- Blaskowski, S. (2024). Barnacle: Sparse tensor decomposition methods. GitHub repository.
- Hafemeister, C. & Satija, R. (2019). Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. *Genome Biology*, 20, 296.

---

## Summary

This analysis successfully implements a sparse tensor decomposition workflow to identify shared gene expression patterns across three coral species during heat stress. The dissertation-validated parameter selection approach provides a principled framework for choosing model complexity (rank) and sparsity (lambda). Key findings include identification of 35 latent components capturing coordinated temporal responses, with results organized for downstream biological interpretation through gene ontology enrichment and pathway analysis.

**Critical next steps:**
1. Address rank selection ambiguity (test higher ranks or apply 1SE rule)
2. Perform GO enrichment on top genes per component
3. Validate patterns against known coral stress biology
4. Link components to physiological outcomes (bleaching, survival)

The complete workflow is documented in `13.00-multiomics-barnacle.Rmd` with intermediate results saved in `output/13.00-multiomics-barnacle/` for independent analysis and visualization.

---

**Document generated:** December 3, 2025  
**Analysis contact:** Sam White  
**Code repository:** urol-e5/timeseries_molecular (branch: barnacle-sjw)