# Rank Selection Methodology: Dissertation vs Current Implementation

## Date: November 19, 2025

## Executive Summary

This document compares the rank selection methodology described in Blaskowski's dissertation (2024) with the current implementation in `13.00-multiomics-barnacle.Rmd`. 

**Key Finding**: The current implementation DIFFERS significantly from the dissertation's recommended approach and needs to be updated to match the validated methodology.

---

## Dissertation Methodology (Validated Approach)

### Section 1.2.3: Parameter Selection (pg. 15)

**Method: Cross-Validated Grid Search with Sample Replicates**

#### Key Components:

1. **Data Structure Required**:
   - Sample replicates (typically 3 replicates per sampling condition)
   - Split tensor along sample axis to create replicate subtensors

2. **Grid Search Parameters**:
   - **Rank (R)**: Test multiple values [1, 2, 3, ..., 12]
   - **Lambda (λ)**: Test multiple values [0.0, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8]

3. **Evaluation Metrics**:
   - **Cross-validated SSE (Sum of Squared Errors)**: 
     - Fit model to one replicate subset
     - Calculate SSE against two held-out replicate subsets
     - Results in 6 cross-validated SSE scores per parameter set
   
   - **Cross-validated FMS (Factor Match Score)**:
     - Compare components between each pair of replicate models
     - Results in 3 cross-validated FMS scores per parameter set

4. **Selection Criteria**:
   
   **Step 1 - Select Optimal Rank (R)**:
   - Examine cross-validated SSE scores for models with λ = 0.0 (no regularization)
   - Choose R that minimizes cross-validated SSE
   - **Quote from dissertation**: "We selected the R value of best fit based on the minimum cross-validated SSE"
   
   **Step 2 - Select Optimal Lambda (λ)**:
   - Fix R at the optimal value from Step 1
   - Apply **1SE rule** for parsimonious model selection
   - Choose maximum λ where cross-validated FMS remains within one standard error of maximum FMS
   - **Quote from dissertation**: "we then selected the best fit sparsity coefficient as the maximum λ value at which the cross-validated FMS remained within one standard error of the maximum FMS, a variation on the 1SE rule for parsimonious sparse model selection"

5. **Validation Results** (from 100 simulated tensors):
   - R matched true components in 86/100 simulations (86% accuracy)
   - R was within ±1 component in 92/100 simulations (92% accuracy)
   - λ matched optimal in 46/100 simulations (46% accuracy)
   - λ was within 2-fold in 80/100 simulations (80% accuracy)
   - Method worked even at noise-to-signal ratio of 10:1

6. **Robustness to Mis-specification** (Section 1.2.4):
   - Models are generally robust to R mis-specification
   - λ mis-specification affects precision/recall tradeoff:
     - Underestimated λ: lower precision, higher recall
     - Overestimated λ: higher precision, lower recall
   - High-sparsity models provide greater confidence in cluster composition

---

## Current Implementation

### What the code currently does:

1. **No Cross-Validation**:
   - ❌ Does NOT use sample replicates for cross-validation
   - ❌ Does NOT calculate cross-validated SSE
   - ❌ Does NOT calculate cross-validated FMS between replicates

2. **Manual Rank Testing**:
   - Tests user-specified ranks: `[5, 8, 10, 12, 15, 20, 25, 35, 45, 55, 65, 75]`
   - Runs each rank independently with fixed random_state

3. **Evaluation Metrics Computed**:
   - ✅ Variance explained (reconstruction accuracy)
   - ✅ Relative error
   - ✅ Component sparsity metrics
   - ✅ Component weight statistics
   - ✅ Convergence information
   - ⚠️ Synthetic FMS (synthetic validation approach, NOT cross-validated FMS)

4. **Rank Selection Approach**:
   - **Manual inspection** of:
     - Variance explained elbow plots
     - Synthetic FMS elbow plots (added recently)
     - Sparsity patterns
     - Component interpretability
   - ❌ No automated rank selection based on cross-validated SSE
   - ❌ No 1SE rule for lambda selection

5. **Recent Additions** (Multi-Random-State Analysis):
   - Tests multiple random states [41, 42, 43, 44, 45]
   - Evaluates synthetic FMS consistency across initializations
   - Checks if optimal rank is stable across random states
   - **Note**: This is a different validation approach than dissertation

---

## Key Differences

| Aspect | Dissertation Method | Current Implementation | Status |
|--------|-------------------|----------------------|--------|
| **Primary Metric** | Cross-validated SSE (between replicates) | Variance explained + Synthetic FMS | ⚠️ Different |
| **Data Requirement** | Sample replicates | No replicates needed | ⚠️ Different |
| **Rank Selection** | Minimize cross-validated SSE | Manual inspection of elbow plots | ❌ Not implemented |
| **Lambda Selection** | 1SE rule with cross-validated FMS | Fixed at [0.1, 0.0, 0.1] | ❌ Not implemented |
| **Automation** | Automated grid search | Manual parameter specification | ❌ Not implemented |
| **Validation** | Cross-validation between replicates | Synthetic validation + multi-random-state | ⚠️ Different |

---

## Recommendations for Code Updates

### HIGH PRIORITY - Core Methodology Alignment

1. **Implement Cross-Validated Grid Search** (if replicates available):
   
   ```python
   def cross_validated_grid_search(tensor, replicate_groups, 
                                    rank_range=[5, 10, 15, 20, 25, 30],
                                    lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8]):
       """
       Dissertation-validated approach for rank and lambda selection.
       
       Parameters:
       -----------
       tensor : 3D array
           Full tensor (genes × samples × timepoints)
       replicate_groups : list of lists
           Sample indices for each replicate group
           E.g., [[0,1,2], [3,4,5], [6,7,8]] for 3 replicates
       
       Returns:
       --------
       optimal_R : int
           Rank that minimizes cross-validated SSE
       optimal_lambda : float
           Lambda selected by 1SE rule
       """
       
       # Step 1: Test all R values with lambda=0.0
       # Calculate cross-validated SSE for each R
       
       # Step 2: Select R with minimum cross-validated SSE
       
       # Step 3: Fix R, test lambda values
       # Calculate cross-validated FMS for each lambda
       
       # Step 4: Apply 1SE rule to select lambda
       
       return optimal_R, optimal_lambda
   ```

2. **Check for Sample Replicates in Data**:
   - Examine if current dataset has technical or biological replicates
   - If yes: Implement full cross-validated grid search
   - If no: Continue with alternative validation approaches

### MEDIUM PRIORITY - Enhanced Validation

3. **Keep Multi-Random-State Analysis** (complementary approach):
   - Current synthetic FMS across random states is a valuable addition
   - Not in dissertation but provides orthogonal validation
   - Helps assess rank stability

4. **Document Methodology Differences**:
   - Add clear comments explaining why cross-validation can't be used (if no replicates)
   - Explain synthetic FMS as alternative validation metric
   - Reference dissertation methodology in comments

### LOW PRIORITY - Code Organization

5. **Create Rank Selection Function**:
   ```python
   def select_optimal_rank(results, method='synthetic_fms_elbow'):
       """
       Automated rank selection from comparison results.
       
       Methods:
       - 'synthetic_fms_elbow': Find elbow in synthetic FMS curve
       - 'variance_elbow': Find elbow in variance explained curve
       - 'cv_sse_min': Minimize cross-validated SSE (requires replicates)
       """
   ```

6. **Add Elbow Detection Algorithm**:
   - Automate identification of "elbow" in metric curves
   - Common approaches:
     - Kneedle algorithm
     - L-curve method
     - Second derivative test

---

## Data Structure Analysis

### Current Data Structure:
```
Tensor shape: (genes, combined_samples, timepoints)
- genes: ~X ortholog groups
- combined_samples: Apul + Peve + Ptua samples (species_sample IDs)
- timepoints: 4 timepoints (TP1, TP2, TP3, TP4)
```

### Question: Are there replicates?
- The code mentions: "all samples with complete timepoints"
- Need to check if multiple biological or technical replicates exist per (species, condition, timepoint)
- **Action Item**: Investigate sample metadata to determine if replicates are available

---

## Synthetic FMS vs Cross-Validated FMS

### Synthetic FMS (Current Implementation):
- **Process**: 
  1. Create synthetic tensor from decomposition
  2. Add noise
  3. Re-decompose noisy synthetic tensor
  4. Compare recovered factors to original
- **Tests**: Can factors be recovered from noisy data?
- **Advantage**: Doesn't require replicates
- **Limitation**: Tests recovery from synthetic data, not real data variability

### Cross-Validated FMS (Dissertation):
- **Process**:
  1. Decompose replicate A
  2. Decompose replicate B
  3. Compare factors between A and B
- **Tests**: Are factors consistent across real replicates?
- **Advantage**: Tests real biological/technical variability
- **Limitation**: Requires sample replicates

### Verdict:
- Both are valid validation approaches
- Cross-validated FMS is **preferred** if replicates available (validated in dissertation)
- Synthetic FMS is acceptable **alternative** when no replicates exist

---

## Implementation Priority

### Immediate Actions:
1. ✅ Document the methodology differences (this file)
2. ⏭️ Check if sample replicates exist in the data
3. ⏭️ If replicates exist: Implement cross-validated grid search
4. ⏭️ If no replicates: Document why synthetic FMS is being used as alternative

### Future Enhancements:
- Automated elbow detection
- More sophisticated rank selection algorithms
- Integration of multiple metrics (SSE + FMS + variance explained)

---

## References

- Blaskowski, S. (2024). *Inference of In Situ Microbial Physiologies via Sparse Tensor Decomposition of Metatranscriptomes*. Doctoral dissertation, University of Washington.
  - Section 1.2.3: Parameter Selection (pg. 15)
  - Section 1.2.4: Robustness to Mis-specification (pg. 16)
  - Section 2.4.5: Parameter Selection Methods (pg. 47-48)

- Related methodology:
  - 1SE rule for sparse model selection (reference [49] in dissertation)
  - Cross-validation for tensor decomposition
  - Factor Match Score metric

---

## Conclusion

**The current implementation does NOT follow the validated rank selection methodology from the dissertation.**

**Critical Missing Components**:
1. Cross-validated grid search
2. Automated rank selection based on cross-validated SSE
3. 1SE rule for lambda selection

**Next Steps**:
1. Determine if sample replicates are available
2. If yes: Implement dissertation methodology
3. If no: Continue with synthetic validation but document the difference
4. Consider implementing automated elbow detection for current approach

**Bottom Line**: The code needs updates to align with the validated methodology, OR clear documentation explaining why the alternative approach is being used and what the tradeoffs are.