# Rank Selection Methodology: Dissertation vs Current Implementation ## Date: November 19, 2025 ## Executive Summary This document compares the rank selection methodology described in Blaskowski's dissertation (2024) with the current implementation in `13.00-multiomics-barnacle.Rmd`. **Key Finding**: The current implementation DIFFERS significantly from the dissertation's recommended approach and needs to be updated to match the validated methodology. --- ## Dissertation Methodology (Validated Approach) ### Section 1.2.3: Parameter Selection (pg. 15) **Method: Cross-Validated Grid Search with Sample Replicates** #### Key Components: 1. **Data Structure Required**: - Sample replicates (typically 3 replicates per sampling condition) - Split tensor along sample axis to create replicate subtensors 2. **Grid Search Parameters**: - **Rank (R)**: Test multiple values [1, 2, 3, ..., 12] - **Lambda (λ)**: Test multiple values [0.0, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8] 3. **Evaluation Metrics**: - **Cross-validated SSE (Sum of Squared Errors)**: - Fit model to one replicate subset - Calculate SSE against two held-out replicate subsets - Results in 6 cross-validated SSE scores per parameter set - **Cross-validated FMS (Factor Match Score)**: - Compare components between each pair of replicate models - Results in 3 cross-validated FMS scores per parameter set 4. **Selection Criteria**: **Step 1 - Select Optimal Rank (R)**: - Examine cross-validated SSE scores for models with λ = 0.0 (no regularization) - Choose R that minimizes cross-validated SSE - **Quote from dissertation**: "We selected the R value of best fit based on the minimum cross-validated SSE" **Step 2 - Select Optimal Lambda (λ)**: - Fix R at the optimal value from Step 1 - Apply **1SE rule** for parsimonious model selection - Choose maximum λ where cross-validated FMS remains within one standard error of maximum FMS - **Quote from dissertation**: "we then selected the best fit sparsity coefficient as the maximum λ value at which the cross-validated FMS remained within one standard error of the maximum FMS, a variation on the 1SE rule for parsimonious sparse model selection" 5. **Validation Results** (from 100 simulated tensors): - R matched true components in 86/100 simulations (86% accuracy) - R was within ±1 component in 92/100 simulations (92% accuracy) - λ matched optimal in 46/100 simulations (46% accuracy) - λ was within 2-fold in 80/100 simulations (80% accuracy) - Method worked even at noise-to-signal ratio of 10:1 6. **Robustness to Mis-specification** (Section 1.2.4): - Models are generally robust to R mis-specification - λ mis-specification affects precision/recall tradeoff: - Underestimated λ: lower precision, higher recall - Overestimated λ: higher precision, lower recall - High-sparsity models provide greater confidence in cluster composition --- ## Current Implementation ### What the code currently does: 1. **No Cross-Validation**: - ❌ Does NOT use sample replicates for cross-validation - ❌ Does NOT calculate cross-validated SSE - ❌ Does NOT calculate cross-validated FMS between replicates 2. **Manual Rank Testing**: - Tests user-specified ranks: `[5, 8, 10, 12, 15, 20, 25, 35, 45, 55, 65, 75]` - Runs each rank independently with fixed random_state 3. **Evaluation Metrics Computed**: - ✅ Variance explained (reconstruction accuracy) - ✅ Relative error - ✅ Component sparsity metrics - ✅ Component weight statistics - ✅ Convergence information - ⚠️ Synthetic FMS (synthetic validation approach, NOT cross-validated FMS) 4. **Rank Selection Approach**: - **Manual inspection** of: - Variance explained elbow plots - Synthetic FMS elbow plots (added recently) - Sparsity patterns - Component interpretability - ❌ No automated rank selection based on cross-validated SSE - ❌ No 1SE rule for lambda selection 5. **Recent Additions** (Multi-Random-State Analysis): - Tests multiple random states [41, 42, 43, 44, 45] - Evaluates synthetic FMS consistency across initializations - Checks if optimal rank is stable across random states - **Note**: This is a different validation approach than dissertation --- ## Key Differences | Aspect | Dissertation Method | Current Implementation | Status | |--------|-------------------|----------------------|--------| | **Primary Metric** | Cross-validated SSE (between replicates) | Variance explained + Synthetic FMS | ⚠️ Different | | **Data Requirement** | Sample replicates | No replicates needed | ⚠️ Different | | **Rank Selection** | Minimize cross-validated SSE | Manual inspection of elbow plots | ❌ Not implemented | | **Lambda Selection** | 1SE rule with cross-validated FMS | Fixed at [0.1, 0.0, 0.1] | ❌ Not implemented | | **Automation** | Automated grid search | Manual parameter specification | ❌ Not implemented | | **Validation** | Cross-validation between replicates | Synthetic validation + multi-random-state | ⚠️ Different | --- ## Recommendations for Code Updates ### HIGH PRIORITY - Core Methodology Alignment 1. **Implement Cross-Validated Grid Search** (if replicates available): ```python def cross_validated_grid_search(tensor, replicate_groups, rank_range=[5, 10, 15, 20, 25, 30], lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8]): """ Dissertation-validated approach for rank and lambda selection. Parameters: ----------- tensor : 3D array Full tensor (genes × samples × timepoints) replicate_groups : list of lists Sample indices for each replicate group E.g., [[0,1,2], [3,4,5], [6,7,8]] for 3 replicates Returns: -------- optimal_R : int Rank that minimizes cross-validated SSE optimal_lambda : float Lambda selected by 1SE rule """ # Step 1: Test all R values with lambda=0.0 # Calculate cross-validated SSE for each R # Step 2: Select R with minimum cross-validated SSE # Step 3: Fix R, test lambda values # Calculate cross-validated FMS for each lambda # Step 4: Apply 1SE rule to select lambda return optimal_R, optimal_lambda ``` 2. **Check for Sample Replicates in Data**: - Examine if current dataset has technical or biological replicates - If yes: Implement full cross-validated grid search - If no: Continue with alternative validation approaches ### MEDIUM PRIORITY - Enhanced Validation 3. **Keep Multi-Random-State Analysis** (complementary approach): - Current synthetic FMS across random states is a valuable addition - Not in dissertation but provides orthogonal validation - Helps assess rank stability 4. **Document Methodology Differences**: - Add clear comments explaining why cross-validation can't be used (if no replicates) - Explain synthetic FMS as alternative validation metric - Reference dissertation methodology in comments ### LOW PRIORITY - Code Organization 5. **Create Rank Selection Function**: ```python def select_optimal_rank(results, method='synthetic_fms_elbow'): """ Automated rank selection from comparison results. Methods: - 'synthetic_fms_elbow': Find elbow in synthetic FMS curve - 'variance_elbow': Find elbow in variance explained curve - 'cv_sse_min': Minimize cross-validated SSE (requires replicates) """ ``` 6. **Add Elbow Detection Algorithm**: - Automate identification of "elbow" in metric curves - Common approaches: - Kneedle algorithm - L-curve method - Second derivative test --- ## Data Structure Analysis ### Current Data Structure: ``` Tensor shape: (genes, combined_samples, timepoints) - genes: ~X ortholog groups - combined_samples: Apul + Peve + Ptua samples (species_sample IDs) - timepoints: 4 timepoints (TP1, TP2, TP3, TP4) ``` ### Question: Are there replicates? - The code mentions: "all samples with complete timepoints" - Need to check if multiple biological or technical replicates exist per (species, condition, timepoint) - **Action Item**: Investigate sample metadata to determine if replicates are available --- ## Synthetic FMS vs Cross-Validated FMS ### Synthetic FMS (Current Implementation): - **Process**: 1. Create synthetic tensor from decomposition 2. Add noise 3. Re-decompose noisy synthetic tensor 4. Compare recovered factors to original - **Tests**: Can factors be recovered from noisy data? - **Advantage**: Doesn't require replicates - **Limitation**: Tests recovery from synthetic data, not real data variability ### Cross-Validated FMS (Dissertation): - **Process**: 1. Decompose replicate A 2. Decompose replicate B 3. Compare factors between A and B - **Tests**: Are factors consistent across real replicates? - **Advantage**: Tests real biological/technical variability - **Limitation**: Requires sample replicates ### Verdict: - Both are valid validation approaches - Cross-validated FMS is **preferred** if replicates available (validated in dissertation) - Synthetic FMS is acceptable **alternative** when no replicates exist --- ## Implementation Priority ### Immediate Actions: 1. ✅ Document the methodology differences (this file) 2. ⏭️ Check if sample replicates exist in the data 3. ⏭️ If replicates exist: Implement cross-validated grid search 4. ⏭️ If no replicates: Document why synthetic FMS is being used as alternative ### Future Enhancements: - Automated elbow detection - More sophisticated rank selection algorithms - Integration of multiple metrics (SSE + FMS + variance explained) --- ## References - Blaskowski, S. (2024). *Inference of In Situ Microbial Physiologies via Sparse Tensor Decomposition of Metatranscriptomes*. Doctoral dissertation, University of Washington. - Section 1.2.3: Parameter Selection (pg. 15) - Section 1.2.4: Robustness to Mis-specification (pg. 16) - Section 2.4.5: Parameter Selection Methods (pg. 47-48) - Related methodology: - 1SE rule for sparse model selection (reference [49] in dissertation) - Cross-validation for tensor decomposition - Factor Match Score metric --- ## Conclusion **The current implementation does NOT follow the validated rank selection methodology from the dissertation.** **Critical Missing Components**: 1. Cross-validated grid search 2. Automated rank selection based on cross-validated SSE 3. 1SE rule for lambda selection **Next Steps**: 1. Determine if sample replicates are available 2. If yes: Implement dissertation methodology 3. If no: Continue with synthetic validation but document the difference 4. Consider implementing automated elbow detection for current approach **Bottom Line**: The code needs updates to align with the validated methodology, OR clear documentation explaining why the alternative approach is being used and what the tradeoffs are.