# Recommendations for Rank Selection Implementation ## Date: November 19, 2025 ## Summary After reviewing the Blaskowski dissertation (2024) and comparing it to the current implementation, **I have good news and specific recommendations**. --- ## KEY FINDING: Your Data HAS Replicates! ✅ Based on the `master-molecular-metadata.csv` file, your experimental design includes: **Biological Replicates**: - Multiple colonies per species (e.g., ACR-139, ACR-145, ACR-150, ACR-173, ACR-186...) - Each colony represents an independent biological replicate - Each colony has measurements at 4 timepoints (TP1, TP2, TP3, TP4) **Example**: ``` Acropora pulchra colonies at Manava site: - ACR-139 (TP1, TP2, TP3, TP4) - ACR-145 (TP1, TP2, TP3, TP4) - ACR-150 (TP1, TP2, TP3, TP4) - ACR-173 (TP1, TP2, TP3, TP4) - ACR-186 (TP1, TP2, TP3, TP4) ``` This means you can implement the **dissertation's validated cross-validation approach**. --- ## Current Situation ### What You're Doing Now: 1. ❌ NOT using cross-validation between biological replicates 2. ❌ NOT implementing dissertation's rank selection methodology 3. ✅ Testing multiple ranks manually 4. ⚠️ Using Synthetic FMS (alternative validation) 5. ⚠️ Using multiple random states (complementary validation) ### What You SHOULD Be Doing (According to Dissertation): 1. ✅ Cross-validate between biological replicates 2. ✅ Use cross-validated SSE to select optimal rank 3. ✅ Use 1SE rule with cross-validated FMS to select lambda 4. ✅ Automated parameter selection --- ## Critical Differences | Component | Dissertation Method | Your Current Method | Impact | |-----------|-------------------|-------------------|---------| | **Rank Selection** | Min cross-validated SSE (between replicates) | Manual inspection of elbow plots | High - Missing validated approach | | **Lambda Selection** | 1SE rule with CV-FMS | Fixed at [0.1, 0.0, 0.1] | High - Not optimizing sparsity | | **Validation** | Cross-validation using real data variability | Synthetic validation | Medium - Different assumptions | | **Automation** | Automated grid search | Manual rank testing | Medium - More time consuming | --- ## Recommended Implementation Plan ### Phase 1: Immediate (Cross-Validated Rank Selection) **GOAL**: Implement dissertation's validated methodology #### Step 1A: Organize Replicates Your tensor structure should group replicates: ```python # Current structure: (genes, combined_samples, timepoints) # Tensor shape: (14733, 52, 4) # Need to identify which samples are replicates of same condition # Example replicate groups (at same timepoint): # Group 1: ACR-139-TP1, ACR-145-TP1, ACR-150-TP1, ... (all Apul at TP1) # Group 2: ACR-139-TP2, ACR-145-TP2, ACR-150-TP2, ... (all Apul at TP2) # etc. ``` #### Step 1B: Implement Cross-Validated Grid Search ```python def cross_validated_rank_selection( tensor, replicate_groups, rank_range=[5, 10, 15, 20, 25, 30, 35, 40], lambda_values=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8] ): """ Dissertation methodology (Section 1.2.3, pg. 15) Parameters: ----------- tensor : 3D array Full tensor (genes × samples × timepoints) replicate_groups : list of lists Each sublist contains sample indices that are replicates E.g., [[0,1,2], [3,4,5], ...] for 3 biological replicates Returns: -------- optimal_R : int Rank minimizing cross-validated SSE (with lambda=0) optimal_lambda : float Lambda selected by 1SE rule cv_results : DataFrame Full cross-validation results for all parameter combinations """ results = [] # Test all rank values with lambda=0.0 first print("="*60) print("STEP 1: Select Optimal Rank (R)") print("Testing with lambda=0.0 (no regularization)") print("="*60) for R in rank_range: # For each replicate group, hold one out cv_sse_scores = [] for holdout_idx in range(len(replicate_groups)): # Create training and test sets train_indices = [idx for i, group in enumerate(replicate_groups) if i != holdout_idx for idx in group] test_indices = replicate_groups[holdout_idx] # Fit model on training data train_tensor = tensor[:, train_indices, :] model = SparseCP(rank=R, lambdas=[0.0, 0.0, 0.0], ...) decomp = model.fit_transform(train_tensor) # Calculate SSE on held-out test data test_tensor = tensor[:, test_indices, :] reconstruction = reconstruct_tensor(decomp, test_indices) sse = np.sum((test_tensor - reconstruction) ** 2) cv_sse_scores.append(sse) # Store mean CV-SSE across all folds mean_cv_sse = np.mean(cv_sse_scores) std_cv_sse = np.std(cv_sse_scores) results.append({ 'R': R, 'lambda': 0.0, 'mean_cv_sse': mean_cv_sse, 'std_cv_sse': std_cv_sse }) print(f"R={R}: CV-SSE = {mean_cv_sse:.2f} ± {std_cv_sse:.2f}") # Select R with minimum CV-SSE cv_results_df = pd.DataFrame(results) optimal_R = cv_results_df.loc[cv_results_df['mean_cv_sse'].idxmin(), 'R'] print(f"\n→ Optimal Rank (R): {int(optimal_R)}") print(f" (Minimum CV-SSE)") # STEP 2: Select lambda using 1SE rule with cross-validated FMS print(f"\n{'='*60}") print(f"STEP 2: Select Optimal Lambda (λ)") print(f"Fixed R = {int(optimal_R)}") print(f"Using 1SE rule with cross-validated FMS") print(f"{'='*60}") lambda_results = [] for lambda_val in lambda_values: # Fit models to each replicate replicate_decomps = [] for rep_indices in replicate_groups: rep_tensor = tensor[:, rep_indices, :] model = SparseCP(rank=int(optimal_R), lambdas=[lambda_val, 0.0, lambda_val], ...) decomp = model.fit_transform(rep_tensor) replicate_decomps.append(decomp) # Calculate FMS between all pairs of replicates fms_scores = [] for i in range(len(replicate_decomps)): for j in range(i+1, len(replicate_decomps)): fms = calculate_fms(replicate_decomps[i], replicate_decomps[j]) fms_scores.append(fms) mean_fms = np.mean(fms_scores) std_fms = np.std(fms_scores) lambda_results.append({ 'lambda': lambda_val, 'mean_cv_fms': mean_fms, 'std_cv_fms': std_fms }) print(f"λ={lambda_val:.2f}: CV-FMS = {mean_fms:.4f} ± {std_fms:.4f}") # Apply 1SE rule lambda_df = pd.DataFrame(lambda_results) max_fms_idx = lambda_df['mean_cv_fms'].idxmax() max_fms = lambda_df.loc[max_fms_idx, 'mean_cv_fms'] max_fms_se = lambda_df.loc[max_fms_idx, 'std_cv_fms'] # Find maximum lambda within 1 SE of maximum FMS within_1se = lambda_df['mean_cv_fms'] >= (max_fms - max_fms_se) optimal_lambda = lambda_df.loc[within_1se, 'lambda'].max() print(f"\n→ Optimal Lambda (λ): {optimal_lambda}") print(f" (1SE rule: max λ within 1 SE of max FMS)") print(f" Max FMS: {max_fms:.4f} ± {max_fms_se:.4f}") return int(optimal_R), optimal_lambda, cv_results_df ``` #### Step 1C: Define Replicate Groups ```python # Need to group samples by (species, timepoint) # Each group contains multiple colonies (biological replicates) def identify_replicate_groups(sample_labels): """ Group samples that are biological replicates. For timeseries data, replicates are: - Same species - Same timepoint - Different colony IDs Example: - ACR-139-TP1, ACR-145-TP1, ACR-150-TP1 → replicate group - ACR-139-TP2, ACR-145-TP2, ACR-150-TP2 → replicate group """ groups = {} for idx, label in enumerate(sample_labels): # Parse: "apul_ACR-139-TP1" or "apul_ACR.139.TP1" parts = label.split('_') species = parts[0] sample_parts = parts[1].replace('.', '-').split('-') # Extract timepoint (last part) timepoint = sample_parts[-1] # e.g., "TP1" # Create key: (species, timepoint) key = (species, timepoint) if key not in groups: groups[key] = [] groups[key].append(idx) # Convert to list of lists replicate_groups = list(groups.values()) return replicate_groups ``` ### Phase 2: Enhanced (Keep Current Validations) **GOAL**: Keep your current validation methods as complementary analyses Your current approaches are valuable: 1. **Synthetic FMS** - Tests factor recovery from noisy data 2. **Multi-random-state** - Tests stability across initializations These are NOT in the dissertation but provide additional validation. **Keep them!** ```python # Full validation pipeline def comprehensive_validation(tensor, replicate_groups): """ Combine dissertation methodology with additional validations """ # METHOD 1: Dissertation cross-validation (PRIMARY) optimal_R, optimal_lambda, cv_results = cross_validated_rank_selection( tensor, replicate_groups ) print(f"\n{'='*60}") print("PRIMARY RESULT (Dissertation Method)") print(f"{'='*60}") print(f"Optimal Rank: {optimal_R}") print(f"Optimal Lambda: {optimal_lambda}") # METHOD 2: Synthetic FMS validation (SUPPLEMENTARY) print(f"\n{'='*60}") print("SUPPLEMENTARY VALIDATION #1: Synthetic FMS") print(f"{'='*60}") synthetic_fms_results = evaluate_synthetic_fms_across_ranks( tensor, rank_range=[optimal_R-5, optimal_R, optimal_R+5] ) # METHOD 3: Multi-random-state stability (SUPPLEMENTARY) print(f"\n{'='*60}") print("SUPPLEMENTARY VALIDATION #2: Random State Stability") print(f"{'='*60}") stability_results = evaluate_rank_stability_across_random_states( tensor, rank_range=[optimal_R-5, optimal_R, optimal_R+5], random_states=[41, 42, 43, 44, 45] ) # FINAL RECOMMENDATION print(f"\n{'='*60}") print("FINAL RECOMMENDATION") print(f"{'='*60}") print(f"Based on validated dissertation methodology:") print(f" → Rank: {optimal_R}") print(f" → Lambda: {optimal_lambda}") print(f"\nSupporting evidence:") print(f" Synthetic FMS: {synthetic_fms_results['optimal_rank']}") print(f" Random state consensus: {stability_results['consensus_rank']}") if all([optimal_R == synthetic_fms_results['optimal_rank'], optimal_R == stability_results['consensus_rank']]): print(f"\n✓ STRONG CONFIDENCE: All methods agree on rank {optimal_R}") else: print(f"\n⚠ MODERATE CONFIDENCE: Methods show some disagreement") print(f" Recommend examining ranks: {sorted([optimal_R, synthetic_fms_results['optimal_rank'], stability_results['consensus_rank']])}") ``` --- ## Why This Matters ### Problem with Current Approach: 1. **No validated methodology**: Manual inspection is subjective 2. **Lambda not optimized**: Fixed values may not be ideal for your data 3. **Missing dissertation's validation**: Developed and tested on similar data ### Benefits of Dissertation Approach: 1. **Validated on 100 simulated datasets**: 86% accuracy in rank selection 2. **Works with high noise**: Effective even at 10:1 noise-to-signal ratio 3. **Objective criteria**: Clear mathematical rules for parameter selection 4. **Uses your replicates**: Leverages biological variability in your data --- ## Timeline Recommendation ### Week 1 (Immediate): - [x] Review dissertation methodology (DONE - this document) - [ ] Identify replicate groups in your data - [ ] Implement `identify_replicate_groups()` function - [ ] Test with a subset of ranks ### Week 2: - [ ] Implement full `cross_validated_rank_selection()` function - [ ] Run grid search on your data - [ ] Compare results with current manual approach ### Week 3: - [ ] Integrate all validation methods (CV + Synthetic + Multi-RS) - [ ] Generate comprehensive report - [ ] Make final rank decision --- ## Questions to Address 1. **How many biological replicates do you have per (species, timepoint)?** - Need at least 3 for reliable cross-validation - More is better 2. **Are all replicates suitable for analysis?** - Check for outliers - Verify data quality 3. **What lambda range to test?** - Dissertation used: [0.0, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8] - Start with subset: [0.0, 0.1, 0.2, 0.4, 0.8] --- ## Expected Outcomes After implementing dissertation methodology, you will have: 1. **Objective rank selection**: Based on cross-validated SSE 2. **Optimized lambda**: Based on 1SE rule 3. **Confidence metrics**: - Cross-validation error - Factor match scores - Consistency across replicates 4. **Validated results**: Using method proven on similar data --- ## Bottom Line **You SHOULD implement the dissertation methodology because:** 1. ✅ Your data HAS the required replicates 2. ✅ Method is validated on similar data (metatranscriptomics) 3. ✅ Provides objective criteria for rank selection 4. ✅ Will give you more confidence in final results 5. ✅ Reviewers will appreciate validated methodology **The current approach is acceptable as:** - Exploratory analysis - Supplementary validation - When NO replicates available **But for publication-quality analysis:** - Use dissertation's cross-validation approach (PRIMARY) - Keep your current validations (SUPPLEMENTARY) - Report all methods for transparency --- ## References Blaskowski, S. (2024). *Inference of In Situ Microbial Physiologies via Sparse Tensor Decomposition of Metatranscriptomes*. Doctoral dissertation, University of Washington. - Section 1.2.3: Parameter Selection (pg. 15-16) - Section 2.4.5: Parameter Selection Methods (pg. 47-48) - Figure 1.4: Cross-validated grid search results --- ## Next Action **Immediate**: Create a new code chunk implementing `cross_validated_rank_selection()` function. Would you like me to help you implement this?