# Recommendations for Rank Selection Implementation

## Date: November 19, 2025

## Summary

After reviewing the Blaskowski dissertation (2024) and comparing it to the current implementation, **I have good news and specific recommendations**.

---

## KEY FINDING: Your Data HAS Replicates! ✅

Based on the `master-molecular-metadata.csv` file, your experimental design includes:

**Biological Replicates**:
- Multiple colonies per species (e.g., ACR-139, ACR-145, ACR-150, ACR-173, ACR-186...)
- Each colony represents an independent biological replicate
- Each colony has measurements at 4 timepoints (TP1, TP2, TP3, TP4)

**Example**:
```
Acropora pulchra colonies at Manava site:
- ACR-139 (TP1, TP2, TP3, TP4)
- ACR-145 (TP1, TP2, TP3, TP4)
- ACR-150 (TP1, TP2, TP3, TP4)
- ACR-173 (TP1, TP2, TP3, TP4)
- ACR-186 (TP1, TP2, TP3, TP4)
```

This means you can implement the **dissertation's validated cross-validation approach**.

---

## Current Situation

### What You're Doing Now:
1. ❌ NOT using cross-validation between biological replicates
2. ❌ NOT implementing dissertation's rank selection methodology
3. ✅ Testing multiple ranks manually
4. ⚠️ Using Synthetic FMS (alternative validation)
5. ⚠️ Using multiple random states (complementary validation)

### What You SHOULD Be Doing (According to Dissertation):
1. ✅ Cross-validate between biological replicates
2. ✅ Use cross-validated SSE to select optimal rank
3. ✅ Use 1SE rule with cross-validated FMS to select lambda
4. ✅ Automated parameter selection

---

## Critical Differences

| Component | Dissertation Method | Your Current Method | Impact |
|-----------|-------------------|-------------------|---------|
| **Rank Selection** | Min cross-validated SSE (between replicates) | Manual inspection of elbow plots | High - Missing validated approach |
| **Lambda Selection** | 1SE rule with CV-FMS | Fixed at [0.1, 0.0, 0.1] | High - Not optimizing sparsity |
| **Validation** | Cross-validation using real data variability | Synthetic validation | Medium - Different assumptions |
| **Automation** | Automated grid search | Manual rank testing | Medium - More time consuming |

---

## Recommended Implementation Plan

### Phase 1: Immediate (Cross-Validated Rank Selection)

**GOAL**: Implement dissertation's validated methodology

#### Step 1A: Organize Replicates

Your tensor structure should group replicates:
```python
# Current structure: (genes, combined_samples, timepoints)
# Tensor shape: (14733, 52, 4)

# Need to identify which samples are replicates of same condition
# Example replicate groups (at same timepoint):
# Group 1: ACR-139-TP1, ACR-145-TP1, ACR-150-TP1, ... (all Apul at TP1)
# Group 2: ACR-139-TP2, ACR-145-TP2, ACR-150-TP2, ... (all Apul at TP2)
# etc.
```

#### Step 1B: Implement Cross-Validated Grid Search

```python
def cross_validated_rank_selection(
    tensor, 
    replicate_groups,
    rank_range=[5, 10, 15, 20, 25, 30, 35, 40],
    lambda_values=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8]
):
    """
    Dissertation methodology (Section 1.2.3, pg. 15)
    
    Parameters:
    -----------
    tensor : 3D array
        Full tensor (genes × samples × timepoints)
    replicate_groups : list of lists
        Each sublist contains sample indices that are replicates
        E.g., [[0,1,2], [3,4,5], ...] for 3 biological replicates
    
    Returns:
    --------
    optimal_R : int
        Rank minimizing cross-validated SSE (with lambda=0)
    optimal_lambda : float
        Lambda selected by 1SE rule
    cv_results : DataFrame
        Full cross-validation results for all parameter combinations
    """
    
    results = []
    
    # Test all rank values with lambda=0.0 first
    print("="*60)
    print("STEP 1: Select Optimal Rank (R)")
    print("Testing with lambda=0.0 (no regularization)")
    print("="*60)
    
    for R in rank_range:
        # For each replicate group, hold one out
        cv_sse_scores = []
        
        for holdout_idx in range(len(replicate_groups)):
            # Create training and test sets
            train_indices = [idx for i, group in enumerate(replicate_groups) 
                           if i != holdout_idx for idx in group]
            test_indices = replicate_groups[holdout_idx]
            
            # Fit model on training data
            train_tensor = tensor[:, train_indices, :]
            model = SparseCP(rank=R, lambdas=[0.0, 0.0, 0.0], ...)
            decomp = model.fit_transform(train_tensor)
            
            # Calculate SSE on held-out test data
            test_tensor = tensor[:, test_indices, :]
            reconstruction = reconstruct_tensor(decomp, test_indices)
            sse = np.sum((test_tensor - reconstruction) ** 2)
            cv_sse_scores.append(sse)
        
        # Store mean CV-SSE across all folds
        mean_cv_sse = np.mean(cv_sse_scores)
        std_cv_sse = np.std(cv_sse_scores)
        
        results.append({
            'R': R,
            'lambda': 0.0,
            'mean_cv_sse': mean_cv_sse,
            'std_cv_sse': std_cv_sse
        })
        
        print(f"R={R}: CV-SSE = {mean_cv_sse:.2f} ± {std_cv_sse:.2f}")
    
    # Select R with minimum CV-SSE
    cv_results_df = pd.DataFrame(results)
    optimal_R = cv_results_df.loc[cv_results_df['mean_cv_sse'].idxmin(), 'R']
    
    print(f"\n→ Optimal Rank (R): {int(optimal_R)}")
    print(f"   (Minimum CV-SSE)")
    
    # STEP 2: Select lambda using 1SE rule with cross-validated FMS
    print(f"\n{'='*60}")
    print(f"STEP 2: Select Optimal Lambda (λ)")
    print(f"Fixed R = {int(optimal_R)}")
    print(f"Using 1SE rule with cross-validated FMS")
    print(f"{'='*60}")
    
    lambda_results = []
    
    for lambda_val in lambda_values:
        # Fit models to each replicate
        replicate_decomps = []
        
        for rep_indices in replicate_groups:
            rep_tensor = tensor[:, rep_indices, :]
            model = SparseCP(rank=int(optimal_R), 
                           lambdas=[lambda_val, 0.0, lambda_val], ...)
            decomp = model.fit_transform(rep_tensor)
            replicate_decomps.append(decomp)
        
        # Calculate FMS between all pairs of replicates
        fms_scores = []
        for i in range(len(replicate_decomps)):
            for j in range(i+1, len(replicate_decomps)):
                fms = calculate_fms(replicate_decomps[i], replicate_decomps[j])
                fms_scores.append(fms)
        
        mean_fms = np.mean(fms_scores)
        std_fms = np.std(fms_scores)
        
        lambda_results.append({
            'lambda': lambda_val,
            'mean_cv_fms': mean_fms,
            'std_cv_fms': std_fms
        })
        
        print(f"λ={lambda_val:.2f}: CV-FMS = {mean_fms:.4f} ± {std_fms:.4f}")
    
    # Apply 1SE rule
    lambda_df = pd.DataFrame(lambda_results)
    max_fms_idx = lambda_df['mean_cv_fms'].idxmax()
    max_fms = lambda_df.loc[max_fms_idx, 'mean_cv_fms']
    max_fms_se = lambda_df.loc[max_fms_idx, 'std_cv_fms']
    
    # Find maximum lambda within 1 SE of maximum FMS
    within_1se = lambda_df['mean_cv_fms'] >= (max_fms - max_fms_se)
    optimal_lambda = lambda_df.loc[within_1se, 'lambda'].max()
    
    print(f"\n→ Optimal Lambda (λ): {optimal_lambda}")
    print(f"   (1SE rule: max λ within 1 SE of max FMS)")
    print(f"   Max FMS: {max_fms:.4f} ± {max_fms_se:.4f}")
    
    return int(optimal_R), optimal_lambda, cv_results_df
```

#### Step 1C: Define Replicate Groups

```python
# Need to group samples by (species, timepoint)
# Each group contains multiple colonies (biological replicates)

def identify_replicate_groups(sample_labels):
    """
    Group samples that are biological replicates.
    
    For timeseries data, replicates are:
    - Same species
    - Same timepoint
    - Different colony IDs
    
    Example:
    - ACR-139-TP1, ACR-145-TP1, ACR-150-TP1 → replicate group
    - ACR-139-TP2, ACR-145-TP2, ACR-150-TP2 → replicate group
    """
    
    groups = {}
    
    for idx, label in enumerate(sample_labels):
        # Parse: "apul_ACR-139-TP1" or "apul_ACR.139.TP1"
        parts = label.split('_')
        species = parts[0]
        sample_parts = parts[1].replace('.', '-').split('-')
        
        # Extract timepoint (last part)
        timepoint = sample_parts[-1]  # e.g., "TP1"
        
        # Create key: (species, timepoint)
        key = (species, timepoint)
        
        if key not in groups:
            groups[key] = []
        groups[key].append(idx)
    
    # Convert to list of lists
    replicate_groups = list(groups.values())
    
    return replicate_groups
```

### Phase 2: Enhanced (Keep Current Validations)

**GOAL**: Keep your current validation methods as complementary analyses

Your current approaches are valuable:
1. **Synthetic FMS** - Tests factor recovery from noisy data
2. **Multi-random-state** - Tests stability across initializations

These are NOT in the dissertation but provide additional validation. **Keep them!**

```python
# Full validation pipeline
def comprehensive_validation(tensor, replicate_groups):
    """
    Combine dissertation methodology with additional validations
    """
    
    # METHOD 1: Dissertation cross-validation (PRIMARY)
    optimal_R, optimal_lambda, cv_results = cross_validated_rank_selection(
        tensor, replicate_groups
    )
    
    print(f"\n{'='*60}")
    print("PRIMARY RESULT (Dissertation Method)")
    print(f"{'='*60}")
    print(f"Optimal Rank: {optimal_R}")
    print(f"Optimal Lambda: {optimal_lambda}")
    
    # METHOD 2: Synthetic FMS validation (SUPPLEMENTARY)
    print(f"\n{'='*60}")
    print("SUPPLEMENTARY VALIDATION #1: Synthetic FMS")
    print(f"{'='*60}")
    synthetic_fms_results = evaluate_synthetic_fms_across_ranks(
        tensor, rank_range=[optimal_R-5, optimal_R, optimal_R+5]
    )
    
    # METHOD 3: Multi-random-state stability (SUPPLEMENTARY)
    print(f"\n{'='*60}")
    print("SUPPLEMENTARY VALIDATION #2: Random State Stability")
    print(f"{'='*60}")
    stability_results = evaluate_rank_stability_across_random_states(
        tensor, rank_range=[optimal_R-5, optimal_R, optimal_R+5],
        random_states=[41, 42, 43, 44, 45]
    )
    
    # FINAL RECOMMENDATION
    print(f"\n{'='*60}")
    print("FINAL RECOMMENDATION")
    print(f"{'='*60}")
    print(f"Based on validated dissertation methodology:")
    print(f"  → Rank: {optimal_R}")
    print(f"  → Lambda: {optimal_lambda}")
    print(f"\nSupporting evidence:")
    print(f"  Synthetic FMS: {synthetic_fms_results['optimal_rank']}")
    print(f"  Random state consensus: {stability_results['consensus_rank']}")
    
    if all([optimal_R == synthetic_fms_results['optimal_rank'],
            optimal_R == stability_results['consensus_rank']]):
        print(f"\n✓ STRONG CONFIDENCE: All methods agree on rank {optimal_R}")
    else:
        print(f"\n⚠ MODERATE CONFIDENCE: Methods show some disagreement")
        print(f"   Recommend examining ranks: {sorted([optimal_R, synthetic_fms_results['optimal_rank'], stability_results['consensus_rank']])}")
```

---

## Why This Matters

### Problem with Current Approach:
1. **No validated methodology**: Manual inspection is subjective
2. **Lambda not optimized**: Fixed values may not be ideal for your data
3. **Missing dissertation's validation**: Developed and tested on similar data

### Benefits of Dissertation Approach:
1. **Validated on 100 simulated datasets**: 86% accuracy in rank selection
2. **Works with high noise**: Effective even at 10:1 noise-to-signal ratio
3. **Objective criteria**: Clear mathematical rules for parameter selection
4. **Uses your replicates**: Leverages biological variability in your data

---

## Timeline Recommendation

### Week 1 (Immediate):
- [x] Review dissertation methodology (DONE - this document)
- [ ] Identify replicate groups in your data
- [ ] Implement `identify_replicate_groups()` function
- [ ] Test with a subset of ranks

### Week 2:
- [ ] Implement full `cross_validated_rank_selection()` function
- [ ] Run grid search on your data
- [ ] Compare results with current manual approach

### Week 3:
- [ ] Integrate all validation methods (CV + Synthetic + Multi-RS)
- [ ] Generate comprehensive report
- [ ] Make final rank decision

---

## Questions to Address

1. **How many biological replicates do you have per (species, timepoint)?**
   - Need at least 3 for reliable cross-validation
   - More is better

2. **Are all replicates suitable for analysis?**
   - Check for outliers
   - Verify data quality

3. **What lambda range to test?**
   - Dissertation used: [0.0, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8]
   - Start with subset: [0.0, 0.1, 0.2, 0.4, 0.8]

---

## Expected Outcomes

After implementing dissertation methodology, you will have:

1. **Objective rank selection**: Based on cross-validated SSE
2. **Optimized lambda**: Based on 1SE rule
3. **Confidence metrics**: 
   - Cross-validation error
   - Factor match scores
   - Consistency across replicates
4. **Validated results**: Using method proven on similar data

---

## Bottom Line

**You SHOULD implement the dissertation methodology because:**

1. ✅ Your data HAS the required replicates
2. ✅ Method is validated on similar data (metatranscriptomics)
3. ✅ Provides objective criteria for rank selection
4. ✅ Will give you more confidence in final results
5. ✅ Reviewers will appreciate validated methodology

**The current approach is acceptable as:**
- Exploratory analysis
- Supplementary validation
- When NO replicates available

**But for publication-quality analysis:**
- Use dissertation's cross-validation approach (PRIMARY)
- Keep your current validations (SUPPLEMENTARY)
- Report all methods for transparency

---

## References

Blaskowski, S. (2024). *Inference of In Situ Microbial Physiologies via Sparse Tensor Decomposition of Metatranscriptomes*. Doctoral dissertation, University of Washington.
- Section 1.2.3: Parameter Selection (pg. 15-16)
- Section 2.4.5: Parameter Selection Methods (pg. 47-48)
- Figure 1.4: Cross-validated grid search results

---

## Next Action

**Immediate**: Create a new code chunk implementing `cross_validated_rank_selection()` function.

Would you like me to help you implement this?