# Implementation Roadmap: Dissertation-Validated Rank Selection

## Quick Summary

**Status**: Your current code does NOT implement the dissertation's validated rank selection methodology.

**Impact**: MEDIUM-HIGH - You're using a reasonable alternative, but missing the validated approach that would give stronger confidence in results.

**Solution**: Implement cross-validated grid search using your biological replicates.

**Timeline**: 2-3 weeks for full implementation + testing

---

## What Needs to Change

### Currently Missing:

1. ❌ **Cross-validated SSE for rank selection**
   - Dissertation: "We selected the R value of best fit based on the minimum cross-validated SSE"
   - You: Manual inspection of elbow plots

2. ❌ **Cross-validated FMS for lambda selection**
   - Dissertation: "maximum λ value at which the cross-validated FMS remained within one standard error of the maximum FMS"
   - You: Fixed lambda=[0.1, 0.0, 0.1]

3. ❌ **Automated parameter selection**
   - Dissertation: Grid search with mathematical criteria
   - You: Manual testing and inspection

### Currently Doing (Good, but different):

1. ⚠️ **Synthetic FMS validation**
   - Tests factor recovery from synthetic noisy data
   - NOT the same as cross-validation between replicates
   - Valid alternative, but not the dissertation method

2. ⚠️ **Multi-random-state analysis**
   - Tests consistency across initializations
   - Good supplementary validation
   - Not mentioned in dissertation

---

## Detailed Implementation Steps

### STEP 1: Data Preparation (1-2 days)

#### A. Identify Your Replicate Structure

Based on your metadata, you have:
```
Species: Apul (ACR-), Peve (POR-), Ptua (POC-)
Colonies per species: ~10-15 (e.g., ACR-139, ACR-145, ACR-150...)
Timepoints: 4 (TP1, TP2, TP3, TP4)

Replicate groups:
- All Apul colonies at TP1 → one replicate group
- All Apul colonies at TP2 → one replicate group
- All Peve colonies at TP1 → one replicate group
- etc.

Total replicate groups: 3 species × 4 timepoints = 12 groups
Replicates per group: ~10-15 colonies
```

#### B. Create Replicate Mapping Function

Add this to your Python code section:

```python
def identify_replicate_groups(sample_labels):
    """
    Identify biological replicate groups in the tensor.
    
    For your timeseries data:
    - Replicates = same species + same timepoint + different colony
    
    Parameters:
    -----------
    sample_labels : list
        Sample labels from sample_mapping
        Format: 'apul_ACR-139' or 'apul_ACR.139'
    
    Returns:
    --------
    replicate_groups : dict
        Keys: (species, timepoint)
        Values: list of sample indices
    """
    from collections import defaultdict
    
    groups = defaultdict(list)
    
    # Also need to track timepoints from tensor
    # Assuming timepoint info is embedded in sample labels or separate
    # This needs to be adapted to your actual data structure
    
    for idx, label in enumerate(sample_labels):
        # Parse label: 'apul_ACR-139' or 'apul_ACR.139'
        parts = label.split('_')
        species = parts[0]  # 'apul', 'peve', or 'ptua'
        
        # For your data, each sample has ALL timepoints
        # So we need a different structure:
        # Group by (species, colony) and track that each has all 4 timepoints
        
        colony_id = parts[1]  # 'ACR-139' or 'ACR.139'
        
        # Create key by species only (since all colonies have all timepoints)
        key = species
        groups[key].append(idx)
    
    print("Replicate Groups:")
    for key, indices in groups.items():
        print(f"  {key}: {len(indices)} samples")
    
    return groups

# Usage in your code:
replicate_groups = identify_replicate_groups(sample_labels)
```

#### C. Verify Replicate Structure

Add verification chunk:

```python
print("="*60)
print("VERIFYING REPLICATE STRUCTURE")
print("="*60)

replicate_groups = identify_replicate_groups(sample_labels)

for group_name, sample_indices in replicate_groups.items():
    print(f"\nGroup: {group_name}")
    print(f"  Number of samples: {len(sample_indices)}")
    print(f"  Sample labels: {[sample_labels[i] for i in sample_indices[:5]]}...")
    
    # Check that all have same tensor structure
    group_tensor = tensor_3d[:, sample_indices, :]
    print(f"  Tensor shape: {group_tensor.shape}")

print(f"\nTotal replicate groups: {len(replicate_groups)}")
print(f"Suitable for {len(replicate_groups)}-fold cross-validation")
```

---

### STEP 2: Implement Cross-Validation Functions (3-5 days)

#### A. Create Core CV Function

Create new Python chunk in your Rmd:

```python
def calculate_cv_sse(tensor, decomposition, test_indices):
    """
    Calculate Sum of Squared Errors on test data.
    
    Parameters:
    -----------
    tensor : 3D array
        Full tensor
    decomposition : barnacle decomposition object
        Fitted decomposition
    test_indices : list
        Sample indices for test set
    
    Returns:
    --------
    sse : float
        Sum of squared errors
    """
    # Extract factors
    gene_factors = decomposition.factors[0]
    sample_factors = decomposition.factors[1]
    time_factors = decomposition.factors[2]
    weights = decomposition.weights
    
    # Reconstruct tensor for test samples only
    test_tensor = tensor[:, test_indices, :]
    test_sample_factors = sample_factors[test_indices, :]
    
    # Reconstruct
    rank = len(weights)
    reconstructed = np.zeros(test_tensor.shape)
    
    for r in range(rank):
        # Outer product: genes × samples × times
        component = np.einsum('i,j,k->ijk',
                            gene_factors[:, r],
                            test_sample_factors[:, r],
                            time_factors[:, r])
        reconstructed += weights[r] * component
    
    # Calculate SSE
    sse = np.sum((test_tensor - reconstructed) ** 2)
    
    return sse


def cross_validated_rank_selection_v1(
    tensor,
    replicate_groups,
    rank_range=[5, 10, 15, 20, 25, 30, 35],
    lambdas=[0.0],  # Start with just rank selection
    random_state=42
):
    """
    Dissertation-validated rank selection via cross-validation.
    
    Implementation of Section 1.2.3 (pg. 15-16)
    """
    from barnacle.decomposition import SparseCP
    
    print("="*60)
    print("CROSS-VALIDATED RANK SELECTION")
    print("Dissertation Method (Blaskowski 2024)")
    print("="*60)
    print(f"Testing ranks: {rank_range}")
    print(f"Lambda: {lambdas[0]} (no regularization)")
    print(f"Number of CV folds: {len(replicate_groups)}")
    print("="*60)
    
    cv_results = []
    
    # For each rank, do leave-one-group-out cross-validation
    for R in rank_range:
        print(f"\nTesting Rank {R}...")
        
        fold_sse_scores = []
        
        # Leave-one-group-out CV
        for fold_idx, (group_name, test_indices) in enumerate(replicate_groups.items()):
            print(f"  Fold {fold_idx+1}/{len(replicate_groups)}: Hold out {group_name}")
            
            # Create train/test split
            train_indices = []
            for other_name, other_indices in replicate_groups.items():
                if other_name != group_name:
                    train_indices.extend(other_indices)
            
            # Fit model on training data
            train_tensor = tensor[:, train_indices, :]
            
            model = SparseCP(
                rank=R,
                lambdas=[lambdas[0], 0.0, lambdas[0]],
                nonneg_modes=[0],
                norm_constraint=True,
                init='random',
                tol=1e-5,
                n_iter_max=10000,
                random_state=random_state,
                n_initializations=1
            )
            
            try:
                decomposition = model.fit_transform(train_tensor, verbose=0)
                
                # Calculate SSE on test data
                sse = calculate_cv_sse(tensor, decomposition, test_indices)
                fold_sse_scores.append(sse)
                
                print(f"    SSE: {sse:.2e}")
                
            except Exception as e:
                print(f"    Error: {e}")
                fold_sse_scores.append(np.nan)
        
        # Calculate mean and std of CV-SSE
        valid_scores = [s for s in fold_sse_scores if not np.isnan(s)]
        if len(valid_scores) > 0:
            mean_cv_sse = np.mean(valid_scores)
            std_cv_sse = np.std(valid_scores)
        else:
            mean_cv_sse = np.inf
            std_cv_sse = np.inf
        
        cv_results.append({
            'rank': R,
            'lambda': lambdas[0],
            'mean_cv_sse': mean_cv_sse,
            'std_cv_sse': std_cv_sse,
            'n_successful_folds': len(valid_scores)
        })
        
        print(f"  → Mean CV-SSE: {mean_cv_sse:.2e} ± {std_cv_sse:.2e}")
    
    # Convert to DataFrame
    cv_df = pd.DataFrame(cv_results)
    
    # Select optimal rank (minimum CV-SSE)
    optimal_idx = cv_df['mean_cv_sse'].idxmin()
    optimal_rank = int(cv_df.loc[optimal_idx, 'rank'])
    optimal_sse = cv_df.loc[optimal_idx, 'mean_cv_sse']
    
    print(f"\n{'='*60}")
    print("OPTIMAL RANK SELECTION")
    print(f"{'='*60}")
    print(f"Optimal Rank: {optimal_rank}")
    print(f"CV-SSE: {optimal_sse:.2e}")
    print(f"\nFull Results:")
    print(cv_df.to_string(index=False))
    
    return optimal_rank, cv_df
```

#### B. Add Lambda Selection (1SE Rule)

```python
def select_optimal_lambda_1se_rule(
    tensor,
    optimal_rank,
    replicate_groups,
    lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8],
    random_state=42
):
    """
    Select optimal lambda using 1SE rule.
    
    Implementation of Section 1.2.3 (pg. 15)
    "maximum λ value at which the cross-validated FMS remained 
     within one standard error of the maximum FMS"
    """
    from barnacle.decomposition import SparseCP
    import tensorly as tl
    import tlviz.factor_tools
    
    print("="*60)
    print("LAMBDA SELECTION - 1SE RULE")
    print(f"Fixed Rank: {optimal_rank}")
    print(f"Testing lambdas: {lambda_range}")
    print("="*60)
    
    lambda_results = []
    
    for lambda_val in lambda_range:
        print(f"\nTesting Lambda {lambda_val}...")
        
        # Fit model to EACH replicate group
        group_decompositions = []
        
        for group_name, group_indices in replicate_groups.items():
            group_tensor = tensor[:, group_indices, :]
            
            model = SparseCP(
                rank=optimal_rank,
                lambdas=[lambda_val, 0.0, lambda_val],
                nonneg_modes=[0],
                norm_constraint=True,
                init='random',
                tol=1e-5,
                n_iter_max=10000,
                random_state=random_state,
                n_initializations=1
            )
            
            try:
                decomp = model.fit_transform(group_tensor, verbose=0)
                group_decompositions.append((group_name, decomp))
            except Exception as e:
                print(f"  Error fitting {group_name}: {e}")
        
        # Calculate FMS between all pairs of replicates
        fms_scores = []
        for i in range(len(group_decompositions)):
            for j in range(i+1, len(group_decompositions)):
                name_i, decomp_i = group_decompositions[i]
                name_j, decomp_j = group_decompositions[j]
                
                # Calculate FMS
                try:
                    # Create CPTensor objects
                    cp_i = tl.cp_tensor.CPTensor(
                        (decomp_i.weights, decomp_i.factors)
                    )
                    cp_j = tl.cp_tensor.CPTensor(
                        (decomp_j.weights, decomp_j.factors)
                    )
                    
                    fms = tlviz.factor_tools.factor_match_score(
                        cp_i, cp_j, consider_weights=True
                    )
                    fms_scores.append(fms)
                    
                except Exception as e:
                    print(f"  Error calculating FMS: {e}")
        
        if len(fms_scores) > 0:
            mean_fms = np.mean(fms_scores)
            std_fms = np.std(fms_scores)
        else:
            mean_fms = 0.0
            std_fms = 0.0
        
        lambda_results.append({
            'lambda': lambda_val,
            'mean_cv_fms': mean_fms,
            'std_cv_fms': std_fms,
            'n_comparisons': len(fms_scores)
        })
        
        print(f"  → Mean CV-FMS: {mean_fms:.4f} ± {std_fms:.4f}")
    
    # Apply 1SE rule
    lambda_df = pd.DataFrame(lambda_results)
    
    # Find maximum FMS
    max_fms_idx = lambda_df['mean_cv_fms'].idxmax()
    max_fms = lambda_df.loc[max_fms_idx, 'mean_cv_fms']
    max_fms_se = lambda_df.loc[max_fms_idx, 'std_cv_fms']
    
    # Find maximum lambda where FMS is within 1 SE of maximum
    threshold = max_fms - max_fms_se
    within_1se = lambda_df['mean_cv_fms'] >= threshold
    
    if within_1se.any():
        optimal_lambda = lambda_df.loc[within_1se, 'lambda'].max()
    else:
        optimal_lambda = lambda_df.loc[max_fms_idx, 'lambda']
    
    print(f"\n{'='*60}")
    print("OPTIMAL LAMBDA SELECTION (1SE RULE)")
    print(f"{'='*60}")
    print(f"Maximum FMS: {max_fms:.4f} ± {max_fms_se:.4f}")
    print(f"1 SE threshold: {threshold:.4f}")
    print(f"Optimal Lambda: {optimal_lambda}")
    print(f"  (Maximum λ within 1 SE of maximum FMS)")
    
    return optimal_lambda, lambda_df
```

---

### STEP 3: Integrate into Your Workflow (2-3 days)

#### Add New Code Chunk to Rmd

```{r barnacle-dissertation-method, eval=TRUE}
```{python}
# ===========================================
# DISSERTATION-VALIDATED RANK SELECTION
# ===========================================
# Implementation of Blaskowski (2024) methodology
# Section 1.2.3: Parameter Selection
# ===========================================

print("="*80)
print("BARNACLE RANK SELECTION - DISSERTATION METHOD")
print("="*80)

# STEP 1: Identify replicate groups
replicate_groups = identify_replicate_groups(sample_labels)

# STEP 2: Select optimal rank via cross-validation
optimal_rank, cv_rank_results = cross_validated_rank_selection_v1(
    tensor=tensor_3d,
    replicate_groups=replicate_groups,
    rank_range=[5, 10, 15, 20, 25, 30, 35, 40],
    lambdas=[0.0],  # No regularization for rank selection
    random_state=41
)

# STEP 3: Select optimal lambda via 1SE rule
optimal_lambda, cv_lambda_results = select_optimal_lambda_1se_rule(
    tensor=tensor_3d,
    optimal_rank=optimal_rank,
    replicate_groups=replicate_groups,
    lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8],
    random_state=41
)

# Save results
results_dir = os.path.join(output_dir, 'dissertation_method_results')
os.makedirs(results_dir, exist_ok=True)

cv_rank_results.to_csv(
    os.path.join(results_dir, 'cv_rank_selection.csv'),
    index=False
)

cv_lambda_results.to_csv(
    os.path.join(results_dir, 'cv_lambda_selection.csv'),
    index=False
)

# Save optimal parameters
with open(os.path.join(results_dir, 'optimal_parameters.json'), 'w') as f:
    json.dump({
        'method': 'dissertation_cross_validation',
        'reference': 'Blaskowski 2024, Section 1.2.3',
        'optimal_rank': int(optimal_rank),
        'optimal_lambda': float(optimal_lambda),
        'rank_selection_criterion': 'minimum_cv_sse',
        'lambda_selection_criterion': '1se_rule_cv_fms',
        'n_replicate_groups': len(replicate_groups),
        'ranks_tested': [5, 10, 15, 20, 25, 30, 35, 40],
        'lambdas_tested': [0.0, 0.05, 0.1, 0.2, 0.4, 0.8]
    }, f, indent=2)

print(f"\n{'='*80}")
print("FINAL PARAMETERS")
print(f"{'='*80}")
print(f"Optimal Rank (R): {optimal_rank}")
print(f"Optimal Lambda (λ): {optimal_lambda}")
print(f"\nResults saved to: {results_dir}")
```
```

---

### STEP 4: Compare Methods (1 day)

Create comparison chunk:

```{python}
# ===========================================
# COMPARE RANK SELECTION METHODS
# ===========================================

print("="*80)
print("COMPARISON: DISSERTATION vs CURRENT METHODS")
print("="*80)

# Method 1: Dissertation cross-validation
dissertation_rank = optimal_rank  # From above

# Method 2: Synthetic FMS (your current approach)
# Load from previous results
synthetic_fms_file = os.path.join(
    output_dir, 'rank_comparison/synthetic_fms_all_random_states.csv'
)
if os.path.exists(synthetic_fms_file):
    synthetic_df = pd.read_csv(synthetic_fms_file)
    # Find rank with highest mean synthetic FMS
    mean_fms_by_rank = synthetic_df.groupby('rank')['fms_synthetic'].mean()
    synthetic_optimal_rank = mean_fms_by_rank.idxmax()
else:
    synthetic_optimal_rank = None

# Method 3: Variance explained elbow
# (would need to implement elbow detection)

print(f"\nOptimal Rank by Method:")
print(f"  Dissertation (CV-SSE):        Rank {dissertation_rank}")
if synthetic_optimal_rank:
    print(f"  Synthetic FMS:                Rank {int(synthetic_optimal_rank)}")
print(f"  Manual inspection (current):  Rank ??? (your choice)")

# Agreement analysis
print(f"\nMethod Agreement:")
if synthetic_optimal_rank and dissertation_rank == synthetic_optimal_rank:
    print(f"  ✓ Methods AGREE on rank {dissertation_rank}")
    print(f"    HIGH confidence in this rank")
elif synthetic_optimal_rank:
    diff = abs(dissertation_rank - synthetic_optimal_rank)
    print(f"  ⚠ Methods DISAGREE by {diff} ranks")
    print(f"    Dissertation: {dissertation_rank}")
    print(f"    Synthetic FMS: {int(synthetic_optimal_rank)}")
    print(f"    MODERATE confidence - examine both ranks")

# Recommendation
print(f"\n{'='*80}")
print("RECOMMENDATION")
print(f"{'='*80}")
print(f"Primary: Use Rank {dissertation_rank}")
print(f"  (Dissertation-validated cross-validation method)")
print(f"\nSupplementary validation:")
if synthetic_optimal_rank:
    print(f"  Synthetic FMS supports rank {int(synthetic_optimal_rank)}")
print(f"  Random state stability: (check previous results)")
```

---

## Success Criteria

You'll know the implementation is successful when:

1. ✅ Cross-validation runs without errors
2. ✅ All replicate groups are used in CV folds
3. ✅ Optimal rank is identified automatically
4. ✅ 1SE rule selects optimal lambda
5. ✅ Results are reproducible
6. ✅ Agreement (or reasonable disagreement) with other methods

---

## Troubleshooting

### Issue: CV-SSE is very high
**Solution**: 
- Check tensor normalization
- Verify replicate groups are correctly identified
- Try different convergence tolerance

### Issue: Models fail to converge
**Solution**:
- Increase `n_iter_max`
- Try different `random_state`
- Check for extreme values in tensor

### Issue: FMS calculation fails
**Solution**:
- Check tensorly/tlviz installation
- Verify decomposition outputs have correct structure
- Try alternative FMS calculation

### Issue: Different methods give very different ranks
**Solution**:
- This can be normal! Methods test different things
- Dissertation method should be primary
- Use other methods to validate
- May need to examine multiple nearby ranks

---

## Final Checklist

Before considering implementation complete:

- [ ] Dissertation method implemented
- [ ] Cross-validation runs successfully
- [ ] Optimal parameters saved
- [ ] Comparison with other methods done
- [ ] Results documented
- [ ] Final rank decision made
- [ ] Updated analysis documentation with methodology

---

## Getting Help

If you get stuck:

1. Check dissertation Section 1.2.3 (pg. 15-16) and 2.4.5 (pg. 47-48)
2. Review barnacle GitHub examples
3. Check tensorly documentation
4. Verify your data structure matches assumptions

---

## Timeline Summary

| Task | Est. Time | Priority |
|------|-----------|----------|
| Data preparation | 1-2 days | HIGH |
| CV functions | 3-5 days | HIGH |
| Integration | 2-3 days | HIGH |
| Comparison | 1 day | MEDIUM |
| Documentation | 1 day | MEDIUM |
| **TOTAL** | **8-12 days** | |

---

## Bottom Line

**You need to implement the dissertation methodology** because:

1. Your data HAS the required replicates
2. The method is validated on similar data
3. It provides objective parameter selection
4. Reviewers will expect validated methodology
5. It will increase confidence in your results

**Start with**: Identifying your replicate groups and implementing basic cross-validation for rank selection. Lambda selection can come later.