# Implementation Roadmap: Dissertation-Validated Rank Selection ## Quick Summary **Status**: Your current code does NOT implement the dissertation's validated rank selection methodology. **Impact**: MEDIUM-HIGH - You're using a reasonable alternative, but missing the validated approach that would give stronger confidence in results. **Solution**: Implement cross-validated grid search using your biological replicates. **Timeline**: 2-3 weeks for full implementation + testing --- ## What Needs to Change ### Currently Missing: 1. ❌ **Cross-validated SSE for rank selection** - Dissertation: "We selected the R value of best fit based on the minimum cross-validated SSE" - You: Manual inspection of elbow plots 2. ❌ **Cross-validated FMS for lambda selection** - Dissertation: "maximum λ value at which the cross-validated FMS remained within one standard error of the maximum FMS" - You: Fixed lambda=[0.1, 0.0, 0.1] 3. ❌ **Automated parameter selection** - Dissertation: Grid search with mathematical criteria - You: Manual testing and inspection ### Currently Doing (Good, but different): 1. ⚠️ **Synthetic FMS validation** - Tests factor recovery from synthetic noisy data - NOT the same as cross-validation between replicates - Valid alternative, but not the dissertation method 2. ⚠️ **Multi-random-state analysis** - Tests consistency across initializations - Good supplementary validation - Not mentioned in dissertation --- ## Detailed Implementation Steps ### STEP 1: Data Preparation (1-2 days) #### A. Identify Your Replicate Structure Based on your metadata, you have: ``` Species: Apul (ACR-), Peve (POR-), Ptua (POC-) Colonies per species: ~10-15 (e.g., ACR-139, ACR-145, ACR-150...) Timepoints: 4 (TP1, TP2, TP3, TP4) Replicate groups: - All Apul colonies at TP1 → one replicate group - All Apul colonies at TP2 → one replicate group - All Peve colonies at TP1 → one replicate group - etc. Total replicate groups: 3 species × 4 timepoints = 12 groups Replicates per group: ~10-15 colonies ``` #### B. Create Replicate Mapping Function Add this to your Python code section: ```python def identify_replicate_groups(sample_labels): """ Identify biological replicate groups in the tensor. For your timeseries data: - Replicates = same species + same timepoint + different colony Parameters: ----------- sample_labels : list Sample labels from sample_mapping Format: 'apul_ACR-139' or 'apul_ACR.139' Returns: -------- replicate_groups : dict Keys: (species, timepoint) Values: list of sample indices """ from collections import defaultdict groups = defaultdict(list) # Also need to track timepoints from tensor # Assuming timepoint info is embedded in sample labels or separate # This needs to be adapted to your actual data structure for idx, label in enumerate(sample_labels): # Parse label: 'apul_ACR-139' or 'apul_ACR.139' parts = label.split('_') species = parts[0] # 'apul', 'peve', or 'ptua' # For your data, each sample has ALL timepoints # So we need a different structure: # Group by (species, colony) and track that each has all 4 timepoints colony_id = parts[1] # 'ACR-139' or 'ACR.139' # Create key by species only (since all colonies have all timepoints) key = species groups[key].append(idx) print("Replicate Groups:") for key, indices in groups.items(): print(f" {key}: {len(indices)} samples") return groups # Usage in your code: replicate_groups = identify_replicate_groups(sample_labels) ``` #### C. Verify Replicate Structure Add verification chunk: ```python print("="*60) print("VERIFYING REPLICATE STRUCTURE") print("="*60) replicate_groups = identify_replicate_groups(sample_labels) for group_name, sample_indices in replicate_groups.items(): print(f"\nGroup: {group_name}") print(f" Number of samples: {len(sample_indices)}") print(f" Sample labels: {[sample_labels[i] for i in sample_indices[:5]]}...") # Check that all have same tensor structure group_tensor = tensor_3d[:, sample_indices, :] print(f" Tensor shape: {group_tensor.shape}") print(f"\nTotal replicate groups: {len(replicate_groups)}") print(f"Suitable for {len(replicate_groups)}-fold cross-validation") ``` --- ### STEP 2: Implement Cross-Validation Functions (3-5 days) #### A. Create Core CV Function Create new Python chunk in your Rmd: ```python def calculate_cv_sse(tensor, decomposition, test_indices): """ Calculate Sum of Squared Errors on test data. Parameters: ----------- tensor : 3D array Full tensor decomposition : barnacle decomposition object Fitted decomposition test_indices : list Sample indices for test set Returns: -------- sse : float Sum of squared errors """ # Extract factors gene_factors = decomposition.factors[0] sample_factors = decomposition.factors[1] time_factors = decomposition.factors[2] weights = decomposition.weights # Reconstruct tensor for test samples only test_tensor = tensor[:, test_indices, :] test_sample_factors = sample_factors[test_indices, :] # Reconstruct rank = len(weights) reconstructed = np.zeros(test_tensor.shape) for r in range(rank): # Outer product: genes × samples × times component = np.einsum('i,j,k->ijk', gene_factors[:, r], test_sample_factors[:, r], time_factors[:, r]) reconstructed += weights[r] * component # Calculate SSE sse = np.sum((test_tensor - reconstructed) ** 2) return sse def cross_validated_rank_selection_v1( tensor, replicate_groups, rank_range=[5, 10, 15, 20, 25, 30, 35], lambdas=[0.0], # Start with just rank selection random_state=42 ): """ Dissertation-validated rank selection via cross-validation. Implementation of Section 1.2.3 (pg. 15-16) """ from barnacle.decomposition import SparseCP print("="*60) print("CROSS-VALIDATED RANK SELECTION") print("Dissertation Method (Blaskowski 2024)") print("="*60) print(f"Testing ranks: {rank_range}") print(f"Lambda: {lambdas[0]} (no regularization)") print(f"Number of CV folds: {len(replicate_groups)}") print("="*60) cv_results = [] # For each rank, do leave-one-group-out cross-validation for R in rank_range: print(f"\nTesting Rank {R}...") fold_sse_scores = [] # Leave-one-group-out CV for fold_idx, (group_name, test_indices) in enumerate(replicate_groups.items()): print(f" Fold {fold_idx+1}/{len(replicate_groups)}: Hold out {group_name}") # Create train/test split train_indices = [] for other_name, other_indices in replicate_groups.items(): if other_name != group_name: train_indices.extend(other_indices) # Fit model on training data train_tensor = tensor[:, train_indices, :] model = SparseCP( rank=R, lambdas=[lambdas[0], 0.0, lambdas[0]], nonneg_modes=[0], norm_constraint=True, init='random', tol=1e-5, n_iter_max=10000, random_state=random_state, n_initializations=1 ) try: decomposition = model.fit_transform(train_tensor, verbose=0) # Calculate SSE on test data sse = calculate_cv_sse(tensor, decomposition, test_indices) fold_sse_scores.append(sse) print(f" SSE: {sse:.2e}") except Exception as e: print(f" Error: {e}") fold_sse_scores.append(np.nan) # Calculate mean and std of CV-SSE valid_scores = [s for s in fold_sse_scores if not np.isnan(s)] if len(valid_scores) > 0: mean_cv_sse = np.mean(valid_scores) std_cv_sse = np.std(valid_scores) else: mean_cv_sse = np.inf std_cv_sse = np.inf cv_results.append({ 'rank': R, 'lambda': lambdas[0], 'mean_cv_sse': mean_cv_sse, 'std_cv_sse': std_cv_sse, 'n_successful_folds': len(valid_scores) }) print(f" → Mean CV-SSE: {mean_cv_sse:.2e} ± {std_cv_sse:.2e}") # Convert to DataFrame cv_df = pd.DataFrame(cv_results) # Select optimal rank (minimum CV-SSE) optimal_idx = cv_df['mean_cv_sse'].idxmin() optimal_rank = int(cv_df.loc[optimal_idx, 'rank']) optimal_sse = cv_df.loc[optimal_idx, 'mean_cv_sse'] print(f"\n{'='*60}") print("OPTIMAL RANK SELECTION") print(f"{'='*60}") print(f"Optimal Rank: {optimal_rank}") print(f"CV-SSE: {optimal_sse:.2e}") print(f"\nFull Results:") print(cv_df.to_string(index=False)) return optimal_rank, cv_df ``` #### B. Add Lambda Selection (1SE Rule) ```python def select_optimal_lambda_1se_rule( tensor, optimal_rank, replicate_groups, lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8], random_state=42 ): """ Select optimal lambda using 1SE rule. Implementation of Section 1.2.3 (pg. 15) "maximum λ value at which the cross-validated FMS remained within one standard error of the maximum FMS" """ from barnacle.decomposition import SparseCP import tensorly as tl import tlviz.factor_tools print("="*60) print("LAMBDA SELECTION - 1SE RULE") print(f"Fixed Rank: {optimal_rank}") print(f"Testing lambdas: {lambda_range}") print("="*60) lambda_results = [] for lambda_val in lambda_range: print(f"\nTesting Lambda {lambda_val}...") # Fit model to EACH replicate group group_decompositions = [] for group_name, group_indices in replicate_groups.items(): group_tensor = tensor[:, group_indices, :] model = SparseCP( rank=optimal_rank, lambdas=[lambda_val, 0.0, lambda_val], nonneg_modes=[0], norm_constraint=True, init='random', tol=1e-5, n_iter_max=10000, random_state=random_state, n_initializations=1 ) try: decomp = model.fit_transform(group_tensor, verbose=0) group_decompositions.append((group_name, decomp)) except Exception as e: print(f" Error fitting {group_name}: {e}") # Calculate FMS between all pairs of replicates fms_scores = [] for i in range(len(group_decompositions)): for j in range(i+1, len(group_decompositions)): name_i, decomp_i = group_decompositions[i] name_j, decomp_j = group_decompositions[j] # Calculate FMS try: # Create CPTensor objects cp_i = tl.cp_tensor.CPTensor( (decomp_i.weights, decomp_i.factors) ) cp_j = tl.cp_tensor.CPTensor( (decomp_j.weights, decomp_j.factors) ) fms = tlviz.factor_tools.factor_match_score( cp_i, cp_j, consider_weights=True ) fms_scores.append(fms) except Exception as e: print(f" Error calculating FMS: {e}") if len(fms_scores) > 0: mean_fms = np.mean(fms_scores) std_fms = np.std(fms_scores) else: mean_fms = 0.0 std_fms = 0.0 lambda_results.append({ 'lambda': lambda_val, 'mean_cv_fms': mean_fms, 'std_cv_fms': std_fms, 'n_comparisons': len(fms_scores) }) print(f" → Mean CV-FMS: {mean_fms:.4f} ± {std_fms:.4f}") # Apply 1SE rule lambda_df = pd.DataFrame(lambda_results) # Find maximum FMS max_fms_idx = lambda_df['mean_cv_fms'].idxmax() max_fms = lambda_df.loc[max_fms_idx, 'mean_cv_fms'] max_fms_se = lambda_df.loc[max_fms_idx, 'std_cv_fms'] # Find maximum lambda where FMS is within 1 SE of maximum threshold = max_fms - max_fms_se within_1se = lambda_df['mean_cv_fms'] >= threshold if within_1se.any(): optimal_lambda = lambda_df.loc[within_1se, 'lambda'].max() else: optimal_lambda = lambda_df.loc[max_fms_idx, 'lambda'] print(f"\n{'='*60}") print("OPTIMAL LAMBDA SELECTION (1SE RULE)") print(f"{'='*60}") print(f"Maximum FMS: {max_fms:.4f} ± {max_fms_se:.4f}") print(f"1 SE threshold: {threshold:.4f}") print(f"Optimal Lambda: {optimal_lambda}") print(f" (Maximum λ within 1 SE of maximum FMS)") return optimal_lambda, lambda_df ``` --- ### STEP 3: Integrate into Your Workflow (2-3 days) #### Add New Code Chunk to Rmd ```{r barnacle-dissertation-method, eval=TRUE} ```{python} # =========================================== # DISSERTATION-VALIDATED RANK SELECTION # =========================================== # Implementation of Blaskowski (2024) methodology # Section 1.2.3: Parameter Selection # =========================================== print("="*80) print("BARNACLE RANK SELECTION - DISSERTATION METHOD") print("="*80) # STEP 1: Identify replicate groups replicate_groups = identify_replicate_groups(sample_labels) # STEP 2: Select optimal rank via cross-validation optimal_rank, cv_rank_results = cross_validated_rank_selection_v1( tensor=tensor_3d, replicate_groups=replicate_groups, rank_range=[5, 10, 15, 20, 25, 30, 35, 40], lambdas=[0.0], # No regularization for rank selection random_state=41 ) # STEP 3: Select optimal lambda via 1SE rule optimal_lambda, cv_lambda_results = select_optimal_lambda_1se_rule( tensor=tensor_3d, optimal_rank=optimal_rank, replicate_groups=replicate_groups, lambda_range=[0.0, 0.05, 0.1, 0.2, 0.4, 0.8], random_state=41 ) # Save results results_dir = os.path.join(output_dir, 'dissertation_method_results') os.makedirs(results_dir, exist_ok=True) cv_rank_results.to_csv( os.path.join(results_dir, 'cv_rank_selection.csv'), index=False ) cv_lambda_results.to_csv( os.path.join(results_dir, 'cv_lambda_selection.csv'), index=False ) # Save optimal parameters with open(os.path.join(results_dir, 'optimal_parameters.json'), 'w') as f: json.dump({ 'method': 'dissertation_cross_validation', 'reference': 'Blaskowski 2024, Section 1.2.3', 'optimal_rank': int(optimal_rank), 'optimal_lambda': float(optimal_lambda), 'rank_selection_criterion': 'minimum_cv_sse', 'lambda_selection_criterion': '1se_rule_cv_fms', 'n_replicate_groups': len(replicate_groups), 'ranks_tested': [5, 10, 15, 20, 25, 30, 35, 40], 'lambdas_tested': [0.0, 0.05, 0.1, 0.2, 0.4, 0.8] }, f, indent=2) print(f"\n{'='*80}") print("FINAL PARAMETERS") print(f"{'='*80}") print(f"Optimal Rank (R): {optimal_rank}") print(f"Optimal Lambda (λ): {optimal_lambda}") print(f"\nResults saved to: {results_dir}") ``` ``` --- ### STEP 4: Compare Methods (1 day) Create comparison chunk: ```{python} # =========================================== # COMPARE RANK SELECTION METHODS # =========================================== print("="*80) print("COMPARISON: DISSERTATION vs CURRENT METHODS") print("="*80) # Method 1: Dissertation cross-validation dissertation_rank = optimal_rank # From above # Method 2: Synthetic FMS (your current approach) # Load from previous results synthetic_fms_file = os.path.join( output_dir, 'rank_comparison/synthetic_fms_all_random_states.csv' ) if os.path.exists(synthetic_fms_file): synthetic_df = pd.read_csv(synthetic_fms_file) # Find rank with highest mean synthetic FMS mean_fms_by_rank = synthetic_df.groupby('rank')['fms_synthetic'].mean() synthetic_optimal_rank = mean_fms_by_rank.idxmax() else: synthetic_optimal_rank = None # Method 3: Variance explained elbow # (would need to implement elbow detection) print(f"\nOptimal Rank by Method:") print(f" Dissertation (CV-SSE): Rank {dissertation_rank}") if synthetic_optimal_rank: print(f" Synthetic FMS: Rank {int(synthetic_optimal_rank)}") print(f" Manual inspection (current): Rank ??? (your choice)") # Agreement analysis print(f"\nMethod Agreement:") if synthetic_optimal_rank and dissertation_rank == synthetic_optimal_rank: print(f" ✓ Methods AGREE on rank {dissertation_rank}") print(f" HIGH confidence in this rank") elif synthetic_optimal_rank: diff = abs(dissertation_rank - synthetic_optimal_rank) print(f" ⚠ Methods DISAGREE by {diff} ranks") print(f" Dissertation: {dissertation_rank}") print(f" Synthetic FMS: {int(synthetic_optimal_rank)}") print(f" MODERATE confidence - examine both ranks") # Recommendation print(f"\n{'='*80}") print("RECOMMENDATION") print(f"{'='*80}") print(f"Primary: Use Rank {dissertation_rank}") print(f" (Dissertation-validated cross-validation method)") print(f"\nSupplementary validation:") if synthetic_optimal_rank: print(f" Synthetic FMS supports rank {int(synthetic_optimal_rank)}") print(f" Random state stability: (check previous results)") ``` --- ## Success Criteria You'll know the implementation is successful when: 1. ✅ Cross-validation runs without errors 2. ✅ All replicate groups are used in CV folds 3. ✅ Optimal rank is identified automatically 4. ✅ 1SE rule selects optimal lambda 5. ✅ Results are reproducible 6. ✅ Agreement (or reasonable disagreement) with other methods --- ## Troubleshooting ### Issue: CV-SSE is very high **Solution**: - Check tensor normalization - Verify replicate groups are correctly identified - Try different convergence tolerance ### Issue: Models fail to converge **Solution**: - Increase `n_iter_max` - Try different `random_state` - Check for extreme values in tensor ### Issue: FMS calculation fails **Solution**: - Check tensorly/tlviz installation - Verify decomposition outputs have correct structure - Try alternative FMS calculation ### Issue: Different methods give very different ranks **Solution**: - This can be normal! Methods test different things - Dissertation method should be primary - Use other methods to validate - May need to examine multiple nearby ranks --- ## Final Checklist Before considering implementation complete: - [ ] Dissertation method implemented - [ ] Cross-validation runs successfully - [ ] Optimal parameters saved - [ ] Comparison with other methods done - [ ] Results documented - [ ] Final rank decision made - [ ] Updated analysis documentation with methodology --- ## Getting Help If you get stuck: 1. Check dissertation Section 1.2.3 (pg. 15-16) and 2.4.5 (pg. 47-48) 2. Review barnacle GitHub examples 3. Check tensorly documentation 4. Verify your data structure matches assumptions --- ## Timeline Summary | Task | Est. Time | Priority | |------|-----------|----------| | Data preparation | 1-2 days | HIGH | | CV functions | 3-5 days | HIGH | | Integration | 2-3 days | HIGH | | Comparison | 1 day | MEDIUM | | Documentation | 1 day | MEDIUM | | **TOTAL** | **8-12 days** | | --- ## Bottom Line **You need to implement the dissertation methodology** because: 1. Your data HAS the required replicates 2. The method is validated on similar data 3. It provides objective parameter selection 4. Reviewers will expect validated methodology 5. It will increase confidence in your results **Start with**: Identifying your replicate groups and implementing basic cross-validation for rank selection. Lambda selection can come later.