# Cross-Validation and Biological Replicates: Important Clarification **Date:** November 19, 2025 **Issue:** Misunderstanding about biological replicates in timeseries tensor CV **Status:** CORRECTED --- ## The Question > "I think biological replicates are ONLY if they exist at the same time point. Does the code process that correctly?" **Answer:** You are **absolutely correct** about the definition of biological replicates. The original code had a conceptual issue that has now been fixed. --- ## The Problem ### What Are True Biological Replicates? **Biological replicates** = Independent samples measured under the **same conditions** For this timeseries experiment: - **TRUE REPLICATES**: Different colonies at the **SAME timepoint** - Example: `ACR-139_TP1`, `ACR-145_TP1`, `ACR-150_TP1` (all measured at TP1) - **NOT REPLICATES**: Same colony at different timepoints - Example: `ACR-139_TP1` vs `ACR-139_TP2` (same colony, different times) ### The Tensor Structure Issue The tensor in this analysis has structure: **genes × samples × timepoints** Where: - Each "sample" dimension = **ONE COLONY** measured across **ALL timepoints** - Example: `ACR-139` occupies one sample dimension with data at [TP1, TP2, TP3, TP4] **Key insight:** In this tensor structure, we cannot separate individual timepoints because each colony's timeseries is stored as a single sample unit. --- ## What the Original Code Did (WRONG) The original `identify_replicate_groups_for_cv()` function created groups like: ``` Group 1 (apul): [ACR-139, ACR-145, ACR-150, ACR-173, ACR-186, ...] Group 2 (peve): [POR-216, POR-221, POR-224, ...] Group 3 (ptua): [POC-40, POC-42, POC-43, ...] ``` This performed **"leave-one-species-out" cross-validation**: - Train on: peve + ptua (all colonies, all timepoints) - Test on: apul (all colonies, all timepoints) **Problem:** This tests **species generalization**, not replicate validation within the same experimental conditions. --- ## What the Code Does Now (CORRECTED) The code now provides **TWO options** with clear explanations: ### Option 1: Species-Level CV (RECOMMENDED) ```python species_groups = identify_species_groups_for_cv(sample_labels, species_sample_map) ``` **What it does:** - Groups: One per species (3 groups: apul, peve, ptua) - CV: Leave-one-species-out (3 folds) - Tests: "Can the decomposition generalize to a NEW SPECIES?" **Why use this:** - ✓ Fast (only 3 folds) - ✓ Practical for rank selection - ✓ Tests meaningful biological question (cross-species patterns) - ✓ Similar to dissertation's leave-one-dataset-out approach - ✓ Honest about what it's testing (species generalization, not replicate validation) ### Option 2: Colony-Level CV (Comprehensive) ```python replicate_groups = identify_replicate_groups_for_cv(sample_labels, species_sample_map) ``` **What it does:** - Groups: One per colony (~30-40 groups) - CV: Leave-one-colony-out (~30-40 folds) - Tests: "Can the decomposition generalize to a NEW COLONY?" **Why use this:** - ✓ More comprehensive validation - ✓ Tests within-species generalization - ✗ Computationally expensive (10-20x slower) - ✗ Still not testing "replicates at same timepoint" due to tensor structure --- ## Important Clarification: Why Neither Is "True Replicate Validation" **You are correct that biological replicates must be at the same timepoint.** However, in this tensor structure: 1. **Data is organized by colony timeseries**, not by individual timepoint measurements 2. Each "sample" = complete timeseries for one colony across all timepoints 3. We **cannot split timepoints** within a sample (it's a structural constraint of the tensor) Therefore: - **Species-level CV** = Tests if patterns work across species (honest labeling) - **Colony-level CV** = Tests if patterns work across colonies (honest labeling) - **Neither** = True biological replicate validation at same timepoint (impossible with this tensor structure) --- ## What Would True Replicate Validation Look Like? To do proper biological replicate validation at the same timepoint, you would need a **different tensor structure**: ### Alternative Structure: genes × replicates × (species×timepoints) Example dimensions: - Genes: 5000 - Replicates: 10 (max number of colonies per species) - Conditions: 12 (3 species × 4 timepoints) Then you could do: - Leave-one-replicate-out CV - Train on 9 colonies, test on 1 colony - All at the **same species×timepoint condition** **BUT** this would require restructuring the entire analysis and handling missing data differently (since not all species have the same number of colonies). --- ## Recommendation Given the current tensor structure (**genes × samples × timepoints**): ### For Rank Selection: **Use Species-Level CV** (`species_groups`) - Fast and practical - Tests cross-species generalization - Appropriate for parameter selection - Honest about what it's validating ### For Comprehensive Validation: **Use Colony-Level CV** (`replicate_groups`) - More thorough testing - Tests within-species generalization - Use if computational time is not a concern ### For True Biological Replicate Validation: **Not possible with current tensor structure** - Would require restructuring the entire analysis - Consider if this level of validation is necessary for your research question --- ## Code Changes Made 1. **Updated `identify_replicate_groups_for_cv()`** - Now creates colony-level groups - Clear documentation about what it tests - Warning about computational cost 2. **Added `identify_species_groups_for_cv()`** - Creates species-level groups - Recommended for practical use - Fast and interpretable 3. **Updated execution chunk** - User can choose which approach - Clear labeling of CV method - Saves metadata about which approach was used 4. **Added comprehensive documentation** - Explains tensor structure constraints - Clarifies what each CV approach tests - Honest about limitations --- ## Summary ✓ **Your concern was valid**: Biological replicates must be at the same timepoint ✓ **Code has been corrected**: Now uses appropriate CV approaches for the tensor structure ✓ **Honest labeling**: Code now clearly states what each CV method is testing ✓ **Practical recommendation**: Use species-level CV for rank selection (fast, meaningful) ✓ **Limitation acknowledged**: True replicate validation not possible with this tensor structure The code now correctly reflects the experimental design and is honest about what validation is being performed.