# Bootstrap Incremental Saving, Resume Capability & Parallel Execution ## Overview The `split_half_bootstrap_grid_search()` function now includes: 1. **Incremental saving** - prevents data loss during long runs 2. **Resume capability** - continue from last checkpoint if interrupted 3. **Parallel execution** - use multiple CPU cores to speed up grid search ## Features ### 1. Parallel Execution (NEW) **Grid search within each bootstrap iteration runs in parallel** using `joblib`: ```python optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search( tensor=tensor_3d, species_sample_map=species_sample_map, rank_range=[5, 10, 15, 20, 25, 30, 35], lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0], n_bootstrap=10, n_jobs=-1 # Use all available cores (default) ) ``` **Performance Impact:** - `n_jobs=-1`: Use all CPU cores (fastest, recommended) - `n_jobs=8`: Use 8 specific cores - `n_jobs=1`: Sequential execution (no parallelization) **Speed Improvement Example:** - Sequential: 56 models × 5 min = 280 minutes (~4.7 hours) per bootstrap iteration - Parallel (8 cores): 280 / 8 = 35 minutes per bootstrap iteration - **~7x faster** with 8 cores ### 2. Incremental Checkpointing After each bootstrap iteration completes (all rank × lambda combinations), results are immediately saved to a checkpoint file: ``` output_dir/bootstrap_grid_search/ ├── bootstrap_checkpoint_iter000.csv (iteration 0) ├── bootstrap_checkpoint_iter001.csv (iteration 1) ├── bootstrap_checkpoint_iter002.csv (iteration 2) ... ├── bootstrap_checkpoint_iter009.csv (iteration 9) ├── bootstrap_aggregated_results.csv (final aggregated results) ├── bootstrap_raw_iterations.csv (all raw data consolidated) └── optimal_parameters.json (selected parameters) ``` ### Resume Logic When the function starts: 1. **Check for existing checkpoints** in the output directory 2. **Load completed iterations** from checkpoint files 3. **Resume from next incomplete iteration** 4. **Skip re-running** already completed work ### Usage Examples #### Standard Usage (with incremental saving) ```python optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search( tensor=tensor_3d, species_sample_map=species_sample_map, rank_range=[5, 10, 15, 20, 25, 30, 35], lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0], n_bootstrap=10, output_dir='output/bootstrap_grid_search', # REQUIRED for saving resume=True # Default: load existing checkpoints ) ``` #### Resume After Interruption If the process crashes at iteration 6 out of 10: ```python # Simply re-run the same command optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search( tensor=tensor_3d, species_sample_map=species_sample_map, rank_range=[5, 10, 15, 20, 25, 30, 35], lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0], n_bootstrap=10, output_dir='output/bootstrap_grid_search', resume=True # Will find iterations 0-5 and resume from 6 ) ``` Output: ``` Checking for existing checkpoint files... ✓ Found 6 completed iterations: [0, 1, 2, 3, 4, 5] → Resuming from iteration 6 Running bootstrap iterations 6 to 9... ``` #### Start Fresh (delete existing checkpoints) ```python optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search( tensor=tensor_3d, species_sample_map=species_sample_map, rank_range=[5, 10, 15, 20, 25, 30, 35], lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0], n_bootstrap=10, output_dir='output/bootstrap_grid_search', resume=False # Delete existing checkpoints and start over ) ``` #### No Saving (in-memory only) ```python # For quick testing or small runs optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search( tensor=tensor_3d, species_sample_map=species_sample_map, rank_range=[5, 10], # Small test lambda_values=[0.0, 0.1], n_bootstrap=2, output_dir=None, # No saving resume=False ) ``` ## Checkpoint File Format Each checkpoint file is a CSV containing results for one bootstrap iteration: ```csv bootstrap_iter,boot_seed,rank,lambda,sse_test,fms,n_train,n_test,converged 0,42,5,0.0,1.234e+08,0.8523,13,14,True 0,42,5,0.01,1.245e+08,0.8501,13,14,True 0,42,5,0.05,1.267e+08,0.8456,13,14,True ... ``` - **bootstrap_iter**: Bootstrap iteration number (0-indexed) - **boot_seed**: Random seed for this iteration - **rank**: Tensor rank tested - **lambda**: Sparsity parameter tested - **sse_test**: Sum of squared errors on test set - **fms**: Factor match score between train and test models - **n_train**: Number of training samples - **n_test**: Number of test samples - **converged**: Whether model optimization converged ## Benefits ### Time Savings - **Without checkpoints**: If crash at iteration 8/10, lose ~80% of 8 hours = 6.4 hours wasted - **With checkpoints**: Resume from iteration 8, only lose work since last checkpoint (~30-60 minutes) ### Monitoring Progress - Check checkpoint files to see how many iterations completed - View intermediate results even while analysis is running - Estimate remaining runtime based on completed iterations ### Flexibility - Can manually stop analysis after sufficient iterations - Can run partial analysis (e.g., 5 iterations) and extend later - Can load and visualize results without re-running ## Technical Details ### File Naming Convention - Checkpoint files use zero-padded 3-digit iteration numbers: `iter000`, `iter001`, ..., `iter009` - This ensures proper alphabetical sorting - Supports up to 999 bootstrap iterations ### Thread Safety - Current implementation assumes single-process execution - Each iteration is atomic (fully completes before saving) - No concurrent writes to same checkpoint file ### Storage Requirements - Each checkpoint file: ~50-200 KB (depending on grid size) - 10 iterations × 8 ranks × 8 lambdas = ~1-2 MB total for checkpoints - Final aggregated file is similar size ### Backward Compatibility - If `output_dir=None`, function works exactly as before (in-memory only) - Existing code without `output_dir` parameter continues to work - No changes required to visualization chunk ## Troubleshooting ### "All iterations already completed" If you see this message, all checkpoint files exist. The function will skip computation and load results from files. **Solution:** - If you want to re-run, set `resume=False` - Or manually delete checkpoint files ### Checkpoint files from different parameter settings If you change `rank_range` or `lambda_values` but checkpoint files exist from previous run: **Solution:** - Set `resume=False` to delete old checkpoints - Or use a different `output_dir` - Or manually delete old checkpoint files ### Partial checkpoint from interrupted iteration If the process crashes mid-iteration (before checkpoint is saved), that iteration's partial work is lost. But all previous complete iterations are safe. **Solution:** - Simply re-run with `resume=True` - The incomplete iteration will be re-computed - All previous iterations will be loaded from checkpoints ## Updated Execution Workflow ### In 13.00-multiomics-barnacle.Rmd The execution chunk now automatically: 1. Creates the output directory 2. Passes `output_dir` to the function (enables saving) 3. Sets `resume=True` (default, loads existing checkpoints) 4. Saves final aggregated results after completion No additional user action required - incremental saving is automatic! ## Performance Impact ### Checkpoint Saving Time - Writing each checkpoint file: <1 second - Negligible compared to model fitting time (minutes per model) - Total overhead: <10 seconds for 10 iterations ### Loading Checkpoints - Reading 10 checkpoint files: ~1 second - Only happens at function start - Minimal impact on total runtime ## Comparison to Alternatives ### Alternative 1: Manual Periodic Saving - User must remember to stop and save - Risk of forgetting - Difficult to resume mid-iteration ### Alternative 2: Save After Every Model - Too frequent I/O operations - Hundreds of checkpoint files - Harder to manage and resume ### Alternative 3: No Saving (Original) - Simplest implementation - **All progress lost on interruption** - Not viable for long-running analyses **Our approach (save after each iteration) is the sweet spot:** - Frequent enough to minimize data loss - Infrequent enough to minimize overhead - Natural resumption points (complete iterations) ## Future Enhancements Possible future improvements: 1. **Progress Bar**: Add tqdm progress bar showing iteration completion 2. **Time Estimation**: Predict total runtime based on completed iterations 3. **Cloud Storage**: Save checkpoints to S3/cloud for distributed computing 4. **Compression**: Compress checkpoint files to save disk space 5. **Metadata**: Add timestamp and system info to checkpoints 6. **Validation**: Verify checkpoint integrity before loading ## References - Blaskowski, S. (2024). Barnacle: Sparse CP Tensor Decomposition. PhD Dissertation. - User request: "Since the runtimes for this are long, please review the bootstrapping process and ensure data are saved along the way" - Implementation date: December 2025 ## Summary **Key Takeaway:** The bootstrap analysis now automatically saves progress after each iteration, allowing you to resume interrupted runs without losing hours of computation. Simply re-run the same command with `resume=True` (default) and it will continue from where it left off.