# Bootstrap Incremental Saving, Resume Capability & Parallel Execution

## Overview

The `split_half_bootstrap_grid_search()` function now includes:
1. **Incremental saving** - prevents data loss during long runs
2. **Resume capability** - continue from last checkpoint if interrupted  
3. **Parallel execution** - use multiple CPU cores to speed up grid search

## Features

### 1. Parallel Execution (NEW)

**Grid search within each bootstrap iteration runs in parallel** using `joblib`:

```python
optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search(
    tensor=tensor_3d,
    species_sample_map=species_sample_map,
    rank_range=[5, 10, 15, 20, 25, 30, 35],
    lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
    n_bootstrap=10,
    n_jobs=-1  # Use all available cores (default)
)
```

**Performance Impact:**
- `n_jobs=-1`: Use all CPU cores (fastest, recommended)
- `n_jobs=8`: Use 8 specific cores
- `n_jobs=1`: Sequential execution (no parallelization)

**Speed Improvement Example:**
- Sequential: 56 models × 5 min = 280 minutes (~4.7 hours) per bootstrap iteration
- Parallel (8 cores): 280 / 8 = 35 minutes per bootstrap iteration
- **~7x faster** with 8 cores

### 2. Incremental Checkpointing

After each bootstrap iteration completes (all rank × lambda combinations), results are immediately saved to a checkpoint file:

```
output_dir/bootstrap_grid_search/
├── bootstrap_checkpoint_iter000.csv  (iteration 0)
├── bootstrap_checkpoint_iter001.csv  (iteration 1)
├── bootstrap_checkpoint_iter002.csv  (iteration 2)
...
├── bootstrap_checkpoint_iter009.csv  (iteration 9)
├── bootstrap_aggregated_results.csv  (final aggregated results)
├── bootstrap_raw_iterations.csv      (all raw data consolidated)
└── optimal_parameters.json            (selected parameters)
```

### Resume Logic

When the function starts:

1. **Check for existing checkpoints** in the output directory
2. **Load completed iterations** from checkpoint files
3. **Resume from next incomplete iteration**
4. **Skip re-running** already completed work

### Usage Examples

#### Standard Usage (with incremental saving)

```python
optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search(
    tensor=tensor_3d,
    species_sample_map=species_sample_map,
    rank_range=[5, 10, 15, 20, 25, 30, 35],
    lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
    n_bootstrap=10,
    output_dir='output/bootstrap_grid_search',  # REQUIRED for saving
    resume=True  # Default: load existing checkpoints
)
```

#### Resume After Interruption

If the process crashes at iteration 6 out of 10:

```python
# Simply re-run the same command
optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search(
    tensor=tensor_3d,
    species_sample_map=species_sample_map,
    rank_range=[5, 10, 15, 20, 25, 30, 35],
    lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
    n_bootstrap=10,
    output_dir='output/bootstrap_grid_search',
    resume=True  # Will find iterations 0-5 and resume from 6
)
```

Output:
```
Checking for existing checkpoint files...
  ✓ Found 6 completed iterations: [0, 1, 2, 3, 4, 5]
  → Resuming from iteration 6

Running bootstrap iterations 6 to 9...
```

#### Start Fresh (delete existing checkpoints)

```python
optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search(
    tensor=tensor_3d,
    species_sample_map=species_sample_map,
    rank_range=[5, 10, 15, 20, 25, 30, 35],
    lambda_values=[0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
    n_bootstrap=10,
    output_dir='output/bootstrap_grid_search',
    resume=False  # Delete existing checkpoints and start over
)
```

#### No Saving (in-memory only)

```python
# For quick testing or small runs
optimal_rank, optimal_lambda, results_df, bootstrap_raw = split_half_bootstrap_grid_search(
    tensor=tensor_3d,
    species_sample_map=species_sample_map,
    rank_range=[5, 10],  # Small test
    lambda_values=[0.0, 0.1],
    n_bootstrap=2,
    output_dir=None,  # No saving
    resume=False
)
```

## Checkpoint File Format

Each checkpoint file is a CSV containing results for one bootstrap iteration:

```csv
bootstrap_iter,boot_seed,rank,lambda,sse_test,fms,n_train,n_test,converged
0,42,5,0.0,1.234e+08,0.8523,13,14,True
0,42,5,0.01,1.245e+08,0.8501,13,14,True
0,42,5,0.05,1.267e+08,0.8456,13,14,True
...
```

- **bootstrap_iter**: Bootstrap iteration number (0-indexed)
- **boot_seed**: Random seed for this iteration
- **rank**: Tensor rank tested
- **lambda**: Sparsity parameter tested
- **sse_test**: Sum of squared errors on test set
- **fms**: Factor match score between train and test models
- **n_train**: Number of training samples
- **n_test**: Number of test samples
- **converged**: Whether model optimization converged

## Benefits

### Time Savings
- **Without checkpoints**: If crash at iteration 8/10, lose ~80% of 8 hours = 6.4 hours wasted
- **With checkpoints**: Resume from iteration 8, only lose work since last checkpoint (~30-60 minutes)

### Monitoring Progress
- Check checkpoint files to see how many iterations completed
- View intermediate results even while analysis is running
- Estimate remaining runtime based on completed iterations

### Flexibility
- Can manually stop analysis after sufficient iterations
- Can run partial analysis (e.g., 5 iterations) and extend later
- Can load and visualize results without re-running

## Technical Details

### File Naming Convention
- Checkpoint files use zero-padded 3-digit iteration numbers: `iter000`, `iter001`, ..., `iter009`
- This ensures proper alphabetical sorting
- Supports up to 999 bootstrap iterations

### Thread Safety
- Current implementation assumes single-process execution
- Each iteration is atomic (fully completes before saving)
- No concurrent writes to same checkpoint file

### Storage Requirements
- Each checkpoint file: ~50-200 KB (depending on grid size)
- 10 iterations × 8 ranks × 8 lambdas = ~1-2 MB total for checkpoints
- Final aggregated file is similar size

### Backward Compatibility
- If `output_dir=None`, function works exactly as before (in-memory only)
- Existing code without `output_dir` parameter continues to work
- No changes required to visualization chunk

## Troubleshooting

### "All iterations already completed"
If you see this message, all checkpoint files exist. The function will skip computation and load results from files.

**Solution:**
- If you want to re-run, set `resume=False`
- Or manually delete checkpoint files

### Checkpoint files from different parameter settings
If you change `rank_range` or `lambda_values` but checkpoint files exist from previous run:

**Solution:**
- Set `resume=False` to delete old checkpoints
- Or use a different `output_dir`
- Or manually delete old checkpoint files

### Partial checkpoint from interrupted iteration
If the process crashes mid-iteration (before checkpoint is saved), that iteration's partial work is lost. But all previous complete iterations are safe.

**Solution:**
- Simply re-run with `resume=True`
- The incomplete iteration will be re-computed
- All previous iterations will be loaded from checkpoints

## Updated Execution Workflow

### In 13.00-multiomics-barnacle.Rmd

The execution chunk now automatically:

1. Creates the output directory
2. Passes `output_dir` to the function (enables saving)
3. Sets `resume=True` (default, loads existing checkpoints)
4. Saves final aggregated results after completion

No additional user action required - incremental saving is automatic!

## Performance Impact

### Checkpoint Saving Time
- Writing each checkpoint file: <1 second
- Negligible compared to model fitting time (minutes per model)
- Total overhead: <10 seconds for 10 iterations

### Loading Checkpoints
- Reading 10 checkpoint files: ~1 second
- Only happens at function start
- Minimal impact on total runtime

## Comparison to Alternatives

### Alternative 1: Manual Periodic Saving
- User must remember to stop and save
- Risk of forgetting
- Difficult to resume mid-iteration

### Alternative 2: Save After Every Model
- Too frequent I/O operations
- Hundreds of checkpoint files
- Harder to manage and resume

### Alternative 3: No Saving (Original)
- Simplest implementation
- **All progress lost on interruption**
- Not viable for long-running analyses

**Our approach (save after each iteration) is the sweet spot:**
- Frequent enough to minimize data loss
- Infrequent enough to minimize overhead
- Natural resumption points (complete iterations)

## Future Enhancements

Possible future improvements:

1. **Progress Bar**: Add tqdm progress bar showing iteration completion
2. **Time Estimation**: Predict total runtime based on completed iterations
3. **Cloud Storage**: Save checkpoints to S3/cloud for distributed computing
4. **Compression**: Compress checkpoint files to save disk space
5. **Metadata**: Add timestamp and system info to checkpoints
6. **Validation**: Verify checkpoint integrity before loading

## References

- Blaskowski, S. (2024). Barnacle: Sparse CP Tensor Decomposition. PhD Dissertation.
- User request: "Since the runtimes for this are long, please review the bootstrapping process and ensure data are saved along the way"
- Implementation date: December 2025

## Summary

**Key Takeaway:** The bootstrap analysis now automatically saves progress after each iteration, allowing you to resume interrupted runs without losing hours of computation. Simply re-run the same command with `resume=True` (default) and it will continue from where it left off.