# 1SE Rule for Rank and Lambda Selection

## Problem: SSE Decreases with Increasing Rank

**Observation:** SSE monotonically decreases as rank increases - this is **expected behavior** but makes selecting the "optimal" rank difficult.

**Why this happens:**
- Higher rank = more parameters = better fit to training data
- SSE will always be lowest at highest rank tested
- This doesn't mean highest rank is best (risk of overfitting)

**The Challenge:** How do we select a rank that balances:
1. **Model fit** (low SSE)
2. **Model complexity** (low rank)
3. **Generalization** (stable across train/test splits)

## Solution: The 1SE Rule

The **"one standard error rule"** is a parsimony principle from statistical learning (Breiman et al., 1984; Hastie et al., 2009):

> *"Select the simplest model whose performance is within one standard error of the best model"*

### Applied to Rank Selection

**Step 1:** Find minimum mean SSE across bootstrap iterations
```
Rank 5:  SSE = 1.50e8 ± 0.10e8
Rank 10: SSE = 1.20e8 ± 0.08e8
Rank 15: SSE = 1.10e8 ± 0.07e8  ← Minimum
Rank 20: SSE = 1.05e8 ± 0.09e8
Rank 25: SSE = 1.03e8 ± 0.12e8
Rank 30: SSE = 1.01e8 ± 0.15e8  ← Lowest, but higher uncertainty
```

**Step 2:** Calculate 1SE threshold
```
Min SSE = 1.10e8 (at rank 15)
SE = 0.07e8
1SE Threshold = 1.10e8 + 0.07e8 = 1.17e8
```

**Step 3:** Find all ranks within threshold
```
Rank 10: 1.20e8 > 1.17e8  ✗ Above threshold
Rank 15: 1.10e8 ≤ 1.17e8  ✓ Within 1SE
Rank 20: 1.05e8 ≤ 1.17e8  ✓ Within 1SE
Rank 25: 1.03e8 ≤ 1.17e8  ✓ Within 1SE
Rank 30: 1.01e8 ≤ 1.17e8  ✓ Within 1SE
```

**Step 4:** Select smallest rank within threshold
```
✓ Selected Rank = 15 (most parsimonious within 1SE)
```

### Why This Works

1. **Statistical justification:** Models within 1SE are not significantly different in performance
2. **Parsimony:** Prefer simpler (lower rank) when performance is comparable
3. **Generalization:** Reduces overfitting risk
4. **Stability:** Accounts for uncertainty in cross-validation estimates

### Example Scenarios

#### Scenario A: Clear Winner (Rare)
```
Rank 10: 1.50e8 ± 0.05e8
Rank 15: 1.10e8 ± 0.04e8  ← Min (1SE = 1.14e8)
Rank 20: 1.25e8 ± 0.06e8  ✗ Above threshold
→ Select Rank 15 (unambiguous best)
```

#### Scenario B: Monotonic Decrease (Common in Your Data)
```
Rank 5:  1.50e8 ± 0.10e8  ✗ Above threshold
Rank 10: 1.30e8 ± 0.08e8  ✗ Above threshold
Rank 15: 1.15e8 ± 0.07e8  ← Min (1SE = 1.22e8)
Rank 20: 1.10e8 ± 0.09e8  ✓ Within 1SE
Rank 25: 1.08e8 ± 0.11e8  ✓ Within 1SE
Rank 30: 1.06e8 ± 0.14e8  ✓ Within 1SE
→ Select Rank 15 (smallest within 1SE, despite not having absolute minimum)
```

#### Scenario C: High Variance at High Ranks
```
Rank 10: 1.20e8 ± 0.05e8
Rank 15: 1.10e8 ± 0.06e8  ← Min (1SE = 1.16e8)
Rank 20: 1.08e8 ± 0.12e8  ✓ Within 1SE (but high variance)
Rank 25: 1.05e8 ± 0.18e8  ✓ Within 1SE (very high variance)
→ Select Rank 15 (both parsimonious AND more stable)
```

## Applied to Lambda Selection

**Same principle** applied to regularization parameter:

**Step 1:** At optimal rank, find maximum mean FMS
```
λ=0.0:  FMS = 0.85 ± 0.05  ← Maximum
λ=0.01: FMS = 0.84 ± 0.04
λ=0.05: FMS = 0.82 ± 0.04
λ=0.1:  FMS = 0.78 ± 0.06
λ=0.5:  FMS = 0.65 ± 0.08
```

**Step 2:** Calculate 1SE threshold
```
Max FMS = 0.85
SE = 0.05
1SE Threshold = 0.85 - 0.05 = 0.80
```

**Step 3:** Find all lambdas within threshold
```
λ=0.0:  0.85 ≥ 0.80  ✓
λ=0.01: 0.84 ≥ 0.80  ✓
λ=0.05: 0.82 ≥ 0.80  ✓
λ=0.1:  0.78 < 0.80   ✗
```

**Step 4:** Select maximum lambda within threshold
```
✓ Selected λ = 0.05 (most sparse while maintaining factor stability)
```

### Why This Works for Lambda

1. **Sparsity goal:** Higher λ → more zeros in factors
2. **Stability requirement:** FMS measures factor consistency
3. **Trade-off:** Maximum sparsity without sacrificing factor quality
4. **Interpretation:** Easier to interpret factors with fewer non-zero elements

## Visualization Updates

The updated rank selection plot now shows:

1. **Mean SSE with error bars** (bootstrap uncertainty)
2. **1SE threshold line** (horizontal orange dashed line)
3. **Minimum SSE rank** (vertical blue dotted line)
4. **Selected rank by 1SE rule** (vertical red dashed line)
5. **Green circles** around ranks within 1SE
6. **Coefficient of variation panel** (shows stability)

This makes it visually clear:
- Which ranks are statistically comparable
- Why a lower rank was selected despite higher ranks having lower SSE
- How stable each rank is across bootstrap iterations

## Theoretical Background

### Statistical Learning Theory

The 1SE rule comes from **cross-validation theory** (Breiman et al., 1984):

**Key insight:** Cross-validation estimates have uncertainty. Models whose CV scores differ by less than 1 SE are not significantly different.

**Mathematical formulation:**
```
Model A is selected over Model B if:
  1. CV_score(A) is within 1×SE(best_score) of best_score
  2. Complexity(A) < Complexity(B)
```

### References

1. **Breiman, L., Friedman, J., Stone, C.J., & Olshen, R.A. (1984).** 
   *Classification and Regression Trees.* 
   Wadsworth. (Original CART algorithm with 1SE rule)

2. **Hastie, T., Tibshirani, R., & Friedman, J. (2009).**
   *The Elements of Statistical Learning* (2nd ed.). 
   Springer. (Section 7.10: Cross-Validation)

3. **James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).**
   *An Introduction to Statistical Learning.*
   Springer. (Section 5.1: Cross-Validation)

### Parsimony Principle

**Occam's Razor in Statistics:**
> "When multiple models explain the data equally well, prefer the simplest"

**Applied here:**
- Rank = model complexity (more factors = more complex)
- Lambda = feature complexity (lower λ = dense factors)
- 1SE rule = operational definition of "equally well"

## Implementation Details

### Code Changes

**Before (minimum SSE):**
```python
optimal_rank_idx = lambda_zero_df['mean_sse'].idxmin()
optimal_rank = int(lambda_zero_df.loc[optimal_rank_idx, 'rank'])
```

**After (1SE rule):**
```python
# Find minimum SSE
min_sse_idx = lambda_zero_df['mean_sse'].idxmin()
min_sse = lambda_zero_df.loc[min_sse_idx, 'mean_sse']
min_sse_se = lambda_zero_df.loc[min_sse_idx, 'se_sse']

# Calculate threshold
sse_1se_threshold = min_sse + min_sse_se

# Find smallest rank within threshold
within_1se = lambda_zero_df[lambda_zero_df['mean_sse'] <= sse_1se_threshold]
optimal_rank = int(within_1se.sort_values('rank').iloc[0]['rank'])
```

### Features Maintained

✅ **Parallel execution** (`n_jobs=-1` for all cores)
✅ **Incremental checkpointing** (saves after each bootstrap iteration)
✅ **Resume capability** (loads from checkpoints)
✅ **Bootstrap stability assessment** (CV, convergence rates)
✅ **Comprehensive visualization** (4 diagnostic plots)

### New Output

The terminal output now includes:
```
STAGE 1: RANK SELECTION (1SE RULE)
Criterion: Smallest rank within 1SE of minimum SSE at λ=0.0
================================================================================

Minimum SSE: 1.10e+08 ± 7.00e+06 (at rank=30)
1SE Threshold: 1.17e+08

Applying 1SE rule (parsimony principle):
  Select smallest rank where SSE ≤ 1.17e+08

✓ OPTIMAL RANK (1SE rule): 15
  Mean SSE: 1.15e+08 ± 7.00e+06
  (smallest rank with SSE within 1SE of minimum)

  Note: Rank 30 had absolute minimum SSE,
        but rank 15 selected by 1SE rule (more parsimonious)

Ranks within 1SE of minimum:
   rank     mean_sse      se_sse  n_converged
     15  1.15000e+08  7.00e+06           10
     20  1.10000e+08  9.00e+06           10
     25  1.08000e+08  1.10e+07           10
     30  1.06000e+08  1.40e+07           10
```

## Practical Recommendations

### When to Use 1SE Rule

✅ **Use 1SE rule when:**
- SSE monotonically decreases with rank
- You want parsimonious models
- High ranks show increased variance
- Interpretability is important
- Avoiding overfitting is priority

❌ **Consider alternatives when:**
- Clear elbow point in SSE curve
- Dramatic SSE improvement at specific rank
- External validation shows optimal rank differs
- Domain knowledge suggests specific rank

### Interpreting Results

**If selected rank = minimum SSE rank:**
- Unambiguous winner
- 1SE rule confirms minimum SSE choice

**If selected rank < minimum SSE rank:**
- Parsimony principle applied
- Selected rank is "good enough" but simpler
- This is the intended behavior!

**If most ranks within 1SE:**
- High uncertainty in rank selection
- Consider: more bootstrap iterations, different rank range, or examine data quality

### Troubleshooting

**Problem:** All ranks within 1SE (selects rank 5)
**Solution:** SSE differences too small relative to uncertainty. Either:
1. Increase bootstrap iterations (more precise SE estimates)
2. Expand rank range downward (test rank 3, 4)
3. Accept rank 5 as adequate

**Problem:** No ranks within 1SE except minimum
**Solution:** Clear winner found. 1SE rule = minimum SSE rule in this case.

**Problem:** Very high SE at optimal rank
**Solution:** Model unstable across train/test splits. Consider:
1. Different rank
2. Data preprocessing (outliers, normalization)
3. More bootstrap iterations to confirm

## Summary

The **1SE rule** provides a principled, statistically justified method for selecting rank when SSE decreases monotonically:

1. **Addresses overfitting** by preferring simpler models
2. **Accounts for uncertainty** in bootstrap CV estimates  
3. **Follows established precedent** from statistical learning literature
4. **Maintains all existing features** (parallelization, checkpointing, resume)
5. **Provides clear visualization** of selection rationale

**Bottom line:** Instead of always selecting the highest rank (lowest SSE), the 1SE rule selects the **simplest model that performs comparably**, which is more likely to generalize to new data.