# Synthetic Data Generation for Barnacle

This directory includes a script to generate synthetic gene expression data for testing the Barnacle sparse CP decomposition.

## Purpose

The synthetic data generator creates gene expression datasets with known underlying structure, which:
- Helps validate the tensor decomposition approach
- Provides data with high probability of convergence
- Allows testing different parameter configurations
- Enables comparison between recovered and ground-truth factors

## Data Characteristics

The synthetic data mimics the real barnacle multi-species dataset:

- **Three species**: `apul` (Acropora pulchra), `peve` (Porites evermanni), `ptua` (Pocillopora tuahiniensis)
- **Common genes**: All species share the same set of ortholog groups (OG_00001, OG_00002, etc.)
- **Samples**: Configurable number of samples per species (default: 10)
- **Timepoints**: 4 timepoints (TP1-TP4) per sample
- **Format**: CSV files with `group_id` column and columns named `SAMPLE.TP#`

## Synthetic Data Structure

The synthetic data is generated from a controlled CP decomposition:

1. **Gene factors**: Sparse matrix where each component involves a subset of genes
2. **Sample factors**: Mixed component weights for each sample (biological variation)
3. **Time factors**: Distinct temporal patterns (increasing, decreasing, peaked, etc.)
4. **Noise**: Gaussian noise added to simulate measurement error
5. **Non-negativity**: All values are non-negative (like real expression data)

This structure ensures:
- Clear component separation for easier convergence
- Realistic sparsity patterns
- Interpretable temporal dynamics

## Usage

### Generate Synthetic Data

```bash
python M-multi-species/scripts/14-barnacle/generate_synthetic_data.py \
  --output-dir M-multi-species/output/14-barnacle-synthetic \
  --n-genes 10223 \
  --n-samples-per-species 10 \
  --n-components 5 \
  --noise-level 0.1 \
  --seed 42
```

### Parameters

- `--output-dir`: Directory to save synthetic CSV files (required)
- `--n-genes`: Number of genes (default: 10223, matching real data)
- `--n-samples-per-species`: Samples per species (default: 10)
- `--n-components`: Number of underlying components (default: 5)
- `--noise-level`: Noise level as fraction of signal (default: 0.1)
- `--seed`: Random seed for reproducibility (default: 42)

### Run Tensor Decomposition on Synthetic Data

After generating the data, run the standard pipeline:

```bash
python M-multi-species/scripts/14-barnacle/build_tensor_and_run.py \
  --input-dir M-multi-species/output/14-barnacle-synthetic \
  --output-dir M-multi-species/output/14-barnacle-synthetic-results \
  --rank 5 \
  --lambda-gene 0.1 \
  --lambda-sample 0.1 \
  --lambda-time 0.05 \
  --max-iter 1000 \
  --tol 1e-5 \
  --seed 42
```

## Outputs

The script generates:

1. **Three species CSV files**:
   - `apul_normalized_expression.csv`
   - `peve_normalized_expression.csv`
   - `ptua_normalized_expression.csv`

2. **Ground truth factors** (in `ground_truth/` subdirectory):
   - `true_gene_factors.csv`: True gene loadings
   - `true_sample_factors.csv`: True sample loadings
   - `true_time_factors.csv`: True temporal patterns

You can compare the recovered factors from the decomposition with these ground truth factors to validate the method.

## Example: Full Workflow

```bash
# 1. Generate synthetic data
python M-multi-species/scripts/14-barnacle/generate_synthetic_data.py \
  --output-dir M-multi-species/output/14-barnacle-synthetic \
  --n-genes 1000 \
  --n-samples-per-species 8 \
  --n-components 5 \
  --noise-level 0.05 \
  --seed 123

# 2. Run tensor decomposition
python M-multi-species/scripts/14-barnacle/build_tensor_and_run.py \
  --input-dir M-multi-species/output/14-barnacle-synthetic \
  --output-dir M-multi-species/output/14-barnacle-synthetic-results \
  --rank 5 \
  --lambda-gene 0.1 \
  --lambda-sample 0.1 \
  --lambda-time 0.05 \
  --max-iter 2000 \
  --tol 1e-4 \
  --seed 42

# 3. Check convergence in results
cat M-multi-species/output/14-barnacle-synthetic-results/barnacle_factors/metadata.json
```

## Why This Data Converges Well

The synthetic data is designed for convergence because:

1. **Clear structure**: Components have distinct temporal patterns
2. **Sparsity**: Most genes are zero for most components
3. **Non-negativity**: Data and factors are non-negative (natural for gene expression)
4. **Sufficient signal**: Signal-to-noise ratio is high enough
5. **No missing values**: Complete data for all samples and timepoints
6. **Realistic scale**: Values are in a similar range to real normalized expression data

This makes it an ideal test case for parameter tuning and validation.