# Getting Started with STDM

Welcome to STDM! This guide will help you get up and running with tensor decomposition analysis of your gene expression data.

## Table of Contents

1.  [Quick Setup](#quick-setup)
2.  [Your First Analysis](#your-first-analysis)
3.  [Understanding Your Results](#understanding-your-results)
4.  [Common Workflows](#common-workflows)
5.  [Troubleshooting](#troubleshooting)

## Quick Setup {#quick-setup}

### Option 1: Automated Setup (Recommended)

``` bash
cd /Users/sr320/Documents/GitHub/workflow-stdm-05
./setup.sh
```

This script will: - ✅ Detect and use `uv` (or fall back to `pip`) - ✅ Create a virtual environment - ✅ Install all dependencies - ✅ Verify the installation - ✅ Check for data files

### Option 2: Manual Setup

``` bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

# Install package
uv pip install -e .
```

### Verify Installation

``` bash
# Test import
python -c "import stdm; print(f'STDM v{stdm.__version__} is ready!')"

# Test CLI
stdm-fit --help
```

## Your First Analysis {#your-first-analysis}

### Step 1: Explore Your Data

``` python
from stdm import DataLoader
from pathlib import Path

# Load your data
data_dir = Path("data/merge-vst")
loader = DataLoader(data_dir)
data = loader.load_merged_data()

print(f"Loaded: {data.shape[0]} genes × {data.shape[1]} samples")
print(f"Sample names: {data.columns[:5].tolist()}")
```

**Expected Output**:

```         
Loaded merged data: 10225 genes × 112 samples
Loaded: 10225 genes × 112 samples
Sample names: ['ACR-139-TP1', 'ACR-139-TP2', 'ACR-139-TP3', 'ACR-139-TP4', 'ACR-145-TP1']
```

### Step 2: Preprocess

``` python
# Filter low expression genes
data_filtered = loader.filter_low_expression(
    data, 
    min_expression=5.0,  # Minimum expression value
    min_samples=10       # Must be expressed in at least 10 samples
)

# Normalize (z-score across samples)
data_normalized = loader.normalize_data(
    data_filtered,
    method="zscore",
    axis=1  # Normalize each gene across samples
)

print(f"After filtering: {data_normalized.shape[0]} genes")
```

### Step 3: Build Tensor

``` python
from stdm import TensorBuilder

builder = TensorBuilder()
tensor = builder.from_merged_data(data_normalized)
metadata = builder.get_metadata()

print(f"\nTensor shape: {tensor.shape}")
print(f"Dimensions:")
print(f"  Genes: {len(metadata['genes'])}")
print(f"  Individuals: {len(metadata['individuals'])}")
print(f"  Time points: {metadata['timepoints']}")
print(f"  Species: {metadata['species']}")
```

**Expected Output**:

```         
Built tensor with shape: (9792, 28, 4, 3)
Dimensions:
  Genes: 9792
  Individuals: 28
  Time points: [1, 2, 3, 4]
  Species: ['ACR', 'POC', 'POR']
```

### Step 4: Fit Model

``` python
from stdm import CPDecomposition

# Create model
model = CPDecomposition(
    rank=10,           # Number of components
    random_state=42    # For reproducibility
)

# Fit to data
model.fit(tensor, n_iter_max=100, verbose=True)

# Get results
factors = model.get_factors()
error = model.compute_reconstruction_error(tensor)

print(f"\nReconstruction error: {error:.6f}")
```

### Step 5: Visualize

``` python
from stdm.visualization import (
    plot_components,
    plot_temporal_patterns,
    plot_species_comparison
)

# Plot all components
plot_components(
    factors,
    mode_names=['Genes', 'Individuals', 'Time', 'Species'],
    n_components=5
)

# Plot temporal patterns
plot_temporal_patterns(
    factors[2],  # Time mode factor
    metadata['timepoints']
)

# Compare species
plot_species_comparison(
    factors[3],  # Species mode factor
    metadata['species']
)
```

## Understanding Your Results {#understanding-your-results}

### What are the Factors?

After decomposition, you get 4 factor matrices:

1.  **Gene Factor** (shape: genes × rank)
    -   Shows which genes contribute to each component
    -   High values = gene is strongly associated with that component
2.  **Individual Factor** (shape: individuals × rank)
    -   Shows expression patterns for each individual
    -   Captures individual-specific variation
3.  **Time Factor** (shape: time points × rank)
    -   Shows temporal dynamics of each component
    -   Reveals patterns that change over time
4.  **Species Factor** (shape: species × rank)
    -   Shows species-specific patterns
    -   Identifies conserved vs. species-specific components

### Interpreting Components

Each component (column in the factors) represents a pattern in the data:

``` python
# Analyze component 0
print("Component 1 Analysis:")

# Temporal pattern
time_pattern = factors[2][:, 0]
print(f"\nTemporal pattern:")
for tp, val in zip(metadata['timepoints'], time_pattern):
    print(f"  TP{tp}: {val:.3f}")

# Species pattern
species_pattern = factors[3][:, 0]
print(f"\nSpecies pattern:")
for species, val in zip(metadata['species'], species_pattern):
    print(f"  {species}: {val:.3f}")

# Top genes
from stdm.utils import identify_marker_genes
markers = identify_marker_genes(
    factors[0],
    metadata['genes'],
    component_idx=0,
    top_n=10
)
print(f"\nTop 10 marker genes:")
for gene, loading in markers:
    print(f"  {gene}: {loading:.4f}")
```

## Common Workflows {#common-workflows}

### Workflow 1: Find Optimal Rank

``` python
from stdm.visualization import plot_reconstruction_error

# Test different ranks
ranks = [5, 10, 15, 20, 25, 30]
errors = {}

for rank in ranks:
    model = CPDecomposition(rank=rank, random_state=42)
    model.fit(tensor, n_iter_max=100, verbose=False)
    errors[rank] = model.compute_reconstruction_error(tensor)

# Plot results
plot_reconstruction_error(errors, save_path="results/rank_comparison.png")

# Choose rank where error plateaus (elbow method)
```

### Workflow 2: Compare CP vs Tucker

``` python
from stdm import TuckerDecomposition

# CP decomposition
cp_model = CPDecomposition(rank=10)
cp_model.fit(tensor, n_iter_max=100, verbose=False)
cp_error = cp_model.compute_reconstruction_error(tensor)

# Tucker decomposition
tucker_model = TuckerDecomposition(rank=(100, 20, 4, 3))
tucker_model.fit(tensor, n_iter_max=50, verbose=False)
tucker_error = tucker_model.compute_reconstruction_error(tensor)

print(f"CP error: {cp_error:.6f}")
print(f"Tucker error: {tucker_error:.6f}")
```

### Workflow 3: Analyze Separate Species

``` python
# Load separate species data
loader = DataLoader("data/separate-log")
species_data = loader.load_separate_data()

# Build tensor with gene alignment
builder = TensorBuilder()
tensor = builder.from_separate_data(
    species_data,
    align_genes=True  # Use common genes only
)

# Fit model
model = CPDecomposition(rank=10)
model.fit(tensor)
```

### Workflow 4: Export Results

``` python
from stdm.utils import save_results

# Save all factors and metadata
save_results(
    factors,
    metadata,
    output_dir="results/my_analysis",
    prefix="cp_rank10"
)

# This creates:
# - cp_rank10_factor_genes.csv
# - cp_rank10_factor_individuals.csv
# - cp_rank10_factor_timepoints.csv
# - cp_rank10_factor_species.csv
# - cp_rank10_metadata.json
```

## Using the Command-Line Interface

For quick analyses without writing code:

``` bash
# Basic analysis
stdm-fit \
    --data-path data/separate-log \
    --data-type merged \
    --method cp \
    --rank 15 \
    --output-dir results/cli_test

# With normalization and non-negativity
stdm-fit \
    --data-dir data/merge-vst \
    --data-type merged \
    --method cp \
    --rank 15 \
    --normalize \
    --non-negative \
    --output-dir results/cli_test_nn

# Tucker decomposition
stdm-fit \
    --data-dir data/separate-log \
    --data-type separate \
    --method tucker \
    --rank 10 \
    --normalize \
    --output-dir results/tucker_test
```

## Running Example Scripts

The `examples/` directory contains complete analysis pipelines:

``` bash
# Activate virtual environment first
source .venv/bin/activate

# Run merged data analysis
python examples/fit_merged_data.py

# Run separate species analysis  
python examples/fit_separate_data.py

# Compare different models and ranks
python examples/compare_models.py
```

## Working with Jupyter Notebooks

``` bash
# Install Jupyter (if not already installed)
uv pip install jupyter

# Start Jupyter
jupyter notebook

# Open examples/quickstart_notebook.ipynb
```

## Troubleshooting {#troubleshooting}

### Issue: "Module not found" error

``` bash
# Make sure virtual environment is activated
source .venv/bin/activate

# Reinstall package
uv pip install -e .
```

### Issue: TensorLy convergence warnings

``` python
# Increase iterations
model.fit(tensor, n_iter_max=200)

# Try different initialization
model.fit(tensor, init='svd')
```

### Issue: Out of memory

``` python
# Filter more aggressively
data = loader.filter_low_expression(data, min_expression=10.0)

# Use lower rank
model = CPDecomposition(rank=5)  # instead of rank=20
```

### Issue: Poor reconstruction

``` python
# Try non-negative decomposition
model = CPDecomposition(rank=10, non_negative=True)

# Increase rank
model = CPDecomposition(rank=20)

# Try Tucker instead of CP
model = TuckerDecomposition(rank=(100, 20, 4, 3))
```

## Next Steps

1.  **Read the full documentation**: [README.md](README.md)
2.  **Study the examples**: Explore `examples/` directory
3.  **Run tests**: `pytest tests/ -v`
4.  **Check package summary**: [PACKAGE_SUMMARY.md](PACKAGE_SUMMARY.md)
5.  **Read installation guide**: [INSTALL.md](INSTALL.md)

## Getting Help

-   **Documentation**: README.md, INSTALL.md, PACKAGE_SUMMARY.md
-   **Examples**: examples/ directory
-   **Tests**: tests/ directory for usage patterns
-   **GitHub Issues**: Report bugs or request features
-   **Email**: [your.email\@example.com](mailto:your.email@example.com){.email}

## Quick Reference

### Common Imports

``` python
# Core functionality
from stdm import DataLoader, TensorBuilder, CPDecomposition, TuckerDecomposition

# Visualization
from stdm.visualization import (
    plot_components,
    plot_temporal_patterns,
    plot_species_comparison,
    plot_gene_loadings
)

# Utilities
from stdm.utils import (
    save_results,
    compute_explained_variance,
    identify_marker_genes
)
```

### Typical Analysis Pipeline

``` python
# 1. Load data
loader = DataLoader("data/merge-vst")
data = loader.load_merged_data()

# 2. Preprocess
data = loader.filter_low_expression(data)
data = loader.normalize_data(data, method="zscore")

# 3. Build tensor
builder = TensorBuilder()
tensor = builder.from_merged_data(data)

# 4. Fit model
model = CPDecomposition(rank=10)
model.fit(tensor)

# 5. Analyze
factors = model.get_factors()
error = model.compute_reconstruction_error(tensor)

# 6. Visualize
plot_components(factors)

# 7. Save
save_results(factors, builder.get_metadata(), "results/")
```

Happy analyzing! 🧬📊