--- title: "13.00-multiomics-stdm-02" author: "Sam White" date: "2025-10-15" output: github_document: toc: true number_sections: true bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib citeproc: true --- # BACKGROUND This notebook performs tensor decomposition analysis on multi-species gene expression timeseries data using Steven's [`workflow-stdm-02`](https://github.com/sr320/workflow-stdm-02) (GitHub repo) tensor decomposition project. This package was auto-generated based on instructions provided by Steven. He created a few different packages and wanted them tested. The goal of this notebook is primarily to see how easily this package runs and whether the output is useful (i.e. useful visualizations, human-readable results tables, etc.) ## Input Data The `workflow-stdm-02` package processes a VST (Variance Stabilizing Transformation) counts matrix containing multi-species gene expression data across time points. The input file structure is: - **Input file**: `vst_counts_matrix.csv` (9,802 rows × 113 columns) - **Genes**: 8,820 ortholog groups (OG_XXXXX identifiers) - **Species**: 3 coral species (ACR, POR, POC) - **Time points**: 4 time points (TP1, TP2, TP3, TP4) - **Samples**: Multiple biological replicates per species-timepoint combination - **Data format**: VST-normalized gene expression values The package performs tensor decomposition to identify latent factors that capture patterns across the gene × species × time dimensions using sparse CP (CANDECOMP/PARAFAC) decomposition with L1 regularization. ## Output Structure The workflow generates comprehensive output organized into four main directories: ``` output/ ├── tensor/ # 3D tensor construction │ ├── tensor.npz # Compressed 3D tensor (genes × species × timepoints) │ ├── genes.csv # Gene ID mappings and indices │ ├── species.csv # Species code mappings │ ├── timepoints.csv # Timepoint mappings │ └── tensor_shapes.json # Tensor dimensions and metadata ├── optimization/ # Hyperparameter optimization results │ ├── best_params.json # Optimal rank and L1 penalties │ ├── top_trials.csv # Best performing parameter combinations │ ├── trials.csv # All optimization trials │ └── optuna_study.pkl # Optuna study object ├── fit/ # Model fitting results │ ├── gene_factors.csv # Gene factor loadings (A matrix) │ ├── species_factors.csv # Species factor loadings (B matrix) │ ├── time_factors.csv # Time factor loadings (C matrix) │ ├── loss_history.csv # Convergence history │ └── fit_metrics.json # Model performance metrics └── export/ # Human-readable results with annotations ├── gene_factors_with_ids.csv # Gene factors with OG identifiers ├── species_factors_with_codes.csv # Species factors with species codes ├── time_factors_with_labels.csv # Time factors with timepoint labels ├── decomposition_results.json # Complete results summary └── factor_heatmaps.png # Factor visualization plots ``` Key output metrics include reconstruction error, explained variance, sparsity levels, and convergence status. The final decomposition identifies latent factors that capture coordinated patterns of gene expression across species and time. # SETUP ## Libraries ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(reticulate) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks results = "hold", # Hold outputs and show them after the full code chunk warning = FALSE, # Hide warnings collapse = FALSE, # Keep code and output in separate blocks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "##" # Prefix output lines with '##' so output is visually distinct ) ``` ## Set R variables ```{r R-variables, eval=TRUE} # OUTPUT DIRECTORY output_dir <- "../output/13.00-multiomics-stdm" #INPUT FILE(S) # PYTHON ENVIRONMENT - Use the sr320-stdm-02 uv environment python_path <- "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm-02/sr320-stdm-02-env/bin/python" ``` ## Load project Python environment The project uses the `sr320-stdm-02-env` uv virtual environment with the `new-tensor` package and all required dependencies. If this is successful, the Python path should show the `sr320-stdm-02-env` environment. ```{r load-python-env, eval=TRUE} # Use the sr320-stdm-02-env uv environment library(reticulate) # Set Python path to the sr320-stdm-02-env environment use_python(python_path, required = TRUE) # Show final configuration py_config() ``` # TENSOR DECOMPOSITION WORKFLOW ## Step 1: Build Tensor Build a 3D tensor from the VST counts matrix with replicate aggregation. The input data is already VST-normalized, so no additional normalization is applied. ```{python build-tensor, eval=TRUE} import subprocess import os # Define paths input_file = "../output/14-pca-orthologs/vst_counts_matrix.csv" output_base = r.output_dir # Use R variable # Create output directory structure tensor_dir = os.path.join(output_base, "tensor") os.makedirs(tensor_dir, exist_ok=True) # Build tensor using the new-tensor CLI # All parameters with their values (defaults noted in comments) cmd = [ "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm-02/sr320-stdm-02-env/bin/new-tensor", "build-tensor", "--input-path", input_file, # Required: path to VST counts matrix "--output-dir", tensor_dir, # Custom output directory "--aggregation-method", "median", # Default: median (alternatives: mean, trimmed_mean) "--normalization", "none", # Custom: none (default: zscore; alternatives: minmax, log1p) "--min-expression", "1.0", # Default: 1.0 (minimum expression threshold) "--min-variance-percentile", "10.0" # Default: 10.0 (variance percentile threshold) ] print("Building tensor...") print(f"Command: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True) print(result.stdout) if result.stderr: print("Errors/Warnings:", result.stderr) print(f"\nTensor saved to: {tensor_dir}") ``` ## Step 2: Optimize Hyperparameters Run hyperparameter optimization using Optuna to find the optimal rank and L1 penalty parameters. ```{python optimize-hyperparameters, eval=TRUE} import subprocess import os # Define paths tensor_file = os.path.join(r.output_dir, "tensor", "tensor.npz") optimization_dir = os.path.join(r.output_dir, "optimization") os.makedirs(optimization_dir, exist_ok=True) # Optimize hyperparameters using the new-tensor CLI # All parameters with their values (defaults noted in comments) cmd = [ "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm-02/sr320-stdm-02-env/bin/new-tensor", "optimize", "--tensor-path", tensor_file, # Required: path to tensor file "--output-dir", optimization_dir, # Custom output directory "--n-trials", "100", # Default: 100 (number of optimization trials) "--rank-min", "4", # Custom: 4 (default: 2; minimum rank to try) "--rank-max", "30", # Custom: 30 (default: 12; maximum rank to try) "--lambda-min", "0.0001", # Default: 0.0001 (minimum L1 penalty) "--lambda-max", "1.0" # Default: 1.0 (maximum L1 penalty) ] print("Optimizing hyperparameters...") print(f"Command: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True) print(result.stdout) if result.stderr: print("Errors/Warnings:", result.stderr) print(f"\nOptimization results saved to: {optimization_dir}") print("\nCheck output/optimization/best_params.json for optimal rank.") ``` ## Step 3: Review Optimization Results Load and display the best hyperparameters found during optimization. ```{python review-optimization, eval=TRUE} import json import os import pandas as pd # Load best parameters best_params_file = os.path.join(r.output_dir, "optimization", "best_params.json") with open(best_params_file, 'r') as f: best_params = json.load(f) print("Best Hyperparameters:") print(json.dumps(best_params, indent=2)) # Load top trials top_trials_file = os.path.join(r.output_dir, "optimization", "top_trials.csv") if os.path.exists(top_trials_file): top_trials = pd.read_csv(top_trials_file) print("\n\nTop 10 Trials:") print(top_trials.head(10)) ``` ## Step 4: Fit Model Fit the CP decomposition model with the optimal parameters identified during optimization. The optimal rank is automatically loaded from the optimization results. ```{python fit-model, eval=TRUE} import subprocess import os import json # Load optimal rank from optimization results best_params_file = os.path.join(r.output_dir, "optimization", "best_params.json") with open(best_params_file, 'r') as f: best_params = json.load(f) optimal_rank = str(best_params['rank']) print(f"Using optimal rank from optimization: {optimal_rank}") # Define paths tensor_file = os.path.join(r.output_dir, "tensor", "tensor.npz") fit_dir = os.path.join(r.output_dir, "fit") os.makedirs(fit_dir, exist_ok=True) # Fit model using the new-tensor CLI with optimal rank # All parameters with their values (defaults noted in comments) cmd = [ "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm-02/sr320-stdm-02-env/bin/new-tensor", "fit", "--tensor-path", tensor_file, # Required: path to tensor file "--output-dir", fit_dir, # Custom output directory "--rank", optimal_rank, # Required: automatically loaded from optimization results "--lambda-a", "0.1", # Default: 0.1 (L1 penalty for gene factors) "--lambda-b", "0.1", # Default: 0.1 (L1 penalty for species factors) "--lambda-c", "0.1", # Default: 0.1 (L1 penalty for time factors) "--non-negative", # Custom: enforces non-negativity (default: no-non-negative) "--max-iter", "100" # Default: 100 (maximum iterations) ] print("Fitting model...") print(f"Command: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True) print(result.stdout) if result.stderr: print("Errors/Warnings:", result.stderr) print(f"\nFit results saved to: {fit_dir}") ``` ## Step 5: Export Results Export the decomposition results with gene IDs, species codes, and time labels. ```{python export-results, eval=TRUE} import subprocess import os # Define paths fit_dir = os.path.join(r.output_dir, "fit") mappings_dir = os.path.join(r.output_dir, "tensor") export_dir = os.path.join(r.output_dir, "export") os.makedirs(export_dir, exist_ok=True) # Export results using the new-tensor CLI # All parameters with their values (defaults noted in comments) cmd = [ "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm-02/sr320-stdm-02-env/bin/new-tensor", "export", "--fit-dir", fit_dir, # Required: directory containing fit results "--mappings-dir", mappings_dir, # Required: directory containing tensor mappings "--output-dir", export_dir, # Custom output directory "--plot-heatmaps" # Default: plot-heatmaps (generate factor heatmaps) ] print("Exporting results...") print(f"Command: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True) print(result.stdout) if result.stderr: print("Errors/Warnings:", result.stderr) print(f"\nExported results saved to: {export_dir}") ``` ## Step 6: Summary of Results Load and display key results from the decomposition analysis. ```{python summary-results, eval=TRUE} import json import os import pandas as pd export_dir = os.path.join(r.output_dir, "export") # Load decomposition results summary results_file = os.path.join(export_dir, "decomposition_results.json") with open(results_file, 'r') as f: results = json.load(f) print("Decomposition Results Summary:") print(json.dumps(results, indent=2)) # Load gene factors gene_factors_file = os.path.join(export_dir, "gene_factors_with_ids.csv") gene_factors = pd.read_csv(gene_factors_file) print(f"\n\nGene Factors Shape: {gene_factors.shape}") print("\nFirst few gene factors:") print(gene_factors.head()) # Load species factors species_factors_file = os.path.join(export_dir, "species_factors_with_codes.csv") species_factors = pd.read_csv(species_factors_file) print(f"\n\nSpecies Factors Shape: {species_factors.shape}") print("\nSpecies factors:") print(species_factors) # Load time factors time_factors_file = os.path.join(export_dir, "time_factors_with_labels.csv") time_factors = pd.read_csv(time_factors_file) print(f"\n\nTime Factors Shape: {time_factors.shape}") print("\nTime factors:") print(time_factors) ```