# Context-Dependent Analysis Output Tables: Data Flow & Schema **Purpose:** This document explains how the output CSV tables generated by `subset_context_dependent_analysis.py` are produced: inputs → processing stages → result objects → saved tables. Includes lineage, column definitions, and a Mermaid diagram for quick reference. --- ## 1. Source Inputs | Data Type | File | Shape (features × samples) | Notes | |-----------|------|-----------------------------|-------| | Gene expression | `data/cleaned_datasets/gene_counts_cleaned.csv` | Loaded at runtime | Target variables for modeling | | lncRNA expression | `data/cleaned_datasets/lncrna_counts_cleaned.csv` | " | Regulator candidates | | miRNA expression | `data/cleaned_datasets/mirna_counts_cleaned.csv` | " | Regulator candidates & context stratifier | | DNA methylation | `data/cleaned_datasets/wgbs_counts_cleaned.csv` | " | Regulator candidates | All four are loaded in parallel, sample alignment is verified, and NumPy arrays are cached. --- ## 2. High-Level Pipeline 1. Load & align datasets (`load_datasets`, `verify_sample_alignment`). 2. For each analysis branch (4 total) sample a subset of genes. 3. Select top regulators per gene via Pearson correlation (p < 0.1, rank by \|corr\|). 4. Fit nested linear models (single → pair → +interaction) or multi-regulator models. 5. Compute ΔR² improvements and F-test p-values. 6. Derive context metrics (conditional correlations under high vs low regulator2 strata) where applicable. 7. Infer context-specific networks via correlation under subsetted samples (context masks). 8. Aggregate results into `self.results['context_dependent']`. 9. Persist tables in `output/subset_context_dependent_analysis_/tables/` via `save_context_results()`. --- ## 3. Output Tables Generated | Table | Source Branch | Granularity | Row Meaning | |-------|---------------|-------------|-------------| | `methylation_mirna_context.csv` | Methylation–miRNA interaction modeling | Gene–(methylation site, miRNA) triple | Context-dependent test for methylation effect modulated by miRNA | | `lncrna_mirna_context.csv` | lncRNA–miRNA interaction modeling | Gene–(lncRNA, miRNA) triple | Context-dependent test for lncRNA effect modulated by miRNA | | `multi_way_interactions.csv` | Multi-regulator modeling | Gene | Improvement from adding all selected regulators vs single baseline | | `{context}_gene_mirna_correlations.csv` | Context networks | Gene–miRNA pair | Pearson correlation inside context subset | | `{context}_gene_lncrna_correlations.csv` | Context networks | Gene–lncRNA pair | Pearson correlation inside context subset | | `{context}_gene_methylation_correlations.csv` | Context networks | Gene–CpG site pair | Pearson correlation inside context subset | Contexts currently: `high_mirna`, `low_mirna`, `high_methylation`. --- ## 4. Column Definitions ### Interaction Context Tables (`*_mirna_context.csv`) | Column | Meaning | |--------|---------| | `interaction_type` | Either `methylation_mirna` or `lncrna_mirna` | | `target` | Gene ID analyzed | | `regulator1` | Primary regulator (methylation site or lncRNA) label prefixed in code (e.g., `methylation_`) | | `regulator2` | miRNA regulator label (`mirna_`) | | `r2_regulator1_only` | R² of model: target ~ regulator1 | | `r2_regulator1_regulator2` | R² of model: target ~ regulator1 + regulator2 | | `r2_with_interaction` | R² of model: target ~ regulator1 + regulator2 + regulator1*regulator2 | | `improvement_from_regulator2` | ΔR² adding regulator2 (second minus first model) | | `improvement_from_interaction` | ΔR² adding interaction term (third minus second model) | | `context_dependent` | Boolean: interaction term F-test p < 0.05 | | `corr_high_regulator2` | Pearson(gene, regulator1) in samples where standardized regulator2 > 0.5 | | `corr_low_regulator2` | Same but regulator2 < -0.5 | | `context_strength` | \|corr_high - corr_low\| | | `context_direction` | `positive` if corr_high > corr_low, else `negative` (or `NA`) | | `interaction_p_value` | F-test p-value for interaction term | ### Multi-Way Table (`multi_way_interactions.csv`) | Column | Meaning | |--------|---------| | `gene` | Gene ID | | `n_regulators` | Count of selected top regulators combined across types | | `r2_base_model` | R² using only first regulator (ordered insertion) | | `r2_full_model` | R² using all regulators | | `improvement_from_regulators` | ΔR² (full - base) | | `has_significant_interactions` | F-test p < 0.05 comparing base vs full | | `interaction_p_value` | F-test p-value | | `regulator_types` | JSON-like list of regulator feature labels | ### Context Correlation Tables (`{context}_gene__correlations.csv`) | Column | Meaning | |--------|---------| | `gene` | Gene ID | | `mirna` / `lncrna` / `cpg` | Regulator ID depending on file type | | `correlation` | Pearson correlation in context subset | | `p_value` | Associated Pearson p-value (filtered at < 0.1) | --- ## 5. Detailed Processing Branches ### 5.1 Methylation–miRNA & lncRNA–miRNA Branches - Sample up to 500 genes. - For each gene: compute Pearson correlations to all candidate regulators; retain those with p < 0.1 and rank by \|corr\|. - Select top 10 per regulator class; restrict to top 5 × top 5 combinations (25 pairs) to model. - Build standardized dataframe with interaction term and fit nested linear models. - Perform nested F-test for interaction significance. - Stratify by regulator2 z-score to obtain conditional correlations and context metrics. ### 5.2 Multi-Way Branch - Sample up to 200 genes. - For each gene: pick top (≈ n_regulators/3 per class) regulators across miRNA, lncRNA, methylation. - Fit baseline (first regulator) vs full (all) linear model, compute ΔR², F-test. ### 5.3 Context-Specific Network Branch - Define contexts using sentinel feature z-score thresholds. - `high_mirna` / `low_mirna`: first miRNA feature as sentinel. - `high_methylation`: first methylation feature as sentinel. - Subset samples meeting context mask; require ≥10 samples else skip. - Within subset: compute gene–regulator correlations for sampled genes (≤200) vs every feature of each regulator class, retain p < 0.1. --- ## 6. Execution to Persistence | Stage | Function(s) | Output Artifact | |-------|-------------|-----------------| | Analysis orchestration | `run_complete_context_analysis` | Populates `self.results` | | Branch computations | `parallel_analyze_*`, `_process_*` helpers | In-memory DataFrames/lists | | Result consolidation | `analyze_context_dependent_regulation` | `self.results['context_dependent']` | | Table writing | `save_context_results` | CSV files in timestamped `tables/` directory | | Reporting | `generate_markdown_report`, `generate_html_report` | Markdown & HTML summaries | --- ## 7. Mermaid Data Flow Diagram ```mermaid flowchart TD A[Cleaned Input CSVs
Gene / lncRNA / miRNA / Methylation] --> B[Load + Align Samples] B --> C[Precompute Arrays] C --> D1[Methylation–miRNA Branch] C --> D2[lncRNA–miRNA Branch] C --> D3[Multi-Way Branch] C --> D4[Context Networks Branch] subgraph S1[Methylation–miRNA] D1 --> E1[Top Correlations (gene vs regulators)] --> F1[Pair Grid] F1 --> G1[Nested Linear Models + F-test] G1 --> H1[Stratified Correlations] H1 --> I1[methylation_mirna_context.csv] end subgraph S2[lncRNA–miRNA] D2 --> E2[Top Correlations] --> F2[Pair Grid] F2 --> G2[Models + Interaction F-test] G2 --> H2[Stratified Metrics] H2 --> I2[lncrna_mirna_context.csv] end subgraph S3[Multi-Way] D3 --> E3[Select Top Mixed Regulators] E3 --> F3[Baseline vs Full Model] F3 --> G3[F-test] G3 --> I3[multi_way_interactions.csv] end subgraph S4[Context Networks] D4 --> E4[Define Context Masks] E4 --> F4[Subset Samples] F4 --> G4[Within-Context Correlations] G4 --> I4[Context Correlation Tables] end I1 --> R[results['context_dependent']] I2 --> R I3 --> R I4 --> R R --> W[save_context_results() -> CSVs] ``` --- ## 8. Key Statistical Decisions - Correlation filter: p < 0.1 pre-selection. - Interaction significance: F-test p < 0.05. - Context stratification thresholds: standardized regulator2 > 0.5 vs < -0.5. - Minimum samples for context network correlation: > 5 per pair, ≥10 per context overall. --- ## 9. Potential Enhancements | Area | Idea | |------|------| | Multiple testing | Apply FDR (Benjamini–Hochberg) to interaction_p_value columns | | Effect robustness | Bootstrap ΔR² or conditional correlations | | Context definition | Derive contexts from clustering or metadata instead of sentinel feature | | Model complexity | Incorporate elastic net for regulator selection | | Output schema | Add standardized effect sizes (partial correlations) | --- ## 10. Quick Reference (Cheat Sheet) - Interaction tables = per gene × (primary regulator, miRNA) pairs with nested model stats. - Multi-way table = per gene aggregated multi-regulator improvement. - Context correlation tables = raw within-context gene–regulator associations. - All tables live in a timestamped `output/.../tables/` directory. --- **Document generated manually to accompany `subset_context_dependent_analysis.py`.**