{ "cells": [ { "cell_type": "markdown", "id": "e260a5a1-686a-491d-8b6c-dc7fcb17787a", "metadata": {}, "source": [ "# Step 4: Fit Barnacle model to your data\n", "\n", "Use this notebook to fit the Barnacle model to your normalized data tensor. Fitting Barnacle to data requires tuning two model parameters: \n", "1. `R` -- the number of components\n", "1. `lambda` -- the sparsity parameter\n", "\n", "There are many methods for fitting model parameters. The cross-validated parameter search strategy here is the method used to fit Barnacle to metatranscriptomic data in the [original Barnacle manuscript](https://doi.org/10.1101/2024.07.15.603627). This strategy aims to reduce resource costs by fitting `R` first and then `lambda`, rather than both parameters simultaneously. It also depends on sample replicates for performing cross validation. If your data does not have sample replicates, you might instead consider trying split-half analysis for parameter selection, but this method is not supported by this notebook.\n", "\n", "Please refer to the notebook [3-tensorize-data.ipynb](https://github.com/blasks/barnacle-boilerplate/blob/main/3-tensorize-data.ipynb) for proper formatting of your input data tensor. Note that in order to facilitate bootstrapping, sample ID and replicate ID are combined into a unique identifier called `'sample_replicate_id'` (how creative). This will be the name of the third mode of your tensor. The sample ID and replicate ID information is preserved in separate metadata arrays in the dataset. The script will use this information to shuffle replicates between bootstraps, which enables more robust parameter selection, and confidence intervals in the final model." ] }, { "cell_type": "code", "execution_count": 1, "id": "14e7b2fc", "metadata": {}, "outputs": [], "source": [ "# imports\n", "\n", "import os\n", "import pandas as pd\n", "import seaborn as sns\n", "import tomli_w\n", "import warnings\n", "import xarray as xr\n", "\n", "from matplotlib import pyplot as plt\n", "\n", "# suppress unhelpful warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "warnings.simplefilter(action='ignore', category=RuntimeWarning)\n", "\n", "# set color palette\n", "sns.set_palette(sns.color_palette([\n", " '#9B5DE5', '#FFAC69', '#00C9AE', '#FD3F92', '#0F0A0A', '#959AB1', '#FFDB66', '#FFB1CA', '#63B9FF', '#4F1DD7'\n", "]))\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "9ef448ea", "metadata": {}, "outputs": [], "source": [ "# USER INPUTS -- edit these variables as needed\n", "\n", "datapath = 'data/data-tensor.nc' # filepath of your input data tensor \n", "outdir = 'data/barnacle' # output directory where produced files will be saved\n", "sparse_modes = 1 # How many modes sparsity will be applied to (0/1/2/3) -- if 1 or 2, should be the first 1/2 modes\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "da00f1e7-fc34-4c74-a31d-52e18de160fc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sparsity will be applied to mode 1-KOfam\n" ] }, { "data": { "text/html": [ "
<xarray.Dataset> Size: 20MB\n", "Dimensions: (KOfam: 10829, phylum: 8, sample_replicate_id: 28)\n", "Coordinates:\n", " * KOfam (KOfam) <U6 260kB 'K00001' 'K00002' ... 'K26180'\n", " * phylum (phylum) <U16 512B 'Bacillariophyta' ... 'Pelagophyc...\n", " * sample_replicate_id (sample_replicate_id) <U17 2kB 'G3.UW.ALL.L25S1_A' ....\n", "Data variables:\n", " data (KOfam, phylum, sample_replicate_id) float64 19MB -9...\n", " sample_id (sample_replicate_id) <U15 2kB 'G3.UW.ALL.L25S1' ......\n", " replicate_id (sample_replicate_id) <U1 112B 'A' 'B' 'C' ... 'B' 'C'
\n", " | lambda | \n", "rank | \n", "sse | \n", "sse_sem | \n", "fms | \n", "fms_sem | \n", "mean_cluster_size | \n", "n_bootstraps | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "0.000 | \n", "5.0 | \n", "0.952651 | \n", "0.002233 | \n", "0.259790 | \n", "0.026131 | \n", "20069.000000 | \n", "10 | \n", "
1 | \n", "0.001 | \n", "5.0 | \n", "0.948178 | \n", "0.001931 | \n", "0.343389 | \n", "0.030911 | \n", "19601.833332 | \n", "10 | \n", "
2 | \n", "0.010 | \n", "5.0 | \n", "0.950024 | \n", "0.002104 | \n", "0.389775 | \n", "0.032836 | \n", "17733.666667 | \n", "10 | \n", "
3 | \n", "0.100 | \n", "5.0 | \n", "0.949620 | \n", "0.002078 | \n", "0.432266 | \n", "0.024360 | \n", "13076.800001 | \n", "10 | \n", "
4 | \n", "1.000 | \n", "5.0 | \n", "0.951096 | \n", "0.001609 | \n", "0.484748 | \n", "0.022969 | \n", "6634.006666 | \n", "10 | \n", "
5 | \n", "10.000 | \n", "5.0 | \n", "0.990993 | \n", "0.000211 | \n", "0.414750 | \n", "0.023995 | \n", "194.473334 | \n", "10 | \n", "