{ "cells": [ { "cell_type": "markdown", "id": "45cc2c09-621d-4def-9eb8-feee9dd994b8", "metadata": {}, "source": [ "# Step 3: Build your data tensor\n", "\n", "Use this notebook to arrange your normalized data into tensor format using the [xarray](https://docs.xarray.dev/en/stable/) package. \n", "\n", "Please note that your input data should already have been normalized, and should be saved in a csv file using [tidy format](https://tidyr.tidyverse.org/articles/tidy-data.html). At a minimum your input csv should have five columns:\n", "1. A column that corresponds to the first mode of your tensor. In metatranscriptomic data this column might indicate gene ID.\n", " - This first mode should generally be the longest in your tensor, and the one that corresponds to the variable you want clustered (e.g. genes in the case of metatranscriptomics data). The sparsity penalty (`lambda`) will be applied to this mode.\n", "1. A column that corresponds to the second mode of your tensor. In metatranscriptomic data this column might indicate taxon ID.\n", "1. A column that corresponds to the third mode of your tensor. This column should indicate sample ID.\n", " - **IMPORTANT: Sample IDs should be identical for different replicates of the same sample condition (see example below).**\n", "1. A column that indicates the replicate ID of the sample.\n", "1. A column that corresponds to the normalized data. If you normalized your data with sctransform (as laid out in jupyter notebook [2-normalize-sctransform.ipynb](https://github.com/blasks/barnacle-boilerplate/blob/main/2-normalize-sctransform.ipynb), this column will correspond to the residual.\n", "\n", "Here's a snippet of how an example csv might be arranged:\n", "\n", "| gene_id | taxon_id | sample_id | replicate | residual |\n", "|---------|------------|-----------|-----------|----------|\n", "| K03839 | P. marinus | sample1 | A | 3.02 |\n", "| K03839 | P. marinus | sample1 | B | 3.31 |\n", "| K03839 | P. marinus | sample1 | C | 3.18 |\n", "| K03839 | P. marinus | sample2 | A | -1.24 |\n", "| ... | ... | ... | ... | ... |\n", "| K03320 | S. marinus | sample9 | C | 0.05 |\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "acdb6e9d-0e79-4c84-8aaf-337ddc34bf09", "metadata": {}, "outputs": [], "source": [ "# imports\n", "\n", "import itertools\n", "import os\n", "import pandas as pd\n", "import plotly.express as px\n", "import random\n", "import xarray as xr\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "6d325a04-d0af-4383-9fc2-89a789469dde", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | KOfam | \n", "phylum | \n", "sample_replicate_id | \n", "sample_id | \n", "replicate_id | \n", "data | \n", "
---|---|---|---|---|---|---|
0 | \n", "K00001 | \n", "Bacillariophyta | \n", "G3.UW.ALL.L25S1_A | \n", "G3.UW.ALL.L25S1 | \n", "A | \n", "-9.222961e-13 | \n", "
1 | \n", "K00002 | \n", "Bacillariophyta | \n", "G3.UW.ALL.L25S1_A | \n", "G3.UW.ALL.L25S1 | \n", "A | \n", "2.092099e+00 | \n", "
2 | \n", "K00003 | \n", "Bacillariophyta | \n", "G3.UW.ALL.L25S1_A | \n", "G3.UW.ALL.L25S1 | \n", "A | \n", "-1.128763e+00 | \n", "
3 | \n", "K00004 | \n", "Bacillariophyta | \n", "G3.UW.ALL.L25S1_A | \n", "G3.UW.ALL.L25S1 | \n", "A | \n", "-1.635693e-02 | \n", "
4 | \n", "K00006 | \n", "Bacillariophyta | \n", "G3.UW.ALL.L25S1_A | \n", "G3.UW.ALL.L25S1 | \n", "A | \n", "-6.474356e-01 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1284467 | \n", "K26159 | \n", "Pelagophyceae | \n", "G3.UW.ALL.L40S2_C | \n", "G3.UW.ALL.L40S2 | \n", "C | \n", "-1.360729e-01 | \n", "
1284468 | \n", "K26163 | \n", "Pelagophyceae | \n", "G3.UW.ALL.L40S2_C | \n", "G3.UW.ALL.L40S2 | \n", "C | \n", "1.537219e+00 | \n", "
1284469 | \n", "K26165 | \n", "Pelagophyceae | \n", "G3.UW.ALL.L40S2_C | \n", "G3.UW.ALL.L40S2 | \n", "C | \n", "6.104757e-01 | \n", "
1284470 | \n", "K26167 | \n", "Pelagophyceae | \n", "G3.UW.ALL.L40S2_C | \n", "G3.UW.ALL.L40S2 | \n", "C | \n", "8.077639e-01 | \n", "
1284471 | \n", "K26171 | \n", "Pelagophyceae | \n", "G3.UW.ALL.L40S2_C | \n", "G3.UW.ALL.L40S2 | \n", "C | \n", "-3.022025e-01 | \n", "
1284472 rows × 6 columns
\n", "<xarray.Dataset> Size: 19MB\n", "Dimensions: (KOfam: 10829, phylum: 8, sample_replicate_id: 28)\n", "Coordinates:\n", " * KOfam (KOfam) object 87kB 'K00001' 'K00002' ... 'K26180'\n", " * phylum (phylum) object 64B 'Bacillariophyta' ... 'Pelagophy...\n", " * sample_replicate_id (sample_replicate_id) object 224B 'G3.UW.ALL.L25S1_A...\n", "Data variables:\n", " data (KOfam, phylum, sample_replicate_id) float64 19MB -9...\n", " sample_id (sample_replicate_id) object 224B 'G3.UW.ALL.L25S1' ...\n", " replicate_id (sample_replicate_id) object 224B 'A' 'B' ... 'B' 'C'
<xarray.DataArray 'data' (KOfam: 50, phylum: 8, sample_replicate_id: 28)> Size: 90kB\n", "array([[[-1.64628585e-10, 7.69230203e-01, -2.50929152e-10, ...,\n", " -7.23918202e-03, -1.12051957e-11, -1.73352168e-02],\n", " [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n", " [ 2.81322758e-01, -6.42805859e-03, -1.75183542e-01, ...,\n", " -5.78476544e-01, 2.40013860e-01, 5.71970806e-01],\n", " ...,\n", " [-4.35489078e-11, -5.52214020e-11, -4.55228581e-11, ...,\n", " -1.19494802e-02, -4.22079785e-11, -3.65491197e-01],\n", " [-1.47167672e-02, -4.67013672e-01, -7.74064479e-04, ...,\n", " -1.35010786e+00, 3.01074092e-01, 1.35045724e+00],\n", " [-6.13419070e-01, -2.74562471e-01, -3.58977187e-01, ...,\n", " -1.13599447e+00, 3.95033179e-01, -8.67057218e-01]],\n", "\n", " [[-9.05984634e-01, -4.48669437e-01, 3.64424192e+00, ...,\n", " 6.06246040e-01, 1.35700630e+00, -3.82847867e-01],\n", " [ 1.37078552e+00, -1.20521982e-01, 1.61430650e+00, ...,\n", " -9.65207398e-01, -7.35669930e-01, -6.67981850e-01],\n", " [-7.07203366e-02, 1.52861606e+00, 1.16469289e-01, ...,\n", " -3.83950600e-01, -9.28490014e-02, 1.28734640e+00],\n", "...\n", " [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n", " [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n", " [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],\n", "\n", " [[-8.76196152e-01, -6.74484172e-01, 3.20138929e-01, ...,\n", " 1.61504440e+00, 2.25175360e+00, -1.06789192e+00],\n", " [-3.37561230e-01, 1.76373671e+00, -6.24937610e-01, ...,\n", " -1.18795420e+00, 1.13316513e-01, -3.78057002e-02],\n", " [-1.45201496e-02, -1.72506067e+00, 6.97145749e-01, ...,\n", " -8.37613055e-01, 6.92059773e-01, 2.94454587e-01],\n", " ...,\n", " [ 1.57844594e-01, -1.05503774e+00, 2.66180635e+00, ...,\n", " 3.83072471e+00, 4.54346103e+00, -5.83866397e-01],\n", " [-3.65795076e-02, -1.38375006e+00, -1.27272218e+00, ...,\n", " 8.01074700e-01, 6.83282090e-01, -4.29440631e-01],\n", " [ 3.20800060e-01, -2.14208923e-02, -8.95054553e-01, ...,\n", " -4.83708091e-01, -6.40752015e-01, -2.95629764e-01]]])\n", "Coordinates:\n", " * KOfam (KOfam) object 400B 'K14842' 'K25166' ... 'K04984'\n", " * phylum (phylum) object 64B 'Bacillariophyta' ... 'Haptophyta'\n", " * sample_replicate_id (sample_replicate_id) object 224B 'G3.UW.ALL.L31S2_B...