{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting SRM Targets\n",
    "\n",
    "This notebook is a continuation of the process I started in [this lab notebook entry](). Here, I document the process I used to select protein targets for a future SRM assay. I'll follow a process similar to the ones described [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/notebooks/2017-03-21-Preliminary-Proteomic-Data-Analyses.ipynb) and [here](https://github.com/sr320/nb-2017/blob/master/C_gigas/05-YV-shortlist-poster.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Step 1: Download Skyline output from OWL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/Users/yaamini/Documents/project-oyster-oa/notebooks'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/yaamini/Documents/project-oyster-oa/analyses\n"
     ]
    }
   ],
   "source": [
    "cd /Users/yaamini/Documents/project-oyster-oa/analyses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mkdir DNR_Skyline_20170511"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[34m2018-02-28-PECAN\u001b[m\u001b[m/                  \u001b[34mDNR_Skyline_20170314\u001b[m\u001b[m/\r\n",
      "\u001b[34mBCA_analysis\u001b[m\u001b[m/                      \u001b[34mDNR_Skyline_20170511\u001b[m\u001b[m/\r\n",
      "\u001b[34mDNR_MSConvert_20170412\u001b[m\u001b[m/            \u001b[34mMZratios_larval_samples\u001b[m\u001b[m/\r\n",
      "\u001b[34mDNR_PECAN_RUN_2_20170307\u001b[m\u001b[m/          README.md\r\n",
      "\u001b[34mDNR_PECAN_Run_3_20170308\u001b[m\u001b[m/          \u001b[34mraw_data_visualizations\u001b[m\u001b[m/\r\n",
      "\u001b[34mDNR_Preliminary_Analyses_20170321\u001b[m\u001b[m/\r\n"
     ]
    }
   ],
   "source": [
    "ls /Users/yaamini/Documents/project-oyster-oa/analyses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511\n"
     ]
    }
   ],
   "source": [
    "cd /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  524M  100  524M    0     0  30.8M      0  0:00:17  0:00:17 --:--:-- 6718k\n"
     ]
    }
   ],
   "source": [
    "!curl http://owl.fish.washington.edu/spartina/DNR_Skyline_20170505/2017-05-11-transition-results.csv > /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511/2017-05-11-transition-results.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017-05-11-transition-results.csv\r\n"
     ]
    }
   ],
   "source": [
    "ls /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein Name,Protein Accession,Protein Gene,Replicate Name,Analyte Concentration,Area,Precursor Mz,Precursor Charge,Product Mz,Product Charge,Fragment Ion,Retention Time,Background,Peak Rank\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,1,,14370792,628.864077,2,628.864077,2,precursor,56.27,0,1\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,2,,23302408,628.864077,2,628.864077,2,precursor,55.94,914460,2\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,3,,#N/A,628.864077,2,628.864077,2,precursor,#N/A,#N/A,#N/A\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,4,,12472846,628.864077,2,628.864077,2,precursor,55.45,1729182,1\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,5,,9034603,628.864077,2,628.864077,2,precursor,55.92,0,2\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,6,,20703680,628.864077,2,628.864077,2,precursor,56,0,2\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,7,,12086430,628.864077,2,628.864077,2,precursor,55.71,0,2\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,8,,22076828,628.864077,2,628.864077,2,precursor,55.35,929857,2\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,,,9,,23673296,628.864077,2,628.864077,2,precursor,55.96,4077997,1\r",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!head 2017-05-11-transition-results.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Format Skyline output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first method I'll use to identify SRM targets is to see which proteins vary the most between all of my treatments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, I opened my data in Excel.\n",
    "\n",
    "![screen shot 2017-05-11 at 4 22 11 pm](https://cloud.githubusercontent.com/assets/22335838/25975922/0fd8cdc4-3666-11e7-867b-5e35bedd5b15.png)\n",
    "\n",
    "There are no data under the columns \"Protein Accession,\" \"Protein Gene,\" and \"Analyte Concentration.\" This makes sense because none of the Skyline inputs had this information. I deleted these columns from the spreadsheet.\n",
    "\n",
    "Now, I have to reorganize my table. I first want to reformat my data, such that I have Replicate across the top, and Peak Area for each protein. I couldn't figure out a good way to do this in Excel, so I simply changed my Skyline export settings and reexported my report.\n",
    "\n",
    "![image](https://cloud.githubusercontent.com/assets/22335838/25980325/a1cf9714-3682-11e7-8fdb-ecd699cb4249.png)\n",
    "\n",
    "My report can be found [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-peak-areas.csv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017-05-11-peak-areas.csv          2017-05-11-transition-results.csv\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein Name,Peptide,Peptide Sequence,1 Replicate Name,1 Area,2 Replicate Name,2 Area,3 Replicate Name,3 Area,4 Replicate Name,4 Area,5 Replicate Name,5 Area,6 Replicate Name,6 Area,7 Replicate Name,7 Area,8 Replicate Name,8 Area,9 Replicate Name,9 Area,10 Replicate Name,10 Area,11 Replicate Name,11 Area,12 Replicate Name,12 Area,13 Replicate Name,13 Area,14 Replicate Name,14 Area,15 Replicate Name,15 Area,16 Replicate Name,16 Area,18 Replicate Name,18 Area,19 Replicate Name,19 Area,20 Replicate Name,20 Area,24 Replicate Name,24 Area,25 Replicate Name,25 Area\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,14370792,2,23302408,3,#N/A,4,12472846,5,9034603,6,20703680,7,12086430,8,22076828,9,23673296,10,20894318,11,#N/A,12,#N/A,13,6366412,14,12113536,15,10475430,16,#N/A,18,11134167,19,11065488,20,14701463,24,89815600,25,85115320\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,#N/A,2,23607888,3,#N/A,4,3931135,5,13043461,6,22820524,7,16789534,8,24926454,9,19369368,10,3053905,11,#N/A,12,11666986,13,6345001,14,6513484,15,6384737,16,19693062,18,3876843,19,3925207,20,9569519,24,30647556,25,34328456\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,#N/A,2,6128333,3,#N/A,4,2542174,5,1640338,6,4559408,7,1843708,8,4468453,9,5002278,10,4651351,11,2632482,12,3600363,13,909673,14,2968306,15,2928264,16,#N/A,18,2694840,19,1965639,20,#N/A,24,18499426,25,20361696\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,666795,2,914045,3,401482,4,250334,5,210997,6,653934,7,468056,8,981552,9,843474,10,#N/A,11,702872,12,561447,13,118468,14,432349,15,295746,16,#N/A,18,434492,19,481225,20,266202,24,2934301,25,3449614\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,1706240,2,2483548,3,1032042,4,1613916,5,1075939,6,1875415,7,1334136,8,2584988,9,2989490,10,2647167,11,1639070,12,1714728,13,927117,14,1688378,15,1744254,16,#N/A,18,1221818,19,1352305,20,1617136,24,10782925,25,11312201\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,1148035,2,2291994,3,841748,4,1407156,5,673015,6,2201076,7,1294640,8,2082576,9,2297001,10,2480154,11,1379714,12,1322377,13,725444,14,1308286,15,1254850,16,#N/A,18,1287800,19,#N/A,20,1083003,24,9581740,25,9666057\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,1412826,2,2047944,3,#N/A,4,745458,5,793645,6,1512449,7,490220,8,1526305,9,1741906,10,2134953,11,#N/A,12,1462891,13,921934,14,1261259,15,1171364,16,#N/A,18,967556,19,1108264,20,1272781,24,7968852,25,8771542\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,#N/A,2,#N/A,3,#N/A,4,#N/A,5,#N/A,6,#N/A,7,#N/A,8,#N/A,9,#N/A,10,#N/A,11,#N/A,12,#N/A,13,#N/A,14,#N/A,15,#N/A,16,#N/A,18,#N/A,19,#N/A,20,#N/A,24,#N/A,25,#N/A\r",
      "\r\n",
      "CHOYP_043R.5.5|m.64252,IISQDTPTILR,IISQDTPTILR,1,#N/A,2,#N/A,3,#N/A,4,#N/A,5,#N/A,6,#N/A,7,#N/A,8,#N/A,9,#N/A,10,#N/A,11,#N/A,12,#N/A,13,#N/A,14,#N/A,15,#N/A,16,#N/A,18,#N/A,19,#N/A,20,#N/A,24,#N/A,25,#N/A\r",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!head 2017-05-11-peak-areas.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I've still got some extraneous columns, so I deleted them in Excel. My final spreadsheet looks like this:\n",
    "\n",
    "![untitled](https://cloud.githubusercontent.com/assets/22335838/25980469/95e52526-3683-11e7-9fab-92ca944c0e03.png)\n",
    "\n",
    "### Normalize Peak Areas with Total Ion Current (TIC)\n",
    "\n",
    "It is important to normalize the peak area data (proxy for protein abundance) with TIC values. TIC values correspond with how much of our samples we loaded onto the machine. Loading more sample on the machine would naturally lead to higher protein abundance and higher peak areas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\r\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\r\n",
      "\r",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r",
      "100 35914  100 35914    0     0  1248k      0 --:--:-- --:--:-- --:--:-- 1524k\r\n"
     ]
    }
   ],
   "source": [
    "!curl http://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data/2017_January_23_TIC_values.xlsx > 2017_January_23_TIC_values.xlsx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017-02-28-DIA-Analysis-PECAN.ipynb\r\n",
      "2017-03-07-Reconvert-mzML-Files.ipynb\r\n",
      "2017-03-08-Formatting-PECAN-Inputs.ipynb\r\n",
      "2017-03-14-Skyline-Test-Run.ipynb\r\n",
      "2017-03-21-Preliminary-Proteomic-Data-Analyses.ipynb\r\n",
      "2017-04-12-Demultiplex-Raw-Files.ipynb\r\n",
      "2017-05-11-Selecting-SRM-Targets.ipynb\r\n",
      "2017_January_23_TIC_values.xlsx\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rm 2017_January_23_TIC_values.xlsx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017-02-28-DIA-Analysis-PECAN.ipynb\r\n",
      "2017-03-07-Reconvert-mzML-Files.ipynb\r\n",
      "2017-03-08-Formatting-PECAN-Inputs.ipynb\r\n",
      "2017-03-14-Skyline-Test-Run.ipynb\r\n",
      "2017-03-21-Preliminary-Proteomic-Data-Analyses.ipynb\r\n",
      "2017-04-12-Demultiplex-Raw-Files.ipynb\r\n",
      "2017-05-11-Selecting-SRM-Targets.ipynb\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511\n"
     ]
    }
   ],
   "source": [
    "cd /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Skyline_20170511"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\r\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\r\n",
      "\r",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r",
      "100 35914  100 35914    0     0  1263k      0 --:--:-- --:--:-- --:--:-- 1461k\r\n"
     ]
    }
   ],
   "source": [
    "!curl http://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data/2017_January_23_TIC_values.xlsx > 2017_January_23_TIC_values.xlsx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2017-05-11-peak-areas.csv          2017_January_23_TIC_values.xlsx\r\n",
      "2017-05-11-transition-results.csv\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I will divide all peak area entries for each replicate with TIC values in the spreadsheet. You can see what I did in [this R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-Normalizing-Peak-Areas.R).\n",
    "\n",
    "After normalizing peak areas, I did one last thing. I renamed my column headers so they describe site and eelgrass condition, as opposed to just listing a sample name. I referred to [the methods file](http://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data/2017_January_23.csv) to correlate sample numbers from the mass spectrometer with vial numbers that DNR used during sample collection. [My lab notebook](https://yaaminiv.github.io/First-Day/) has information relating DNR vial numbers with sample site and eelgrass condition.\n",
    "\n",
    "Here's my file with [normalized peak areas and renamed column headers](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Skyline_20170511/2017-05-11-normalized-peak-areas.csv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"\",\"protein-name\",\"peptide\",\"peptide-sequence\",\"O127-wb-bare-1\",\"O107-sk-eelgrass-1\",\"O07-ci-eelgrass-1\",\"O77-pg-eelgrass-1\",\"O47-fb-bare-1\",\"O55-pg-bare-1\",\"O37-fb-eelgrass-1\",\"O15-ci-eelgrass-1\",\"O142-wb-eelgrass-1\",\"O119-sk-bare-1\",\"O47-fb-bare-2\",\"O127-wb-bare-2\",\"O37-fb-eelgrass-2\",\"O55-pg-bare-2\",\"O15-ci-bare-2\",\"O119-sk-bare-2\",\"O142-wb-eelgrass-2\",\"O07-ci-eelgrass-2\",\"O77-pg-eelgrass-2\",\"O107-2-sk-eegrass-1\",\"O107-2-sk-eelgrass-2\"\r\n",
      "\"1\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",14370792,7185396,4790264,3592698,2874158.4,2395132,2052970.28571429,1796349,1596754.66666667,1437079.2,1306435.63636364,1197566,1105445.53846154,1026485.14285714,958052.8,898174.5,798377.333333333,756357.473684211,718539.6,598783,574831.68\r\n",
      "\"2\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA\r\n",
      "\"3\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA\r\n",
      "\"4\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",666795,333397.5,222265,166698.75,133359,111132.5,95256.4285714286,83349.375,74088.3333333333,66679.5,60617.7272727273,55566.25,51291.9230769231,47628.2142857143,44453,41674.6875,37044.1666666667,35094.4736842105,33339.75,27783.125,26671.8\r\n",
      "\"5\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",1706240,853120,568746.666666667,426560,341248,284373.333333333,243748.571428571,213280,189582.222222222,170624,155112.727272727,142186.666666667,131249.230769231,121874.285714286,113749.333333333,106640,94791.1111111111,89802.1052631579,85312,71093.3333333333,68249.6\r\n",
      "\"6\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",1148035,574017.5,382678.333333333,287008.75,229607,191339.166666667,164005,143504.375,127559.444444444,114803.5,104366.818181818,95669.5833333333,88310.3846153846,82002.5,76535.6666666667,71752.1875,63779.7222222222,60422.8947368421,57401.75,47834.7916666667,45921.4\r\n",
      "\"7\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",1412826,706413,470942,353206.5,282565.2,235471,201832.285714286,176603.25,156980.666666667,141282.6,128438.727272727,117735.5,108678.923076923,100916.142857143,94188.4,88301.625,78490.3333333333,74359.2631578947,70641.3,58867.75,56513.04\r\n",
      "\"8\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA\r\n",
      "\"9\",\"CHOYP_043R.5.5|m.64252\",\"IISQDTPTILR\",\"IISQDTPTILR\",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA\r\n"
     ]
    }
   ],
   "source": [
    "!head 2017-05-11-normalized-peak-areas.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Identify proteins with high variation across all treatment conditions\n",
    "\n",
    "To do this, I'm going to calculate coefficient of variations for each sample site and eelgrass condition. I created a [new spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-high-variation-treatments.csv) that I will be working from.\n",
    "\n",
    "I first calculated the averages for peak area between each replicate.\n",
    "![untitled](https://cloud.githubusercontent.com/assets/22335838/25983366/41204d02-3699-11e7-9e25-14c6507f066d.png)\n",
    "\n",
    "After about two hours of struggle bussing trying to calculate coefficients of variation due to several files crashing/obtaining the same coefficient of variation for each protein, I realized that my normalized data will always have the same coefficient of variation because **I NORMALIZED IT**. #mathfail\n",
    "\n",
    "So to really understand the variation in my data, I have to play with the nonnormalized data.\n",
    "\n",
    "Using the same R Script as before, I renamed column headers without normalizing data. Here is the [resultant file](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-nonnormalized-peak-areas.csv).\n",
    "\n",
    "Because I'm working in Excel and not R, I replaced all \"#N/A\" values with blank cells so Excel formulas would still calculate parameters like standard deviation and means. I calculated the mean for each replicate (using AVERAGE). Between those means, I then calculated the standard deviation (using STDEV) and mean (using AVERAGE). Finally, I calculated the coefficient of variation by dividing the standard deviation with the mean.\n",
    "\n",
    "Then, I sorted my spreadsheet such that the rows with the highest coefficient of variance was at the top. I deleted all of the rows where I could not calculate my CV due to missing values. I was left with 98156 rows of data. My lowest CV was 0.126116578 and my highest CV was 3.103972971.\n",
    "\n",
    "I then identified outliers for CV to pinpoint proteins that varied more or less than expected. Using the QUARTILE function in Excel, I calculated the IQR for my data.\n",
    "\n",
    "Q1 = 0.460219756\n",
    "\n",
    "Q3 = 0.740687047\n",
    "\n",
    "IQR = 0.280467291\n",
    "\n",
    "I considered any coefficient of variation with a value greater than 1.161387982 as high variation, and anything with value lower than 0.03951882 as low variation. My work can be found in [this spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-nonnormalized-peak-areas-all-treatment-quartiles.csv)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Compare sample sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "My next step is to essentially repeat the workflow above, but this time pooling each site when calculating an average, regardless of eelgrass condition. This time, I had 201,309 rows of data.\n",
    "\n",
    "Q1 = 0.497903746\n",
    "\n",
    "Q3 = 0.83721613\n",
    "\n",
    "IQR = 0.339312384\n",
    "\n",
    "LOWER = -0.01106483\n",
    "\n",
    "UPPER = 1.346184706\n",
    "\n",
    "[Spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-nonnormalized-peak-areas-quartiles-sites.xlsx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Compare eelgrass conditions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, I'll do the same thing by comparing eelgrass conditions regardless of sample site. I had 261,633 rows of data.\n",
    "\n",
    "Q1 = 0.138417957\n",
    "\n",
    "Q3 = 0.449766071\n",
    "\n",
    "IQR = 0.311348114\n",
    "\n",
    "LOWER = -0.328604213\n",
    "\n",
    "UPPER = 0.916788241\n",
    "\n",
    "[Spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-nonnormalized-peak-areas-quartiles-eelgrass.xlsx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Consolidate potential targets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I copied and pasted all rows deemed as outliers by IQR analysis into [this tab-delimited file](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170511/2017-05-11-high-variation-targets.txt). It's interesting to note that only high variation targets were selected using this method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Merge with protein information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I now have information about how variable certain proteins are, but I don't have any information on protein function! I'm going to merge my list of high variation targets with [this list](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/all-proteins-go-terms/Proteins-GO-terms.tabular) that has function and GO term information.\n",
    "\n",
    "I started by uploading my potential targets and GO terms to [Galaxy](usegalaxy.com).\n",
    "\n",
    "![untitled](https://cloud.githubusercontent.com/assets/22335838/25994051/20d8869a-36c1-11e7-8ea5-46e9ff0ec5e6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Update 2017-05-12 9 a.m.**: After talking with Emma this morning, I realized that I need to use MSStats to find proteins differentially expressed instead of this CV method. I will start a new lab notebook for this."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}