{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preliminary Proteomic Data Analyses\n",
    "\n",
    "Using [data from Skyline](https://github.com/RobertsLab/project-oyster-oa/blob/master/notebooks/2017-03-14-Skyline-Test-Run.ipynb), I will analyze and visualize my data. I will first assess which proteins are differentially present in samples across sites and inside or outside of eelgrass beds. Then, I will visualize my data in three different ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, I want to see what data I'm working with. I will work with the data for [average peak area](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Skyline_20170314/Oyster-AverageArea-Proteinbased.csv) for proteins across samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![screen shot 2017-03-21 at 6 38 09 pm](https://cloud.githubusercontent.com/assets/22335838/24178348/ce486518-0e65-11e7-92a5-999f56a87274.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "My \"Row Labels\" column is a list of protein IDs. The other columns pertain to the mass spectrometry sample IDs. To make this usable, I will use an R script to add more informative column names. Additionally, I need to remove two columns relating to sample O107. There was a bubble in the sample column, which lead to [poor mass spectrometer readings](https://yaaminiv.github.io/Mass-Spec-Updates/). I reran the sample twice, so I need to use those runs for analysis instead of the initial poor runs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the [R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Reformat-Preliminary-Data.R) I used to reformat my data.\n",
    "\n",
    "The data can be found [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/Oyster-AverageAdjustedMergedArea.csv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"\",\"averageAreaAdjusted.proteins\",\"bareCaseInlet\",\"bareFidalgoBay\",\"bareWillapaBay\",\"bareSkokomishRiver\",\"barePortGamble\",\"eelgrassCaseInlet\",\"eelgrassFidalgoBay\",\"eelgrassWillapaBay\",\"eelgrassSkokomishRiver\",\"eelgrassPortGamble\"\r\n",
      "\"1\",\"CHOYP_14332.1.2|m.5643\",3293219.556,1726699.5,2716275.545,13976900.44,2243615.1,4909780.357,3073463.222,1943599.9,5308420.417,1708631.273\r\n",
      "\"2\",\"CHOYP_14332.2.2|m.61737\",81172.33333,101834.6667,2605333.25,143657,94883.2778,883664.4286,47877.6,859956.1429,7453332.5,181803\r\n",
      "\"3\",\"CHOYP_1433E.1.2|m.3638\",9599115.391,22026926.25,5985057.944,29630617.52,2209625.5,2888550.125,24439963.26,17286007.32,49787884.24,4732493.542\r\n",
      "\"4\",\"CHOYP_1433E.2.2|m.63376\",9599115.391,22026926.25,5985057.944,29630617.52,2209625.5,2888550.125,24439963.26,17286007.32,49787884.24,4732493.542\r\n",
      "\"5\",\"CHOYP_1433G.1.2|m.8906\",4875530.333,2873835,4671105.667,20887905.17,3481416.667,5906856.286,4141798.333,2896918.8,6873987.556,271279.25\r\n",
      "\"6\",\"CHOYP_1433G.2.2|m.63450\",10460977.8,7264823.438,5228882.182,11503021.89,5483118.813,7031340,10340272.32,9302575,27924467.09,12652430.6\r\n",
      "\"7\",\"CHOYP_1433G.2.2|m.63451\",229456.5556,780323.1111,351988.2222,755481.9091,4747114,2972555,112558.8571,362153,3853072.929,878322\r\n",
      "\"8\",\"CHOYP_1433Z.1.1|m.17045\",148181.6667,115980.5,3138431.571,180230.875,155703.7143,346657.3333,245099.125,126245.5714,2087374.3,337927.4286\r\n",
      "\"9\",\"CHOYP_2A5D.2.3|m.32240\",245588.8,926314,137453.3333,175999.6667,81883.5,326141,92516.5,352734.6,478352.3333,232660.6667\r\n"
     ]
    }
   ],
   "source": [
    "!head /Users/yaaminivenkataraman/Documents/School/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Oyster-AverageAdjustedMergedArea.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the same R script, I reformatted my maximum peak area data. The reformatted data can be found [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/Oyster-MaxAdjustedMergedArea.csv)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ratio Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I will now use basic ratios to determine which proteins are potentially differentially expressed between eelgrass and bare patches, and between the five different sites.\n",
    "\n",
    "Within the formatted data files, I calculated averages across sites for proteins expression in bare sites versus eelgrass patches. I then took the ratio of eelgrass:bare and sorted the data from smallest to largest ratio."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**Average area**:\n",
    "\n",
    "![average1](https://cloud.githubusercontent.com/assets/22335838/24216092/2eb7143e-0ef8-11e7-80c1-07ba1ded3fc0.png)\n",
    "![average2](https://cloud.githubusercontent.com/assets/22335838/24216093/2eb7b362-0ef8-11e7-8c7f-ad17d9a800e8.png)\n",
    "\n",
    "**Maximum area**:\n",
    "\n",
    "![max1](https://cloud.githubusercontent.com/assets/22335838/24216097/306f1dda-0ef8-11e7-8096-665217a846d1.png)\n",
    "![max2](https://cloud.githubusercontent.com/assets/22335838/24216096/306b3e86-0ef8-11e7-9234-67481b239536.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "There's quite a range of eelgrass:bare ratios. Keeping this in mind, I'll proceed with my entire dataset for my intial visualizations. After seeing what I get, I'll pare down the dataset based on my ratio analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nonmetric Multidimensional Scaling Plot\n",
    "\n",
    "Based on my data exploration, it seems like most of my differences between eelgrass and bare patches are driven by site-specific interactions. To better visualize this, I'll first create an NMDS plot.\n",
    "\n",
    "[R Script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Preliminary-Data-Analysis.R)\n",
    "\n",
    "![preliminary NMDS](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/preliminaryNMDS.png)\n",
    "\n",
    "I couldn't figure out how to add a legend properly, so the preliminary figure it missing it. However, I did color each site differently. Bare patches are squares, eelgrass are circles. Without knowing which site is which, they don't seem to be clustering together in any way that makes sense.\n",
    "\n",
    "I showed this plot to Emma and she said there was something weird about the axes. She modified my code, removing the \"score\" column from my analysis.\n",
    "\n",
    "![modified NMDS](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/finalNMDS.png)\n",
    "\n",
    "There are still no stark differences between sites or eelgrass vs. bare patches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Heatmap\n",
    "\n",
    "My next step was to create a heatmap to visualize differences in protein expression across sites and eelgrass conditions.\n",
    "\n",
    "[R Script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Preliminary-Data-Analysis.R)\n",
    "\n",
    "![preliminary heatmap](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/preliminaryHeatmap.png)\n",
    "\n",
    "From the heatmap, it's apparent that most of the proteins expressed are consistent between sites. There are only certain proteins on the \"edges\" of the heatmap that are different. It is likely that these proteins are the same as those identified by my ratio analysis. Either way, I'll need to pare down my dataset to get anything meaningful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enrichment Analysis Preparation\n",
    "\n",
    "The end goal is to use REVIGO to create plots similar to the [ones Steven created for Laura's data](https://sr320.github.io/Proteomic-Visualization/). Before I can do that, I need to obtain Uniprot accessions for the proteins in my background proteome and my Skyline data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Isolate accession codes for background proteome\n",
    "\n",
    "The accession codes for the *C. gigas* background proteome I'm using can be found in Rhonda's Github [here](https://raw.githubusercontent.com/Ellior2/Fish-546-Bioinformatics/master/analyses/gigas_prot/blastoutput4.txt). I'm going to download the file myself and then modify it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CHOYP_043R.1.5|m.16874\tsp|Q06852|SLAP1_CLOTH\t56.944\t216\t61\t18\t10\t197\t1388\t1599\t5.11e-008\t55.8\r\n",
      "CHOYP_043R.5.5|m.64252\tsp|Q06852|SLAP1_CLOTH\t52.381\t294\t80\t24\t575\t816\t1351\t1636\t1.98e-016\t88.2\r\n",
      "CHOYP_14332.1.2|m.5643\tsp|Q2F637|1433Z_BOMMO\t66.031\t262\t74\t2\t19\t280\t1\t247\t2.73e-119\t344\r\n",
      "CHOYP_14332.1.2|m.5644\tsp|P62325|BTG1_MOUSE\t47.205\t161\t80\t2\t1\t156\t11\t171\t2.18e-047\t155\r\n",
      "CHOYP_14332.2.2|m.61737\tsp|Q2F637|1433Z_BOMMO\t67.331\t251\t78\t1\t1\t251\t1\t247\t4.15e-119\t342\r\n",
      "CHOYP_1433E.1.2|m.3639\tsp|Q9CWP8|DPOD4_MOUSE\t38.235\t102\t61\t2\t30\t130\t7\t107\t9.56e-019\t78.6\r\n",
      "CHOYP_1433E.1.2|m.3638\tsp|P92177|1433E_DROME\t77.692\t260\t53\t1\t1\t255\t1\t260\t1.91e-149\t420\r\n",
      "CHOYP_1433E.2.2|m.63376\tsp|P92177|1433E_DROME\t77.692\t260\t53\t1\t1\t255\t1\t260\t1.91e-149\t420\r\n",
      "CHOYP_1433E.2.2|m.63378\tsp|Q6MG82|PRRT1_RAT\t44.444\t99\t53\t1\t41\t137\t196\t294\t1.81e-018\t82.8\r\n",
      "CHOYP_1433G.1.2|m.8906\tsp|Q1HR36|1433Z_AEDAE\t72.180\t133\t37\t0\t18\t150\t116\t248\t1.86e-067\t207\r\n"
     ]
    }
   ],
   "source": [
    "!head /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/background-proteome-accession.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first line, the accession code is \"Q06852,\" but it is currently nested inbetween other content in the second column.\n",
    "\n",
    "I will use `bash` to convert the \"|\" (pipes) to tabs, and then redirect that output to a new file. I will run this in the terminal, because Jupyter Notebook web sockets tend to ping out after a few minutes. The code is as follows:\n",
    "\n",
    "**Background proteome**: `tr '|' '\\t' < /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/background-proteome-accession.txt > /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/background-proteome-accession-no-pipes.txt`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CHOYP_043R.1.5\tm.16874\tsp\tQ06852\tSLAP1_CLOTH\t56.944\t216\t61\t18\t10\t197\t1388\t1599\t5.11e-008\t55.8\r\n",
      "CHOYP_043R.5.5\tm.64252\tsp\tQ06852\tSLAP1_CLOTH\t52.381\t294\t80\t24\t575\t816\t1351\t1636\t1.98e-016\t88.2\r\n",
      "CHOYP_14332.1.2\tm.5643\tsp\tQ2F637\t1433Z_BOMMO\t66.031\t262\t74\t2\t19\t280\t1\t247\t2.73e-119\t344\r\n",
      "CHOYP_14332.1.2\tm.5644\tsp\tP62325\tBTG1_MOUSE\t47.205\t161\t80\t2\t1\t156\t11\t171\t2.18e-047\t155\r\n",
      "CHOYP_14332.2.2\tm.61737\tsp\tQ2F637\t1433Z_BOMMO\t67.331\t251\t78\t1\t1\t251\t1\t247\t4.15e-119\t342\r\n",
      "CHOYP_1433E.1.2\tm.3639\tsp\tQ9CWP8\tDPOD4_MOUSE\t38.235\t102\t61\t2\t30\t130\t7\t107\t9.56e-019\t78.6\r\n",
      "CHOYP_1433E.1.2\tm.3638\tsp\tP92177\t1433E_DROME\t77.692\t260\t53\t1\t1\t255\t1\t260\t1.91e-149\t420\r\n",
      "CHOYP_1433E.2.2\tm.63376\tsp\tP92177\t1433E_DROME\t77.692\t260\t53\t1\t1\t255\t1\t260\t1.91e-149\t420\r\n",
      "CHOYP_1433E.2.2\tm.63378\tsp\tQ6MG82\tPRRT1_RAT\t44.444\t99\t53\t1\t41\t137\t196\t294\t1.81e-018\t82.8\r\n",
      "CHOYP_1433G.1.2\tm.8906\tsp\tQ1HR36\t1433Z_AEDAE\t72.180\t133\t37\t0\t18\t150\t116\t248\t1.86e-067\t207\r\n"
     ]
    }
   ],
   "source": [
    "!head /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/background-proteome-accession-no-pipes.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now all of my background accession codes are in a separate column, easy to extract. \n",
    "\n",
    "#### Get accession codes for Skyline output\n",
    "\n",
    "My Skyline output does not have any accession codes, so I need to merge the information in my accession code table with the information in my Skyline output. To do this, I'll use an R script modified from [one I've used previously](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/oly_oa_gonad_GO_enrichment/2016-12-2-merging-tables.R).\n",
    "\n",
    "[R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Merging-Skyline-Accession.R)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"CHOYP_14332.1.2|m.5643\" 3293219.556 1726699.5 2716275.545 13976900.44 2243615.1 4909780.357 3073463.222 1943599.9 5308420.417 1708631.273 \"0.707271369\" \"sp|Q2F637|1433Z_BOMMO\"\r\n",
      "\"CHOYP_14332.2.2|m.61737\" 81172.33333 101834.6667 2605333.25 143657 94883.2778 883664.4286 47877.6 859956.1429 7453332.5 181803 \"3.11430649\" \"sp|Q2F637|1433Z_BOMMO\"\r\n",
      "\"CHOYP_1433E.1.2|m.3638\" 9599115.391 22026926.25 5985057.944 29630617.52 2209625.5 2888550.125 24439963.26 17286007.32 49787884.24 4732493.542 \"1.427400749\" \"sp|P92177|1433E_DROME\"\r\n",
      "\"CHOYP_1433E.2.2|m.63376\" 9599115.391 22026926.25 5985057.944 29630617.52 2209625.5 2888550.125 24439963.26 17286007.32 49787884.24 4732493.542 \"1.427400749\" \"sp|P92177|1433E_DROME\"\r\n",
      "\"CHOYP_1433G.1.2|m.8906\" 4875530.333 2873835 4671105.667 20887905.17 3481416.667 5906856.286 4141798.333 2896918.8 6873987.556 271279.25 \"0.546098216\" \"sp|Q1HR36|1433Z_AEDAE\"\r\n",
      "\"CHOYP_1433G.2.2|m.63450\" 10460977.8 7264823.438 5228882.182 11503021.89 5483118.813 7031340 10340272.32 9302575 27924467.09 12652430.6 \"1.683768087\" \"sp|P68252|1433G_BOVIN\"\r\n",
      "\"CHOYP_1433G.2.2|m.63451\" 229456.5556 780323.1111 351988.2222 755481.9091 4747114 2972555 112558.8571 362153 3853072.929 878322 \"1.191466832\" \"sp|Q4R979|RBM4_MACFA\"\r\n",
      "\"CHOYP_1433Z.1.1|m.17045\" 148181.6667 115980.5 3138431.571 180230.875 155703.7143 346657.3333 245099.125 126245.5714 2087374.3 337927.4286 \"0.840786396\" \"sp|Q2F637|1433Z_BOMMO\"\r\n",
      "\"CHOYP_2A5D.2.3|m.32240\" 245588.8 926314 137453.3333 175999.6667 81883.5 326141 92516.5 352734.6 478352.3333 232660.6667 \"0.945870296\" \"sp|Q14738|2A5D_HUMAN\"\r\n",
      "\"CHOYP_2A5D.3.3|m.37716\" 3215208.909 414344.9 1231461.375 406817.7778 248851.1429 12436748.27 535174.6667 1152859.833 883051.75 1879546.636 \"3.061147029\" \"sp|Q14738|2A5D_HUMAN\"\r\n"
     ]
    }
   ],
   "source": [
    "!head /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Skyline-ProteinAccession-nohead.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, my accession codes are nested between pipes. I'll use the following code in the terminal to remove pipes and isolate the accession codes.\n",
    "\n",
    "**Skyline output**: `tr '|' '\\t' < /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Skyline-ProteinAccession-nohead.txt > /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Skyline-ProteinAccession-nohead-nopipes.txt`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "O54692\t\"ZW10_MOUSE\"\"\"\t\"CHOYP_ZW10.1.1\tm.35287\"\t53686.5\t142385.5\t209814.3333\t86507\t154063.6667\t253388\t232862.6667\t996650.25\t59871.66667\t300669\t2.851607428\tsp"
     ]
    }
   ],
   "source": [
    "!head /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Skyline-ProteinAccession-nohead-nopipes.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I then copied and pasted all of my accession codes from my background proteome and my Skyline output into a [new spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-accession-codes.xlsx)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### DAVID\n",
    "\n",
    "I can now use my accession codes in [DAVID](https://david.ncifcrf.gov/tools.jsp) to get gene ontology terms.\n",
    "\n",
    "#### Upload Skyline output accession codes\n",
    "\n",
    "I copied and pasted my Skyline accesssion codes from my spreadsheet into DAVID.\n",
    "\n",
    "![skylineoutput](https://cloud.githubusercontent.com/assets/22335838/24221177/7cb9b1d4-0f0a-11e7-85dd-7f273d34a541.png)\n",
    "\n",
    "![skylineoutput2](https://cloud.githubusercontent.com/assets/22335838/24221178/7cce0a9e-0f0a-11e7-868b-ad384fa55572.png)\n",
    "\n",
    "Multiple species were detected, but that's okay becuase I had yet to specify my background list.\n",
    "\n",
    "#### Upload background proteome accession codes\n",
    "\n",
    "Did the same for my background proteome.\n",
    "\n",
    "![backgroundaccession](https://cloud.githubusercontent.com/assets/22335838/24220972/aa60e46e-0f09-11e7-83b4-ad93ea1d6f27.png)\n",
    "\n",
    "#### Pull interesting information from DAVID\n",
    "\n",
    "I had the following options using the Functional Annotation Tool:\n",
    "\n",
    "![annotationoptions](https://cloud.githubusercontent.com/assets/22335838/24221206/a0514f4e-0f0a-11e7-9e3b-929e4243f9cf.png)\n",
    "\n",
    "To see what was overrepresented across all of my samples, I pulled out [Biological Processes](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/DNR-Biological-Processes.txt), [Cellular Component](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/DNR-Cellular-Processes.txt) and [Molecular Function](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/DNR-Molecular-Function.txt) information.\n",
    "\n",
    "![biologicalprocesseschart](https://cloud.githubusercontent.com/assets/22335838/24223012/8af7ddfa-0f11-11e7-89a3-a5d37c16be19.png)\n",
    "![cellularprocesses](https://cloud.githubusercontent.com/assets/22335838/24223010/8af6c38e-0f11-11e7-8a73-036513a49165.png)\n",
    "![molecularfunction](https://cloud.githubusercontent.com/assets/22335838/24223009/8af5e18a-0f11-11e7-9d6a-b511fadcdf7a.png)\n",
    "\n",
    "I also got information on the [Kegg Pathway](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/DNR-Kegg.txt). Below is the carbohydrate pathway that was overrepresented in all of my Skyline data.\n",
    "\n",
    "![carbon metabolism](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/DNR-carbon-metabolism.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### REVIGO\n",
    "\n",
    "For now, I'm just going to make a REVIGO plot using the Biological Processes information. I used the first 44 entries of the table to limit the amount of information in the plot.\n",
    "\n",
    "![revigo](https://cloud.githubusercontent.com/assets/22335838/24228094/2a6f0670-0f2f-11e7-9977-13b3b291161f.png)\n",
    "\n",
    "![biological processes](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-files/overrepresented-biological-processes.jpg)\n",
    "\n",
    "Looking at this figure, I can see which processes are overrepresented amongst all of my samples. I now need to identify proteins of interest and see how they vary between my sites and eelgrass conditions. Together, this will give us a hollistic picture of protein expression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subsetting data\n",
    "\n",
    "To carry out my next analyses, I first need to get a subset of my Skyline output. Ideally, I want proteins of interest and average peak areas across sites associated with GO terms. I can then created the table mentioned above as well as create a new REVIGO plot. If I have time, I can also redo my NMDS plot and heatmap with the subsetted data.\n",
    "\n",
    "First, I downloaded a [table from Rhonda's Github](https://github.com/Ellior2/Fish-546-Bioinformatics/blob/master/analyses/gigas_prot/Galaxy41-%5BSort_on_data_40%5D.tabular) with *C. gigas* proteins and associated GO terms. I then reformatted the file and my Skyline output file so that they were both tabular and had protein names in the second column.\n",
    "\n",
    "[*C. gigas* proteome with GO terms](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/all-proteins-go-terms/Proteins-GO-terms.tabular)\n",
    "\n",
    "[Reformatted Skyline output](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/all-proteins-go-terms/Oyster-AverageAdjustedMergedArea.tabular)\n",
    "\n",
    "I used [Galaxy](usegalaxy.com) to join the two tables using the protein name.\n",
    "\n",
    "![untitled](https://cloud.githubusercontent.com/assets/22335838/24228692/0e7d498c-0f33-11e7-88bd-8fb6ce2acdee.png)\n",
    "\n",
    "[Here](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/all-proteins-go-terms/skyline-proteins-go-terms.tabular) is the final table of protein names and GO terms. Peak area information is also in the spreadsheet.\n",
    "\n",
    "Steven identified proteins of interest and created a [smaller table](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/stress-short-list.tab.txt). I will use this table moving forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Coefficient of Variance Table for Data Subset\n",
    "\n",
    "I calculated the coefficiant of variance for all proteins identified in the subset to understand how much average peak area, and therefore abundance, varied between the ten combinations of sample sites and eelgrass conditions.\n",
    "\n",
    "![untitled](https://cloud.githubusercontent.com/assets/22335838/24229657/f9d406a6-0f37-11e7-96d2-bf4f1e4628ac.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein Name\tCoefficient of Variation\r",
      "ATP-binding cassette sub-family A member 1 (ATP-binding cassette transporter 1) (ABC-1) (ATP-binding cassette 1)\t1.253314851\r",
      "\"Apoptosis-inducing factor 1, mitochondrial (EC 1.1.1.-) (Programmed cell death protein 8)\"\t0.921483289\r",
      "\"Peroxiredoxin-5, mitochondrial (EC 1.11.1.15) (Alu corepressor 1) (Antioxidant enzyme B166) (AOEB166) (Liver tissue 2D-page spot 71B) (PLP) (Peroxiredoxin V) (Prx-V) (Peroxisomal antioxidant enzyme) (TPx type VI) (Thioredoxin peroxidase PMP20) (Thioredoxin reductase)\"\t0.989366319\r",
      "Metabotropic glutamate receptor 7 (mGluR7)\t2.674513807\r",
      "Catalase (EC 1.11.1.6)\t0.458616818\r",
      "Heat shock protein beta-1 (HspB1) (28 kDa heat shock protein) (Estrogen-regulated 24 kDa protein) (Heat shock 27 kDa protein) (HSP 27) (Stress-responsive protein 27) (SRP27)\t1.196391219\r",
      "\"DnaJ homolog subfamily C member 3 (Interferon-induced, double-stranded RNA-activated protein kinase inhibitor) (Protein kinase inhibitor of 58 kDa) (Protein kinase inhibitor p58)\"\t1.069669022\r",
      "Heat shock protein beta-1 (HspB1) (28 kDa heat shock protein) (Estrogen-regulated 24 kDa protein) (Heat shock 27 kDa protein) (HSP 27) (Stress-responsive protein 27) (SRP27)\t1.168746732\r",
      "Heat shock protein beta-1 (HspB1) (Heat shock 27 kDa protein) (HSP 27)\t1.303305422\r",
      "Glucose-6-phosphate 1-dehydrogenase (G6PD) (EC 1.1.1.49)\t1.140978726\r",
      "Glycogen synthase kinase-3 beta (GSK-3 beta) (EC 2.7.11.26) (Serine/threonine-protein kinase GSK3B) (EC 2.7.11.1)\t0.857598721\r",
      "Peroxidasin (EC 1.11.1.7)\t0.929128752\r",
      "Heat shock 70 kDa protein 4L (Heat shock 70-related protein APG-1) (Osmotic stress protein 94)\t0.800052181\r",
      "Heat shock protein 83 (HSP 82)\t1.286826464\r",
      "Heat shock protein HSP 90-alpha 1\t1.28808981\r",
      "Hypoxia up-regulated protein 1 (150 kDa oxygen-regulated protein) (ORP-150) (170 kDa glucose-regulated protein) (GRP-170)\t1.869121031\r",
      "Superoxide dismutase [Cu-Zn] (EC 1.15.1.1)\t1.361991698\r",
      "Peroxiredoxin (EC 1.11.1.15) (AsPrx) (Thioredoxin peroxidase)\t0.705391595\r",
      "Bcl-2-like protein 1 (Bcl2-L-1) (Apoptosis regulator Bcl-X)\t1.452370643\r",
      "Histone deacetylase 4 (HD4) (EC 3.5.1.98)\t1.810861403\r",
      "Insulin-like growth factor 2 mRNA-binding protein 1 (IGF2 mRNA-binding protein 1) (IMP-1) (IGF-II mRNA-binding protein 1) (VICKZ family member 1) (Zip-code binding polypeptide) (Zipcode-binding protein 1) (ZBP-1)\t0.643417737\r",
      "Universal stress protein YxiE (USP YxiE)\t1.922524858\r",
      "Universal stress protein Sll1388 (USP Sll1388)\t1.402272923\r",
      "Putative universal stress protein SAOUHSC_01819\t2.049485734\r",
      "Universal stress protein MSMEG_3950/MSMEI_3859 (USP MSMEG_3950)\t2.696616574\r",
      "Universal stress protein in QAH/OAS sulfhydrylase 3'region (USP)\t1.857172894\r",
      "Putative universal stress protein SAOUHSC_01819\t2.049485734\r",
      "Universal stress protein A-like protein\t1.024429849\r",
      "Stress response protein NhaX\t0.903562525\r",
      "Heat shock protein beta-1 (HspB1) (28 kDa heat shock protein) (Estrogen-regulated 24 kDa protein) (Heat shock 27 kDa protein) (HSP 27) (Stress-responsive protein 27) (SRP27)\t1.051346873\r",
      "Cytochrome P450 1A1 (EC 1.14.14.1) (CYPIA1) (Cytochrome P450-P1)\t1.476627112\r",
      "DnaJ homolog subfamily A member 1 (DnaJ protein homolog 2) (HSDJ) (Heat shock 40 kDa protein 4) (Heat shock protein J2) (HSJ-2) (Human DnaJ protein 2) (hDj-2)\t1.445522413\r",
      "Heat shock protein beta-1 (HspB1) (28 kDa heat shock protein) (Estrogen-regulated 24 kDa protein) (Heat shock 27 kDa protein) (HSP 27) (Stress-responsive protein 27) (SRP27)\t0.984186386\r",
      "Hepatocyte growth factor receptor (HGF receptor) (EC 2.7.10.1) (HGF/SF receptor) (Proto-oncogene c-Met) (Scatter factor receptor) (SF receptor) (Tyrosine-protein kinase Met)\t1.816442791\r",
      "Multidrug resistance-associated protein 1 (ATP-binding cassette sub-family C member 1) (Leukotriene C(4) transporter) (LTC4 transporter)\t0.898224035\r",
      "Heat shock protein beta-1 (HspB1) (28 kDa heat shock protein) (Estrogen-regulated 24 kDa protein) (Heat shock 27 kDa protein) (HSP 27) (Stress-responsive protein 27) (SRP27)\t1.758080064\r",
      "Heat shock protein HSP 90-beta\t1.227667935\r",
      "Heat shock protein beta-1 (HspB1) (Heat shock 27 kDa protein) (HSP 27)\t1.303305422\r",
      "\"Superoxide dismutase [Mn], mitochondrial (EC 1.15.1.1)\"\t1.206078094\r",
      "Stress-induced-phosphoprotein 1 (STI1) (Hsc70/Hsp90-organizing protein) (Hop) (Renal carcinoma antigen NY-REN-11) (Transformation-sensitive protein IEF SSP 3521)\t1.90652305\r",
      "\"Heat shock protein 75 kDa, mitochondrial (HSP 75) (TNFR-associated protein 1) (Tumor necrosis factor type 1 receptor-associated protein) (TRAP-1)\"\t0.990699714\r",
      "Universal stress protein in QAH/OAS sulfhydrylase 3'region (USP)\t1.555452384\r",
      "Extracellular superoxide dismutase [Cu-Zn] (EC-SOD) (EC 1.15.1.1) (Superoxide dismutase B)\t0.566196417"
     ]
    }
   ],
   "source": [
    "!head /Users/yaamini/Documents/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/coefficient-of-variance-stress-short-list.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enrichment for Data Subset\n",
    "\n",
    "I will use DAVID to find proteins overrepresented in my data subset.\n",
    "\n",
    "[Uniprot accession codes for DAVID](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DAVID-accession-codes.xlsx)\n",
    "\n",
    "![1-annotationoptions](https://cloud.githubusercontent.com/assets/22335838/24230264/50371f58-0f3b-11e7-8258-7dc355bcc564.png)\n",
    "\n",
    "[Biological Processes](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/DAVID-files/short-list-GOBP.txt)\n",
    "\n",
    "[Cellular Components](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/DAVID-files/short-list-GOCC.txt)\n",
    "\n",
    "[Molecular Function](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/DAVID-files/short-list-GOMF.txt)\n",
    "\n",
    "[Kegg Pathway](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/DAVID-files/short-list-kegg.txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### REVIGO for Data Subset\n",
    "\n",
    "Using REVIGO, I will create plots for my data subset using both p-values from DAVID and coefficient of variance from my previous analysis.\n",
    "\n",
    "#### Coefficient of Variation\n",
    "\n",
    "The first thing I needed to do was modify the [stress short list table] and include coefficient of variation. I also needed to unfold the GO IDs, as they were all in one column.\n",
    "\n",
    "![1-reformattable](https://cloud.githubusercontent.com/assets/22335838/24230657/6fc5ba1c-0f3d-11e7-99e3-ddf46b8c5ece.png)\n",
    "\n",
    "Because a combination of GO IDs could refer to the same protein, I used only the first GO ID listed for each protein in REVIGO with an accompanying covariance.\n",
    "\n",
    "![2-revigosettings](https://cloud.githubusercontent.com/assets/22335838/24230864/8691ca28-0f3e-11e7-870d-9f1c612061a6.png)\n",
    "\n",
    "![coefficient-of-variation-REVIGO](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/REVIGO/coefficient-of-variation-biological-processes.png)\n",
    "\n",
    "#### DAVID p-values\n",
    "\n",
    "I did the same thing using p-values less than 0.05 and associated GO IDs from DAVID.\n",
    "\n",
    "![1-revigosettings](https://cloud.githubusercontent.com/assets/22335838/24231063/a1a9f1f4-0f3f-11e7-838a-6e05cb8d4b8a.png)\n",
    "\n",
    "![p-value-REVIGO](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/REVIGO/p-value-biological-processes.png)\n",
    "\n",
    "The p-value REVIGO plot has more information, while the coefficient of variation REVIGO plot shows how consistent stress preotin abundance is between stites."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NMDS Plot for Data Subset\n",
    "\n",
    "Using my data subset of just stress-related proteins, I'm going to remake my NMDS plot and see if that changes anything.\n",
    "\n",
    "Here is the [R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/R-analyses/DNR-Preliminary-Data-Analysis-Subset.R).\n",
    "\n",
    "![subsetNMDS](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/R-analyses/subsetNMDS-revised.png)\n",
    "\n",
    "Wow, these proteins are actually clustering more than the NMDS from my entire protein list! There doesn't seem to be a distinct directionality with temperature based on [summary data](https://github.com/RobertsLab/project-oyster-oa/blob/master/data/Environmental%20Summary%20Data%20for%20Proteomics%20Project.csv)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Heatmap for Data Subset\n",
    "\n",
    "Here is the [R Script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/R-analyses/DNR-Preliminary-Data-Analysis-Subset.R).\n",
    "\n",
    "![subset heatmap](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/short-list-analysis/R-analyses/subsetHeatmap.png)\n",
    "\n",
    "Once again, this one doesn't show much variation, but it is much cleaner than my previous heatmap."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes all of my preliminary analysis work! Now I'll get started on that NSA poster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}