# Preliminary Analysis for Full Skyline Document

After loading in a [revised .blib](https://yaaminiv.github.io/Skyline-Attempt-3/) and [error checking](https://yaaminiv.github.io/Skyline-Error-Checking-Round2/), I noticed that I had 9047 proteins instaed of the 6000 that were identified during my [preliminary analysis](https://yaaminiv.github.io/Preliminary-Data-Analysis/) for my oyster seed .blib. In this notebook, I run through the same pipeline to see how the new Skyline document differs from the old one.

## A few differences between the Skyline documents

- New .blib is demultiplexed
 - Skyline would have an easier time differentiating peaks
- New .blib made from demultiplexed data, not oyster seed spectra

## Step 1: Modify data

The [data I exported from Skyline after error checking](http://owl.fish.washington.edu/spartina/DNR_Skyline_20170524/2017-06-10-protein-areas-only-error-checked.csv) has protein and area information, but it's still broken down by peptide transition. The first thing I need to do is average the peak area for each mass spectrometer across transitions.

I the `aggregate` function in an [R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/2017-06-13-Full-Skyline-Preliminary-Analysis.R) to average peak areas across transitions for each mass spectrometer. I then added more informative column headers that specified site and eelgrass condition. Finally, I averaged areas across mass spectrometer replicates. In my final spreadsheet, each site and eelgrass condition only had one column of associated protein peak areas.




## Step 2: Create NMDS Plot

My first step in comparing my full dataset is to see if there is any site or eelgrass condition clustering. Last time, I created an NMDS plot for my preliminary dataset but found no clustering pattern. The code I used is in the same R script as **Step 1**, but can also be found below.



My NMDS plot this time demonstrates very clear site clustering for Port Gamble and Case Inlet. Willapa Bay and Fidalgo Bay are a bit farther clustered, while Skokomish River Delta is not clustered together at all.

![NMDS](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Skyline_20170524/full-NMDS.jpeg)

## Step 3: Create Heatmap

Using `pheatmap` in R, I made a heatmap for all 9,047 proteins across all 10 site and eelgrass conditions.




Like last time, there are no clear patterns. However, it is important to note that the Skokomish River eelgrass condition has different colors across the board. This could be due to the fact that the eelgrass location was in lower salinity water at this site compared to other locations.

![heatmap](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Skyline_20170524/fullHeatmap.png). 

## Step 4: Merge Protein Areas with GO Terms

This was done in the same [R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/2017-06-13-Full-Skyline-Preliminary-Analysis.R) as the three previous steps.



I copied and pasted the GO terms from the Skyline output in [this spreadsheet](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/2017-06-13-DAVID-accession-codes.xlsx).

## Step 5: Gene Enrichment in DAVID

Copied and pasted GO terms for gene list, then background proteome, into DAVID.









Downloaded [biological processes](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/207-06-13-full-biological-processes.txt), [cellular components](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/207-06-13-full-cellular-components.txt), and [molecular function](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/207-06-13-full-molecular-function.txt) functional annotation tables.







Also downloaded the [kegg pathway](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Skyline_20170524/207-06-13-full-kegg-pathway.txt) functional annotation table, as there was no interesting accompanying graphic.

## Step 6: REVIGO

Using REVIGO, I created a similar plot for the biological processes overexpressed in my new data set. To limit the amount of information in the graphic, I used the first 44 GO terms provided by DAVID. This is the same restriction I used last time.



Final REVIGO plot:



There are different biological processes expressed in my new data set!

Terms common between both REVIGO plots: 

- oxidation-reduction
- cell-cell adhesion
- metabolism
- gluconeogenesis
- mRNA processing
- protein folding

Similar terms:

The top term represents the term used in the previous REVIGO, and the bottom represents the term in this REVIGO plot.

- translation
 - translational initiation
- cell redox homeostasis
 - oxidation-reduction process
- protein homotetramerization
 - protein folding
- reproduction
 - embryo development ending in birth or egg hatching
- lipid homeostasis
 - fatty acid beta oxidation

Not present:

Terms in the previous REVIGO but not in this one:

- response to stress

Different:

Terms present in this REVIGO but not the previous one:

- regulation of endocytosis
- mRNA processing
- RNA splicing
- negative regulation of mRNA splicing, via spliceosome
- **miRNA MEDIATED INHIBITION OF TRANSLATION <-- THIS IS THE MOST EXCITING GO TERM (in my opinion)**

My next steps are to work out the kinks in MSstats and find a list of potential targets, then walk throug the same analysis pipeline.