# Data-Independent Mass Spectrometry: From RAW files to Analyses

In this notebook, I'll outline the workflow necessary to take my *Crassostrea gigas* Data-Independent Mass Spectrometry (DIA) data from `RAW` files off of the mass spectrometer to a `.csv` usable in R Studio. DIA allows us to identify all peptides in each oyster sample without having any prior knowledge of what we may find.

*Note: This pipeline relies heavily on Windows-based programs*

## Step 1: Collect Materials

**Software Dependencies**:

- [Skyline daily](https://skyline.ms/project/home/software/Skyline/daily/register-form/begin.view?)
 - This version of Skyline will udpates as the software is modified. You need permission to download it. The version of Skyline daily used in this notebook is Skyline-daily (64-bit) 3.7.1.11446 (as of 2017-10-11). 
- [MSConvert]()
 - Special version of MSConvert modified by Austin in Genome Sciences on 2017-04-18.
- [R](https://www.r-project.org) and [R Studio](https://www.rstudio.com)
 - R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
 - R Studio Version 1.0.143
 
**Specific files (in order used)**:

- [config file for MSConvert](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_MSConvert_20170412/config_fix_20170413.txt)
 - This file contains all of the information MSConvert needs to demultiplex RAW files. It includes an arugment specific to the isolation scheme used by the mass spectrometer to collect data.
- [.blib file from `pecanpie`](http://owl.fish.washington.edu/spartina/DNR_Skyline_20170524/2017-05-23-oyster-desearleinated.blib)
 - A .blib file is a library of all peptides you want to identify in your samples. For this experiment, the .blib contains the entire *C. gigas* proteome. This is the same .blib file used in my DIA analyses. This file was created by Emma Timmins-Schiffman and her team in `brecan`, a version of `pecanpie` produced by Brian Searle in Genome Sciences. Jarrett Egertson wrote a script to get rid of incompatibilities generated by `brecan`. The final compatible document is found above. For more information regarding .blib creation, see this [lab notebook entry](https://yaaminiv.github.io/Skyline-Attempt-3/).
- [Undigested **FASTA** background proteome, including QC protein sequence](http://owl.fish.washington.edu/spartina/DNR_Skyline_20170427/Combined-gigas-QC.fasta)
- [RAW files](http://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data/Oyster_raw_files/)
- [Key for samples](http://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data/2017_January_23.csv)

## Step 2: Demultiplex RAW data in MSConvert

DIA mass spectrometry uses overlapping mass-charge (m/z) windows to capture all peptides in a sample. To discern which peptides and transitions were found in each window, RAW files are demultiplexed. This allows us to parse out peptide abundance data from each individual window. Demultiplexing requires use of the MSConvert command line interface.

### Step 2a: Install the MSConvert CLI

Unzip the file in Step 1 in your desired working directory. This should be the same directory with the RAW files to convert.

### Step 2b: Convert RAW files

Use the following code to demultiplex all RAW files and convert them to .mzML files:

`msconvert.exe -c config_fix_20170413.txt *.raw`

## Step 3: Prepare Peptide Database

### Step 3a. Merge reference proteome with Quality Standards.

Essentially concatenate the fasta files. 

[C. gigas proteome](http://owl.fish.washington.edu/halfshell/bu-git-repos/nb-2017/C_gigas/data/Cg_Gigaton_proteins.fa). 
A tabular version of the QC peptide list for the January 2017 Lumos run can be found [here](http://owl.fish.washington.edu/generosa/Generosa_DNR/Pierce_PRTC.tabular).

### Step 3b. Run in silico tryptic digestion of reference proteome with quality standards.

This is done with [Protein Digestion Simulator](https://omics.pnl.gov/software/protein-digestion-simulator), a Windows program.

Settings for the in silico tryptic digest, as discussed on [Issue #483](https://github.com/sr320/LabDocs/issues/483). Each screenshot refers to a different tab in the program window. The screenshots are sequential (Tab #1-Tab #4).

Ensure the output, a digested reference proteome, is in a .tabular format.

![PDS tab 1](https://github.com/RobertsLab/Paper-DNR-Proteomics/blob/master/images/2017-02-19_final-Digest-Settings1.png?raw=true)

*Note: "Delimited Input File Options" depends on the format of your fasta file. This one only has protein ID and sequence.*

![PDS tab 2](https://github.com/RobertsLab/Paper-DNR-Proteomics/blob/master/images/2017-02-19_final-Digest-Settings2.png?raw=true)

![PDS tab 3](https://github.com/RobertsLab/Paper-DNR-Proteomics/blob/master/images/2017-02-19_final-Digest-Settings3.png?raw=true)

*Note: Use default settings*

![PDS tab 4](https://github.com/RobertsLab/Paper-DNR-Proteomics/blob/master/images/2017-02-19_final-Digest-Settings4.png?raw=true)

*Note: Use default settings*

To execute digest, on select "Parse and Digest" on Tab #2.

### Step 3c. Modify .tabular digested proteome

The digested proteome will provide information unnecessary for PECAN.

![full-digested-proteome](https://cloud.githubusercontent.com/assets/22335838/23740214/790b2008-0457-11e7-8c26-e3aea0881759.png)

The only columns needed for PECAN are the first two: "Protein_Name" and "Sequence". These columns can be cut out using `awk` or in [Galaxy](usegalaxy.org). The final .tabular digested proteome should look like this:

![modified-digested-proteome](https://cloud.githubusercontent.com/assets/22335838/23740215/790e76e0-0457-11e7-911f-aa65f6069b61.png)

Example
On Windows using Git Bash.
```

srlab@swan MINGW64 ~
$ cd Desktop/

srlab@swan MINGW64 ~/Desktop
$ cd grace/

srlab@swan MINGW64 ~/Desktop/grace
$ head Cg_Giga_cont_prtc_AA_digested_Mass400to6000.txt
Protein_Name Sequence Unique_ID Monoisotopic_Mass Predicte
d_NET Tryptic_Name
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPK 1 2541.2598016 0.3655t2
.1
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKK 2 2669.35475980.34
14 t2.2
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKKEPTYDENVVVK 3 3942.973
762 0.3449 t2.3
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKKEPTYDENVVVKIISQDTPTILR 45180.67
6764 0.5144 t2.4
CHOYP_043R.5.5|m.64252 KEPTYDENVVVK 5 1419.7245246 0.2186 t3.2
CHOYP_043R.5.5|m.64252 KEPTYDENVVVKIISQDTPTILR 6 2657.4275266 0.4593t3
.3
CHOYP_043R.5.5|m.64252 KEPTYDENVVVKIISQDTPTILRVSFTVNR 7 3460.85649280.56
58 t3.4
CHOYP_043R.5.5|m.64252 EPTYDENVVVK 8 1291.6295664 0.2301 t4.1
CHOYP_043R.5.5|m.64252 EPTYDENVVVKIISQDTPTILR 9 2529.3325684 0.4402t4
.2

srlab@swan MINGW64 ~/Desktop/grace
$ awk '{print $1,$2}' Cg_Giga_cont_prtc_AA_digested_Mass400to6000.txt | head
Protein_Name Sequence
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPK
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKK
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKKEPTYDENVVVK
CHOYP_043R.5.5|m.64252 SPSEDPDAPIENILQTNSVYKPKKEPTYDENVVVKIISQDTPTILR
CHOYP_043R.5.5|m.64252 KEPTYDENVVVK
CHOYP_043R.5.5|m.64252 KEPTYDENVVVKIISQDTPTILR
CHOYP_043R.5.5|m.64252 KEPTYDENVVVKIISQDTPTILRVSFTVNR
CHOYP_043R.5.5|m.64252 EPTYDENVVVK
CHOYP_043R.5.5|m.64252 EPTYDENVVVKIISQDTPTILR

srlab@swan MINGW64 ~/Desktop/grace
$ awk '{print $1,$2}' Cg_Giga_cont_prtc_AA_digested_Mass400to6000.txt \
> > Cg_Giga_cont_prtc_AA_M400-6000-2c.txt
```

## Step 4. Run PECAN

[PECAN](https://bitbucket.org/maccosslab/pecan/overview) correlates your acquired peptide spectra to a database of known sequences and creates a library of proteins and peptides that you detected in your experiment. PECAN requires several inputs, each of which must be prepared before running PECAN in the command line.

### Step 4a. Create two text files that indicate path for 1) the reference peptide list (in silico protein digestion) and 2) spectra data (mzML files)

The background proteome should be a .txt list with protein names and sequences. It can be the same file as the one generated by the in silico protein digestion step above.

[example peptide path file](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_PECAN_Run_3_20170308/2017-03-08-background-peptides-path-list.txt)

Similar to the peptide path file, this should be a .txt file with the path for each of your `.mzML` files you want to analyze.

[example .mzmL path file](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_PECAN_Run_3_20170308/2017-03-08-mzML-file-path-list.txt)

### Step 4b. Generate isolation scheme file

The isolation scheme is a list of DIA precursor windows with m/z ratios used to analyze samples. The isolation schemes are experiment-dependent, so you need to look at your mass spectrometry method file to figure it out. The final isolation scheme should be formatted as a .csv file. The m/z ratios should be paired.

[example isolation scheme](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/2018-02-28-PECAN/PECAN-inputs/2017-03-03-isolation-windows.csv)

### Step 4c. Add Background Proteome to Pecan

Pecan needs to be compiled with the background proteome. To do this: 

1. Copy background proteome to `/home/shared/pecan/PECAN/PecanUtil/` 

2. Add the background proteome to the config file via `gedit /home/shared/pecan/PECAN/PecanUtil/config`. It will be added to the end of the file and follow the format `speciesname = file name` ex: `geoduck = geoduck-background-proteome.csv` 

3. Recompile Pecan via `python /home/shared/pecan/setup.py install` 

### Step 4d. Create a peptide spectral library from your data

This will be done using the `pecanpie` command and the code below:

```
pecanpie -o [directory to create] \ # This is the directory you will navigate to afterwards to run a search.
-b [digested background proteome] \
-n [blib file name] \ # The name of the file you will use in Skyline. No need to include a .blib extension.
-s [species] \ # The species may need to be configured beforehand every time this command is run.
--isolationSchemeType BOARDER \ 
--pecanMemRequest [GB estimate] \ # PECAN will ask you to input a GB estimate of the memory you need. If your estimate is too low, the program will suggest a higher memory for you to use.
[filename for mzML file path list] \ 
[filename for peptide file path] \ 
[filename for isolation scheme file name] \
--fido --jointPercolator
```

This step will be fast because it is just setting up the real PECAN run. Navigate to your newly created directory (specified by the `-o` argument and run the actual search: 

```
cd [directory specified by the -o argument for pecanpie] \
./run_search.sh
```

The search takes a while to run, especially if you are looking for all possible peptides in your proteome. You can check the status of your job with the following code:

```
qstat -f
```

### Step 4e. Confirm creation of .blib file

When `pecanpie` is done running, navigate to the `pecan2blib` folder within your output directory. Inside that folder, you'll find the .blib file, the name of which is specified by `pecanpie -n`.

The .blib file you create will be used in the Skyline workflow.

## Step 3: Create a new Skyline Document

### Step 3a: Import the Spectral Library

1. Open a new Skyline file ("Blank Document")
2. Under Settings >> Peptide Settings >> Library, click "Edit List." The *C. gigas* .blib should already be on the list (2017-05-23-oyster-desearlinated).
3. If the .blib is not already on the list, click "Add"
 - Name the library and select the .blib file
 - After clicking "OK," select the correct library from the list
4. Under pick peptides matching, select "Library"

"screen

### Step 3b: Add Background Proteome

1. Settings >> Peptide Settings >> Digestion
2. Select "Gigas-4-27" under "Background proteome"
3. If "Gigas-4-27" is not an option
 - Select "Add" under "Background proteome"
 - Name background
 - Click "Create" under "Proteome file" to choose where to save the background
 - Click "Add file" under "FASTA files", and select your background proteome fasta file
 - After the FASTA loads, Skyline may prompt you about deleting repeated protein sequences. Select "OK"
 
 ![unnamed-1](https://user-images.githubusercontent.com/22335838/31412862-93c9d204-adcb-11e7-9fd4-68eca9957ee7.png) 

4. Select "Trypsin [KR | P]" under "Enzyme

"screen

### Step 3c: Adjust Peptide Settings

1. Under the Prediction tab, make sure "none" is selected for retention time predictor

 "screen

2. Under the Filter tab, click "Auto-select all matching peptides" and ensure 2 and 25 are the Min and Max lengths. Under "Exclude N-Terminal AAs" enter zero.

 "screen

3. Under the Modification tab, select "Carbamidomethyl (C)" under structural modifications, "heavy" Isotope label type, and "light" Internal standard type.

 "screen

4. Use the following settings under the Quantification tab

 "screen

### Step 3d: Populate Analyte Tree

1. Under File >> Import, select "Transition List"
2. Select the transition list (.csv) outlined in Step 1. This will populate the analyte tree only with targeted proteins and quality control (PRTC) peptides.

Skyline will keep the proteins, peptides and transitions that match what it finds in the library provided in Step 2a.

### Step 3e: Adjust Transition Settings

1. **Settings >> Transition Settings >> Prediction**
 - Precursor mass: Monoisotopic
 - Product ion mass: Monoisotopic
 - Collision energy: Thermo TSQ Vantage
 - Declustering potential: None
 - Optimization library: None
 - Compensation voltage: None
 - DO NOT select "Use optimization values when present"
 
 "screen

2. **Settings >> Transition Settings >> Filter >> Peptides**
 - Precursor charges: 2, 3
 - Ion charges: 1, 2
 - Ion types: y
 - Product ion selection
 - From: ion 2
 - To: last ion
 - DO NOT select any Special ions
 - DO NOT specify any Precursor m/z exclusion window
 - DO select "Auto-select all matching transitions"
 
 "screen
 
3. **Transition Settings >> Library**
 - Ion match tolerance: 0.5 m/z
 - DO NOT select "If a library spectrum is available, pick its most intense ions"
 
 "screen
 
4. **Transition Settings >> Instrument**
 - Min m/z: 100
 - Max m/z: 2000
 - DO NOT select "Dynamic min product m/z"
 - Method match tolerance m/z: 0.055 m/z
 - DO NOT specify any other settings on this tab
 
 "screen
 
5. **Transition Settings >> Full-Scan**
 - Isotope peaks included: Count
 - Precursor mass analyzer: Orbitrap
 - Peaks: 3
 - Resolving power: 60,000
 - At: 400 m/z
 - Isotope labeling enrichment: Default
 - Acquisition method: Targeted
 - Product mass analyzer: Centroided
 - Isolation scheme: None
 - Mass Accuracy: 20 ppm
 - DO NOT select "Use high-sensitivity extraction"
 - Select "Use only scans within 2 minutes of MS/MS IDs"
 
 "screen

## Step 5: Clean Data

In this step, files without any data will be removed and peaks in Skyline will be verfied against predicted retention times.

### Step 5a: Check that all QC peptides are chosen correctly

### Step 5b: Spot check peptides

### Step 5c: Export data

To proceed with downstream analyses, data must be exported from Skyline as a .csv.

Under File > Export > Report, use the following settings to export Skyline data as a .csv. However, do not include the "Total Ion Current Area" option.

![30132381-03f87a94-9305-11e7-8dfa-812e738abbd0](https://user-images.githubusercontent.com/22335838/30983237-ad0f3056-a43e-11e7-8d51-d99207d262e1.png)

Exported data can be found [here](http://owl.fish.washington.edu/spartina/DNR_SRM_20170728/Analyses/2017-09-12-Gigas-SRM-ReplicatesOnly-PostDilutionCurve-NoPivot-RevisedSettings-Report.csv).

### Step 5d: Normalize peak areas