# DIA Analysis Part 1: PECAN

In this notebook, I'll walk through how I prepared Pacific oyster (*Crassostrea gigas*) [proteomic data](https://yaaminiv.github.io/Mass-Spec-Start/) collected for the [DNR project](https://yaaminiv.github.io/DNRprojectintroduction/) for [PECAN](https://bitbucket.org/maccosslab/pecan/overview). PECAN is the first step in the [Data-Independent Mass Spectrometry](https://github.com/sr320/LabDocs/wiki/DIA-data-Analyses) pipeline. In general, DIA Analysis is a bottom-up proteomics method that separately gathers MS/MS spectra and MS survey spectra.

PECAN correlates your acquired peptide spectra to a database of known sequences and creates a library of proteins and peptides that you detected in your experiment. PECAN requires several inputs, each of which must be prepared before running PECAN in the command line.

The first step is to install PECAN, [MSConvert](http://proteowizard.sourceforge.net/tools.shtml) and a [Protein Digestion Simulator](https://omics.pnl.gov/software/protein-digestion-simulator) on the same Windows machine.

I then obtained the .raw files from the mass spectrometer. They can be found [here](https://owl.fish.washington.edu/spartina/January_2017_DNR_Raw_Data).

### 1. MSConvert

Output files from a mass spectrometer are in the .raw format, but PECAN requires mzML files. MSConvert is a GUI used to generate these files with the appropriate centroid peaks using 64-bit and zlib compression. Using the settings outlined in the [DIA Wiki](https://github.com/sr320/LabDocs/wiki/DIA-Data-Analyses), Steven ran MSConvert on my .raw files since I didn't have access to a Windows computer with the program.

Converted files can be found [here](http://owl.fish.washington.edu/halfshell/index.php?dir=working-directory%2F17-02-15%2Forf%2F).

### 2. Protein Digestion Simulator

This step requires a list of all peptides I'm interested in identifying in my sample and a list of QC peptides. I'm interested in all possible peptides in my sample, so I'll use a *C. gigas* proteome.

I need to ensure my proteome is tab-delimited, and has protein names and sequences.

In [2]:
!curl http://owl.fish.washington.edu/halfshell/bu-git-repos/nb-2017/C_gigas/data/Cg_Gigaton_proteins.fa \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Cg_Gigaton_proteins.fa

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 20.7M 100 20.7M 0 0 19.0M 0 0:00:01 0:00:01 --:--:-- 19.8M


In [4]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Cg_Gigaton_proteins.fa

>CHOYP_043R.1.5|m.16874
TPSGPTPSGPTPSVTPTPSGPTPSVTPTPSGSTPSGPTPSVTPTPSGPTPSGPTPSVTPT
PSGPTPSVTPTPSVPTPSGPTPSVTPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTP
SGPTPSVTPTPSVTPTPSGPTPSVTSTPSAPTPSGPTPSGPTPSVTPTPSGPTPSGPTPS
VTPTPSGPTPSVTPTPSG
>CHOYP_043R.5.5|m.64252
SRPTPSVTPTPSGPTPSVTPTPSVSTPSGPTPSVTPTPSGPSPSVTPTPSGPSPSGPTPS
ATPTPSGPTPSGTTPSGSTPSATITTISTPSTTVCSYVDIGPEQAIDVSLRSPSEDPDAP
IENILQTNSVYKPKKEPTYDENVVVKIISQDTPTILRVSFTVNRADTVGLEYLTDYKQKI
ITQNNETVEFVFAAGIITDNFTINIRSDSAEQPEISNLKIRACYKPVIGQPSTTTPNPSI


Using the `!head` command, I confirmed my proteome file has the protein name and sequence in a tab-delimited format. 

Next, I append the list of Quality Control peptides to the list of peptides I'm interested in. To do this, I will use [Galaxy](https://usegalaxy.org).

In [7]:
!curl https://owl.fish.washington.edu/generosa/Generosa_DNR/Pierce_PRTC.tabular -k \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Pierce_PRTC.tabular

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 230 100 230 0 0 64 0 0:00:03 0:00:03 --:--:-- 64


In [8]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Pierce_PRTC.tabular

P00000 Pierce Peptide Retention Time Calibration Mixture	SSAAPPPPPRGISNEGQNASIKHVLTSIGEKDIPVPKPKIGDYAGIKTASEFDSAIAQDKSAAGAFGPELSRELGQSGVDTYLQTKGLILVGGYGTRGILFVGSGVSGGEEGARSFANQPLEVVYSKLTILEELRNGFILDGFPRELASGLSFPVGFKLSSEAPALFQFDLK


First, I converted by `.tabular` file to a FASTA file.

![tab-to-FASTA-converstion](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/2018-02-28-PECAN/PECAN-inputs/01-tab-to-fasta.png)

Then, I merged the two files.

![concatenate-FASTA-files](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/2018-02-28-PECAN/PECAN-inputs/02-concatenate.png)

In [11]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Combined-gigas-QC.fasta

>CHOYP_043R.1.5|m.16874
TPSGPTPSGPTPSVTPTPSGPTPSVTPTPSGSTPSGPTPSVTPTPSGPTPSGPTPSVTPT
PSGPTPSVTPTPSVPTPSGPTPSVTPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTP
SGPTPSVTPTPSVTPTPSGPTPSVTSTPSAPTPSGPTPSGPTPSVTPTPSGPTPSGPTPS
VTPTPSGPTPSVTPTPSG
>CHOYP_043R.5.5|m.64252
SRPTPSVTPTPSGPTPSVTPTPSVSTPSGPTPSVTPTPSGPSPSVTPTPSGPSPSGPTPS
ATPTPSGPTPSGTTPSGSTPSATITTISTPSTTVCSYVDIGPEQAIDVSLRSPSEDPDAP
IENILQTNSVYKPKKEPTYDENVVVKIISQDTPTILRVSFTVNRADTVGLEYLTDYKQKI
ITQNNETVEFVFAAGIITDNFTINIRSDSAEQPEISNLKIRACYKPVIGQPSTTTPNPSI


In [12]:
!tail /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Combined-gigas-QC.fasta

QTQIKLVIIAVANNIGLMIMMCFLYLYCELQSRRSTSLNLGQGRPGVYTP*
>CHOYP_contig_056607|m.67058
LAASISDSLQYGFQRQSQKTDLTETEVIIIGVICAVILLSLIIVLIVIICRRRTNAKDYG
NSSTTVSRPVPNLQTNGNSVGPPKPQRGPDNQDLKNEAYYGVHNGSPSTPRKDPH
>CHOYP_contig_056609|m.67060
MGKSNEEDQHQNISRMPTVKIGNNKHISDSEIVSEQTDHLADSAEMDHLAVQKGYLFLLE
HIHEELQLVDYLCQLCESHLSDDEKKDMRDGKGYFKKRELLKCLISKGESACKEFLEKFK
CYENLYSQFRNAINSVTNADGI
>P00000 Pierce Peptide Retention Time Calibration Mixture
SSAAPPPPPRGISNEGQNASIKHVLTSIGEKDIPVPKPKIGDYAGIKTASEFDSAIAQDKSAAGAFGPELSRELGQSGVDTYLQTKGLILVGGYGTRGILFVGSGVSGGEEGARSFANQPLEVVYSKLTILEELRNGFILDGFPRELASGLSFPVGFKLSSEAPALFQFDLK


Merging my files was successful! Now, I can run the in silico tryptic digest. The settings I used for the digest are below, but can also be found on the [DIA Wiki](https://github.com/sr320/LabDocs/wiki/DIA-Data-Analyses).

![tab-1](https://raw.githubusercontent.com/RobertsLab/Paper-DNR-Proteomics/master/images/2017-02-19_final-Digest-Settings1.png)

![tab-2](https://raw.githubusercontent.com/RobertsLab/Paper-DNR-Proteomics/master/images/2017-02-19_final-Digest-Settings2.png)

![tab-3](https://raw.githubusercontent.com/RobertsLab/Paper-DNR-Proteomics/master/images/2017-02-19_final-Digest-Settings3.png)

![tab-4](https://raw.githubusercontent.com/RobertsLab/Paper-DNR-Proteomics/master/images/2017-02-19_final-Digest-Settings4.png)

The digest is complete! Here is what it looks like:

In [18]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Combined-gigas-QC.txt

ProteinName	Description	Sequence
CHOYP_043R.1.5|m.16874	CHOYP_043R.1.5|m.16874	TPSGPTPSGPTPSVTPTPSGPTPSVTPTPSGSTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTPSVTPTPSVPTPSGPTPSVTPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSVTPTPSGPTPSVTSTPSAPTPSGPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTPSVTPTPSG
CHOYP_043R.5.5|m.64252	CHOYP_043R.5.5|m.64252	SRPTPSVTPTPSGPTPSVTPTPSVSTPSGPTPSVTPTPSGPSPSVTPTPSGPSPSGPTPSATPTPSGPTPSGTTPSGSTPSATITTISTPSTTVCSYVDIGPEQAIDVSLRSPSEDPDAPIENILQTNSVYKPKKEPTYDENVVVKIISQDTPTILRVSFTVNRADTVGLEYLTDYKQKIITQNNETVEFVFAAGIITDNFTINIRSDSAEQPEISNLKIRACYKPVIGQPSTTTPNPSITSGTTTSVLTTTYQCPPTTIPCSKEPICYLTSEICDGKCDCLVHCDDEKDCKETTTKTPPTTTSGVPSVTTPTSTPSVPTSTPSGTVTPTPSVTSSTPYIPSETPTITPTPSLTPSATTPTVTSTVTPTPSGPTPSVTPTPSEPTPSVTPTPSGPPPSVTPTTSGTTPSVTQVTSTPTPSQTTILSTVPSETPSQTFTPSITPSLTTAYTTANPCREVNGMLDATIIPATSITLSEPAIQPNVDQIRNGPLIVPADITTFTVTIDLPGDIQLGSINLGSFTNVKAFEVNIRKPTDTQPVLYKEVTDSNILVFPAGTIADQIQIVLLEKNDVSQGYQLQIDLRACFETGTTSVQPQTTPISTGVISTTPSVTNTPSQQTPSVTPTPSGPTPSGPTPSVTPTPSGPTPSGPTPSVTPTPSGPTPSVTPTPSGPT

In [19]:
!tail /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/Combined-gigas-QC.txt

CHOYP_contig_056515|m.67007	CHOYP_contig_056515|m.67007	MELYRLAFVCLTLIAFSKVFVNSNKCNDNSAATAQIVKSCPQNHKEWIKAAARKGCEQMAHFCSSVEYHCVINAWGNETIEVCAPKLQIVGNNCAEYSQGGKRIQRNGIVPCKNCPSHYFSNETFKYQECYEHVKNAKTAHTTQLTTESISVKSTEENVYQSTSVTPMENSARLFQNIDNQNTPSRIIIICVCVVVVLAGILIVFTVKQRSWANKMCSHFKRIVLQSEESKMTNQESAIEIVEEGHDVQNCLLE
CHOYP_contig_056520|m.67008	CHOYP_contig_056520|m.67008	FCHVNLCKPCVVDHISDGYHKHVIVPFQKRRSTLIYPKCGTHTHKNCEFQCKDCNNIFVCSSCMASEQHGRHRFVEVAEVFKTKKDEIIKDTKELENHISPTYEEIARDLENQLANLDGGYEKITTTISKQGEQCHKEIDIVINKMKTEINEKKAKHRDILKKHLNEIKQTQSLIKQTIQAIRKIENSTEVSPTIEYSSKITKFSKLPPTVQVTLPTFIPKPIDRNRLYTLVGEITPLSTATEEVSQQNQPNTSVRELLDEPEVVATVQTNHTRLCSVTFLGKDKIW
CHOYP_contig_056524|m.67011	CHOYP_contig_056524|m.67011	MFINTQNKPSRIQSAPSTVRPVDKNREKKKSYHIQRQSSCEELITKLLAGKKSEDSVKSLKGEKRKNQTSGFRWLRPFRNYKVQTTLYLPGDDVNFEKQKQSWVKDSNPNVLKKHCSEEPVETTKGKKIEMCAQVHYYCSIPLKA
CHOYP_contig_056539|m.67025	CHOYP_contig_056539|m.67025	ENTEQPKTTLSSTTTLKAKNKNIGLLNMDSLSTDKVTPSSYLPSDERTKQSKSSPFLPFRENTEQPKTTQSSTTTLKAKNK

### 3. File Path Lists

To run PECAN, I need two different path lists: one for my converted mzmL files, and another for my background protein list. These path lists need to be `.txt` files with just a list of file names in them.

To create the `.txt` file for my mzML files, I need to first download my files of interest and reupload them to a separate OWL folder. This folder will be downloaded onto the machine I'm using. PECAN requires that all inputs are in that same folder. 

For my first PECAN run, I will just examine five oyster samples.

In [13]:
!curl http://owl.fish.washington.edu/halfshell/working-directory/17-02-15/orf/2017_January_23_envtstress_oyster1.mzML \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/mzML-files/2017_January_23_envtstress_oyster1.mzML

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 1001M 100 1001M 0 0 21.5M 0 0:00:46 0:00:46 --:--:-- 53.2M


In [14]:
!curl http://owl.fish.washington.edu/halfshell/working-directory/17-02-15/orf/2017_January_23_envtstress_oyster2.mzML \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/mzML-files/2017_January_23_envtstress_oyster2.mzML

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 1008M 100 1008M 0 0 52.7M 0 0:00:19 0:00:19 --:--:-- 43.2M


In [15]:
!curl http://owl.fish.washington.edu/halfshell/working-directory/17-02-15/orf/2017_January_23_envtstress_oyster3.mzML \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/mzML-files/2017_January_23_envtstress_oyster3.mzML

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 1037M 100 1037M 0 0 74.2M 0 0:00:13 0:00:13 --:--:-- 75.7M


In [16]:
!curl http://owl.fish.washington.edu/halfshell/working-directory/17-02-15/orf/2017_January_23_envtstress_oyster4.mzML \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/mzML-files/2017_January_23_envtstress_oyster4.mzML

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 1018M 100 1018M 0 0 20.1M 0 0:00:50 0:00:50 --:--:-- 45.4M


In [17]:
!curl http://owl.fish.washington.edu/halfshell/working-directory/17-02-15/orf/2017_January_23_envtstress_oyster5.mzML \
> /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/mzML-files/2017_January_23_envtstress_oyster5.mzML

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 942M 100 942M 0 0 56.5M 0 0:00:16 0:00:16 --:--:-- 46.9M


My files downloaded, so I reuploaded them to owl. Now, I can copy and paste the paths for the files and put it into a `.txt` file. I will create a similar `.txt` file for my background proteome. The background proteome is the same as the one I obtained from the protein digest. Below are the files I'm using for my `.mzML` files and background proteome.

In [24]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/2017-03-03-mzML-file-path-list.txt

/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017_January_23_envtstress_oyster1.mzML
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017_January_23_envtstress_oyster2.mzML
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017_January_23_envtstress_oyster3.mzML
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017_January_23_envtstress_oyster4.mzML
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017_January_23_envtstress_oyster5.mzML

In [25]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/2017-03-03-background-peptides-path-list.txt

/home/srlab/Documents/DNR_PECAN_Run_1_20170303/Combined-gigas-QC.txt

### 4. Isolation Scheme

The last thing I need to obtain is my isolation scheme. The isolation scheme represents all of the m/z windows we used to analyze our samples. I need to ensure that the file is a `.csv` and that I have paired values. This means the two bounds of the isolation scheme are in the same row.

My isolation scheme can be seen below.

In [23]:
!head /Users/yaamini/Documents/project-oyster-oa/analyses/2018-02-28-PECAN/PECAN-inputs/2017-03-03-isolation-windows.csv

444.4519,456.4574
450.4546,462.4601
456.4574,468.4628
462.4601,474.4656
468.4628,480.4683
474.4656,486.471
480.4683,492.4737
486.471,498.4765
492.4737,504.4792
498.4765,510.4819


Looks good! All of my files are ready to go for PECAN. Because PECAN requires all of my files to be in the same folder when I want to use them, I uploaded everything to the same [owl folder](https://owl.fish.washington.edu//web/spartina/DNR_PECAN_Run_1_20170303). I then downloaded this folder onto Roadrunner, the machine where PECAN is installed.

Now I'm ready to use PECAN.

### 5. PECAN

Here's the code I'm using for PECAN.

pecanpie -o [directory to create] \

-b [background proteome.txt] \

-n [blib file name] \ 

-s [species] \

--isolationSchemeType BOARDER \

--pecanMemRequest [GB estimate] \

[mzML file path list name] \

[peptide file path name] \

[isolation scheme file name] \

--fido --jointPercolator

In [None]:
pecanpie -o /home/srlab/Documents/DNR_PECAN_Run_1_20170303_Output \
-b /home/srlab/Documents/DNR_PECAN_Run_1_20170303/Combined-gigas-QC.txt \
-n DNR_PECAN_Run_1_20170303_SpLibrary \ 
-s gigas \
--isolationSchemeType BOARDER \
--pecanMemRequest 35 \
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017-03-03-mzML-file-path-list.txt \
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017-03-03-background-peptides-path-list.txt \
/home/srlab/Documents/DNR_PECAN_Run_1_20170303/2017-03-03-isolation-windows.csv \
--fido --jointPercolator

This step sets up the real PECAN run. Now I need to run the search. To do this, I navigated to my new directory.

In [26]:
cd /home/srlab/Documents/DNR_PECAN_Run_1_20170303_Output \
./run_search.sh

[Errno 2] No such file or directory: '/home/srlab/Documents/DNR_PECAN_Run_1_20170303_Output ./run_search.sh'
/Users/yaamini/Documents/project-oyster-oa/notebooks


I can use `qstat -f` to check the status of my jobs if I `ssh` into Roadrunner! Here's what the Terminal looks like:

![PECAN-roadrunner](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/2018-02-28-PECAN/PECAN-inputs/PECAN-run-1.png)

** March 4 morning**: `ssh` into Roadrunner this morning and saw that my `.blib` directory was created!

![PECAN-run-1-complete](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/2018-02-28-PECAN/PECAN-inputs/PECAN-Run-1-complete.png)

Looks like the only way I can open this is in Skyline, so I asked Emma what to do next. I'm hoping that my code works fine so I can prep some more samples to run over the weekend.

**March 4 afternoon**

After talking to Sean, it seems like there might have been a problem with my analyses. He [looked into the PECAN output](https://genefish.wordpress.com/2017/03/04/pecan-on-roadrunner-isnt-working-correctly/). It seems like there were 0 MS2 scans done, and then it quit. I'm not sure what this means.

Based on [discussions with Sam, Sean and Steven](https://github.com/sr320/LabDocs/issues/508), I'm going to play around with the `.blib` file in Skyline while simultaneously trying to convert one `.mzML` file through the command line MSConvert option and see if PECAN produces the same message in the log file.

**March 7 afternoon**

Laura and I are going to play around with the .blib file generated by this run. First, I uploaded all of my files to Owl.

[PECAN Run 1 Inputs](http://owl.fish.washington.edu/spartina/DNR_PECAN_Run_1_20170303/)

[Pecan Run 1 Outputs](http://owl.fish.washington.edu/spartina/DNR_PECAN_Run_1_20170303_Output/)

We're following the instructions outlined in this [powerpoint](https://github.com/RobertsLab/project-pacific.oyster-larvae/blob/master/Skyline-example-files-ETS.sky/slides01.pdf). We were able to upload the proteome successfully, but we got this error when we uploaded the .blib file:

![skyline-error](https://cloud.githubusercontent.com/assets/22335838/23686193/eebd0ee8-035c-11e7-8c6b-c4612579f46a.png)

Looks like nothing was ever generated by PECAN because it couldn't find what it considered to be valid MS2 data and it stopped there. This did not happen to Laura's files, which is weird because we underwent the same MSConvert process. I will reconvert one raw file to an mzML file and rerun PECAN. Hopefully that will yield a usable .blib file!