{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Annotation of *Acropora hyacinthus* transcriptome " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workflow details the annotation of an *Acropora hyacinthus* [transcriptome](http://palumbi.stanford.edu/data/33496_Ahyacinthus_CoralContigs.fasta.zip)\n", "\n", "The notebook requires you have the following \n", "- [NCBI Blast: 2.2.3](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)\n", "- [SQLShare](https://sqlshare.escience.washington.edu/accounts/login/?next=/sqlshare/%3F__hash__%3D)\n", "\n", "The annotation also requires a Uniprot/Swissprot BLAST database. Instructions for setting up this database can be found [here](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/README.md)\n", "\n", "The orginal analysis was carried out on on Mac OS X v10.10.3 running Python: 2.7.9 and IPython: 3.1.0.\n", "\n", "This workflow is structured so that anyone can reproduce the analysis by downloading the repository locally and executing." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/data/Ahya\n" ] } ], "source": [ "cd ../data/Ahya" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 4601k 100 4601k 0 0 3749k 0 0:00:01 0:00:01 --:--:-- 3750k\n" ] } ], "source": [ "#Obtain FASTA file\n", "!curl -O http://palumbi.stanford.edu/data/33496_Ahyacinthus_CoralContigs.fasta.zip" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: 33496_Ahyacinthus_CoralContigs.fasta.zip\n", " inflating: 33496_Ahyacinthus_CoralContigs.fasta \n" ] } ], "source": [ "!unzip -o 33496_Ahyacinthus_CoralContigs.fasta.zip " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">contig27\r\n", "CAAAATTCCAGCACTCCGTTTTGCATGGTAAACTTGTCTTAGTAGGACACTGTGGAAGATGTACAGCGCAAGACATCACAGTTGCAAGCGCCGACGAACAGCTGTTAAACTCTCCTCTCATATTCTCGAACAAACCAAATATTTCTTCCTCTCTGTTGTTGCTAACCTTTGAATATATGAAGCTGGCATTAGCACAGGACTCAAAGTTTCCGCCGAGCAGTTT\r\n", ">contig88\r\n", "TGTCCTGTGTTAGAGGCCAGCTTCAACCTCTTGCTTTCCCTGTCAGCCGAGTTTTCTTCTCCTTCAATAAGCTGGGATTTTCGATCTCTACTCAATGTTTCCATCAAACACCTGAGAGTTAAATCTGCCAGATAACGAAGAAATCCTCTTGCTAGAATACTTTTCAAAAGCCCTTCTTCATACATTGATCTTATCCCATTGCAAATTGCGTTGG\r\n", ">contig100\r\n", "TTCAGAACTATTCTCCGCCACACAGGGATAAATGGTCTTCACTTTCTTAGGTGTTTTTGTCTGTGGTGATGGTGTGGGTTCCTCCTGTGGGGGAGGTTCTCTTGGGAGTGGGGGTGGTCGTGTTCCCCATGACTGTATCCCCCTGTTTAGGCTCCCACCCTGACTGCTGACACGGCGTATTGCATGGGCAGAGCCCTTGTCATTCCTGCCCTTGTCATTC\r\n", ">contig211\r\n", "TGGGGCGATCAGGTCACCAACGAAATTGTCCGACAAGTCATGGAGATGAAAGGGTTTTATAGTCTCGACAAACCCGGTGAATTTACCAGCATAGTGGATCTTCAGTTTGTGGCAGCTATGATCCAGCCCGGTGGGGGCCGCAATGACATCCCAAGTCGTCTGAAAAGGCAGTTTACCATCCTCAACTGCACGCTTCCCGCAAATGCGTCCATCGATAAGATCTTTAGCTCTATCGGCTGTGGTTACTTCAACACTGAGCGCGGTTTCC\r\n", ">contig405\r\n", "ATTTTTAATTATTAAAGTTAGTTTTCCTTCTTTGTGAAGACTCAGTGCTTACTATCACATTTGTTTGTCATGCTCGTATCAAGACCACCGAGGTCTTGAGTGAGTGATACCATATGTACTTTCTGTACCTCATTTCCTCAATTTCATGACAGACCGCCTATTCTTCTTGATGTGCTTGATTTGAAATGGCATCGGATGGCAATACTTGATGAAATTTCCGGTGTGGCGTGAAGAACTTTGAA\r\n" ] } ], "source": [ "!head 33496_Ahyacinthus_CoralContigs.fasta" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "33496\r\n" ] } ], "source": [ "#Count number of seqs\n", "!fgrep -c \">\" 33496_Ahyacinthus_CoralContigs.fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blastx query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!blastx \\\n", "-query 33496_Ahyacinthus_CoralContigs.fasta \\ #FASTA file\n", "-db ~blast/db/uniprot_sprot \\ #Use your blastx database address\n", "-max_target_seqs 1 \\ #maximum number of target sequences = 1\n", "-max_hsps 1 \\ #maximum number of high-scoring pairs = 1\n", "-outfmt 6 \\ #output format = tabular\n", "-evalue 1E-05 \\ #E-value = 10^-5\n", "-num_threads 8 \\ #number of threads = 8\n", "-out ../../analyses/Ahya/Ahya_blastx_uniprot.tab \\ #Direct output to analyses directory\n", "2> ../../analyses/Ahya/Ahya_blastx_uniprot.error #Direct standard error output to its own file" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/jd/Documents/Projects/Coral-CpG-ratio-MS/analyses/Ahya\n" ] } ], "source": [ "cd ../../analyses/Ahya" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "head: ../../analyses/Ahya/Ahya_blastx_uniprot.tab: No such file or directory\r\n" ] } ], "source": [ "#Checking head and tail of the output file.\n", "!head -10 Ahya_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wc: Ahya_blastx_uniprot.tab: open: No such file or directory\r\n" ] } ], "source": [ "!wc Ahya_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "contig211\tsp|Q96JB1|DYH8_HUMAN\t77.53\t89\t20\t0\t1\t267\t2533\t2621\t4e-44\t 158\n", "SQLShare ready version has Pipes converted to Tabs ....\n", "contig211\tsp\tQ96JB1\tDYH8_HUMAN\t77.53\t89\t20\t0\t1\t267\t2533\t2621\t4e-44\t 158\n" ] } ], "source": [ "#Removing pipes and converted to tab-delimited file\n", "!tr '|' \"\\t\" Ahya_blastx_uniprot_sql.tab\n", "!head -1 Ahya_blastx_uniprot.tab\n", "!echo SQLShare ready version has Pipes converted to Tabs ....\n", "!head -1 Ahya_blastx_uniprot_sql.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining with GOSlim using SQLShare" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###First upload dataset\n", "![screen shot1](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.01.38%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Then find the dataset, execute query, and download the new dataset\n", "![screen shot](https://github.com/jldimond/Coral-CpG-ratio-MS/blob/master/images/Screen%20Shot%202015-09-25%20at%2012.29.18%20PM.png?raw=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Query (note: insert your SQLShare account instead of jldimond@washington.edu)\n", "`SELECT Distinct Column2 as ContigID, GOSlim_bin\n", " FROM [jldimond@washington.edu].[Ahya_blastx_uniprot_sql.tab]anno\n", " left join [sr320@washington.edu].[SPID and GO Numbers]go\n", " on anno.Column7=go.SPID left join [sr320@washington.edu].[GO_to_GOslim]slim\n", " on go.GOID=slim.GO_id where aspect like 'P'`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Output file downloaded to ./Analyses/Ahya: Ahya_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11594 162316 899613 Ahya_blastx_uniprot_sql.tab\r\n" ] } ], "source": [ "!wc Ahya_blastx_uniprot_sql.tab" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\r\n", " File \"/Users/jd/sqlshare-pythonclient-master/tools/singleupload.py\", line 6, in \r\n", " import sqlshare\r\n", "ImportError: No module named sqlshare\r\n" ] } ], "source": [ "!python /Users/jd/sqlshare-pythonclient-master/tools/singleupload.py \\\n", "-d Ahya_uniprot \\\n", "Ahya_blastx_uniprot_sql.tab\n", "\n", "#Uploads blast file that you just separated by tabs into SQL share\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#### Output file downloaded to ./analyses/Ahya: Ahya_GOSlim.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Replacing commas with tabs\n", "!tr ',' \"\\t\" Ahya_GOSlim.tab" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ContigID\tGOSlim_bin\r", "\r\n", "contig135011_153678_153601\tcell organization and biogenesis\r", "\r\n", "contig135011_153678_153601\tother biological processes\r", "\r\n", "contig135011_153678_153601\tdevelopmental processes\r", "\r\n", "contig69684\tprotein metabolism\r", "\r\n", "contig113621\tprotein metabolism\r", "\r\n", "contig97647\tprotein metabolism\r", "\r\n", "contig199902\tprotein metabolism\r", "\r\n", "contig78855\tother biological processes\r", "\r\n", "contig8505_94477\tDNA metabolism\r", "\r\n" ] } ], "source": [ "!head -10 Ahya_GOSlim.tab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }