{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Blast at the commandline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NCBI offers BLAST as a service online however there might be some instances when you need to blast thousands to 100 of thousands of sequences. In these cases it is often best to do this on your own computer. This tutorial will take you through the process. This is done using the Mac OS. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T00:47:27.549862Z",
     "start_time": "2018-06-15T00:47:27.541108Z"
    }
   },
   "source": [
    "## Step 1: Download Blast "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The latest version of the Blast executable can be found at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once installed you should able to run a type of blast (blastp) and the -h arguement (help) to verify you have the program installed "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T00:50:19.970764Z",
     "start_time": "2018-06-15T00:50:19.823722Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "USAGE\r\n",
      "  blastp [-h] [-help] [-import_search_strategy filename]\r\n",
      "    [-export_search_strategy filename] [-task task_name] [-db database_name]\r\n",
      "    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]\r\n",
      "    [-negative_gilist filename] [-negative_seqidlist filename]\r\n",
      "    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]\r\n",
      "    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]\r\n",
      "    [-subject_loc range] [-query input_file] [-out output_file]\r\n",
      "    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]\r\n",
      "    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]\r\n",
      "    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]\r\n",
      "    [-xdrop_gap_final float_value] [-searchsp int_value]\r\n",
      "    [-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking]\r\n",
      "    [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]\r\n",
      "    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]\r\n",
      "    [-window_size int_value] [-lcase_masking] [-query_loc range]\r\n",
      "    [-parse_deflines] [-outfmt format] [-show_gis]\r\n",
      "    [-num_descriptions int_value] [-num_alignments int_value]\r\n",
      "    [-line_length line_length] [-html] [-max_target_seqs num_sequences]\r\n",
      "    [-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo]\r\n",
      "    [-use_sw_tback] [-version]\r\n",
      "\r\n",
      "DESCRIPTION\r\n",
      "   Protein-Protein BLAST 2.7.1+\r\n",
      "\r\n",
      "Use '-help' to print detailed descriptions of command line arguments\r\n"
     ]
    }
   ],
   "source": [
    "!/Applications/bioinfo/ncbi-blast-2.7.1+/bin/blastp -h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create a blast database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the example we will use Swiss-Prot protein database found http://www.uniprot.org/downloads\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:06:49.004880Z",
     "start_time": "2018-06-15T01:04:40.373914Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2018-06-14 18:04:40--  ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz\n",
      "           => ‘data/uniprot_sprot.fasta.gz’\n",
      "Resolving ftp.uniprot.org (ftp.uniprot.org)... 141.161.180.197\n",
      "Connecting to ftp.uniprot.org (ftp.uniprot.org)|141.161.180.197|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /pub/databases/uniprot/current_release/knowledgebase/complete ... done.\n",
      "==> SIZE uniprot_sprot.fasta.gz ... 87920815\n",
      "==> PASV ... done.    ==> RETR uniprot_sprot.fasta.gz ... done.\n",
      "Length: 87920815 (84M) (unauthoritative)\n",
      "\n",
      "uniprot_sprot.fasta 100%[===================>]  83.85M   220KB/s    in 2m 7s   \n",
      "\n",
      "2018-06-14 18:06:48 (675 KB/s) - ‘data/uniprot_sprot.fasta.gz’ saved [87920815]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz \\\n",
    "-O data/uniprot_sprot.fasta.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:06:49.163542Z",
     "start_time": "2018-06-15T01:06:49.010822Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data/uniprot_sprot.fasta.gz\r\n"
     ]
    }
   ],
   "source": [
    "!ls -h data/uniprot_sprot.fasta.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:06:50.469685Z",
     "start_time": "2018-06-15T01:06:49.170432Z"
    }
   },
   "outputs": [],
   "source": [
    "!gunzip data/uniprot_sprot.fasta.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T00:58:59.880981Z",
     "start_time": "2018-06-15T00:58:59.728416Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "USAGE\r\n",
      "  makeblastdb [-h] [-help] [-in input_file] [-input_type type]\r\n",
      "    -dbtype molecule_type [-title database_title] [-parse_seqids]\r\n",
      "    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]\r\n",
      "    [-mask_desc mask_algo_descriptions] [-gi_mask]\r\n",
      "    [-gi_mask_name gi_based_mask_names] [-out database_name]\r\n",
      "    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]\r\n",
      "    [-taxid_map TaxIDMapFile] [-version]\r\n",
      "\r\n",
      "DESCRIPTION\r\n",
      "   Application to create BLAST databases, version 2.7.1+\r\n",
      "\r\n",
      "Use '-help' to print detailed descriptions of command line arguments\r\n"
     ]
    }
   ],
   "source": [
    "!/Applications/bioinfo/ncbi-blast-2.7.1+/bin/makeblastdb -h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:07:11.615853Z",
     "start_time": "2018-06-15T01:06:54.982523Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Building a new DB, current time: 06/14/2018 18:06:55\n",
      "New DB name:   /Users/sr320/Documents/GitHub/nb-2018/C_virginica/data/uniprot_sprot\n",
      "New DB title:  data/uniprot_sprot.fasta\n",
      "Sequence type: Protein\n",
      "Keep MBits: T\n",
      "Maximum file size: 1000000000B\n",
      "Adding sequences from FASTA; added 557491 sequences in 16.3846 seconds.\n"
     ]
    }
   ],
   "source": [
    "!/Applications/bioinfo/ncbi-blast-2.7.1+/bin/makeblastdb \\\n",
    "-in data/uniprot_sprot.fasta \\\n",
    "-dbtype prot \\\n",
    "-out data/uniprot_sprot\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:07:26.809825Z",
     "start_time": "2018-06-15T01:07:26.679679Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data/uniprot_sprot.fasta data/uniprot_sprot.pin\r\n",
      "data/uniprot_sprot.phr   data/uniprot_sprot.psq\r\n"
     ]
    }
   ],
   "source": [
    "!ls data/u*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Determine query sequences "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:07:36.322573Z",
     "start_time": "2018-06-15T01:07:36.312991Z"
    }
   },
   "source": [
    "Will be using protein sequences from C. virginica "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:15:05.573762Z",
     "start_time": "2018-06-15T01:14:53.500302Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2018-06-14 18:14:53--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_protein.faa.gz\n",
      "           => ‘data/GCF_002022765.2_C_virginica-3.0_protein.fa.gz’\n",
      "Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::10, 165.112.9.228\n",
      "Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::10|:21... connected.\n",
      "Logging in as anonymous ... Logged in!\n",
      "==> SYST ... done.    ==> PWD ... done.\n",
      "==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0 ... done.\n",
      "==> SIZE GCF_002022765.2_C_virginica-3.0_protein.faa.gz ... 12708638\n",
      "==> EPSV ... done.    ==> RETR GCF_002022765.2_C_virginica-3.0_protein.faa.gz ... done.\n",
      "Length: 12708638 (12M) (unauthoritative)\n",
      "\n",
      "GCF_002022765.2_C_v 100%[===================>]  12.12M  1.56MB/s    in 7.7s    \n",
      "\n",
      "2018-06-14 18:15:05 (1.58 MB/s) - ‘data/GCF_002022765.2_C_virginica-3.0_protein.fa.gz’ saved [12708638]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_protein.faa.gz \\\n",
    " -O data/GCF_002022765.2_C_virginica-3.0_protein.fa.gz   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:15:16.704712Z",
     "start_time": "2018-06-15T01:15:16.379179Z"
    }
   },
   "outputs": [],
   "source": [
    "!gunzip data/GCF_002022765.2_C_virginica-3.0_protein.fa.gz  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:15:26.915657Z",
     "start_time": "2018-06-15T01:15:26.786333Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">XP_022286047.1 purine nucleoside phosphorylase-like isoform X3 [Crassostrea virginica]\r\n",
      "MHTQGSLEEVEALTEQIRSRIRHEPRVGIVCGSGLGGLADVVEDKQMIKYSELEGMPVSTVPGHVGQFVFGLLKGIPVML\r\n",
      "MQGRIHVYEGYPLHRVVLPIRVMWKFGIKTLILTNAAGGINPEYKVGDIMIMKDHLNLPGFCGVNPLIGVNDDRIGPRFP\r\n",
      "PMSDAYSKKYRDIAKQTAQELGYSEFVREGVYSFLTGPTFETVTECRLLRMIGSDATGMSTVPEATVARHCGMEVLAMSL\r\n",
      "ITNSCVMEFDSDQTANHEEVLETGRARSQDMQNLITKIVQKIVS\r\n",
      ">XP_022286048.1 uncharacterized protein LOC111099031 isoform X1 [Crassostrea virginica]\r\n",
      "MSTRVWYGGLLLFVLYGAGGTCDLCTFMNQTTSVSSCSASCGGGQQYYYRYVCCLNLTRSDCESHCNLNLTLTKSIKHFI\r\n",
      "GCGQECYGGGAFNETTESCLCPQGTTGPCCDPIIEPQNTTTETSPEQKNFNTTRETSPEQKYLNTTTETLPEQKYLNTTT\r\n",
      "ETSPEQKNFNTTKETSPEQKNVNTTTETSPEQKYRYTSTETSPQQKSQNTIKGSLYVSVGVPLGLISTLALVLAIIYFIC\r\n",
      "MKRKRNHRVDAAENVPDSDEIGTNRSRDTGQHKGEFT\r\n"
     ]
    }
   ],
   "source": [
    "!head data/GCF_002022765.2_C_virginica-3.0_protein.fa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:15:47.803820Z",
     "start_time": "2018-06-15T01:15:47.435804Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "60213\r\n"
     ]
    }
   ],
   "source": [
    "!fgrep -c \">\" data/GCF_002022765.2_C_virginica-3.0_protein.fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 Blastp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T01:20:02.752625Z",
     "start_time": "2018-06-15T01:20:02.484708Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "USAGE\r\n",
      "  blastp [-h] [-help] [-import_search_strategy filename]\r\n",
      "    [-export_search_strategy filename] [-task task_name] [-db database_name]\r\n",
      "    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]\r\n",
      "    [-negative_gilist filename] [-negative_seqidlist filename]\r\n",
      "    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]\r\n",
      "    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]\r\n",
      "    [-subject_loc range] [-query input_file] [-out output_file]\r\n",
      "    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]\r\n",
      "    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]\r\n",
      "    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]\r\n",
      "    [-xdrop_gap_final float_value] [-searchsp int_value]\r\n",
      "    [-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking]\r\n",
      "    [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]\r\n",
      "    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]\r\n",
      "    [-window_size int_value] [-lcase_masking] [-query_loc range]\r\n",
      "    [-parse_deflines] [-outfmt format] [-show_gis]\r\n",
      "    [-num_descriptions int_value] [-num_alignments int_value]\r\n",
      "    [-line_length line_length] [-html] [-max_target_seqs num_sequences]\r\n",
      "    [-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo]\r\n",
      "    [-use_sw_tback] [-version]\r\n",
      "\r\n",
      "DESCRIPTION\r\n",
      "   Protein-Protein BLAST 2.7.1+\r\n",
      "\r\n",
      "OPTIONAL ARGUMENTS\r\n",
      " -h\r\n",
      "   Print USAGE and DESCRIPTION;  ignore all other parameters\r\n",
      " -help\r\n",
      "   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters\r\n",
      " -version\r\n",
      "   Print version number;  ignore other arguments\r\n",
      "\r\n",
      " *** Input query options\r\n",
      " -query <File_In>\r\n",
      "   Input file name\r\n",
      "   Default = `-'\r\n",
      " -query_loc <String>\r\n",
      "   Location on the query sequence in 1-based offsets (Format: start-stop)\r\n",
      "\r\n",
      " *** General search options\r\n",
      " -task <String, Permissible values: 'blastp' 'blastp-fast' 'blastp-short' >\r\n",
      "   Task to execute\r\n",
      "   Default = `blastp'\r\n",
      " -db <String>\r\n",
      "   BLAST database name\r\n",
      "    * Incompatible with:  subject, subject_loc\r\n",
      " -out <File_Out>\r\n",
      "   Output file name\r\n",
      "   Default = `-'\r\n",
      " -evalue <Real>\r\n",
      "   Expectation value (E) threshold for saving hits \r\n",
      "   Default = `10'\r\n",
      " -word_size <Integer, >=2>\r\n",
      "   Word size for wordfinder algorithm\r\n",
      " -gapopen <Integer>\r\n",
      "   Cost to open a gap\r\n",
      " -gapextend <Integer>\r\n",
      "   Cost to extend a gap\r\n",
      " -matrix <String>\r\n",
      "   Scoring matrix name (normally BLOSUM62)\r\n",
      " -threshold <Real, >=0>\r\n",
      "   Minimum word score such that the word is added to the BLAST lookup table\r\n",
      " -comp_based_stats <String>\r\n",
      "   Use composition-based statistics:\r\n",
      "       D or d: default (equivalent to 2 )\r\n",
      "       0 or F or f: No composition-based statistics\r\n",
      "       1: Composition-based statistics as in NAR 29:2994-3005, 2001\r\n",
      "       2 or T or t : Composition-based score adjustment as in Bioinformatics\r\n",
      "   21:902-911,\r\n",
      "       2005, conditioned on sequence properties\r\n",
      "       3: Composition-based score adjustment as in Bioinformatics 21:902-911,\r\n",
      "       2005, unconditionally\r\n",
      "   Default = `2'\r\n",
      "\r\n",
      " *** BLAST-2-Sequences options\r\n",
      " -subject <File_In>\r\n",
      "   Subject sequence(s) to search\r\n",
      "    * Incompatible with:  db, gilist, seqidlist, negative_gilist,\r\n",
      "   negative_seqidlist, db_soft_mask, db_hard_mask\r\n",
      " -subject_loc <String>\r\n",
      "   Location on the subject sequence in 1-based offsets (Format: start-stop)\r\n",
      "    * Incompatible with:  db, gilist, seqidlist, negative_gilist,\r\n",
      "   negative_seqidlist, db_soft_mask, db_hard_mask, remote\r\n",
      "\r\n",
      " *** Formatting options\r\n",
      " -outfmt <String>\r\n",
      "   alignment view options:\r\n",
      "     0 = Pairwise,\r\n",
      "     1 = Query-anchored showing identities,\r\n",
      "     2 = Query-anchored no identities,\r\n",
      "     3 = Flat query-anchored showing identities,\r\n",
      "     4 = Flat query-anchored no identities,\r\n",
      "     5 = BLAST XML,\r\n",
      "     6 = Tabular,\r\n",
      "     7 = Tabular with comment lines,\r\n",
      "     8 = Seqalign (Text ASN.1),\r\n",
      "     9 = Seqalign (Binary ASN.1),\r\n",
      "    10 = Comma-separated values,\r\n",
      "    11 = BLAST archive (ASN.1),\r\n",
      "    12 = Seqalign (JSON),\r\n",
      "    13 = Multiple-file BLAST JSON,\r\n",
      "    14 = Multiple-file BLAST XML2,\r\n",
      "    15 = Single-file BLAST JSON,\r\n",
      "    16 = Single-file BLAST XML2,\r\n",
      "    18 = Organism Report\r\n",
      "   \r\n",
      "   Options 6, 7 and 10 can be additionally configured to produce\r\n",
      "   a custom format specified by space delimited format specifiers.\r\n",
      "   The supported format specifiers are:\r\n",
      "   \t    qseqid means Query Seq-id\r\n",
      "   \t       qgi means Query GI\r\n",
      "   \t      qacc means Query accesion\r\n",
      "   \t   qaccver means Query accesion.version\r\n",
      "   \t      qlen means Query sequence length\r\n",
      "   \t    sseqid means Subject Seq-id\r\n",
      "   \t sallseqid means All subject Seq-id(s), separated by a ';'\r\n",
      "   \t       sgi means Subject GI\r\n",
      "   \t    sallgi means All subject GIs\r\n",
      "   \t      sacc means Subject accession\r\n",
      "   \t   saccver means Subject accession.version\r\n",
      "   \t   sallacc means All subject accessions\r\n",
      "   \t      slen means Subject sequence length\r\n",
      "   \t    qstart means Start of alignment in query\r\n",
      "   \t      qend means End of alignment in query\r\n",
      "   \t    sstart means Start of alignment in subject\r\n",
      "   \t      send means End of alignment in subject\r\n",
      "   \t      qseq means Aligned part of query sequence\r\n",
      "   \t      sseq means Aligned part of subject sequence\r\n",
      "   \t    evalue means Expect value\r\n",
      "   \t  bitscore means Bit score\r\n",
      "   \t     score means Raw score\r\n",
      "   \t    length means Alignment length\r\n",
      "   \t    pident means Percentage of identical matches\r\n",
      "   \t    nident means Number of identical matches\r\n",
      "   \t  mismatch means Number of mismatches\r\n",
      "   \t  positive means Number of positive-scoring matches\r\n",
      "   \t   gapopen means Number of gap openings\r\n",
      "   \t      gaps means Total number of gaps\r\n",
      "   \t      ppos means Percentage of positive-scoring matches\r\n",
      "   \t    frames means Query and subject frames separated by a '/'\r\n",
      "   \t    qframe means Query frame\r\n",
      "   \t    sframe means Subject frame\r\n",
      "   \t      btop means Blast traceback operations (BTOP)\r\n",
      "   \t    staxid means Subject Taxonomy ID\r\n",
      "   \t  ssciname means Subject Scientific Name\r\n",
      "   \t  scomname means Subject Common Name\r\n",
      "   \tsblastname means Subject Blast Name\r\n",
      "   \t sskingdom means Subject Super Kingdom\r\n",
      "   \t   staxids means unique Subject Taxonomy ID(s), separated by a ';'\r\n",
      "   \t\t\t (in numerical order)\r\n",
      "   \t sscinames means unique Subject Scientific Name(s), separated by a ';'\r\n",
      "   \t scomnames means unique Subject Common Name(s), separated by a ';'\r\n",
      "   \tsblastnames means unique Subject Blast Name(s), separated by a ';'\r\n",
      "   \t\t\t (in alphabetical order)\r\n",
      "   \tsskingdoms means unique Subject Super Kingdom(s), separated by a ';'\r\n",
      "   \t\t\t (in alphabetical order) \r\n",
      "   \t    stitle means Subject Title\r\n",
      "   \tsalltitles means All Subject Title(s), separated by a '<>'\r\n",
      "   \t   sstrand means Subject Strand\r\n",
      "   \t     qcovs means Query Coverage Per Subject\r\n",
      "   \t   qcovhsp means Query Coverage Per HSP\r\n",
      "   \t    qcovus means Query Coverage Per Unique Subject (blastn only)\r\n",
      "   When not provided, the default value is:\r\n",
      "   'qaccver saccver pident length mismatch gapopen qstart qend sstart send\r\n",
      "   evalue bitscore', which is equivalent to the keyword 'std'\r\n",
      "   Default = `0'\r\n",
      " -show_gis\r\n",
      "   Show NCBI GIs in deflines?\r\n",
      " -num_descriptions <Integer, >=0>\r\n",
      "   Number of database sequences to show one-line descriptions for\r\n",
      "   Not applicable for outfmt > 4\r\n",
      "   Default = `500'\r\n",
      "    * Incompatible with:  max_target_seqs\r\n",
      " -num_alignments <Integer, >=0>\r\n",
      "   Number of database sequences to show alignments for\r\n",
      "   Default = `250'\r\n",
      "    * Incompatible with:  max_target_seqs\r\n",
      " -line_length <Integer, >=1>\r\n",
      "   Line length for formatting alignments\r\n",
      "   Not applicable for outfmt > 4\r\n",
      "   Default = `60'\r\n",
      " -html\r\n",
      "   Produce HTML output?\r\n",
      "\r\n",
      " *** Query filtering options\r\n",
      " -seg <String>\r\n",
      "   Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or\r\n",
      "   'no' to disable)\r\n",
      "   Default = `no'\r\n",
      " -soft_masking <Boolean>\r\n",
      "   Apply filtering locations as soft masks\r\n",
      "   Default = `false'\r\n",
      " -lcase_masking\r\n",
      "   Use lower case filtering in query and subject sequence(s)?\r\n",
      "\r\n",
      " *** Restrict search or results\r\n",
      " -gilist <String>\r\n",
      "   Restrict search of database to list of GI's\r\n",
      "    * Incompatible with:  negative_gilist, seqidlist, negative_seqidlist,\r\n",
      "   remote, subject, subject_loc\r\n",
      " -seqidlist <String>\r\n",
      "   Restrict search of database to list of SeqId's\r\n",
      "    * Incompatible with:  gilist, negative_gilist, negative_seqidlist, remote,\r\n",
      "   subject, subject_loc\r\n",
      " -negative_gilist <String>\r\n",
      "   Restrict search of database to everything except the listed GIs\r\n",
      "    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc\r\n",
      " -negative_seqidlist <String>\r\n",
      "   Restrict search of database to everything except the listed SeqIDs\r\n",
      "    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc\r\n",
      " -entrez_query <String>\r\n",
      "   Restrict search with the given Entrez query\r\n",
      "    * Requires:  remote\r\n",
      " -db_soft_mask <String>\r\n",
      "   Filtering algorithm ID to apply to the BLAST database as soft masking\r\n",
      "    * Incompatible with:  db_hard_mask, subject, subject_loc\r\n",
      " -db_hard_mask <String>\r\n",
      "   Filtering algorithm ID to apply to the BLAST database as hard masking\r\n",
      "    * Incompatible with:  db_soft_mask, subject, subject_loc\r\n",
      " -qcov_hsp_perc <Real, 0..100>\r\n",
      "   Percent query coverage per hsp\r\n",
      " -max_hsps <Integer, >=1>\r\n",
      "   Set maximum number of HSPs per subject sequence to save for each query\r\n",
      " -culling_limit <Integer, >=0>\r\n",
      "   If the query range of a hit is enveloped by that of at least this many\r\n",
      "   higher-scoring hits, delete the hit\r\n",
      "    * Incompatible with:  best_hit_overhang, best_hit_score_edge\r\n",
      " -best_hit_overhang <Real, (>0 and <0.5)>\r\n",
      "   Best Hit algorithm overhang value (recommended value: 0.1)\r\n",
      "    * Incompatible with:  culling_limit\r\n",
      " -best_hit_score_edge <Real, (>0 and <0.5)>\r\n",
      "   Best Hit algorithm score edge value (recommended value: 0.1)\r\n",
      "    * Incompatible with:  culling_limit\r\n",
      " -max_target_seqs <Integer, >=1>\r\n",
      "   Maximum number of aligned sequences to keep \r\n",
      "   Not applicable for outfmt <= 4\r\n",
      "   Default = `500'\r\n",
      "    * Incompatible with:  num_descriptions, num_alignments\r\n",
      "\r\n",
      " *** Statistical options\r\n",
      " -dbsize <Int8>\r\n",
      "   Effective length of the database \r\n",
      " -searchsp <Int8, >=0>\r\n",
      "   Effective length of the search space\r\n",
      " -sum_stats <Boolean>\r\n",
      "   Use sum statistics\r\n",
      "\r\n",
      " *** Search strategy options\r\n",
      " -import_search_strategy <File_In>\r\n",
      "   Search strategy to use\r\n",
      "    * Incompatible with:  export_search_strategy\r\n",
      " -export_search_strategy <File_Out>\r\n",
      "   File name to record the search strategy used\r\n",
      "    * Incompatible with:  import_search_strategy\r\n",
      "\r\n",
      " *** Extension options\r\n",
      " -xdrop_ungap <Real>\r\n",
      "   X-dropoff value (in bits) for ungapped extensions\r\n",
      " -xdrop_gap <Real>\r\n",
      "   X-dropoff value (in bits) for preliminary gapped extensions\r\n",
      " -xdrop_gap_final <Real>\r\n",
      "   X-dropoff value (in bits) for final gapped alignment\r\n",
      " -window_size <Integer, >=0>\r\n",
      "   Multiple hits window size, use 0 to specify 1-hit algorithm\r\n",
      " -ungapped\r\n",
      "   Perform ungapped alignment only?\r\n",
      "\r\n",
      " *** Miscellaneous options\r\n",
      " -parse_deflines\r\n",
      "   Should the query and subject defline(s) be parsed?\r\n",
      " -num_threads <Integer, (>=1 and =<4)>\r\n",
      "   Number of threads (CPUs) to use in the BLAST search\r\n",
      "   Default = `1'\r\n",
      "    * Incompatible with:  remote\r\n",
      " -remote\r\n",
      "   Execute search remotely?\r\n",
      "    * Incompatible with:  gilist, seqidlist, negative_gilist,\r\n",
      "   negative_seqidlist, subject_loc, num_threads\r\n",
      " -use_sw_tback\r\n",
      "   Compute locally optimal Smith-Waterman alignments?\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!/Applications/bioinfo/ncbi-blast-2.7.1+/bin/blastp -help"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-15T13:10:54.389440Z",
     "start_time": "2018-06-15T01:21:18.836455Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "^C\r\n"
     ]
    }
   ],
   "source": [
    "!/Applications/bioinfo/ncbi-blast-2.7.1+/bin/blastp \\\n",
    "-query data/GCF_002022765.2_C_virginica-3.0_protein.fa \\\n",
    "-db data/uniprot_sprot \\\n",
    "-max_target_seqs 1 \\\n",
    "-evalue 1E-20 \\\n",
    "-outfmt 6 \\\n",
    "-num_threads 4 \\\n",
    "-out data/Cv_sprot.blastout\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "estimated time - 33 hours"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note on running on cluster with scheduler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "#!/bin/bash\n",
    "## Job Name\n",
    "#SBATCH --job-name=blastp\n",
    "## Allocation Definition\n",
    "#SBATCH --account=srlab\n",
    "#SBATCH --partition=srlab\n",
    "## Resources\n",
    "## Nodes (We only get 1, so this is fixed)\n",
    "#SBATCH --nodes=1\n",
    "## Walltime (days-hours:minutes:seconds format)\n",
    "#SBATCH --time=10-100:00:00\n",
    "## Memory per node\n",
    "#SBATCH --mem=70G\n",
    "#SBATCH --mail-type=ALL\n",
    "#SBATCH --mail-user=sr320@uw.edu\n",
    "## Specify the working directory for this job\n",
    "#SBATCH --workdir=/gscratch/srlab/sr320/analyses/0614\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "source /gscratch/srlab/programs/scripts/paths.sh\n",
    "\n",
    "\n",
    "/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp  \\\n",
    "-query /gscratch/srlab/sr320/query/GCF_002022765.2_C_virginica-3.0_protein.faa \\\n",
    "-db /gscratch/srlab/sr320/blastdb/uniprot_sprot_080917 \\\n",
    "-max_target_seqs 1 \\\n",
    "-evalue 1E-20 \\\n",
    "-outfmt 6 \\\n",
    "-num_threads 28 \\\n",
    "-out Cv_sprot.blastout\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Time on Mox = 3 hours 40 minutes\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}