{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook to annotate the BLASTX output from the _P. helianthoides_ genome gene list FASTA with uniprot SP GO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/code'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "01-FastQC_pre-trim.Rmd\r\n",
      "02-20220810_pycno_fastp.sh\r\n",
      "03-kallisto-summer21-phelgenomegenelist-20240117.Rmd\r\n",
      "04-PCAplots.Rmd\r\n",
      "05-deseq2.Rmd\r\n",
      "06-BLAST.Rmd\r\n",
      "07-BLAST_annot-SP_GO.ipynb\r\n",
      "07-deglist_annot.Rmd\r\n",
      "readme.md\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#change working directory to analyses folder for this notebook\n",
    "wd = \"../analyses/06-BLAST/\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/analyses/06-BLAST\n"
     ]
    }
   ],
   "source": [
    "cd $wd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/analyses/06-BLAST'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GO-GOslim.sorted               uniprot-SP-GO.sorted\r\n",
      "summer2021-uniprot_blastx.tab\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I already copied over the files needed to annotate:     \n",
    "1. summer2021-uniprot_blastx.tab was in here from the Rmd 06-BLAST.Rmd that was run on Raven to get the BLAST output\n",
    "2. uniprot-SP-GO.sorted was copied over from paper-crab repository\n",
    "3. GO-GOslim.sorted was also copied over from paper-crab repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sort -u -k1,1 --field-separator $'\\t' summer2021-uniprot_blastx.tab > blastout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   13606 blastout\r\n"
     ]
    }
   ],
   "source": [
    "!wc -l blastout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#set `blast` output file as variable\n",
    "blastout=\"summer2021-uniprot_blastx.tab\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "g1000.t1\tsp\tQ6AY85\tALG14_RAT\t53.846\t182\t81\t1\t130\t666\t35\t216\t1.85E-73\t224\r",
      "\r\n",
      "g10000.t1\tsp\tQ8C163\tEXOG_MOUSE\t38.281\t256\t150\t5\t34\t792\t16\t266\t1.39E-52\t178\r",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!head -2 blastout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normally here would be a code chunk to convert pipes to tabs, but I did that in excel already. If I hadn't code would be:  \n",
    "```\n",
    "!tr '|' '\\t' < blastout \\\n",
    "> _blast-sep.tab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#reduce the number of columns and sort\n",
    "!awk -v OFS='\\t' '{print $3, $1, $13}' < blastout | sort \\\n",
    "> _blastout_sort.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A0A061ACU2\tg2123.t1\t0\r\n",
      "A0A061ACU2\tg2123.t2\t0\r\n"
     ]
    }
   ],
   "source": [
    "!head -2 _blastout_sort.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A0A023GPI8\tLECA_CANBL\treviewed\tLectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]\t\tCanavalia boliviana\t237\t\t\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tGO:0005537; GO:0046872\r\n",
      "A0A023GPJ0\tCDII_ENTCC\treviewed\tImmunity protein CdiI\tcdiI ECL_04450.1\tEnterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)\t145\t\t\t\t\t\r\n"
     ]
    }
   ],
   "source": [
    "!head -2 uniprot-SP-GO.sorted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "#join blastout_sort.tab with uniprot-SP-GO.sorted and reduce to three columns: UniprotID, QueryID, All Go Terms\n",
    "!join -t $'\\t' \\\n",
    "_blastout_sort.tab \\\n",
    "uniprot-SP-GO.sorted \\\n",
    "| cut -f1,2,14 \\\n",
    "> blast-annot.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A0A0D2YG10\tg13111.t1\tGO:0016491; GO:0016746; GO:0031177\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:0005576; GO:0016055; GO:0030178; GO:1990697; GO:1990699\r\n"
     ]
    }
   ],
   "source": [
    "!head -2 blast-annot.tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following is a script modified by Sam White"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash \n",
    "\n",
    "# This script was originally written to address a specific problem that Rhonda was having\n",
    "\n",
    "\n",
    "\n",
    "# input_file is the initial, \"problem\" file\n",
    "# file is an intermediate file that most of the program works upon\n",
    "# output_file is the final file produced by the script\n",
    "input_file=\"blast-annot.tab\"\n",
    "file=\"_intermediate.file\"\n",
    "output_file=\"_blast-GO-unfolded.tab\"\n",
    "\n",
    "# sed command substitutes the \"; \" sequence to a tab and writes the new format to a new file.\n",
    "# This character sequence is how the GO terms are delimited in their field.\n",
    "sed $'s/; /\\t/g' \"$input_file\" > \"$file\"\n",
    "\n",
    "# Identify first field containing a GO term.\n",
    "# Search file with grep for \"GO:\" and pipe to awk.\n",
    "# Awk sets tab as field delimiter (-F'\\t'), runs a for loop that looks for \"GO:\" (~/GO:/), and then prints the field number).\n",
    "# Awk results are piped to sort, which sorts unique by number (-ug).\n",
    "# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; \"-n1\").\n",
    "begin_goterms=$(grep \"GO:\" \"$file\" | awk -F'\\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)\n",
    "\n",
    "# While loop to process each line of the input file.\n",
    "while read -r line\n",
    "\tdo\n",
    "\t\n",
    "\t# Send contents of the current line to awk.\n",
    "\t# Set the field separator as a tab (-F'\\t') and print the number of fields in that line.\n",
    "\t# Save the results of the echo/awk pipe (i.e. number of fields) to the variable \"max_field\".\n",
    "\tmax_field=$(echo \"$line\" | awk -F'\\t' '{print NF}')\n",
    "\n",
    "\t# Send contents of current line to cut.\n",
    "\t# Cut fields (i.e. retain those fields) 1-12.\n",
    "\t# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable \"fixed_fields\"\n",
    "\tfixed_fields=$(echo \"$line\" | cut -f1-2)\n",
    "\n",
    "\t# Since not all the lines contain the same number of fields (e.g. may not have GO terms),\n",
    "\t# evaluate the number of fields in each line to determine how to handle current line.\n",
    "\n",
    "\t# If the value in max_field is less than the field number where the GO terms begin,\n",
    "\t# then just print the current line (%s) followed by a newline (\\n).\n",
    "\tif (( \"$max_field\" < \"$begin_goterms\" ))\n",
    "\t\tthen printf \"%s\\n\" \"$line\"\n",
    "\t\t\telse\n",
    "\n",
    "\t\t\t# Send contents of current line (which contains GO terms) to cut.\n",
    "\t\t\t# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.\n",
    "\t\t\t# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable \"goterms\".\n",
    "\t\t\tgoterms=$(echo \"$line\" | cut -f\"$begin_goterms\"-\"$max_field\")\n",
    "\t\t\t\n",
    "\t\t\t# Assign values in the variable \"goterms\" to a new indexed array (called \"array\"), \n",
    "\t\t\t# with tab delimiter (IFS=$'\\t')\n",
    "\t\t\tIFS=$'\\t' read -r -a array <<<\"$goterms\"\n",
    "\t\t\t\n",
    "\t\t\t# Iterate through each element of the array.\n",
    "\t\t\t# Print the first 12 fields (i.e. the fields stored in \"fixed_fields\") followed by a tab (%s\\t).\n",
    "\t\t\t# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\\n).\n",
    "\t\t\tfor element in \"${!array[@]}\"\t\n",
    "\t\t\t\tdo printf \"%s\\t%s\\n\" \"$fixed_fields\" \"${array[$element]}\"\n",
    "\t\t\tdone\n",
    "\tfi\n",
    "\n",
    "# Send the input file into the while loop and send the output to a file named \"rhonda_fixed.txt\".\n",
    "done < \"$file\" > \"$output_file\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A0A0D2YG10\tg13111.t1\tGO:0016491\r\n",
      "A0A0D2YG10\tg13111.t1\tGO:0016746\r\n",
      "A0A0D2YG10\tg13111.t1\tGO:0031177\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:0005576\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:0016055\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:0030178\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:1990697\r\n",
      "A0A0D3QS98\tg5914.t1\tGO:1990699\r\n",
      "A0A0F7Z3J2\tg4163.t1\tGO:0005179\r\n",
      "A0A0F7Z3J2\tg4163.t1\tGO:0005576\r\n"
     ]
    }
   ],
   "source": [
    "!head _blast-GO-unfolded.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#get rid of lines with no GOIDs\n",
    "!sort -k3 _blast-GO-unfolded.tab | grep \"GO:\" > _blast-GO-unfolded.sorted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#joining files to get GOslim for each query (with duplicate GOslim / query removed)\n",
    "!join -1 3 -2 1 -t $'\\t' \\\n",
    "_blast-GO-unfolded.sorted \\\n",
    "GO-GOslim.sorted \\\n",
    "| uniq | awk -F'\\t' -v OFS='\\t' '{print $3, $1, $5, $6}' \\\n",
    "> Blastquery-GOslim.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "g9026.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g1868.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g9382.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g16208.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g4483.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g4483.t2\tGO:0000002\tcell organization and biogenesis\tP\r\n",
      "g22532.t1\tGO:0000003\tother biological processes\tP\r\n",
      "g21574.t1\tGO:0000009\tother molecular function\tF\r\n",
      "g15378.t1\tGO:0000010\tother molecular function\tF\r\n",
      "g5617.t1\tGO:0000010\tother molecular function\tF\r\n"
     ]
    }
   ],
   "source": [
    "!head Blastquery-GOslim.tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  144162 Blastquery-GOslim.tab\r\n"
     ]
    }
   ],
   "source": [
    "#check number of rows \n",
    "!wc -l Blastquery-GOslim.tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Push Blastquery-GOslim.tab to GitHub repo for later joining with DEGlists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Blastquery-GOslim.tab         blast-annot.tab\r\n",
      "GO-GOslim.sorted              blastout\r\n",
      "_blast-GO-unfolded.sorted     blastout_sort.tab\r\n",
      "_blast-GO-unfolded.tab        summer2021-uniprot_blastx.tab\r\n",
      "_blastout_sort.tab            uniprot-SP-GO.sorted\r\n",
      "_intermediate.file\r\n"
     ]
    }
   ],
   "source": [
    "#check directory contents\n",
    "!ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "64005 \r\n"
     ]
    }
   ],
   "source": [
    "#count unique P rows (P = biological process)\n",
    "!cat Blastquery-GOslim.tab | grep \"\tP\" | awk -F $'\\t' '{print $10}' | sort | uniq -c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original blast output has 13606 rows. More rows in Blastquery-GOslim.tab becasue there are multiple hits per query. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "64005 \r\n"
     ]
    }
   ],
   "source": [
    "!cat Blastquery-GOslim.tab | grep \"\tP\" | awk -F $'\\t' '{print $9}' | sort | uniq -c | sort -r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}