{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook to annotate the BLASTX output from the _P. helianthoides_ genome gene list FASTA with uniprot SP GO" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/code'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "01-FastQC_pre-trim.Rmd\r\n", "02-20220810_pycno_fastp.sh\r\n", "03-kallisto-summer21-phelgenomegenelist-20240117.Rmd\r\n", "04-PCAplots.Rmd\r\n", "05-deseq2.Rmd\r\n", "06-BLAST.Rmd\r\n", "07-BLAST_annot-SP_GO.ipynb\r\n", "07-deglist_annot.Rmd\r\n", "readme.md\r\n" ] } ], "source": [ "ls" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#change working directory to analyses folder for this notebook\n", "wd = \"../analyses/06-BLAST/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/analyses/06-BLAST\n" ] } ], "source": [ "cd $wd" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/Users/graciecrandall/Documents/GitHub/paper-pycno-sswd-2021/analyses/06-BLAST'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GO-GOslim.sorted uniprot-SP-GO.sorted\r\n", "summer2021-uniprot_blastx.tab\r\n" ] } ], "source": [ "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I already copied over the files needed to annotate: \n", "1. summer2021-uniprot_blastx.tab was in here from the Rmd 06-BLAST.Rmd that was run on Raven to get the BLAST output\n", "2. uniprot-SP-GO.sorted was copied over from paper-crab repository\n", "3. GO-GOslim.sorted was also copied over from paper-crab repository" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!sort -u -k1,1 --field-separator $'\\t' summer2021-uniprot_blastx.tab > blastout" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 13606 blastout\r\n" ] } ], "source": [ "!wc -l blastout" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#set `blast` output file as variable\n", "blastout=\"summer2021-uniprot_blastx.tab\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "g1000.t1\tsp\tQ6AY85\tALG14_RAT\t53.846\t182\t81\t1\t130\t666\t35\t216\t1.85E-73\t224\r", "\r\n", "g10000.t1\tsp\tQ8C163\tEXOG_MOUSE\t38.281\t256\t150\t5\t34\t792\t16\t266\t1.39E-52\t178\r", "\r\n" ] } ], "source": [ "!head -2 blastout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally here would be a code chunk to convert pipes to tabs, but I did that in excel already. If I hadn't code would be: \n", "```\n", "!tr '|' '\\t' < blastout \\\n", "> _blast-sep.tab\n", "```" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#reduce the number of columns and sort\n", "!awk -v OFS='\\t' '{print $3, $1, $13}' < blastout | sort \\\n", "> _blastout_sort.tab" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A061ACU2\tg2123.t1\t0\r\n", "A0A061ACU2\tg2123.t2\t0\r\n" ] } ], "source": [ "!head -2 _blastout_sort.tab" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A023GPI8\tLECA_CANBL\treviewed\tLectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]\t\tCanavalia boliviana\t237\t\t\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tGO:0005537; GO:0046872\r\n", "A0A023GPJ0\tCDII_ENTCC\treviewed\tImmunity protein CdiI\tcdiI ECL_04450.1\tEnterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)\t145\t\t\t\t\t\r\n" ] } ], "source": [ "!head -2 uniprot-SP-GO.sorted" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "#join blastout_sort.tab with uniprot-SP-GO.sorted and reduce to three columns: UniprotID, QueryID, All Go Terms\n", "!join -t $'\\t' \\\n", "_blastout_sort.tab \\\n", "uniprot-SP-GO.sorted \\\n", "| cut -f1,2,14 \\\n", "> blast-annot.tab" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A0D2YG10\tg13111.t1\tGO:0016491; GO:0016746; GO:0031177\r\n", "A0A0D3QS98\tg5914.t1\tGO:0005576; GO:0016055; GO:0030178; GO:1990697; GO:1990699\r\n" ] } ], "source": [ "!head -2 blast-annot.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is a script modified by Sam White" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "%%bash \n", "\n", "# This script was originally written to address a specific problem that Rhonda was having\n", "\n", "\n", "\n", "# input_file is the initial, \"problem\" file\n", "# file is an intermediate file that most of the program works upon\n", "# output_file is the final file produced by the script\n", "input_file=\"blast-annot.tab\"\n", "file=\"_intermediate.file\"\n", "output_file=\"_blast-GO-unfolded.tab\"\n", "\n", "# sed command substitutes the \"; \" sequence to a tab and writes the new format to a new file.\n", "# This character sequence is how the GO terms are delimited in their field.\n", "sed $'s/; /\\t/g' \"$input_file\" > \"$file\"\n", "\n", "# Identify first field containing a GO term.\n", "# Search file with grep for \"GO:\" and pipe to awk.\n", "# Awk sets tab as field delimiter (-F'\\t'), runs a for loop that looks for \"GO:\" (~/GO:/), and then prints the field number).\n", "# Awk results are piped to sort, which sorts unique by number (-ug).\n", "# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; \"-n1\").\n", "begin_goterms=$(grep \"GO:\" \"$file\" | awk -F'\\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)\n", "\n", "# While loop to process each line of the input file.\n", "while read -r line\n", "\tdo\n", "\t\n", "\t# Send contents of the current line to awk.\n", "\t# Set the field separator as a tab (-F'\\t') and print the number of fields in that line.\n", "\t# Save the results of the echo/awk pipe (i.e. number of fields) to the variable \"max_field\".\n", "\tmax_field=$(echo \"$line\" | awk -F'\\t' '{print NF}')\n", "\n", "\t# Send contents of current line to cut.\n", "\t# Cut fields (i.e. retain those fields) 1-12.\n", "\t# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable \"fixed_fields\"\n", "\tfixed_fields=$(echo \"$line\" | cut -f1-2)\n", "\n", "\t# Since not all the lines contain the same number of fields (e.g. may not have GO terms),\n", "\t# evaluate the number of fields in each line to determine how to handle current line.\n", "\n", "\t# If the value in max_field is less than the field number where the GO terms begin,\n", "\t# then just print the current line (%s) followed by a newline (\\n).\n", "\tif (( \"$max_field\" < \"$begin_goterms\" ))\n", "\t\tthen printf \"%s\\n\" \"$line\"\n", "\t\t\telse\n", "\n", "\t\t\t# Send contents of current line (which contains GO terms) to cut.\n", "\t\t\t# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.\n", "\t\t\t# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable \"goterms\".\n", "\t\t\tgoterms=$(echo \"$line\" | cut -f\"$begin_goterms\"-\"$max_field\")\n", "\t\t\t\n", "\t\t\t# Assign values in the variable \"goterms\" to a new indexed array (called \"array\"), \n", "\t\t\t# with tab delimiter (IFS=$'\\t')\n", "\t\t\tIFS=$'\\t' read -r -a array <<<\"$goterms\"\n", "\t\t\t\n", "\t\t\t# Iterate through each element of the array.\n", "\t\t\t# Print the first 12 fields (i.e. the fields stored in \"fixed_fields\") followed by a tab (%s\\t).\n", "\t\t\t# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\\n).\n", "\t\t\tfor element in \"${!array[@]}\"\t\n", "\t\t\t\tdo printf \"%s\\t%s\\n\" \"$fixed_fields\" \"${array[$element]}\"\n", "\t\t\tdone\n", "\tfi\n", "\n", "# Send the input file into the while loop and send the output to a file named \"rhonda_fixed.txt\".\n", "done < \"$file\" > \"$output_file\"" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A0D2YG10\tg13111.t1\tGO:0016491\r\n", "A0A0D2YG10\tg13111.t1\tGO:0016746\r\n", "A0A0D2YG10\tg13111.t1\tGO:0031177\r\n", "A0A0D3QS98\tg5914.t1\tGO:0005576\r\n", "A0A0D3QS98\tg5914.t1\tGO:0016055\r\n", "A0A0D3QS98\tg5914.t1\tGO:0030178\r\n", "A0A0D3QS98\tg5914.t1\tGO:1990697\r\n", "A0A0D3QS98\tg5914.t1\tGO:1990699\r\n", "A0A0F7Z3J2\tg4163.t1\tGO:0005179\r\n", "A0A0F7Z3J2\tg4163.t1\tGO:0005576\r\n" ] } ], "source": [ "!head _blast-GO-unfolded.tab" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#get rid of lines with no GOIDs\n", "!sort -k3 _blast-GO-unfolded.tab | grep \"GO:\" > _blast-GO-unfolded.sorted" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#joining files to get GOslim for each query (with duplicate GOslim / query removed)\n", "!join -1 3 -2 1 -t $'\\t' \\\n", "_blast-GO-unfolded.sorted \\\n", "GO-GOslim.sorted \\\n", "| uniq | awk -F'\\t' -v OFS='\\t' '{print $3, $1, $5, $6}' \\\n", "> Blastquery-GOslim.tab" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "g9026.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g1868.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g9382.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g16208.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g4483.t1\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g4483.t2\tGO:0000002\tcell organization and biogenesis\tP\r\n", "g22532.t1\tGO:0000003\tother biological processes\tP\r\n", "g21574.t1\tGO:0000009\tother molecular function\tF\r\n", "g15378.t1\tGO:0000010\tother molecular function\tF\r\n", "g5617.t1\tGO:0000010\tother molecular function\tF\r\n" ] } ], "source": [ "!head Blastquery-GOslim.tab" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 144162 Blastquery-GOslim.tab\r\n" ] } ], "source": [ "#check number of rows \n", "!wc -l Blastquery-GOslim.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Push Blastquery-GOslim.tab to GitHub repo for later joining with DEGlists" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Blastquery-GOslim.tab blast-annot.tab\r\n", "GO-GOslim.sorted blastout\r\n", "_blast-GO-unfolded.sorted blastout_sort.tab\r\n", "_blast-GO-unfolded.tab summer2021-uniprot_blastx.tab\r\n", "_blastout_sort.tab uniprot-SP-GO.sorted\r\n", "_intermediate.file\r\n" ] } ], "source": [ "#check directory contents\n", "!ls" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "64005 \r\n" ] } ], "source": [ "#count unique P rows (P = biological process)\n", "!cat Blastquery-GOslim.tab | grep \"\tP\" | awk -F $'\\t' '{print $10}' | sort | uniq -c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original blast output has 13606 rows. More rows in Blastquery-GOslim.tab becasue there are multiple hits per query. " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "64005 \r\n" ] } ], "source": [ "!cat Blastquery-GOslim.tab | grep \"\tP\" | awk -F $'\\t' '{print $9}' | sort | uniq -c | sort -r" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 2 }