# `blastx` and Uniprot File Merging

In this notebook, we will merge `blastx` output for *Zostera marina* and *Labyrinthula zosterae* with information from the [Uniprot-SwissProt database](https://www.uniprot.org/uniprot/?query=reviewed:yes).

## Step 0. Set working directory

In [1]:
!pwd

/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/scripts


In [2]:
cd ../data/

/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/data


In [4]:
ls -F

[34m2019-07-09-Merging-Script-Troubleshooting[m[m/
README.md
Zostera-blast-sep.tab
Zostera-blast-sort.tab
[31mZostera_SwissProt_e5_output[m[m*
[31mZostera_contigs.fasta[m[m*
nonZostera-blast-sep.tab
nonZostera-blast-sort.tab
[31mnonZostera_SwissProt_e5_outputBOX.txt[m[m*
[31mnonZostera_contigs.fasta[m[m*
uniprot-SP-GO-sorted.tab
uniprot-reviewed_yes.tab


## Step 1. Format `blastx` output

### Step 1a. *Z. marina*

In [11]:
!head -n2 Zostera_SwissProt_e5_output

TRINITY_DN31278_c0_g1_i1	sp|Q54J75|RPB2_DICDI	64.0	111	40	0	1	333	950	1060	6.9e-40	164.5
TRINITY_DN31239_c0_g1_i1	sp|Q9SIT6|AB5G_ARATH	29.6	125	88	0	58	432	440	564	1.4e-12	74.3


In [16]:
#Convert pipe delimiters to tab delimiters using tr (tr means translate)
!tr '|' '\t' < Zostera_SwissProt_e5_output \
> Zostera-blast-sep.tab

In [17]:
!head -n2 Zostera-blast-sep.tab

TRINITY_DN31278_c0_g1_i1	sp	Q54J75	RPB2_DICDI	64.0	111	40	0	1	333	950	1060	6.9e-40	164.5
TRINITY_DN31239_c0_g1_i1	sp	Q9SIT6	AB5G_ARATH	29.6	125	88	0	58	432	440	564	1.4e-12	74.3


In [26]:
#Reduce the number of columns using awk. Sort, and save as a new file.
!awk -v OFS='\t' '{print $3, $1, $13}' < Zostera-blast-sep.tab | sort \
> Zostera-blast-sort.tab

In [27]:
!head Zostera-blast-sort.tab

A0A024B7I0	TRINITY_DN278019_c0_g1_i1	6.4e-133
A0A067XMP1	TRINITY_DN166310_c0_g1_i1	2.0e-17
A0A067XMP1	TRINITY_DN17396_c0_g1_i1	6.1e-17
A0A067XMP1	TRINITY_DN309320_c0_g7_i1	3.2e-07
A0A068FIK2	TRINITY_DN241620_c0_g1_i1	1.6e-37
A0A068FIK2	TRINITY_DN285385_c0_g3_i3	2.4e-29
A0A068FIK2	TRINITY_DN308379_c0_g1_i1	3.7e-207
A0A068FIK2	TRINITY_DN308379_c0_g1_i4	0.0e+00
A0A068FIK2	TRINITY_DN308379_c0_g1_i5	4.5e-226
A0A068FIK2	TRINITY_DN308379_c0_g1_i5	7.7e-133


In [9]:
!wc Zostera-blast-sort.tab
!echo "Zostera transcripts"

 270061 810183 11135448 Zostera-blast-sort.tab
Zostera transcripts


### Step 1b. *L. zosterae*

We will assume that everything that is not *Z. marina* will be *L. zosterae*.

In [28]:
!head -n2 nonZostera_SwissProt_e5_outputBOX.txt

TRINITY_DN31224_c0_g1_i1	sp|Q54T06|Y8206_DICDI	52.1	96	40	1	4	273	458	553	4.1e-25	115.2
TRINITY_DN31259_c0_g1_i1	sp|P15374|UCHL3_HUMAN	49.3	73	37	0	8	226	144	216	1.3e-13	76.6


In [29]:
#Convert pipe delimiters to tab delimiters using tr (tr means translate)
!tr '|' '\t' < nonZostera_SwissProt_e5_outputBOX.txt \
> nonZostera-blast-sep.tab

In [30]:
!head -n2 nonZostera-blast-sep.tab

TRINITY_DN31224_c0_g1_i1	sp	Q54T06	Y8206_DICDI	52.1	96	40	1	4	273	458	553	4.1e-25	115.2
TRINITY_DN31259_c0_g1_i1	sp	P15374	UCHL3_HUMAN	49.3	73	37	0	8	226	144	216	1.3e-13	76.6


In [31]:
#Reduce the number of columns using awk. Sort, and save as a new file.
!awk -v OFS='\t' '{print $3, $1, $13}' < nonZostera-blast-sep.tab | sort \
> nonZostera-blast-sort.tab

In [32]:
!head nonZostera-blast-sort.tab

A0A024RXP8	TRINITY_DN416168_c0_g1_i1	7.8e-07
A0A024SMV2	TRINITY_DN174741_c0_g1_i1	8.9e-10
A0A024SMV2	TRINITY_DN192522_c0_g1_i1	3.6e-11
A0A060X6Z0	TRINITY_DN123691_c0_g1_i1	3.5e-19
A0A067XMP1	TRINITY_DN166166_c0_g1_i1	1.1e-10
A0A067XMP1	TRINITY_DN166166_c0_g1_i2	2.0e-10
A0A067XMP1	TRINITY_DN245901_c0_g1_i3	4.4e-06
A0A067XMP1	TRINITY_DN336889_c0_g1_i1	3.6e-06
A0A067XMP1	TRINITY_DN341637_c0_g1_i1	1.8e-12
A0A067XMP1	TRINITY_DN393221_c0_g1_i1	4.0e-06


In [10]:
!wc -l nonZostera-blast-sort.tab
!echo "nonZostera transcripts"

 87550 nonZostera-blast-sort.tab
nonZostera transcripts


## Step 2. Format Uniprot-SwissProt database

The Uniprot annotation file was downloaded from [this link](https://www.uniprot.org/uniprot/?query=reviewed:yes) on 2019-07-10. The following information was included as separate columns:

- Entry (Uniprot Accession code)
- Protein Names
- Gene ontology (biological process)
- Gene ontology (cellular component)
- Gene ontology (molecular function)
- Gene onology IDs
- Status (Reviewed or not reviewed)
- Organism

In [34]:
!head -n2 uniprot-reviewed_yes.tab

Entry	Protein names	Gene ontology (biological process)	Gene ontology (cellular component)	Gene ontology (molecular function)	Gene ontology IDs	Status	Organism
Q0ATK2	Acetyl-coenzyme A carboxylase carboxyl transferase subunit beta (ACCase subunit beta) (Acetyl-CoA carboxylase carboxyltransferase subunit beta) (EC 2.1.3.15)	fatty acid biosynthetic process [GO:0006633]; malonyl-CoA biosynthetic process [GO:2001295]	acetyl-CoA carboxylase complex [GO:0009317]	acetyl-CoA carboxylase activity [GO:0003989]; ATP binding [GO:0005524]; carboxyl- or carbamoyltransferase activity [GO:0016743]	GO:0003989; GO:0005524; GO:0006633; GO:0009317; GO:0016743; GO:2001295	reviewed	Maricaulis maris (strain MCS10)


In [36]:
#Sort file by the first column (-k 1), which is the Uniprot Entry (Uniprot Accession Code)
!sort uniprot-reviewed_yes.tab -k 1 > uniprot-SP-GO-sorted.tab

In [23]:
#Count the number of columns for reference
!awk '{print NF; exit}' uniprot-SP-GO-sorted.tab

25


In [38]:
!head -n2 uniprot-SP-GO-sorted.tab

A0A023GPI8	Lectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]			mannose binding [GO:0005537]; metal ion binding [GO:0046872]	GO:0005537; GO:0046872	reviewed	Canavalia boliviana
A0A023GPJ0	Immunity protein CdiI					reviewed	Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)


## Step 3. Join `blastx` output with Uniprot annotation file

### Step 3a. *Z. marina*

In [5]:
#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
Zostera-blast-sort.tab \
uniprot-SP-GO-sorted.tab \
> Zostera-blast-annot.tab

In [7]:
!head -n2 Zostera-blast-annot.tab

A0A024B7I0	TRINITY_DN278019_c0_g1_i1	6.4e-133	SCARECROW-LIKE protein 7 (PeSCL7)		nucleus [GO:0005634]	DNA binding [GO:0003677]	GO:0003677; GO:0005634	reviewed	Populus euphratica (Euphrates poplar)
A0A067XMP1	TRINITY_DN166310_c0_g1_i1	2.0e-17	Oxidoreductase ptaL (EC 1.-.-.-) (Pestheic acid biosynthesis cluster protein L)			oxidoreductase activity [GO:0016491]	GO:0016491	reviewed	Pestalotiopsis fici (strain W106-1 / CGMCC3.15140)


In [8]:
!wc -l Zostera-blast-annot.tab
!echo "annotated Zostera transcripts"

 270014 Zostera-blast-annot.tab
annotated Zostera transcripts


### Step 3b. *L. zosterae*

In [11]:
#Join the first column in the first file with the first column in the second file. 
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
nonZostera-blast-sort.tab \
uniprot-SP-GO-sorted.tab \
> nonZostera-blast-annot.tab

In [12]:
!head -n2 nonZostera-blast-annot.tab

A0A024RXP8	TRINITY_DN416168_c0_g1_i1	7.8e-07	Exoglucanase 1 (EC 3.2.1.91) (1,4-beta-cellobiohydrolase) (Cellobiohydrolase 7A) (Cel7A) (Exocellobiohydrolase I) (CBHI) (Exoglucanase I)	cellulose catabolic process [GO:0030245]	extracellular region [GO:0005576]	cellulose 1,4-beta-cellobiosidase activity [GO:0016162]; cellulose binding [GO:0030248]	GO:0005576; GO:0016162; GO:0030245; GO:0030248	reviewed	Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)
A0A024SMV2	TRINITY_DN174741_c0_g1_i1	8.9e-10	D-xylose 1-dehydrogenase (NADP(+)) (XDH) (EC 1.1.1.179) (D-xylose-NADP dehydrogenase) (NADP(+)-dependent D-xylose dehydrogenase)			D-xylose 1-dehydrogenase (NADP+) activity [GO:0047837]	GO:0047837	reviewed	Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)


In [13]:
!wc -l nonZostera-blast-annot.tab
!echo "annotated nonZostera transcripts"

 87532 nonZostera-blast-annot.tab
annotated nonZostera transcripts


## Step 4. Isolate gene IDs

`blastx` was performed using isoform data. Currently, each line in the annotated file is denoted by an isoform ID (ex. TRINITY_DN416168_c0_g1_i1). The gene IDs are similar to the isoform IDs, in that they have contig and gene information, but no isoform information (ex. TRINITY_DN416168_c0_g1). Differential expression analyses will be conducted in `edgeR` at the gene level, so gene IDs are needed on annotation files.

### Step 4a. *Z. marina*

In [17]:
#Isolate the contig column name with cut
#Flip order of characters with rev
#Delete last three characters with cut -c
#Flip order of characters with rev
#Add information as a new column to annotated table with paste

!cut -f2 Zostera-blast-annot.tab \
| rev \
| cut -c 4- \
| rev \
> Zostera-blast-annot-geneIDOnly.tab

In [18]:
!head Zostera-blast-annot-geneIDOnly.tab

TRINITY_DN278019_c0_g1
TRINITY_DN166310_c0_g1
TRINITY_DN17396_c0_g1
TRINITY_DN309320_c0_g7
TRINITY_DN241620_c0_g1
TRINITY_DN285385_c0_g3
TRINITY_DN308379_c0_g1
TRINITY_DN308379_c0_g1
TRINITY_DN308379_c0_g1
TRINITY_DN308379_c0_g1


In [19]:
#Line count matches line count of original file
!wc -l Zostera-blast-annot-geneIDOnly.tab

 270014 Zostera-blast-annot-geneIDOnly.tab


In [21]:
!paste Zostera-blast-annot-geneIDOnly.tab Zostera-blast-annot.tab \
> Zostera-blast-annot-withGeneID.tab

In [22]:
!head -n2 Zostera-blast-annot-withGeneID.tab

TRINITY_DN278019_c0_g1	A0A024B7I0	TRINITY_DN278019_c0_g1_i1	6.4e-133	SCARECROW-LIKE protein 7 (PeSCL7)		nucleus [GO:0005634]	DNA binding [GO:0003677]	GO:0003677; GO:0005634	reviewed	Populus euphratica (Euphrates poplar)
TRINITY_DN166310_c0_g1	A0A067XMP1	TRINITY_DN166310_c0_g1_i1	2.0e-17	Oxidoreductase ptaL (EC 1.-.-.-) (Pestheic acid biosynthesis cluster protein L)			oxidoreductase activity [GO:0016491]	GO:0016491	reviewed	Pestalotiopsis fici (strain W106-1 / CGMCC3.15140)


### Step 4b. *L. zosterae*

In [24]:
#Isolate the contig column name with cut
#Flip order of characters with rev
#Delete last three characters with cut -c
#Flip order of characters with rev
#Add information as a new column to annotated table with paste

!cut -f2 nonZostera-blast-annot.tab \
| rev \
| cut -c 4- \
| rev \
> nonZostera-blast-annot-geneIDOnly.tab

In [25]:
!head nonZostera-blast-annot-geneIDOnly.tab

TRINITY_DN416168_c0_g1
TRINITY_DN174741_c0_g1
TRINITY_DN192522_c0_g1
TRINITY_DN123691_c0_g1
TRINITY_DN166166_c0_g1
TRINITY_DN166166_c0_g1
TRINITY_DN245901_c0_g1
TRINITY_DN336889_c0_g1
TRINITY_DN341637_c0_g1
TRINITY_DN393221_c0_g1


In [26]:
#Line count matches line count of original file
!wc -l nonZostera-blast-annot-geneIDOnly.tab

 87532 nonZostera-blast-annot-geneIDOnly.tab


In [27]:
!paste nonZostera-blast-annot-geneIDOnly.tab nonZostera-blast-annot.tab \
> nonZostera-blast-annot-withGeneID.tab

In [28]:
!head -n2 nonZostera-blast-annot-withGeneID.tab

TRINITY_DN416168_c0_g1	A0A024RXP8	TRINITY_DN416168_c0_g1_i1	7.8e-07	Exoglucanase 1 (EC 3.2.1.91) (1,4-beta-cellobiohydrolase) (Cellobiohydrolase 7A) (Cel7A) (Exocellobiohydrolase I) (CBHI) (Exoglucanase I)	cellulose catabolic process [GO:0030245]	extracellular region [GO:0005576]	cellulose 1,4-beta-cellobiosidase activity [GO:0016162]; cellulose binding [GO:0030248]	GO:0005576; GO:0016162; GO:0030245; GO:0030248	reviewed	Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)
TRINITY_DN174741_c0_g1	A0A024SMV2	TRINITY_DN174741_c0_g1_i1	8.9e-10	D-xylose 1-dehydrogenase (NADP(+)) (XDH) (EC 1.1.1.179) (D-xylose-NADP dehydrogenase) (NADP(+)-dependent D-xylose dehydrogenase)			D-xylose 1-dehydrogenase (NADP+) activity [GO:0047837]	GO:0047837	reviewed	Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)


## Step 5. Retain one line per gene ID

Since differential expression analysis will be conducted at the gene level, each gene should only have one associated Uniprot annotation. This is complicated by the fact that some genes have multiple isoforms. Each isoform should map to the same protein (otherwise it wouldn't be an isoform, but a different gene). We will trim the list such that we retain annotations from the first listed isoform for each gene.

### Step 5a. *Z. marina*

In [29]:
#Sort file by the first column (--key = 1,1) and only retain unique IDs (--unique). Save output to a new file.
!sort --unique --key=1,1 Zostera-blast-annot-withGeneID.tab \
> Zostera-blast-annot-withGeneID-noIsoforms.tab

In [31]:
!head -n2 Zostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100001_c0_g1	Q54EW8	TRINITY_DN100001_c0_g1_i1	1.2e-19	Dihydrolipoyl dehydrogenase, mitochondrial (EC 1.8.1.4) (Dihydrolipoamide dehydrogenase) (Glycine cleavage system L protein)	cell redox homeostasis [GO:0045454]; glycine catabolic process [GO:0006546]; isoleucine catabolic process [GO:0006550]; leucine catabolic process [GO:0006552]; L-serine biosynthetic process [GO:0006564]; valine catabolic process [GO:0006574]	extracellular matrix [GO:0031012]; mitochondrial matrix [GO:0005759]; mitochondrial pyruvate dehydrogenase complex [GO:0005967]; phagocytic vesicle [GO:0045335]	dihydrolipoyl dehydrogenase activity [GO:0004148]; electron transfer activity [GO:0009055]; flavin adenine dinucleotide binding [GO:0050660]	GO:0004148; GO:0005759; GO:0005967; GO:0006546; GO:0006550; GO:0006552; GO:0006564; GO:0006574; GO:0009055; GO:0031012; GO:0045335; GO:0045454; GO:0050660	reviewed	Dictyostelium discoideum (Slime mold)
TRINITY_DN100015_c0_g1	P16894	TRINITY_DN100015_c0_g1_i1	1.2e-21	

In [33]:
!wc -l Zostera-blast-annot-withGeneID-noIsoforms.tab
!echo "unique Zostera gene IDs"

 138394 Zostera-blast-annot-withGeneID-noIsoforms.tab
unique Zostera gene IDs


### Step 5b. *L. zosterae*

In [34]:
#Sort file by the first column (--key = 1,1) and only retain unique IDs (--unique). Save output to a new file.
!sort --unique --key=1,1 nonZostera-blast-annot-withGeneID.tab \
> nonZostera-blast-annot-withGeneID-noIsoforms.tab

In [35]:
!head -n2 nonZostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100016_c0_g1	Q8GYA6	TRINITY_DN100016_c0_g1_i1	1.2e-08	26S proteasome non-ATPase regulatory subunit 13 homolog B (26S proteasome regulatory subunit RPN9b) (AtRNP9b) (26S proteasome regulatory subunit S11 homolog B)	proteasome assembly [GO:0043248]; protein catabolic process [GO:0030163]; ubiquitin-dependent protein catabolic process [GO:0006511]	cytosol [GO:0005829]; nucleus [GO:0005634]; proteasome complex [GO:0000502]; proteasome regulatory particle, lid subcomplex [GO:0008541]	structural molecule activity [GO:0005198]	GO:0000502; GO:0005198; GO:0005634; GO:0005829; GO:0006511; GO:0008541; GO:0030163; GO:0043248	reviewed	Arabidopsis thaliana (Mouse-ear cress)
TRINITY_DN100076_c0_g1	Q59118	TRINITY_DN100076_c0_g1_i1	2.3e-09	Histamine oxidase (EC 1.4.3.22) (Copper amine oxidase)	amine metabolic process [GO:0009308]; cellular response to azide [GO:0097185]; oxidation-reduction process [GO:0055114]	cytoplasm [GO:0005737]	copper ion binding [GO:0005507]; diamine oxidase activity 

In [36]:
!wc -l nonZostera-blast-annot-withGeneID-noIsoforms.tab
!echo "unique nonZostera gene IDs"

 71191 nonZostera-blast-annot-withGeneID-noIsoforms.tab
unique nonZostera gene IDs


## Step 6. Join with `trinity` gene matrix

The last step is to join the annotated genes for each species with the gene matrix output from `trinity`. This will allow us to conduct individual `edgeR` analyses for host and pathogen.

In [39]:
#Although the first column header is "GRASS_GENE," the list includes both Zostera and nonZostera genes.
!head genes.counts.matrix.txt

GRASS_GENE	S_10B	S_9A	S_13A	S_42A	S_46B	S_47B	S_48B	S_2A	S_2B	S_7B	S_8B	S_33A	S_36A	S_38A	S_40A
TRINITY_DN0_c0_g1	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0
TRINITY_DN100_c0_g1	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0
TRINITY_DN10000_c0_g1	0	2	0	0	6	1	0	0	0	0	0	0	0	0	0
TRINITY_DN100000_c0_g1	0	2	0	1	0	0	0	0	0	0	0	0	0	0	0
TRINITY_DN100001_c0_g1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
TRINITY_DN100003_c0_g1	0	0	0	0	1	0	0	0	0	0	0	0	5	0	0
TRINITY_DN100006_c0_g1	0	1	0	0	0	0	0	1	0	3	0	0	0	0	0
TRINITY_DN100007_c0_g1	0	0	0	0	0	0	0	3	0	0	0	1	0	0	0
TRINITY_DN10001_c0_g1	0	5	2	1	9	0	1	1	0	1	0	0	1	0	0


### Step 6a. *Z. marina*

In [60]:
#Join the first column in the first file with the first column in the second file. 
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
Zostera-blast-annot-withGeneID-noIsoforms.tab \
genes.counts.matrix.txt \
> Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

The columns are as follows:

- Gene ID
- Uniprot Accession code
- Isoform ID 
- e-value
- Protein Names
- Gene ontology (biological process)
- Gene ontology (cellular component)
- Gene ontology (molecular function)
- Gene onology IDs
- Status (Reviewed or not reviewed)
- Organism
- S_10B	counts
- S_9A counts 
- S_13A	counts
- S_42A	counts
- S_46B	counts
- S_47B	counts
- S_48B	counts
- S_2A counts
- S_2B counts
- S_7B counts
- S_8B counts 
- S_33A counts
- S_36A counts
- S_38A counts
- S_40A counts

In [63]:
!head -n2 Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

TRINITY_DN102005_c0_g1	Q1DZQ0	TRINITY_DN102005_c0_g1_i1	1.3e-37	Protein transport protein SEC13	COPII-coated vesicle budding [GO:0090114]; mRNA transport [GO:0051028]; positive regulation of TORC1 signaling [GO:1904263]; protein transport [GO:0015031]	COPII vesicle coat [GO:0030127]; endoplasmic reticulum membrane [GO:0005789]; Golgi membrane [GO:0000139]; nuclear pore [GO:0005643]	structural molecule activity [GO:0005198]	GO:0000139; GO:0005198; GO:0005643; GO:0005789; GO:0015031; GO:0030127; GO:0051028; GO:0090114; GO:1904263	reviewed	Coccidioides immitis (strain RS) (Valley fever fungus)	0	0	1	1	0	0	0	0	0	4	0	1	3	0	0
TRINITY_DN102006_c0_g1	Q54NS9	TRINITY_DN102006_c0_g1_i1	3.0e-21	Apoptosis-inducing factor homolog A (EC 1.-.-.-)		cytoplasm [GO:0005737]; lipid droplet [GO:0005811]	electron-transferring-flavoprotein dehydrogenase activity [GO:0004174]; flavin adenine dinucleotide binding [GO:0050660]	GO:0004174; GO:0005737; GO:0005811; GO:0050660	reviewed	Dictyostelium discoideum (Sl

In [62]:
!wc -l Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

 5159 Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab


### Step 6b. *L. zosterae*

In [64]:
#Join the first column in the first file with the first column in the second file. 
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
nonZostera-blast-annot-withGeneID-noIsoforms.tab \
genes.counts.matrix.txt \
> nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

The columns are as follows:

- Gene ID
- Uniprot Accession code
- Isoform ID 
- e-value
- Protein Names
- Gene ontology (biological process)
- Gene ontology (cellular component)
- Gene ontology (molecular function)
- Gene onology IDs
- Status (Reviewed or not reviewed)
- Organism
- S_10B	counts
- S_9A counts 
- S_13A	counts
- S_42A	counts
- S_46B	counts
- S_47B	counts
- S_48B	counts
- S_2A counts
- S_2B counts
- S_7B counts
- S_8B counts 
- S_33A counts
- S_36A counts
- S_38A counts
- S_40A counts

In [65]:
!head -n2 nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

TRINITY_DN102004_c0_g1	Q54BM8	TRINITY_DN102004_c0_g1_i1	6.5e-19	UPF0652 protein			zinc ion binding [GO:0008270]	GO:0008270	reviewed	Dictyostelium discoideum (Slime mold)	0	1	0	0	0	0	0	0	0	4	0	0	6	0	0
TRINITY_DN102011_c0_g1	Q84M24	TRINITY_DN102011_c0_g1_i1	2.5e-14	ABC transporter A family member 1 (ABC transporter ABCA.1) (AtABCA1) (ABC one homolog protein 1) (AtAOH1)	lipid transport [GO:0006869]	integral component of membrane [GO:0016021]; intracellular membrane-bounded organelle [GO:0043231]; vacuolar membrane [GO:0005774]	ATPase activity [GO:0016887]; ATPase-coupled transmembrane transporter activity [GO:0042626]; ATP binding [GO:0005524]; lipid transporter activity [GO:0005319]	GO:0005319; GO:0005524; GO:0005774; GO:0006869; GO:0016021; GO:0016887; GO:0042626; GO:0043231	reviewed	Arabidopsis thaliana (Mouse-ear cress)	0	11	0	0	0	0	0	0	0	0	0	0	0	0	0


In [66]:
!wc -l nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab

 2402 nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab
