Gene annotation

Author

Steven Roberts

Published

August 30, 2024

Summary

Here I will try to annotate the Manila clam genes. I know NCBI has annoation and I will also blast to SP to get GO information.

What does NCBI have?

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_026571515.1/

Grabbing the code clicking on ‘datasets’ button in screenshot

cd ../data

/home/shared/datasets \
download genome accession GCF_026571515.1 --include gff3,rna,cds,protein,genome,seq-report
cd ../data
unzip ncbi_dataset.zip
cd ../data/ncbi_dataset/data/GCF_026571515.1/
ls *
cds_from_genomic.fna
GCF_026571515.1_ASM2657151v2_genomic.fna
genomic.gff
protein.faa
rna.fna
sequence_report.jsonl
head ../data/ncbi_dataset/data/GCF_026571515.1/*
==> ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna <==
>lcl|NW_026851514.1_cds_XP_060579799.1_1 [gene=LOC132712676] [db_xref=GeneID:132712676] [protein=uncharacterized protein LOC132712676] [protein_id=XP_060579799.1] [location=complement(join(8570..8679,9698..9800,9934..10108,10375..10520,10828..10970,11352..11870,12239..12377,12599..12643))] [gbkey=CDS]
ATGGATAATCCATGTTCGTATGAATCAGTTAAAAGAGTTATCAAAGAAATACGAAAAACTAGCGGAGTGACAGAAGATAA
CTTAAAAACATGGTGTATTATGGGATGCGACGGCCTTCCCTATACGCTAGGATCGAGACTAATTGAAAAAAATAAAGATA
TGCAAAATATCTTACTTATTCCCGGACACGGGCACATAGAGATGAATGTGGTAAAAGCTGCTTTTAAGTTACTGTGGGAG
CCCATTCTACAGGACTTAAGTAAGGAATGTGGTTTCAAGTCACCAAGGGCACAAGTTGCGGCACAATCGTGTACTGATCA
CCACAAATCATACATGCTTTTGGAAATAATGTTTGAAAGTGCTTTGCAGGAGATCATGACAACTTTTATCATAAATAGTG
TGCAGAATTCAATAACACCTAACATAGCCACTTTTTTTGATTATATAAAATCTTCAAAAGACAAAAATTATAGATTTATG
TGTGACGCAATAATAAACTTTATTTTTCCAATATTTCTATATAGAGCCGGTGTTAGGAGAAACAATTTCGGATATATATC
TGCAGCTAAAGCTAAATTTTTTAAACTATTTTTTTCTGGCGGAATGAAAAATTATCAGCAACTTATTATGAAAGATATTA
AAACATACATCCTAGCACCTCCAGAAGTGAAGCATTTCTTACAGAAAACCCAATCTTTTACTGTCAGTGGACACCCTTCA

==> ../data/ncbi_dataset/data/GCF_026571515.1/GCF_026571515.1_ASM2657151v2_genomic.fna <==
>NW_026851514.1 Ruditapes philippinarum isolate M1 unplaced genomic scaffold, ASM2657151v2 ctg11145_1_1_1_1, whole genome shotgun sequence
AAAAAaacgttcttttttttttttttcaaataaacattaatcacttcagcgcgcgcgcaattatcgttttcaaattaaag
tttcgtaagatatgttaatgacttcccgatgcgatcgactgtaagtaatttgtatatgtatgcggttttttggggttttt
tttttgtttgttttttaattaaagcttcttattctttaatattttttatgtatctcataatctgtattgaaaatttgaat
tttattacacatggactattcaaattatgatgcgttgccctagaattgaataaactttcttaattacacaatacggggtc
atactgatctttgcgtctttctctgaaataattatttccttgttaaaacatatttggaccacatatttctgtcaattgtt
cactttacttccgatttttgttgctacggaaacacatgtttctatttgtgttttctaatcaattcgaatctataatcagt
ttaatgggatcaccaccgggtaaaacgactactaggtactttggtcatgtttttaccccaaatatttgaatgtctctatt
gggattaaaatgagcatagttaactaaacaactggttgtataaatccctgacctgtctaattctaattatgagattagtg
aatatataagactttcactacaaaatattcctaaacaaaagagcacatgtttaaatttattcctatacaaataaaagatc

==> ../data/ncbi_dataset/data/GCF_026571515.1/genomic.gff <==
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM2657151v2
#!genome-build-accession NCBI_Assembly:GCF_026571515.1
#!annotation-date 10/30/2023
#!annotation-source NCBI RefSeq GCF_026571515.1-RS_2023_10
##sequence-region NW_026851514.1 1 38808
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=129788
NW_026851514.1  RefSeq  region  1   38808   .   +   .   ID=NW_026851514.1:1..38808;Dbxref=taxon:129788;Name=Unknown;chromosome=Unknown;collection-date=2014-10;country=USA: Puget Sound region%2C Pacific Northwest;dev-stage=adult clam;gbkey=Src;genome=genomic;isolate=M1;isolation-source=lagoon;mol_type=genomic DNA;sex=male;tissue-type=mantle

==> ../data/ncbi_dataset/data/GCF_026571515.1/protein.faa <==
>XP_060551064.1 uncharacterized protein LOC132760673 [Ruditapes philippinarum]
MPKRKPSASTSGKGKATTSSTSDDHRLASIIANALVQNKSALKEVAALLPPMTIEPGIRPDGTEAELTEPDHAVQKQPRL
GNRGFGTGSSPKQGPCTNSTRNVTMDIAHVKKSLLHSSLAPGTHKAYDRFWERFLCFVSTSVVHFSPLPATPDCISDFVA
HLHILAFAPSSISSHLSAISHFHNISGFTDPCENFITRKMLVGCRKINLRSDTRKPLLNNHIQLLCQAVKEMFAHVPYLK
YLYMALILTAFNGFFRLGELLPATISSADKVVQITDLSSSSKSVRLKLLNHKTNKSDKPTLILMKSLQTNCPVKALNNYL
SMRGQSAGPLFLLANNSPLTLPSFREVFKLLLRLANLSPVHYKLHSFRIGACTQAILSGTPENEVMRMGRWKSNAFKRYI
RMPVVNATH
>XP_060551065.1 transmembrane cell adhesion receptor mua-3-like isoform X1 [Ruditapes philippinarum]
MGIPGIMRHAYIAILLMNLVLRGGSVTDICQTATADIVFIVDSSGSVGSSGYDDEIDFIKAIVNELVIDPNEVRMGLIDY
STSVHTSLGFNLNDPNFDTNAEVIAKLNSLPYSGGSTRTDLAIQAAKNMFSGPGNRPDVPDVLFTLTDGETNDGGQDLLD

==> ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna <==
>XM_060695081.1 PREDICTED: Ruditapes philippinarum uncharacterized LOC132760673 (LOC132760673), mRNA
TAGCTTTATTTAGTGAAACATATAAAGATGTTTATATTTGTAAACTTATTATCATTGCTTACTCTAAGAGATTGGAGGAG
GAAGGGCCAGGATTTTCGGTTATATGAACTTTCCGGAAATTGTTAACCATTTCGGTTTTCTAGATGAGCCTCGCTTTGAA
AAGCGTCACTTTCGGACCACAGACTTAGTCATTTCAGAGAAGTTTGCTGTCACTTTGAATTGTTGAACAAAAATCAGTGC
ATAGATTGACATATTTTGAATTGTATGTGTTGAGTGATATTTTCTACATTGCCGGACGAGCTTCCACTATTTACAATGCC
AAAAAGAAAACCCTCCGCGTCTACCAGTGGTAAAGGCAAGGCAACAACATCATCCACTAGCGACGACCACAGGTTGGCGA
GCATCATCGCCAATGCTTTAGTGCAGAATAAGTCTGCACTGAAGGAAGTAGCCGCCTTATTACCACCAATGACGATAGAA
CCAGGCATCCGCCCCGACGGTACAGAGGCGGAATTAACAGAACCGGATCATGCCGTCCAAAAGCAGCCGCGGCTGGGTAA
CAGAGGCTTTGGAACTGGCTCCTCACCTAAACAGGGACCCTGTACCAATTCCACGAGAAATGTCACCATGGATATTGCTC
ACGTGAAGAAGTCCTTGTTGCATTCTTCTCTGGCTCCGGGAACCCATAAAGCTTACGATCGGTTTTGGGAACGGTTTCTA

==> ../data/ncbi_dataset/data/GCF_026571515.1/sequence_report.jsonl <==
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000001.1","length":38808,"refseqAccession":"NW_026851514.1","role":"unplaced-scaffold","sequenceName":"ctg11145_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000002.1","length":239892,"refseqAccession":"NW_026851515.1","role":"unplaced-scaffold","sequenceName":"ctg3581_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000003.1","length":379583,"refseqAccession":"NW_026851516.1","role":"unplaced-scaffold","sequenceName":"ctg1911_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000004.1","length":285079,"refseqAccession":"NW_026851517.1","role":"unplaced-scaffold","sequenceName":"ctg1746_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000005.1","length":344284,"refseqAccession":"NW_026851518.1","role":"unplaced-scaffold","sequenceName":"ctg1952_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000006.1","length":15851,"refseqAccession":"NW_026851519.1","role":"unplaced-scaffold","sequenceName":"ctg18844_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000007.1","length":57002,"refseqAccession":"NW_026851520.1","role":"unplaced-scaffold","sequenceName":"ctg7485_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000008.1","length":134014,"refseqAccession":"NW_026851521.1","role":"unplaced-scaffold","sequenceName":"ctg10092_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000009.1","length":35506,"refseqAccession":"NW_026851522.1","role":"unplaced-scaffold","sequenceName":"ctg9815_1_1_1_1"}
{"assemblyAccession":"GCF_026571515.1","assemblyUnit":"Primary Assembly","assignedMoleculeLocationType":"Chromosome","chrName":"Un","genbankAccession":"JAKTTH010000010.1","length":49325,"refseqAccession":"NW_026851523.1","role":"unplaced-scaffold","sequenceName":"ctg6631_1_1_1_1"}
tail ../data/ncbi_dataset/data/GCF_026571515.1/*gff
NC_031332.1 RefSeq  exon    19162   19569   .   +   .   ID=exon-BJM09_gp03-1;Parent=rna-BJM09_gp03;Dbxref=GeneID:29141298;gbkey=mRNA;gene=ND4L;locus_tag=BJM09_gp03
NC_031332.1 RefSeq  CDS 19162   19569   .   +   0   ID=cds-YP_009305282.1;Parent=rna-BJM09_gp03;Dbxref=GenBank:YP_009305282.1,GeneID:29141298;Name=YP_009305282.1;gbkey=CDS;gene=ND4L;locus_tag=BJM09_gp03;product=NADH dehydrogenase subunit 4L;protein_id=YP_009305282.1;transl_table=5
NC_031332.1 RefSeq  gene    21431   21496   .   +   .   ID=gene-BJM09_gt24;Dbxref=GeneID:29141299;Name=BJM09_gt24;gbkey=Gene;gene_biotype=tRNA;locus_tag=BJM09_gt24
NC_031332.1 RefSeq  tRNA    21431   21496   .   +   .   ID=rna-BJM09_gt24;Parent=gene-BJM09_gt24;Dbxref=GeneID:29141299;gbkey=tRNA;locus_tag=BJM09_gt24;product=tRNA-Ile
NC_031332.1 RefSeq  exon    21431   21496   .   +   .   ID=exon-BJM09_gt24-1;Parent=rna-BJM09_gt24;Dbxref=GeneID:29141299;gbkey=tRNA;locus_tag=BJM09_gt24;product=tRNA-Ile
NC_031332.1 RefSeq  gene    21551   23521   .   +   .   ID=gene-BJM09_gp01;Dbxref=GeneID:29141326;Name=COX2;gbkey=Gene;gene=COX2;gene_biotype=protein_coding;locus_tag=BJM09_gp01
NC_031332.1 RefSeq  mRNA    21551   23521   .   +   .   ID=rna-BJM09_gp01;Parent=gene-BJM09_gp01;Dbxref=GeneID:29141326;gbkey=mRNA;gene=COX2;locus_tag=BJM09_gp01
NC_031332.1 RefSeq  exon    21551   23521   .   +   .   ID=exon-BJM09_gp01-1;Parent=rna-BJM09_gp01;Dbxref=GeneID:29141326;gbkey=mRNA;gene=COX2;locus_tag=BJM09_gp01
NC_031332.1 RefSeq  CDS 21551   23521   .   +   0   ID=cds-YP_009305270.1;Parent=rna-BJM09_gp01;Dbxref=GenBank:YP_009305270.1,GeneID:29141326;Name=YP_009305270.1;gbkey=CDS;gene=COX2;locus_tag=BJM09_gp01;product=cytochrome c oxidase subunit II;protein_id=YP_009305270.1;transl_table=5
###

Swiss-Prot Blast

Create Blast Database

cd ../data/blast_dbs
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_04.fasta.gz
gunzip -k uniprot_sprot_r2024_04.fasta.gz
head ../data/blast_dbs/uniprot_sprot_r2024_04.fasta
echo "Number of Sequences"
grep -c ">" ../data/blast_dbs/uniprot_sprot_r2024_04.fasta
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
Number of Sequences
571864
/home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \
-in ../data/blast_dbs/uniprot_sprot_r2024_04.fasta \
-dbtype prot \
-out ../data/blast_dbs/uniprot_sprot_r2024_04
fasta="../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna"

/home/shared/ncbi-blast-2.15.0+/bin/blastx \
-query $fasta \
-db ../data/blast_dbs/uniprot_sprot_r2024_04 \
-out ../output/14-gene-annotation/cds_blastx_sp.tab \
-evalue 1E-05 \
-num_threads 48 \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt 6