03.2-genome-annotation
================
Kathleen Durkin
2024-05-21

We need to generate a databse that associates gene IDs with Uniprot
Acession numbers and GO terms, for use with our gene-level count
matrices generated through Hisat2/featureCounts. We already have a
transcriptome database (generated in `03-transcriptome-annotation`), and
the transcriptome has annotated headers that include gene IDs. We can
use these headers to add gene IDs to our existing database.

First, load our existing transcriptome database, which associates
transcript IDs with Uniprot Acession numbers and GO terms (generated in
`03-transcriptome-annotation`)

``` r
# Load our transcript ID -> UniprotID/GO term database
rna_id <- read.csv("../output/03-transcriptome-annotation/G_macrocephalus_rna_IDmapping_2024_04_17.tab", sep="\t")
```

Now let’s view the header format from our original transcriptome file

``` bash
grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | head -5
```

    ## >XM_060035408.1 PREDICTED: Gadus macrocephalus putative helicase MOV-10 (LOC132473465), mRNA
    ## >XM_060035409.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X2, mRNA
    ## >XM_060035410.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X3, mRNA
    ## >XM_060035411.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X4, mRNA
    ## >XM_060035412.1 PREDICTED: Gadus macrocephalus ATP-sensitive inward rectifier potassium channel 12-like (LOC132475957), transcript variant X2, mRNA

We need to extract three pieces of information from these headers: the
transcript ID (e.g., “XM_060035408.1”), the associated gene ID (e.g.,
“LOC132473465”), and the type (e.g., “mRNA”).

``` bash
grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | \
while read -r line; do
    # Extract the sequence ID
    seq_id=$(echo "$line" | awk -F'>| ' '{print $2}')
    # Extract the gene ID
    gene_ID=$(echo "$line" | grep -oP '\([^()]*\)(?=,[^()]*$)' | grep -oP '(?<=\().*(?=\))')
    # Extract the sequence type
    seq_type=$(echo "$line" | awk -F'), ' '{print $NF}')
    # Print the desired tab-delimited output format
    echo -e "${seq_id}\t${gene_ID}\t${seq_type}"
done > ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt
```

``` bash
# View some formatted headers
head -5 ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt

echo ""
echo ""

# Check that we have the same number of formatted header lines as original header lines
echo "# headers in original transcriptome file:"
grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | wc -l
echo""
echo "# headers in formatted file:"
wc -l < ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt
```

    ## XM_060035408.1   LOC132473465    mRNA
    ## XM_060035409.1   LOC132445431    transcript variant X2, mRNA
    ## XM_060035410.1   LOC132445431    transcript variant X3, mRNA
    ## XM_060035411.1   LOC132445431    transcript variant X4, mRNA
    ## XM_060035412.1   LOC132475957    transcript variant X2, mRNA
    ## 
    ## 
    ## # headers in original transcriptome file:
    ## 49417
    ## 
    ## # headers in formatted file:
    ## 49417

``` r
headers <- read.csv("../output/03.2-genome-annotation/transcriptome_headers_formatted.txt", sep="\t", header=FALSE, col.names=c("transcript_ID", "gene_ID", "transcript_type"))
```

``` r
# Join
genes_IDmap <- left_join(rna_id, headers, by=c("V1"="transcript_ID"))

# View output (columns gene_ID and transcript_type)
genes_IDmap %>%
  select(gene_ID, transcript_type) %>%
  head()
```

    ##        gene_ID             transcript_type
    ## 1 LOC132473465                        mRNA
    ## 2 LOC132475957 transcript variant X2, mRNA
    ## 3 LOC132445432 transcript variant X1, mRNA
    ## 4 LOC132445432 transcript variant X2, mRNA
    ## 5 LOC132445432 transcript variant X3, mRNA
    ## 6 LOC132445433                        mRNA

The output indicates we have gene_ID duplicates due to multiple
transcript variants in the transcriptome file. We only want *one* entry
for each gene_ID, since this is the column which will be used to join
with our gene-level counts matrix. We need to reduce this ID mapping to
only unique gene_IDs

``` r
# Reduce to unique gene_IDs 
genes_IDmap_unique <- genes_IDmap %>%
  distinct(gene_ID, .keep_all=TRUE)

# Check
genes_IDmap_unique %>%
  select(gene_ID, transcript_type) %>%
  head()
```

    ##        gene_ID             transcript_type
    ## 1 LOC132473465                        mRNA
    ## 2 LOC132475957 transcript variant X2, mRNA
    ## 3 LOC132445432 transcript variant X1, mRNA
    ## 4 LOC132445433                        mRNA
    ## 5        rbm46                        mRNA
    ## 6 LOC132445437                        mRNA

``` r
# Save to file
write.table(genes_IDmap_unique, "../output/03.2-genome-annotation/G_macrocephalus_genes_IDmapping_2024_05_22.tab", sep="\t")
```