03.2-genome-annotation ================ Kathleen Durkin 2024-05-21 We need to generate a databse that associates gene IDs with Uniprot Acession numbers and GO terms, for use with our gene-level count matrices generated through Hisat2/featureCounts. We already have a transcriptome database (generated in `03-transcriptome-annotation`), and the transcriptome has annotated headers that include gene IDs. We can use these headers to add gene IDs to our existing database. First, load our existing transcriptome database, which associates transcript IDs with Uniprot Acession numbers and GO terms (generated in `03-transcriptome-annotation`) ``` r # Load our transcript ID -> UniprotID/GO term database rna_id <- read.csv("../output/03-transcriptome-annotation/G_macrocephalus_rna_IDmapping_2024_04_17.tab", sep="\t") ``` Now let’s view the header format from our original transcriptome file ``` bash grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | head -5 ``` ## >XM_060035408.1 PREDICTED: Gadus macrocephalus putative helicase MOV-10 (LOC132473465), mRNA ## >XM_060035409.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X2, mRNA ## >XM_060035410.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X3, mRNA ## >XM_060035411.1 PREDICTED: Gadus macrocephalus uncharacterized LOC132445431 (LOC132445431), transcript variant X4, mRNA ## >XM_060035412.1 PREDICTED: Gadus macrocephalus ATP-sensitive inward rectifier potassium channel 12-like (LOC132475957), transcript variant X2, mRNA We need to extract three pieces of information from these headers: the transcript ID (e.g., “XM_060035408.1”), the associated gene ID (e.g., “LOC132473465”), and the type (e.g., “mRNA”). ``` bash grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | \ while read -r line; do # Extract the sequence ID seq_id=$(echo "$line" | awk -F'>| ' '{print $2}') # Extract the gene ID gene_ID=$(echo "$line" | grep -oP '\([^()]*\)(?=,[^()]*$)' | grep -oP '(?<=\().*(?=\))') # Extract the sequence type seq_type=$(echo "$line" | awk -F'), ' '{print $NF}') # Print the desired tab-delimited output format echo -e "${seq_id}\t${gene_ID}\t${seq_type}" done > ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt ``` ``` bash # View some formatted headers head -5 ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt echo "" echo "" # Check that we have the same number of formatted header lines as original header lines echo "# headers in original transcriptome file:" grep "^>" ../data/GCF_031168955.1_ASM3116895v1_rna.fna | wc -l echo"" echo "# headers in formatted file:" wc -l < ../output/03.2-genome-annotation/transcriptome_headers_formatted.txt ``` ## XM_060035408.1 LOC132473465 mRNA ## XM_060035409.1 LOC132445431 transcript variant X2, mRNA ## XM_060035410.1 LOC132445431 transcript variant X3, mRNA ## XM_060035411.1 LOC132445431 transcript variant X4, mRNA ## XM_060035412.1 LOC132475957 transcript variant X2, mRNA ## ## ## # headers in original transcriptome file: ## 49417 ## ## # headers in formatted file: ## 49417 ``` r headers <- read.csv("../output/03.2-genome-annotation/transcriptome_headers_formatted.txt", sep="\t", header=FALSE, col.names=c("transcript_ID", "gene_ID", "transcript_type")) ``` ``` r # Join genes_IDmap <- left_join(rna_id, headers, by=c("V1"="transcript_ID")) # View output (columns gene_ID and transcript_type) genes_IDmap %>% select(gene_ID, transcript_type) %>% head() ``` ## gene_ID transcript_type ## 1 LOC132473465 mRNA ## 2 LOC132475957 transcript variant X2, mRNA ## 3 LOC132445432 transcript variant X1, mRNA ## 4 LOC132445432 transcript variant X2, mRNA ## 5 LOC132445432 transcript variant X3, mRNA ## 6 LOC132445433 mRNA The output indicates we have gene_ID duplicates due to multiple transcript variants in the transcriptome file. We only want *one* entry for each gene_ID, since this is the column which will be used to join with our gene-level counts matrix. We need to reduce this ID mapping to only unique gene_IDs ``` r # Reduce to unique gene_IDs genes_IDmap_unique <- genes_IDmap %>% distinct(gene_ID, .keep_all=TRUE) # Check genes_IDmap_unique %>% select(gene_ID, transcript_type) %>% head() ``` ## gene_ID transcript_type ## 1 LOC132473465 mRNA ## 2 LOC132475957 transcript variant X2, mRNA ## 3 LOC132445432 transcript variant X1, mRNA ## 4 LOC132445433 mRNA ## 5 rbm46 mRNA ## 6 LOC132445437 mRNA ``` r # Save to file write.table(genes_IDmap_unique, "../output/03.2-genome-annotation/G_macrocephalus_genes_IDmapping_2024_05_22.tab", sep="\t") ```