---
title: "Accessing data"
output: html_document
---
```{r}
library(dplyr)
#install.packages("BiocManager")
#BiocManager::install("Biostrings")
library(Biostrings)
```
#Objective
RNAseq analysis was run but there are several unannotated genes that are significant and currently not included in our analysis. The goal is to manually blast sequences of unannotated genes to find potential annotations of closely related species.
# Background
These samples were sequenced by NovoGene in paired-end mode, 150 bp
reads. We used
[fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
to assess the overall quality for the [FASTQ](multiqc_report.html)
files. The link will open a summary of the fastqc results, which
indicate that the raw data are of high quality.
We aligned reads to the Botryllus_tozio_genes.fasta transcript file, using the salmon aligner, and then summarized to the gene level using the
Bioconductor tximport package .
The FASTA transcripts file appears to have been generated by the
Trinity aligner, which starts by identifying transcripts and then
collapsing the transcripts to genes. The resulting identifier formats
are something like g1.t1, which in this case indicates that the given
transcript is the 'first' (t1) transcript for the 'first' (g1) gene.
The transcripts were annotated using both eggNOG and OmicsBox. the
eggNOG annotations were more complete, so we used those to map the
genes to gene symbols and GO terms.
# Read in RNAseq Results
Read in the results file for the RNAseq analysis and filter for unannotated genes
```{r}
# Read the CSV file
rnaseq_results <- read.csv("../data/Gardell_pid2462_results.csv", stringsAsFactors = FALSE)
# Filter for genes with significant adjusted p-values (also known as the false discovery rate of < 0.1) and no gene symbols
noname_sig_genes <- rnaseq_results %>%
filter(adj.P.Val < 0.1, SYMBOL == Gene) # SYMBOL equals Gene for unannotated genes
# To view all the genes that have been annotated with eggNOG
sig_genes <- rnaseq_results %>%
filter(adj.P.Val < 0.1, SYMBOL != Gene)
# Write a new csv that just contains the results of the significant unannotated genes
write.csv(noname_sig_genes, file = "noname.csv")
```
There are 25 annotated genes and 29 unannotated genes.
We will need to use this file to make a gene list to extract all of the sequences of the genes in that csv file.
Take a look at the fasta file for the transcriptome provided.
```{r, engine='bash'}
pwd
cd ../data
head -n 20 Botryllus_tozio_genes.fasta
```
```{r, engine='bash'}
cd ../data
echo "How many sequences are there?"
grep -c ">" Botryllus_tozio_genes.fasta
```
The next step will involve making a new fasta file with just those sequences in the gene list
```{r, engine='bash'}
# Extract the 'Gene' column while skipping the header
awk -F 'NR > 1 {print $2}' ../data/noname.csv > ../data/gene_ids.txt
```
Now we need a transcript id list. this is going to be important for when we try to extract each sequence of each transcript from the fasta file and make a new one using seqtk tools.
```{r, engine='bash'}
#!/bin/bash
# Input FASTA file containing the full transcriptome
INPUT_FASTA="/home/shared/8TB_HDD_02/cvaldi/Botryllus-Nickel-Tox/data/Botryllus_tozio_genes.fasta"
# Output file for the new list of transcript IDs (with suffixes like .t1, .t2, etc.)
TRANSCRIPT_IDS="/home/shared/8TB_HDD_02/cvaldi/Botryllus-Nickel-Tox/data/transcript_ids.txt"
# Create (or clear) the transcript_ids.txt file
> $TRANSCRIPT_IDS
# Directly execute multiple awk commands for each gene and append to the output file
awk '/^>g10908[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g11811[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g11813[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g1196[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g12050[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g12111[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g12775[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g1294[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g12982[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g13014[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g14195[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g14703[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g14965[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g15728[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g15986[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g16191[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g16471[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g1703[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g1744[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g2472[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g2473[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g4208[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g5661[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g6611[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g7259[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g7613[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g8170[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g9341[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
awk '/^>g9597[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS
echo "Transcript IDs for all genes extracted and saved to $TRANSCRIPT_IDS"
# Optionally, view the transcript IDs to confirm they were extracted
cat $TRANSCRIPT_IDS
```