--- title: "Accessing data" output: html_document --- ```{r} library(dplyr) #install.packages("BiocManager") #BiocManager::install("Biostrings") library(Biostrings) ``` #Objective RNAseq analysis was run but there are several unannotated genes that are significant and currently not included in our analysis. The goal is to manually blast sequences of unannotated genes to find potential annotations of closely related species. # Background These samples were sequenced by NovoGene in paired-end mode, 150 bp reads. We used [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the overall quality for the [FASTQ](multiqc_report.html) files. The link will open a summary of the fastqc results, which indicate that the raw data are of high quality. We aligned reads to the Botryllus_tozio_genes.fasta transcript file, using the salmon aligner, and then summarized to the gene level using the Bioconductor tximport package . The FASTA transcripts file appears to have been generated by the Trinity aligner, which starts by identifying transcripts and then collapsing the transcripts to genes. The resulting identifier formats are something like g1.t1, which in this case indicates that the given transcript is the 'first' (t1) transcript for the 'first' (g1) gene. The transcripts were annotated using both eggNOG and OmicsBox. the eggNOG annotations were more complete, so we used those to map the genes to gene symbols and GO terms. # Read in RNAseq Results Read in the results file for the RNAseq analysis and filter for unannotated genes ```{r} # Read the CSV file rnaseq_results <- read.csv("../data/Gardell_pid2462_results.csv", stringsAsFactors = FALSE) # Filter for genes with significant adjusted p-values (also known as the false discovery rate of < 0.1) and no gene symbols noname_sig_genes <- rnaseq_results %>% filter(adj.P.Val < 0.1, SYMBOL == Gene) # SYMBOL equals Gene for unannotated genes # To view all the genes that have been annotated with eggNOG sig_genes <- rnaseq_results %>% filter(adj.P.Val < 0.1, SYMBOL != Gene) # Write a new csv that just contains the results of the significant unannotated genes write.csv(noname_sig_genes, file = "noname.csv") ``` There are 25 annotated genes and 29 unannotated genes. We will need to use this file to make a gene list to extract all of the sequences of the genes in that csv file. Take a look at the fasta file for the transcriptome provided. ```{r, engine='bash'} pwd cd ../data head -n 20 Botryllus_tozio_genes.fasta ``` ```{r, engine='bash'} cd ../data echo "How many sequences are there?" grep -c ">" Botryllus_tozio_genes.fasta ``` The next step will involve making a new fasta file with just those sequences in the gene list ```{r, engine='bash'} # Extract the 'Gene' column while skipping the header awk -F 'NR > 1 {print $2}' ../data/noname.csv > ../data/gene_ids.txt ``` Now we need a transcript id list. this is going to be important for when we try to extract each sequence of each transcript from the fasta file and make a new one using seqtk tools. ```{r, engine='bash'} #!/bin/bash # Input FASTA file containing the full transcriptome INPUT_FASTA="/home/shared/8TB_HDD_02/cvaldi/Botryllus-Nickel-Tox/data/Botryllus_tozio_genes.fasta" # Output file for the new list of transcript IDs (with suffixes like .t1, .t2, etc.) TRANSCRIPT_IDS="/home/shared/8TB_HDD_02/cvaldi/Botryllus-Nickel-Tox/data/transcript_ids.txt" # Create (or clear) the transcript_ids.txt file > $TRANSCRIPT_IDS # Directly execute multiple awk commands for each gene and append to the output file awk '/^>g10908[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g11811[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g11813[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g1196[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g12050[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g12111[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g12775[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g1294[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g12982[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g13014[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g14195[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g14703[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g14965[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g15728[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g15986[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g16191[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g16471[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g1703[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g1744[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g2472[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g2473[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g4208[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g5661[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g6611[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g7259[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g7613[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g8170[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g9341[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS awk '/^>g9597[^0-9]/{print substr($1, 2)}' $INPUT_FASTA >> $TRANSCRIPT_IDS echo "Transcript IDs for all genes extracted and saved to $TRANSCRIPT_IDS" # Optionally, view the transcript IDs to confirm they were extracted cat $TRANSCRIPT_IDS ```