01-get-NCBI-sequences

Author

Sarah Tanja

Published

April 26, 2023

Environment pre-reqs:

Install Packages

Code
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb")

Load packages

Code
library(SRAdb)
library(tidyverse)

Get sample metadata

First we want to get an idea of the files we are downloading and the samples that generated the data. We will start by looking at the metadata for the samples.

Code
# navigate to data directory
cd ../data

# download metadata from the git repo into the data directory
curl -O https://raw.githubusercontent.com/AHuffmyer/EarlyLifeHistory_Energetics/master/Mcap2020/Data/TagSeq/Sample_Info.csv  
Code
#pull data into R and rename it metadata 
metadata <- read_csv("../data/Sample_Info.csv")

Check file integrity with md5sum

Code
cd ../data
md5sum Sample_Info.csv > md5.transferred
Code
cd ../data
cmp Sample_Info.csv md5.transferred

These files differ by 1 byte, and I haven’t yet figured out why… possibly a windows vs unix thing

Code
#look at the metadata
head(metadata)

There are 39 samples (rows) with 8 metadata columns in this tibble dataset (AH1 - AH39). These samples are Montipora capitata coral taken at different life-stages (denoted by column names time-stage and code ), and RNA extracted and sequenced using Tag-Seq.

M. capitata development diagram by A. Huffmyer

Get TagSeq FASTQ files from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA)

Info

Checkout some background info on The Sequence Read Archive (SRA) and the SRA factsheet

NCBI sequences have been trimmed and cleaned and are ready for alignment and analysis.

Use SRA Run Selector to make a .txt file list of Run (SRR) numbers

“Results are called runs (SRR). Runs comprise the data gathered for a sample or sample bundle and refer to a defining experiment.”

First, we need to obtain RNAseq files from NCBI from the project PRJNA900235. In order to do this we need a list of SRR numbers that identify the specific sequence files for RNA. Go to the SRA Run Selector for BioProject 900235 and select the files of interest.

To make things easier and faster, we’re going to compare 2 life-history groups, with an n=4 for each group. In this case, let’s compare 4 Embryos (an early stage) to 4 Attached Recruits (a later stage).

SRA run selector, downloading 8 fastq files (4 Embryo, 4 Attached Recruits)

Select the files of interest, click ‘Accession List’ and the list of file run ID’s will be downloaded in a text file named SRR_Acc_list.txt , upload this list into the data folder.

Download the fastq files

Roberts Lab Resources Github issue#1569 thread

Using sratoolkit.3.0.2-ubuntu64 which is already downloaded in /home/shared folder

Danger

The following code will take some time, run it and go take a wee break

Code
/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/mcap \
--progress \
SRR22293447 \
SRR22293448 \
SRR22293449 \
SRR22293450 \
SRR22293451 \
SRR22293452 \
SRR22293453 \
SRR22293454

Absolute path to fastq files in raven:

/home/shared/8TB_HDD_01/mcap/

Relative path to fastq files in raven:

cd ../../../../../8TB_HDD_01/mcap/

Check that the fastq files are downloaded:

Code
cd /home/shared/8TB_HDD_01/mcap/
ls
Tip

The fastq files have been downloaded from NCBI!