--- title: "01-get-data" author: "Sarah Tanja" date: "4/11/2023" output: md_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Getting RNAseq FASTQ Data Files ------------------------------------------------------------------------ ## Advanced Prep - have sequence data files deposited on NCBI - operate within RStudio in a cloud environment (JupyterHub or Roberts Lab Raven server) with lots of drive space! These fastq files are really big so you don't want to download them onto your local computer ### Install packages ```{r install-packages, cache=TRUE} if ("BiocManager" %in% rownames(installed.packages()) == 'FALSE') install.packages("BiocManager") if ("SRAdb" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("SRAdb") ``` ### Load packages ```{r load-packages} library(SRAdb) library(tidyverse) ``` First we want to get an idea of the files we are downloading and the samples that generated the data. We will start by getting and looking at the metadata for the samples. ## Get sample metadata ```{r , engine='bash'} # navigate to data directory cd ../data # download metadata from the git repo into the data directory # `-O` indicates to save using the same file name `Sample_Info.csv` curl -O https://raw.githubusercontent.com/AHuffmyer/EarlyLifeHistory_Energetics/master/Mcap2020/Data/TagSeq/Sample_Info.csv ``` ```{r} #pull data into R and rename it metadata metadata <- read_csv("../data/Sample_Info.csv") ``` ### Practice using hashes Use `md5sum` hash to check that the file we saved and renamed 'metadata' is the same as the original file. ```{r} ?md5sum md5sum('../data/Sample_Info.csv') ``` ### Understanding the metadata ```{r} #look at the metadata head(metadata) ``` There are 39 samples (rows) with 8 metadata columns in this tibble dataset (AH1 - AH39). These samples are *Montipora capitata* coral taken at different life-stages (denoted by column names `time-stage` and `code` ), and RNA extracted and sequenced using Tag-Seq. [![M. capitata development diagram by A. Huffmyer](https://user-images.githubusercontent.com/32178010/211181816-cf21abb7-7038-4f86-9aca-3ca326a958ce.png)](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics) ***All 39 TagSeq `fastq.gz` files in the dataset:*** | **TagSeq seq ID** | **TagSeq File** | **TagSeq biosample accession#** | |-------------------|-------------------------------|---------------------------------| | AH1 | AH1_S32_L002_R1_001.fastq.gz | SRR22293483 | | AH2 | AH2_S33_L002_R1_001.fastq.gz | SRR22293482 | | AH3 | AH3_S34_L002_R1_001.fastq.gz | SRR22293471 | | AH4 | AH4_S35_L002_R1_001.fastq.gz | SRR22293460 | | AH5 | AH5_S36_L002_R1_001.fastq.gz | SRR22293450 | | AH6 | AH6_S37_L002_R1_001.fastq.gz | SRR22293449 | | AH7 | AH7_S38_L002_R1_001.fastq.gz | SRR22293448 | | AH8 | AH8_S39_L002_R1_001.fastq.gz | SRR22293447 | | AH9 | AH9_S40_L002_R1_001.fastq.gz | SRR22293446 | | AH10 | AH10_S41_L002_R1_001.fastq.gz | SRR22293445 | | AH11 | AH11_S42_L002_R1_001.fastq.gz | SRR22293481 | | AH12 | AH12_S43_L002_R1_001.fastq.gz | SRR22293480 | | AH13 | AH13_S44_L002_R1_001.fastq.gz | SRR22293479 | | AH14 | AH14_S45_L002_R1_001.fastq.gz | SRR22293478 | | AH15 | AH15_S46_L002_R1_001.fastq.gz | SRR22293477 | | AH16 | AH16_S47_L002_R1_001.fastq.gz | SRR22293476 | | AH17 | AH17_S48_L002_R1_001.fastq.gz | SRR22293475 | | AH18 | AH18_S49_L002_R1_001.fastq.gz | SRR22293474 | | AH19 | AH19_S50_L002_R1_001.fastq.gz | SRR22293473 | | AH20 | AH20_S51_L002_R1_001.fastq.gz | SRR22293472 | | AH21 | AH21_S52_L002_R1_001.fastq.gz | SRR22293470 | | AH22 | AH22_S53_L002_R1_001.fastq.gz | SRR22293469 | | AH23 | AH23_S54_L002_R1_001.fastq.gz | SRR22293468 | | AH24 | AH24_S55_L002_R1_001.fastq.gz | SRR22293467 | | AH25 | AH25_S56_L002_R1_001.fastq.gz | SRR22293466 | | AH26 | AH26_S57_L002_R1_001.fastq.gz | SRR22293465 | | AH27 | AH27_S58_L002_R1_001.fastq.gz | SRR22293464 | | AH28 | AH28_S59_L002_R1_001.fastq.gz | SRR22293463 | | AH29 | AH29_S60_L002_R1_001.fastq.gz | SRR22293462 | | AH30 | AH30_S61_L002_R1_001.fastq.gz | SRR22293461 | | AH31 | AH31_S62_L002_R1_001.fastq.gz | SRR22293459 | | AH32 | AH32_S63_L002_R1_001.fastq.gz | SRR22293458 | | AH33 | AH33_S64_L002_R1_001.fastq.gz | SRR22293457 | | AH34 | AH34_S65_L002_R1_001.fastq.gz | SRR22293456 | | AH35 | AH35_S66_L002_R1_001.fastq.gz | SRR22293455 | | AH36 | AH36_S67_L002_R1_001.fastq.gz | SRR22293454 | | AH37 | AH37_S68_L002_R1_001.fastq.gz | SRR22293453 | | AH38 | AH38_S69_L002_R1_001.fastq.gz | SRR22293452 | | AH39 | AH39_S70_L002_R1_001.fastq.gz | SRR22293451 | : All 39 samples in the dataset and their fastq file names ======= ### Select a sub-sample of files To make things easier and faster, we're going to select just 5 of these files to scaffold this code. ### Make a .txt file list of Run (SRR) numbers > *"Results are called runs (SRR). Runs comprise the data gathered for a sample or sample bundle and refer to a defining experiment."* First, we need to obtain RNAseq files from NCBI from the project [PRJNA900235](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA900235). In order to do this we need a list of SRR numbers that identify the specific sequence files for RNAseq. Go to the [SRA links](https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=900235) for BioProject 900235 and select all the RNAseq files of interest. In the upper right hand corner, I selected "Send to" and "File". This outputs a .txt file with a list of all SRR numbers. [![screen capture of SRA link webpage for BioProject 900235](images/ncbi-sra-links.png)](https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=900235) ## Do we use SRAdb R package? ...NO ```{r} browseVignettes("SRAdb") ``` We have to get the SRAdb SQLite file from the online location. The download and uncompress steps are done automatically with a single command, `getSRAdbFile( )`, which has defaults `getSRAdbFile( destdir = getwd(), destfile = "SRAmetadb.sqlite.gz", method )` ```{r} getSRAdbFile(destdir = "../data") ``` ## SRA Download Roberts Lab Resources Github [issue#1569](https://github.com/RobertsLab/resources/issues/1569) thread [SRA Run Selector](https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=9&WebEnv=MCID_64358c3e49f6486f57b185c1&o=acc_s%3Aa) SRR_Acc_list.txt ```{r, engine='bash'} # move to large data hardrive cd /home/shared/8TB_HDD_01 # make a new directory named 'mcap', short for Montipora capitata mkdir mcap ``` ```{r, engine='bash', cache=TRUE} /home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \ --outdir /home/shared/8TB_HDD_01/mcap \ --progress \ SRR22293447 \ SRR22293448 \ SRR22293449 \ SRR22293464 \ SRR22293466 \ SRR22293467 ``` Path to download fastq files in Raven: /home/shared/8TB_HDD_01/ ## NCBI Data Download ## Get RNAseq files from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) > đź‘€Checkout some background info on [The Sequence Read Archive (SRA)](https://linsalrob.github.io/ComputationalGenomicsManual/Databases/SRA.html#:~:text=A%20Study%20(SRP)%20has%20one,want%20to%20download%20from%20NCBI.) and the [SRA factsheet](https://www.ncbi.nlm.nih.gov/core/assets/sra/files/Factsheet_SRA.pdf)