--- title: "Week 02: Accessing SRA Data from NCBI" author: "Celeste Valdivia" date: "2023-04-06" output: md_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` ## 1. Identify the RNAseq data files (accession numbers) intend to use I will be using transcriptome data from the BioProject [PRJNA579844](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA579844). - On this page, select the numeric value next to "SRA Experiments" which indicates the number of runs in this project. PRJNA579844 has 90 sequence reads available. We will focus on four reads for the sake of simplicity. - On the top corner select the drop down menu "send to" and select "Run Selector" - Run selector will show you all the runs available for the BioProject of interest with all the metadata. You can download metadata or accession list (SRR numbers) to keep track of your runs of interest. I downloaded the metadata and added it to this project repo's ReadMe. Below is a simplified list of the sequences of interest. We will be evaluating two replicates of early and late blastogenic stages in the marine model tunicate *Botryllus schlosseri*. | Run | BioSample | dev_stage | Experiment | Sample Name | |------|------|------|------|------| | SRR10352205 | SAMN13134439 | b5-b6 | SRX7061986 | HISeq-Sample_3411_IL2574-4 | | SRR10352206 | SAMN13134437 | b5-b6 | SRX7061984 | 4017_944axbyd196_stageC1to2_SecBud | | SRR10352224 | SAMN13134463 | TO | SRX7061961 | 3902_Sc109e_olds_zooid_D\_early_mid_kp | | SRR10352254 | SAMN13134462 | TO | SRX7061960 | 3890_Sc109e_old_zooid_D\_mid_Late_kp | ## Identify where you want to store your data ```{r, engine='bash'} # move to large data hardrive in Raven cd /home/shared/8TB_HDD_01 # if you don't already have a folder in this harddrive run the code below to make a new directory with a unique name #NOTE: IF YOU DON'T MAKE THE DIRECTORY YOURSELF YOU WILL NOT BE ABLE TO MOVE OR ADD FILES #mkdir cvaldi/tunicate-fish546/raw ``` ## 2. Accessing data In the below code chunk we are using the fasterq-dump command in the NCBI SRA Toolkit which is pre-installed in Raven. Please see these directions for [installing](https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit) the SRA ToolKit if you have not done so in your working environment already. The --outdir portion of the code will indicate the location in which your fastq files will be downloaded. Below that are the SRR I am using for this project. ```{r, engine = 'bash'} # here we can bypass the prefetch command in the SRA Toolkit by using the fasterq-dump command directly, I believe bypassing this step will just make the download time longer /home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \ --outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \ --progress \ SRR10352205 \ SRR10352206 \ SRR10352224 \ SRR10352254 ``` ## 3. Generate checksums For the purpose of maintaining data integrity, we will generate checksums which are unique strings of numerical and character values that will inform us if the contents of each file remain the same over the upcoming workflow. ```{r, engine = 'bash'} cd raw md5sum *.fastq > md5.original.download.20230413 ``` The following code chunk will show us the checksums in the new md5 file we made. ```{r, engine ='bash'} cat md5.original.download.20230413 ``` ## 4. Checking what's in your directory This is just as a reminder that you can check how much space your files are taking up with the following command: ```{r, engine ='bash'} cd /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw du -h --max-depth=1 ```