--- title: "RNAseq with Ariana" author: "Celeste Valdivia" date: '2023-04-06' output: md_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Questions to answer with RNAseq: The type of questions changes the design and sequencing approach. - What are the novel or rare transcripts in the transition from in vivo to in vitro? - Stressor effects on the above mentioned time scale? # How do we design an RNAseq experiment? Things to consider: ### Type of sequencing - total RNAseq is the gold standard -> we probably will be using this for our project - single cell RNAseq could potentially be interesting for us? - on top of the different types of RNAseq, you can also have single-end or paired-end sequencing where the paired end has a reverse read of the sequence so you could get higher accuracy useful for looking for rare transcripts and a need to know the details of the seq itself but is more expensive! - extracting for RNA: - PolyA selection or riboRNA-depletion method: - Depends on your target of seq and what are you looking for - what method does the QIAGEN RNA extraction kit use? - TagSeq instead of RNAseq: - Efficient alt to RNAseq where it adds a unique molecular identifier for each transcript ### Sampling protocols - Important to randomize samples across any technical divisions of your process (like RNA extraction date) - Use consistent methods of storage and take detailed notes about what those are so that you can look for batch effects (i.e. evaluating how RNA abundance potentially changes over the days you did your work) - Consider how sampling may influence rapid molecular response - Does sampling influence expression (i.e. biopsies, stress on organism)? - What is the length of time b/w treatment and preservation? - Sequencing date? How was the RNA handled by the core facility? - Include extra samples outside of your core ones so that you can send them off for preliminary tests that may be used towards getting an idea of the range of data (alternative plan: high enough replication in case you lose those samples) - If you use your true samples you may have batch effects and you may have to pull them from the study ### Replication and sequence depth #### Types of Replication: - Technical Replication - Replicates that are from for example the same cell line, same dish same tank, same individual (replicates of one biological replicate) - Duplicate sampling to test assay accuracy - Biological Replication - Different individual, time point, treatment, tank - Pseudo-replication - Individual observations are dependent on each other - Multiple samples from the same individual multiple samples from the same tank - Statistical methods to account for these effects since it is difficult to get rid of these (linear mixed modeling to account for these effects) #### Power analysis: - You should run power analyses before doing an experiment If you don't have the following things you will need to do a shot in the dark power analysis. - Expected effect size - using previous lit or preliminary studies to id the difference between treatments - Tell program your power threshold - Give it some coefficient of variability (CV) that you expect your pop to have #### Sequencing Depth: - On the sequencing end, you're stat power will also depend on your **sequencing depth** - Plan for a sufficient sequencing depth but don't spend more money on higher depth than needed instead use that towards increasing biological replication - if looking for rare transcripts you will need higher depth ### Batch effects - Randomize across all possible aspects of your experiment and keep track of all process to check for batch effect later ### Cost $$$X_dollars = #replicates x library_prep_cost + #replicates x sequencing_depth_per_replicate ~ sequencing_length$$ - Cost of sequencing is often estimated by lane. Can combine lane sequencing with other projects to save money. - Total RNAseq = ~$200-300 per sample (but this varies) # Bioinformatics Workflow & Core Analysis ### 1. QA/QC Within a MultiQC report: ThreadScore = How confident the instrument was that it read the correct nucleotide. - Dip in the beginning or end is normal, you can trim that data off GC content - For example each organism will have a peak at a certain percentage so that if there is contamination there would be two peaks on the plot. N content - Machine produces N when it is not sure what base is there ### 2. Trimming and cleaning You can use fastp to do so ### 3. Quality control of trimmed sequences ### 4. Align sequences to reference Read [Conesa et al. 2016](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8) to identify what type of mapping can be done. ### 5. Assemble sequences ### 6. Generate a gene count matrix # Comparisons of your gene count matrix DESeq2 is used for pairwise comparisons (i.e., control vs some treatment), however, for my work, I want to compare expression over time so maybe a [**Weighted Gene Co-Expression Network Analysis (WGCNA)**](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) would be better. - Can identify co-expression then construct a network of regulatory genes. - Identify modules (pathways) based analysis - Relate modules to external information - Array info: clinical data, SNPs, proteomics - Gene info: ontology, functional enrichment - Study module relationship - Find key drivers