--- title: "15-hisat2-deseq2-summer2022" output: html_document date: "2024-09-02" --- Follow code from 03-hisat2-summer2021_2022.Rmd. A lot of steps were done in the beginning that don't need to be repeated. Summer 2022 `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` ```{r, engine='bash'} cd ../data ``` ```{r, engine='bash', eval=TRUE} head ../data/augustus.hints.gtf ``` For `stringtie` later, the gtf needs to be in gff format. There isn't currently one in existence that works. Steven figured out how to fix this issue (https://sr320.github.io/tumbling-oysters/posts/sr320-20-seastar/) Won't run code because Steven did this already and the new file lives in: ` ../analyses/12-fix-gff/mod_augustus.gtf` ```{r} library(knitr) library(tidyverse) library(Biostrings) library(pheatmap) library(DESeq2) library(tidyverse) ``` # 1. Build an index Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html ```{bash} pwd ``` From the gtf, get exon list ```{bash} /home/shared/hisat2-2.2.1/hisat2_extract_exons.py \ ../analyses/12-fix-gff/mod_augustus.gtf \ > ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab ``` ```{bash} head ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab ``` from the gtf, get the splice sites ```{bash} #!/bin/bash # This script will extract splice sites from the gtf file # This is the command to extract splice sites from the gtf file /home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \ ../analyses/12-fix-gff/mod_augustus.gtf \ > ../analyses/15-hisat2-deseq2-summer2022/m_splice_sites.tab ``` use the genome fasta to make an index for alignment ```{bash} # build an index /home/shared/hisat2-2.2.1/hisat2-build \ ../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \ ../data/Phel_genome_index \ --exon ../analyses/15-hisat2-deseq2-summer2022/m_exon.tab \ --ss ../analyses/15-hisat2-deseq2-summer2022/m_splice_sites.tab \ -p 40 \ ../analyses/12-fix-gff/mod_augustus.gtf \ 2> ../analyses/15-hisat2-deseq2-summer2022/hisat2-build_stats.txt ``` ```{bash} tail ../analyses/03-hisat2/hisat2-build_stats.txt ``` # 2. ALIGNMENT Trimmed data in `/home/shared/8TB-`/home/shared/8TB_HDD_02/graceac9/data/pycno2022` ```{bash} pwd ``` ```{bash} for file in ../../../data/pycno2022/*_R1*fq.gz; do # Remove the part to get the base name base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz) # Construct the names of the pair of files file1=${base}_R1_001.fastq.gz.fastp-trim.20231101.fq.gz file2=${base}_R2_001.fastq.gz.fastp-trim.20231101.fq.gz # Run the hisat2 command /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/Phel_genome_index \ --dta \ -p 20 \ -1 ../../../data/pycno2022/$file1 \ -2 ../../../data/pycno2022/$file2 \ -S ../analyses/15-hisat2-deseq2-summer2022/${base}.sam \ 2> ../analyses/15-hisat2-deseq2-summer2022/${base}-hisat.out done ``` ```{bash} cat ../analyses/15-hisat2-deseq2-summer2022/*-hisat.out ``` Convert sam to bam ```{bash} for samfile in ../analyses/15-hisat2-deseq2-summer2022/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile" /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile" /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile" done ``` ```{bash} ls ../analyses/15-hisat2-deseq2-summer2022/*sorted.bam | wc -l ``` ```{r, engine='bash', eval=TRUE} cat ../analyses/15-hisat2-deseq2-summer2022/*hisat.out \ | grep "overall alignment rate" ``` # `stringtie` ```{bash} find ../analyses/15-hisat2-deseq2-summer2022/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../analyses/13-hisat-deseq2/mod_augustus.gff \ -o ../analyses/15-hisat2-deseq2-summer2022/{}.gtf \ ../analyses/15-hisat2-deseq2-summer2022/{}.sorted.bam ``` ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc multiqc ../analyses/15-hisat2-deseq2-summer2022/ \ -o ../analyses/15-hisat2-deseq2-summer2022/ ``` # get a count matrix ```{bash} ls ../analyses/15-hisat2-deseq2-summer2022/*gtf ``` copy the above and put into a txt file write library name to left of path ```{bash} cat ../analyses/15-hisat2-deseq2-summer2022/list01.txt ``` ```{r, engine='bash'} python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \ -i ../analyses/15-hisat2-deseq2-summer2022/list01.txt \ -g ../data/gene_count_matrix_2022.csv \ -t ../data/transcript_count_matrix_2022.csv ``` ```{bash} head ../data/*matrix_2022.csv ``` # `DESEQ2` for Experiment C ```{r} library(DESeq2) library(dplyr) library(ggplot2) library(pheatmap) library("pcaExplorer") library("airway") ``` Install 2024-09-06 - `pcaExplorer`: ```{r} #if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") #BiocManager::install("pcaExplorer") ``` ```{r} #if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") #BiocManager::install("airway") ``` ## Explore all the samples with `pcaExplorer` made a coldata frame in google sheets https://docs.google.com/spreadsheets/d/1tZuBu8wFnKgh_mvZY8jpleF4nzMsQh-NOHZ1YKoeyqU/edit?gid=0#gid=0 downloaded as .csv, then copy-pasted into data directory and pushed to github in `pcaExplorere`, can select the count matrices to import it's call coldata2022.csv in data in this repo ```{r} pcaExplorer() ``` saved a bunch of plots can be found in analyses/15-hisat2-deseq2-summer2022/pcaExplorer_plots ## Use `DESeq2` to compare the samples that `pcaExplorer` showed would be interesting Want a balanced comparison, so I'll be comparing the libraries shown in the plot [noday0_norepeatstars_size_disease_signs_3pergroup_samplesPca_sampleout.pdf](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/analyses/15-hisat2-deseq2-summer2022/pcaExplorer_plots/noday0_norepeatstars_size_disease_signs_3pergroup_samplesPca_sampleout.pdf) I'm not sure why the sample IDs got all cut-off, but below is a table detailing the samples: | library_ID | star_ID | star_age_class | treatment | experiment_day | total_armdrop | total_armtwist | disease_sign | |------------|---------|----------------|-----------|----------------|---------------|----------------|---------------| | PSC.0186 | 15 | adult | control | 11 | 0 | 0 | healthy | | PSC.0203 | 31 | adult | control | 12 | 0 | 0 | healthy | | PSC.0209 | 9 | adult | control | 13 | 0 | 0 | healthy | | PSC.0177 | 34 | juvenile | control | 11 | 0 | 0 | healthy | | PSC.0219 | 63 | juvenile | control | 14 | 0 | 0 | healthy | | PSC.0230 | 65 | juvenile | control | 15 | 0 | 0 | healthy | | PSC.0174 | 30 | adult | exposed | 10 | 1 | 6 | first_armdrop | | PSC.0190 | 10 | adult | exposed | 11 | 1 | 0 | first_armdrop | | PSC.0231 | 13 | adult | exposed | 15 | 1 | 5 | first_armdrop | | PSC.0187 | 57 | juvenile | exposed | 11 | 1 | 1 | first_armdrop | | PSC.0188 | 38 | juvenile | exposed | 11 | 1 | 0 | first_armdrop | | PSC.0228 | 61 | juvenile | exposed | 14 | 1 | 0 | first_armdrop | ## `DESeq2` --> Differentially expressed genes and transcripts for the above library comparisons ### DIFFERENTIALLY EXPRESSED TRANSCRIPTS Read in counts: ```{r} counts2022 <- read.csv("../data/transcript_count_matrix_2022.csv", row.names = "transcript_id") head(counts2022) ``` subset just the counts for the libraries specified in the table above: ```{r} counts2022sub <- select(counts2022, PSC.0228, PSC.0187, PSC.0188, PSC.0174, PSC.0190, PSC.0231, PSC.0230, PSC.0219, PSC.0177, PSC.0186, PSC.0209, PSC.0203) head(counts2022sub) ``` create a text file called "conditions.txt" in `/analyses/15-hisat2-deseq2-summer2022/` and put the metadata for the libraries in it, with tabs between each column ```{r} deseq2.colData <- data.frame(condition=factor(c("exposed", "exposed", "exposed", "exposed", "exposed", "exposed", "control", "control", "control", "control", "control", "control")), type=factor(rep("paired-end", 12)), age=factor(c("juvenile", "juvenile", "juvenile", "adult", "adult", "adult", "juvenile", "juvenile", "juvenile", "adult", "adult", "adult"))) rownames(deseq2.colData) <- colnames(counts2022sub) ``` set counts as matrix: ```{r} counts2022sub <- as.matrix(counts2022sub) ``` ```{r} # Check all sample IDs in colData are also in CountData and match their orders all(rownames(deseq2.colData) %in% colnames(counts2022sub)) #should return TRUE ``` ```{r} counts2022sub <- counts2022sub[, rownames(deseq2.colData)] all(rownames(deseq2.colData) == colnames(counts2022sub)) # This should also return TRUE ``` modelled after crab paper https://github.com/RobertsLab/paper-tanner-crab/blob/master/scripts/DESeq.Rmd ```{r} # Create a DESeqDataSet from count matrix and labels, with the design taking age into account dds <- DESeqDataSetFromMatrix(countData = counts2022sub, colData = deseq2.colData, design = ~ condition + age) ``` Check levels of age and condition ```{r} levels(dds$age) ``` ```{r} levels(dds$condition) ``` note: `DESeq2` automatically puts the levels in alphabetical order and the first listed level is the reference level for the factor. We want control to be the reference. ```{r} dds$condition = relevel(dds$condition, "control") levels(dds$condition) ``` The following will pull the results from `condition` because that is our variable of interest. This tells us how age contributes to the DEGs ```{r} design(dds) <- formula(~ age + condition) dds <- DESeq(dds) ``` Access results: ```{r} deseq2.res <- results(dds) head(deseq2.res) ``` ```{r} summary(deseq2.res) ``` ```{r} # Count number of hits with adjusted p-value less then 0.05 dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) ``` 6189 dets between control and exposed, impacted by age.... ```{r} inf <- deseq2.res # The main plot plot(inf$baseMean, inf$log2FoldChange, pch=20, cex=0.45, ylim=c(-15, 15), log="x", col="darkgray", #main="Infection Status (pval