--- title: "20-hisat2-genecount-matrices" output: html_document date: "2024-09-27" --- Rmd to use `hisat2` to get gene count matrices for use in `DESeq2` and to annotate with gene IDs retrieved in [paper-pycno-sswd-2021-2022/code/19-get-gene-annotations.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/19-get-gene-annotations.Rmd) Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd) All trimmed reads from summer 2021 and 2022 live on Raven: Summer 2021 `/home/shared/8TB_HDD_02/graceac9/data/pycno2021` Summer 2022 `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` Get genome from NCBI into data directory on Raven: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_032158295.1/ ```{r, engine='bash'} cd ../data ``` ```{r, engine='bash', eval=TRUE} head ../data/augustus.hints.gtf ``` For `stringtie` later, the gtf needs to be in gff format. There isn't currently one in existence that works. Steven figured out how to fix this issue (https://sr320.github.io/tumbling-oysters/posts/sr320-20-seastar/) Won't run code because Steven did this already and the new file lives in: ` ../analyses/12-fix-gff/mod_augustus.gtf` ```{r} library(knitr) library(tidyverse) library(Biostrings) library(pheatmap) library(DESeq2) library(tidyverse) ``` # 1. BUILD AN INDEX Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html ```{bash} pwd ``` From the gtf, get exon list ```{bash} /home/shared/hisat2-2.2.1/hisat2_extract_exons.py \ ../analyses/12-fix-gff/mod_augustus.gtf \ > ../analyses/20-hisat2/m_exon.tab ``` ```{bash} head ../analyses/20-hisat2/m_exon.tab ``` from the gtf, get the splice sites ```{bash} #!/bin/bash # This script will extract splice sites from the gtf file # This is the command to extract splice sites from the gtf file /home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \ ../analyses/12-fix-gff/mod_augustus.gtf \ > ../analyses/20-hisat2/m_splice_sites.tab ``` use the genome fasta to make an index for alignment ```{bash} # build an index /home/shared/hisat2-2.2.1/hisat2-build \ ../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \ ../data/Phel_genome_index \ --exon ../analyses/20-hisat2/m_exon.tab \ --ss ../analyses/20-hisat2/m_splice_sites.tab \ -p 40 \ ../analyses/12-fix-gff/mod_augustus.gtf \ 2> ../analyses/20-hisat2/hisat2-build_stats.txt ``` ```{bash} tail ../analyses/20-hisat2/hisat2-build_stats.txt ``` # 2. ALIGNMENT 2021 RNAseq Data ```{bash} for file in ../data/2021_trimmed/*_R1*fq.gz; do # Remove the part to get the base name base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20220810.fq.gz) # Construct the names of the pair of files file1=${base}_R1_001.fastq.gz.fastp-trim.20220810.fq.gz file2=${base}_R2_001.fastq.gz.fastp-trim.20220810.fq.gz # Run the hisat2 command /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/Phel_genome_index \ --dta \ -p 20 \ -1 ../data/2021_trimmed/$file1 \ -2 ../data/2021_trimmed/$file2 \ -S ../analyses/20-hisat2/${base}.sam \ 2> ../analyses/20-hisat2/${base}-hisat.out done ``` ```{bash} cat ../analyses/20-hisat2/*-hisat.out ``` ```{bash} pwd ``` Convert sam to bam ```{bash} for samfile in ../analyses/20-hisat2/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile" /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile" /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile" done ``` ```{bash} ls ../analyses/20-hisat2/*sorted.bam | wc -l ``` Woo! 32 .bam files converted from the 32 .sam files. ```{bash} ls ../analyses/20-hisat2/*sam | wc -l ``` ```{r, engine='bash', eval=TRUE} cat ../analyses/20-hisat2/*hisat.out \ | grep "overall alignment rate" ``` ## `stringtie` ```{bash} find ../analyses/20-hisat2/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../data/augustus.hints.gtf \ -o ../analyses/20-hisat2/{}.gtf \ ../analyses/20-hisat2/{}.sorted.bam ``` steven did this to get the gtf into gff for `stringtie` use ```{bash} #/home/shared/gffread-0.12.7.Linux_x86_64/gffread \ #../analyses/12-fix-gff/mod_augustus.gtf \ #-T \ #-o ../analyses/13-hisat-deseq2/mod_augustus.gff ``` ```{bash} find ../analyses/20-hisat2/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../analyses/13-hisat-deseq2/mod_augustus.gff \ -o ../analyses/20-hisat2/{}.gtf \ ../analyses/20-hisat2/{}.sorted.bam ``` ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc multiqc ../analyses/20-hisat2/ \ -o ../analyses/20-hisat2/multiqc2021/ ``` ## get a count matrix ```{bash} ls ../analyses/20-hisat2/*gtf ``` copy the above and put into a txt file, add the sample ID before each path called 2021.txt ```{bash} cat ../analyses/20-hisat2/2021.txt ``` ```{r, engine='bash'} python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \ -i ../analyses/20-hisat2/2021.txt \ -g ../data/gene_count_matrix_2021.csv \ -t ../data/transcript_count_matrix_2021.csv ``` ```{bash} head ../data/*matrix_2021.csv ``` # 3. ALIGNMENT 2022 RNAseq Data pycno2022 trimmed data is in: `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` on Raven moving it into the data/2022_trimmed directory in this repo: ``` rsync --archive --progress --verbose PSC*fq.gz graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/2022_trimmed ``` run `rsync` while in the `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` directory. ```{bash} for file in ../data/2022_trimmed/*_R1*fq.gz; do # Remove the part to get the base name base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz) # Construct the names of the pair of files file1=${base}_R1_001.fastq.gz.fastp-trim.20231101.fq.gz file2=${base}_R2_001.fastq.gz.fastp-trim.20231101.fq.gz # Run the hisat2 command /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/Phel_genome_index \ --dta \ -p 20 \ -1 ../data/2022_trimmed/$file1 \ -2 ../data/2022_trimmed/$file2 \ -S ../analyses/20-hisat2/2022/${base}.sam \ 2> ../analyses/20-hisat2/2022/${base}-hisat.out done ``` ```{bash} cat ../analyses/20-hisat2/2022/*-hisat.out ``` ```{bash} ls ../analyses/20-hisat2/2022/*sam | wc -l ``` Convert sam to bam ```{bash} for samfile in ../analyses/20-hisat2/2022/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile" /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile" /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile" done ``` ```{bash} ls ../analyses/20-hisat2/2022/*sorted.bam | wc -l ``` woooo ```{r, engine='bash', eval=TRUE} cat ../analyses/20-hisat2/2022/*hisat.out \ | grep "overall alignment rate" ``` ## `stringtie` ```{bash} find ../analyses/20-hisat2/2022/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../data/augustus.hints.gtf \ -o ../analyses/20-hisat2/2022/{}.gtf \ ../analyses/20-hisat2/2022/{}.sorted.bam ``` ```{bash} ls ../analyses/20-hisat2/2022/*sorted.bam | wc -l ``` ```{bash} find ../analyses/20-hisat2/2022/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../analyses/13-hisat-deseq2/mod_augustus.gff \ -o ../analyses/20-hisat2/2022/{}.gtf \ ../analyses/20-hisat2/2022/{}.sorted.bam ``` ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc multiqc ../analyses/20-hisat2/2022/ \ -o ../analyses/20-hisat2/multiqc2022/ ``` ## get a count matrix ```{bash} ls ../analyses/20-hisat2/2022/*gtf ``` copy the above and put into a txt file, add the sample ID before each path called 2022.txt ```{bash} cat ../analyses/20-hisat2/2022/2022.txt ``` ```{r, engine='bash'} python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \ -i ../analyses/20-hisat2/2022/2022.txt \ -g ../data/gene_count_matrix_2022.csv \ -t ../data/transcript_count_matrix_2022.csv ``` ```{bash} head ../data/*matrix_2022.csv ```