--- title: "14-hisat2_derm" output: html_document date: "2025-02-20" --- Rmd to get gene and transcript count matrices for _Dermasterias imbricata_ from day 12 from the multispecies experiment. Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd) Untrimmed reads: `/home/shared/8TB_HDD_02/graceac9/multispecies2023` Trimmed reads: `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/derm` | Library_ID | Bin_Number | |------------|------------| | 517 | 1 | | 523 | 2 | | 529 | 3 | | 535 | 4 | | 547 | 6 | | 559 | 8 | # KEY FILES Key files and genomes can be found linked in this wikipage: [project-pycno-multispecies-2023/wiki/Key-Files](https://github.com/grace-ac/project-pycno-multispecies-2023/wiki/Key-Files) Need to get genome fasta into `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/` Downloaded fasta file from Melissa DeBiasse microsoft folder. Then `rsync` from my personal laptop onto raven into the `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/`: navigate in terminal on laptop to my Downloads folder, then run: `rsync --archive --progress --verbose derm_imbr_genome.fa graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/Newest_Derm_files` File is now in data, and it is called: `derm_imbr_genome.fa` ```{bash} #head ../data/Newest_Derm_files/derm_imbr_genome.fa ``` ```{bash} grep -c ">" ../data/Newest_Derm_files/derm_imbr_genome.fa ``` ```{bash} pwd ``` ```{bash} gunzip -k ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf.gz head -3 ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf ``` # LOAD PACKAGES ```{r} library(knitr) library(tidyverse) library(Biostrings) library(pheatmap) library(DESeq2) library(tidyverse) ``` # 1. BUILD AN INDEX Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html ```{bash} pwd ``` From the gtf, get exon list ```{bash} /home/shared/hisat2-2.2.1/hisat2_extract_exons.py \ ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf \ > ../output/14-hisat2_derm/m_exon.tab ``` ```{bash} head ../output/14-hisat2_derm/m_exon.tab ``` from the gtf, get the splice sites ```{bash} #!/bin/bash # This script will extract splice sites from the gtf file # This is the command to extract splice sites from the gtf file /home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \ ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf \ > ../output/14-hisat2_derm/m_splice_sites.tab ``` ```{bash} head ../output/14-hisat2_derm/m_splice_sites.tab ``` use the genome fasta to make an index for alignment ```{bash} # build an index /home/shared/hisat2-2.2.1/hisat2-build \ ../data/Newest_Derm_files/derm_imbr_genome.fa \ ../data/Derm_genome_index \ --exon ../output/14-hisat2_derm/m_exon.tab \ --ss ../output/14-hisat2_derm/m_splice_sites.tab \ -p 40 \ 2> ../output/14-hisat2_derm/hisat2-build_stats.txt ``` ```{bash} tail ../output/14-hisat2_derm/hisat2-build_stats.txt ``` Need to move the RNAseq files for _D. imbricata_ into own folder: now live in: `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/derm` ```{bash} for file in ../output/11-multi-fastp/derm/*_R1*fq.gz; do # Remove the part to get the base name base=$(basename "$file" _R1.fastp-trim.fq.gz) # Construct the names of the pair of files file1=${base}_R1.fastp-trim.fq.gz file2=${base}_R2.fastp-trim.fq.gz # Run the hisat2 command /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/Derm_genome_index \ --dta \ -p 20 \ -1 ../output/11-multi-fastp/derm/$file1 \ -2 ../output/11-multi-fastp/derm/$file2 \ -S ../output/14-hisat2_derm/${base}.sam \ 2> ../output/14-hisat2_derm/${base}-hisat.out done ``` ```{bash} cat ../output/14-hisat2_derm/*-hisat.out ``` ```{bash} pwd ``` Convert sam to bam ```{bash} for samfile in ../output/14-hisat2_derm/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile" /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile" /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile" done ``` ```{bash} ls ../output/14-hisat2_derm/*sorted.bam | wc -l ``` ```{r, engine='bash', eval=TRUE} cat ../output/14-hisat2_derm/*hisat.out \ | grep "overall alignment rate" ``` alignment rates look good!! do this from Steven to get gtf into gff: steven did this to get the gtf into gff for `stringtie` use ```{bash} /home/shared/gffread-0.12.7.Linux_x86_64/gffread \ ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf \ -T \ -o ../data/Newest_Derm_files/Dermasterias_imbricata/mod.augustus.hints.gff ``` ```{bash} head ../data/Newest_Derm_files/Dermasterias_imbricata/mod.augustus.hints.gff ``` ## `stringtie` ```{bash} find ../output/14-hisat2_derm/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../data/Newest_Derm_files/Dermasterias_imbricata/augustus.hints.gtf \ -o ../output/14-hisat2_derm/{}.gtf \ ../output/14-hisat2_derm/{}.sorted.bam ``` get a multiqc file on the alignment: ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc multiqc ../output/14-hisat2_derm/ \ -o ../output/14-hisat2_derm/multiqc_derm/ ``` navigate into folder: `~/GitHub/project-pycno-multispecies-2023/output/14-hisat2_derm/multiqc_derm` copy alignment multiqc report to OWL: `rsync --archive --progress --verbose multiqc_report_1.html grace@owl.fish.washington.edu:/volume1/web/gcrandall/multispeciesSSWD/QCreports` on Owl-- rename report to dermasterias_2023_alignment_multiqc_report.html # GET A COUNT MATRIX ```{bash} ls ../output/14-hisat2_derm/*gtf ``` copy the above and put into a txt file, add the sample ID before each path called derm2023.txt ```{bash} cat ../output/14-hisat2_derm/derm2023.txt ``` ```{bash} head -5 ../output/14-hisat2_derm/PSC-0529.gtf ``` ```{r, engine='bash'} python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \ -i ../output/14-hisat2_derm/derm2023.txt \ -g ../data/derm_gene_count_matrix_2023.csv \ -t ../data/derm_transcript_count_matrix_2023.csv ``` ```{bash} head ../data/derm*matrix_2023.csv ```