--- title: "13-hisat2_pisaster" output: html_document date: "2025-02-20" --- Rmd to get gene and transcript count matrices for _Pisaster ochraceus_ from day 12 from the multispecies experiment. Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd) Untrimmed reads: `/home/shared/8TB_HDD_02/graceac9/multispecies2023` Trimmed reads: `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/pisaster` Only want the libraries from samples from _P. ochraceus_ at Day 12: | Library_ID | Bin_Number | |------------|------------| | 518 | 1 | | 524 | 2 | | 530 | 3 | | 536 | 4 | | 548 | 6 | | 560 | 8 | # KEY FILES Key files and genomes can be found linked in this wikipage: [project-pycno-multispecies-2023/wiki/Key-Files](https://github.com/grace-ac/project-pycno-multispecies-2023/wiki/Key-Files) Need to get genome fasta into `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/` Downloaded from https://ucmerced.app.box.com/s/uyq6kx52awjuiy0mfepcrwowt44yjqpf onto my laptop. Then `rsync` from my personal laptop onto raven into the `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/`: navigate in terminal on laptop to my Downloads folder, then run: `rsync --archive --progress --verbose jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data` File is now in data, and it is called: `jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta` ```{bash} head ../data/jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta ``` Get the .gtf file into `../data/` Download from [https://ucmerced.app.box.com/s/vsqqkvbl0ie39nu1457efi5emln7qggu](https://ucmerced.app.box.com/s/vsqqkvbl0ie39nu1457efi5emln7qggu) file called augustus.hints.gtf.gz rename to add pisaster: pisaster_augustus.hints.gtf.gz `rsync` from laptop to raven `../data/` folder: `rsync --archive --progress --verbose pisaster_augustus.hints.gtf.gz graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data` unzip the file: ```{r, engine='bash'} #cd ../data #gunzip pisaster_augustus.hints.gtf.gz ``` ```{bash} head ../data/pisaster_augustus.hints.gtf ``` # LOAD PACKAGES ```{r} library(knitr) library(tidyverse) library(Biostrings) library(pheatmap) library(DESeq2) library(tidyverse) ``` # 1. BUILD AN INDEX Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html ```{bash} pwd ``` From the gtf, get exon list ```{bash} #/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \ #../data/pisaster_augustus.hints.gtf \ #> ../output/13-hisat2_pisaster/m_exon.tab ``` ```{bash} head ../output/13-hisat2_pisaster/m_exon.tab ``` from the gtf, get the splice sites ```{bash} #!/bin/bash # This script will extract splice sites from the gtf file # This is the command to extract splice sites from the gtf file /home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \ ../data/pisaster_augustus.hints.gtf \ > ../output/13-hisat2_pisaster/m_splice_sites.tab ``` use the genome fasta to make an index for alignment ```{bash} # build an index /home/shared/hisat2-2.2.1/hisat2-build \ ../data/jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta \ ../data/Pisaster_genome_index \ --exon ../output/13-hisat2_pisaster/m_exon.tab \ --ss ../output/13-hisat2_pisaster/m_splice_sites.tab \ -p 40 \ ../data/pisaster_augustus.hints.gtf \ 2> ../output/13-hisat2_pisaster/hisat2-build_stats.txt ``` ```{bash} tail ../output/13-hisat2_pisaster/hisat2-build_stats.txt ``` ```{bash} for file in ../output/11-multi-fastp/pisaster/*_R1*fq.gz; do # Remove the part to get the base name base=$(basename "$file" _R1.fastp-trim.fq.gz) # Construct the names of the pair of files file1=${base}_R1.fastp-trim.fq.gz file2=${base}_R2.fastp-trim.fq.gz # Run the hisat2 command /home/shared/hisat2-2.2.1/hisat2 \ -x ../data/Pisaster_genome_index \ --dta \ -p 20 \ -1 ../output/11-multi-fastp/pisaster/$file1 \ -2 ../output/11-multi-fastp/pisaster/$file2 \ -S ../output/13-hisat2_pisaster/${base}.sam \ 2> ../output/13-hisat2_pisaster/${base}-hisat.out done ``` ```{bash} cat ../output/13-hisat2_pisaster/*-hisat.out ``` ```{bash} pwd ``` Convert sam to bam ```{bash} for samfile in ../output/13-hisat2_pisaster/*.sam; do bamfile="${samfile%.sam}.bam" sorted_bamfile="${samfile%.sam}.sorted.bam" /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile" /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile" /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile" done ``` ```{bash} ls ../output/13-hisat2_pisaster/*sorted.bam | wc -l ``` ```{r, engine='bash', eval=TRUE} cat ../output/13-hisat2_pisaster/*hisat.out \ | grep "overall alignment rate" ``` alignment rates look good!! do this from Steven to get gtf into gff: steven did this to get the gtf into gff for `stringtie` use ```{bash} /home/shared/gffread-0.12.7.Linux_x86_64/gffread \ ../data/pisaster_augustus.hints.gtf \ -T \ -o ../data/pisaster_augustus.hints.gff ``` ## `stringtie` ```{bash} find ../output/13-hisat2_pisaster/*sorted.bam \ | xargs basename -s .sorted.bam | xargs -I{} \ /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ -p 36 \ -eB \ -G ../data/pisaster_augustus.hints.gff \ -o ../output/13-hisat2_pisaster/{}.gff \ ../output/13-hisat2_pisaster/{}.sorted.bam ``` get a multiqc file on the alignment: ```{bash} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc multiqc ../output/13-hisat2_pisaster/ \ -o ../output/13-hisat2_pisaster/multiqc_pisaster/ ``` navigate into folder: `~/GitHub/project-pycno-multispecies-2023/output/13-hisat2_pisaster/multiqc_pisaster` copy alignment multiqc report to OWL: `rsync --archive --progress --verbose multiqc_report.html grace@owl.fish.washington.edu:/volume1/web/gcrandall/multispeciesSSWD/QCreports` on Owl-- rename report to pisaster_2023_alignment_multiqc_report.html # GET A COUNT MATRIX ```{bash} ls ../output/13-hisat2_pisaster/*gff ``` copy the above and put into a txt file, add the sample ID before each path called pisaster2023.txt ```{bash} cat ../output/13-hisat2_pisaster/pisaster2023.txt ``` ```{bash} head -5 ../output/13-hisat2_pisaster/PSC-0518.gff tail -5 ../output/13-hisat2_pisaster/PSC-0518.gff ``` ```{r, engine='bash'} python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \ -i ../output/13-hisat2_pisaster/pisaster2023.txt \ -g ../data/pisaster_gene_count_matrix_2023.csv \ -t ../data/pisaster_transcript_count_matrix_2023.csv ``` ```{bash} head ../data/pisaster*matrix_2023.csv ```