--- title: "02-align-hisat" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" editor: visual format: gfm: default html: df-print: paged toc: true smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: arrow code-overflow: wrap theme: light: sandstone dark: vapor --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE) ``` ## Overview - sample sequence files located at absolute path `/home/shared/8TB_HDD_01/mcap/` - reference genome V3 files located at `../data/` Look at the GFF (generic feature format) genomic annotation files and the FASTA genome file ```{r, engine='bash'} cd ../data tail -n 10 Montipora_capitata_HIv3.genes.gff3 tail -n 10 Montipora_capitata_HIv3.genes_fixed.gff3 tail -n 10 Montipora_capitata_HIv3.assembly.fasta ``` ## Build an index file from the reference genome with HISAT2 ::: callout-info title="ChatGPT Info on 'Indexing with HISAT2'" An index file built from a genome is a data structure that enables fast searching and retrieval of information from the genome. Here we use `HISAT2` to break down the genome into smaller fragments and build an index file that stores the location and sequence of these fragments. This index file is created using a reference genome as a template. HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts 2) is a popular RNA-seq aligner that uses a type of index file called a Hierarchical Graph FM Index (HGFM). The HGFM index is an extension of the FM-index data structure that is used in many other aligners such as Bowtie, BWA, and HISAT. The HGFM index builds on the FM-index by adding a hierarchical structure that allows for more efficient searching of splice junctions in RNA-seq data. In HISAT2, the HGFM index is constructed from a reference genome and a gene annotation file, which together define the locations of splice junctions. The HGFM index stores a hierarchical graph of the reference genome that includes both the exonic and intronic regions, along with the splice junctions between them. This allows HISAT2 to align RNA-seq reads across splice junctions with high sensitivity and accuracy. Overall, the HGFM index used by HISAT2 is optimized for aligning RNA-seq reads, which have unique challenges compared to genomic DNA sequencing. By incorporating splice junction information into the index, HISAT2 can accurately align RNA-seq reads to the genome and help identify gene expression and alternative splicing events. ::: When using HISAT2's hisat2-build command to build an index from a genome, the input file should be a FASTA file containing the reference genome sequence. in this case, `Montipora_capitata_HIv3.assembly.fasta`. The `-f` option soecifies the input is a fasta file (which can come in the fasta, fna, or fa file extension format. We then title and create an index `Mcapv3_assembly.index` which is stored into the output folder. ```{r, engine='bash'} /home/shared/hisat2-2.2.1/hisat2-build \ -f ../data/Montipora_capitata_HIv3.assembly.fasta \ ../output/Mcapv3_assembly.index ``` Or should I use the cds.fna file to build the index? CDS coding sequences .fna denotes FASTA Nucleotides file... Build index with cds.fna as well... ```{r, engine='bash'} /home/shared/hisat2-2.2.1/hisat2-build \ -f ../data/Montipora_capitata_HIv3.genes.cds.fna \ ../output/Mcapv3_cds.index ``` # Align to Index (making SAM files) ``` ```{r, engine='bash'} hisat2 -x path/to/hisat2_index/genome\ -U path/to/read1.fastq\ -S output.sam\ --threads num_threads ``` ``` - specify the path to the single-end fastq file with the -U option ```{r, engine='bash'} # Set input and output directories input_dir=/home/shared/8TB_HDD_01/mcap output_dir=../output # Set Hisat2 options hisat2_options="-x /home/shared/8TB_HDD_02/stanja/sarahtanja/sarahtanja-coralRNA/output/Mcapv3_assembly.index --threads 7" # Loop through fastq files in input directory for f in ${input_dir}/*; do # Get base filename base=$(basename ${f} .fastq) # Align with Hisat2 /home/shared/hisat2-2.2.1/hisat2 \ ${hisat2_options} \ -U ${f} \ -S ${output_dir}/${base}.sam \ done ``` ## Convert SAM file to sorted BAM file The following code chunk will iteratively take all sam files in the output directory and convert them to bam files. ```{r, engine='bash'} #!/bin/bash # Directory containing SAM files sam_dir="../output/" # Iterate over all SAM files in the directory for sam_file in ${sam_dir}*.sam do # Generate output file names base_name=$(basename ${sam_file} .sam) bam_file="${sam_dir}${base_name}.sorted.bam" index_file="${bam_file}.bai" # Convert SAM to BAM and sort /home/shared/samtools-1.12/samtools view \ -@ 7 \ -Su ${sam_file} \ | /home/shared/samtools-1.12/samtools sort - \ -@ 7 \ -o ${bam_file} # Index BAM file /home/shared/samtools-1.12/samtools index ${bam_file} done ``` ## Generate a reference transcriptome annotation to input into trinity? I can't make abundance estimates until I have gtf it seems. Here is an example of the code chunk that ChatGPT spit out for me. ```{r, engine='bash'} #!/bin/bash # Set input and output directories input_dir=../output output_dir=../output/stringtie # Loop through the sorted BAM files in input directory for f in ${input_dir}/*.sorted.bam; do # Get base filename base=$(basename ${f} .sorted.bam) # Run StringTie to estimate transcript abundances, botznik gff not accepted as a valid reference transcript... :( /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ ${input_dir}/${base}.sorted.bam \ -G ../data/genome-alignment/botznik-chr-all.gff \ -o ${output_dir}/${base}.gtf \ -p 7 done # Merge GTF files with StringTie /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \ --merge ${output_dir}/mergelist.txt \ -G ../data/botznik-chr-ctg.gff \ -o ${output_dir}/merged.gtf \ -p 7 ```