--- title: "Step 4: Align sequences to annotated *M. capitata* genome" subtitle: "Using `HISAT2`" author: "Sarah Tanja" date: 10/21/2024 format: gfm: default # or html if you want to render in HTML toc: true toc-depth: 3 link-external-icon: true link-external-newwindow: true reference-location: margin citation-location: margin --- # 1 \| INTRODUCTION This notebook will align trimmed M. capitata RNA-seq data to the M. capitata genome using hierarchical indexing for spliced alignment of transcripts HISAT2 [@zhang_rapid_2021]. Followed by StringTie (Pertea et al. 2016, 2015) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2. Input(s) - Trimmed FastQ files, with format: `*fastp-trim.fq.gz` are located in `output/03_qaqc/trim_reads_fastp` - HISAT 2 genome index: - Genome GTF: - Genome [Version 3](http://cyanophora.rutgers.edu/montipora/) [@stephens2022] - [GFF](http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz) from Rutgers (or [GFF fixed](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gz) from AHuffmyer?) - [genomes, indexes, & feature tracks](https://robertslab.github.io/resources/Genomic-Resources/#montipora-capitata) from Roberts Lab Handbook - Sample metadata: `~metadata/rna_metadata.csv` # 2 \| Genome downloads ## Annotated Reference Genome for *Montipora capitata* Deep Dive project with genomes of interest: https://github.com/urol-e5/deep-dive *Montipora capitata* Genome version V3, Rutgers University: Genome publication: Nucleotide Coding Sequence (CDS): This code grabs the *Montipora capitata* fasta file (rna.fna) of genes. ```{r, genome-download, engine = 'bash'} # change from code directory to work in data directory cd ../data # make mcapgenome directory a subdirectory of data if not already present mkdir -p mcapgenome # change directory to ~data/mcapgenome cd mcapgenome # download the annotated genomes from the rutgers site if not already present [-f Montipora_capitata_HIv3.assembly.fasta.gz] || wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz ``` - [`Montipora_capitata_HIv3.assembly.fasta`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.assembly.fasta) (745MB) - MD5 checksum: `99819eadba1b13ed569bb902eef8da08` - Downloaded 2023017: ```{r, genome-checksums, engine = 'bash'} # change to work in data genome folder cd ../data/mcapgenome # generate checksum for the genome assembly file md5sum *assembly* ``` ## HISAT Index - [`Montipora_capitata_HIv3-hisat2-indices.tar.gz`](https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz) (tarball gzip; 1.2GB) - MD5 checksum: `c8accb6c54e843198c776f0d6f0c603d` - Needs to be unpacked before use! ```{r, hisat-download, engine = 'bash'} # change to work in data genome directory cd ../data/mcapgenome # download the hisat index from Robert's Lab gannet server [-f Montipora_capitata_HIv3-hisat2-indices.tar.gz] || wget https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz ``` ```{r, hisat-checksum, engine = 'bash'} # change to work in data genome directory cd ../data/mcapgenome # generate checksum for the hisat index zip file md5sum *hisat2* ``` Unpack the tar.gz hisat index file using `tar -xvzf` - `-x`: Extracts the contents of the archive. - `-v`: Verbose, shows the files being extracted. - `-z`: Tells `tar` that the archive is compressed with `gzip` (for `.tar.gz` files). - `-f`: Specifies the file name of the archive to extract. This command will extract the contents of the `.tar.gz` file into the current directory: ```{r, hisat-unzip, engine = 'bash'} cd ../data/mcapgenome tar -xvzf Montipora_capitata_HIv3-hisat2-indices.tar.gz ``` ## Genome Feature Tracks [Generic Feature Format (GFF3)](https://gmod.org/wiki/GFF3) - [`Montipora_capitata_HIv3.genes.gff3`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.genes.gff3) (67MB) - MD5 checksum: `5f6b80ba2885471c8c1534932ccb7e84` - Downloaded 2023017: - [`Montipora_capitata_HIv3.genes.gtf`](https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf) (101MB) - MD5 checksum: `ceef8eca945199415b23d2f1f0dd2066` - Created 2023017: ```{r, feature-tracks-download, engine = 'bash'} # change to work in data genome directory cd ../data/mcapgenome wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz wget https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf ``` ```{r, feature-tracks-checksum, engine = 'bash'} cd ../data/mcapgenome md5sum *.gff3.gz *.gtf *gff3 ``` ```{r, engine = 'bash'} cd ../data/mcapgenome head -10 Montipora_capitata_HIv3.genes.gff3 ``` Load libraries and data. ```{r} if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse') if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse') library(tidyverse) library(R.utils) ``` ```{r} gff <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes.gff3", header=FALSE, sep="\t") gff_fixed <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes_fixed.gff3", header=FALSE, sep="\t") gtf <- ``` # 3 \| Create a `bash` variables file This file will overwrite any existing `.bashvars` file that is in the code directory ```{r, engine= 'bash'} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export project_dir=/home/shared/8TB_HDD_02/stanja/sarahtanja/coral-embryo-leachate' echo 'export genome_dir="${project_dir}/data/mcapgenome"' echo 'export genome_index_dir="${genome_dir}/hisat-index"' echo 'export output_dir_top=${project_dir}/output' echo 'export output_dir_align=${output_dir_top}/04_align' echo 'export trimmed_fastqs_dir=${output_dir_top}/03_qaqc/trim_reads_fastp' echo 'export raw_reads_dir=${project_dir}/rawfastq/00_fastq' echo "" echo "# Location of Hisat2 index files" echo "# Must keep variable name formatting, as it's used by HiSat2" echo 'export HISAT2_INDEXES="${genome_index_dir}"' echo "# Input files" #echo 'export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"' echo 'export genome_index_name="Montipora_capitata_HIv3"' echo 'export genome_gff="${genome_dir}/Montipora_capitata_HIv3.genes_fixed.gff3"' echo 'export genome_fasta="${genome_dir}/Montipora_capitata_HIv3.assembly.fasta"' #echo 'export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"' echo 'export transcripts_gtf="${genome_dir}/Montipora_capitata_HIv3.genes.gtf"' echo "# Output files" echo 'export gtf_list="${output_dir_align}/gtf_list.txt"' echo 'export merged_bam="${output_dir_align}/sorted-bams-merged.bam"' echo "" echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"' echo 'export hisat2="${hisat2_dir}/hisat2"' echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc' echo 'export samtools="${programs_dir}/samtools-1.12/samtools"' echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"' echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"' echo "" echo "# Set FastQ filename patterns" echo "export R1_fastq_pattern='*_R1_*.fq.gz'" echo "export R2_fastq_pattern='*_R2_*.fq.gz'" echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'" echo "" echo "# Set number of CPUs to use" echo 'export threads=40' echo "" echo "# Set average read length - for StringTie prepDE.py" echo 'export read_length=150' echo "" echo "## Initialize arrays" echo 'export fastq_array_R1=()' echo 'export fastq_array_R2=()' echo 'export R1_names_array=()' echo 'export R2_names_array=()' echo "declare -A sample_info_map" echo "" echo "# Programs associative array" echo "declare -A programs_array" echo "programs_array=(" echo '[hisat2]="${hisat2}" \' echo '[multiqc]="${multiqc}" \' echo '[prepDE]="${prepDE}" \' echo '[samtools]="${samtools}" \' echo '[stringtie]="${stringtie}" \' echo ")" echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # 4 \| Align reads using HISAT 2 ## Sam's Loop This requires usage of the `mcap_RNAseq_simplified_metadata.csv` This step has a lengthy, semi-complex workflow: 1. Parse `mcap_RNAseq_simplified_metadata.csv` for *M. capitata* sample names, leachate treatments, and embryonic phase. This info will be used for downstream file naming and for passing the treatment variables to the read group (`SM:`) in the alignment file. 2. Loop through all samples and perform individual alignments using HISAT2. 3. HISAT2 output is piped to through multiple samtools tools: flagstat (stats aggregation), sort (creates/sorts BAM), index (creates BAM index). Piping saves time and disk space, by avoiding the generation of large SAM files. 4. Loop continues and runs StringTie on sorted BAM file to produce individual GTF file. 5. Loop continues and adds GTF path/filename to a list file, which will be used downstream. ```{r} metadata <- read.csv("../metadata/mcap_RNAseq_simplified_metadata.csv") head(metadata) ``` ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Make output directories, if they don't exist mkdir --parents "${output_dir_align}" # Change to ouput directory cd "${output_dir_align}" # Create associative array with sample and timepoint metadata="../../metadata/mcap_RNAseq_simplified_metadata.csv" # Declare the associative array to store all the sample information declare -A sample_info_map # Read the metadata file line by line while IFS=',' read -r sample_no sample_name organism collection_date pvc_leachate_level hours_post_fertilization plate well sample_type sample_buffer; do # Add the sample name as the key and the combined info as the value sample_info_map["${sample_name}"]="${collection_date},${pvc_leachate_level},${hours_post_fertilization}" done < <(tail -n +2 "${metadata}") # Skip the header ## Populate trimmed reads arrays fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern}) fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern}) ############## BEGIN HISAT2 ALIGNMENTS ############## for sample in "${!sample_info_map[@]}" do # Create and switch to dedicated sample directory mkdir --parents "${sample}" && cd "$_" # Create HISAT2 list of fastq R1 files # and generated MD5 checksums file. for fastq in "${fastq_array_R1[@]}" do # Parse sample name from FastQ filename fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}') # Generate checksum/list of input files used md5sum "${fastq}" >> input_fastqs_checksums.md5 # Create comma-separated lists of FastQs for HISAT2 printf -v joined_R1 '%s,' "${fastq}" fastq_list_R1=$(echo "${joined_R1%,}") done # Create HISAT2 list of fastq R1 files # and generated MD5 checksums file. for fastq in "${fastq_array_R2[@]}" do # Parse sample name from FastQ filename fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}') # Generate checksum/list of input files used md5sum "${fastq}" >> input_fastqs_checksums.md5 # Create comma-separated lists of FastQs for HISAT2 printf -v joined_R2 '%s,' "${fastq}" fastq_list_R2=$(echo "${joined_R2%,}") done # HISAT2 alignments # Sets read group info (RG) using samples array "${programs_array[hisat2]}" \ -x "${genome_index_name}" \ -1 "${fastq_list_R1}" \ -2 "${fastq_list_R2}" \ --threads "${threads}" \ --rg-id "${sample}" \ --rg "SM:""${sample_info_map[$sample]}" \ 2> "${sample}"_hisat2.stats \ | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \ | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \ | tee "${sample}".sorted.bam \ | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai # Run stringtie on alignments # Uses "-B" option to output tables intended for use in Ballgown # Uses "-e" option; recommended when using "-B" option. # Limits analysis to only reads alignments matching reference. "${programs_array[stringtie]}" "${sample}".sorted.bam \ -p "${threads}" \ -o "${sample}".gtf \ -G "${genome_gff}" \ -C "${sample}.cov_refs.gtf" \ -B \ -e # Add GTFs to list file, only if non-empty # Identifies GTF files that only have header gtf_lines=$(wc -l < "${sample}".gtf ) if [ "${gtf_lines}" -gt 2 ]; then echo "$(pwd)/${sample}.gtf" >> "${gtf_list}" fi # Generate checksums find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5 # Move up to orig. working directory cd .. done ``` ## Broken into chunks ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Make output directories, if they don't exist mkdir --parents "${output_dir_align}" # Change to output directory cd "${output_dir_align}" # Create associative array with sample and timepoint metadata="../../metadata/mcap_RNAseq_simplified_metadata.csv" # Declare the associative array to store all the sample information declare -A sample_info_map # Read the metadata file line by line while IFS=',' read -r sample_no sample_name organism collection_date development_stage pvc_leachate_level hours_post_fertilization pvc_leachate_concentration_mg_L plate well sample_type sample_buffer; do # Add the sample name as the key and the combined info as the value sample_info_map["${sample_name}"]="${collection_date},${development_stage},${pvc_leachate_level},${hours_post_fertilization},${pvc_leachate_concentration_mg_L}" done < <(tail -n +2 "${metadata}") # Skip the header ## Populate trimmed reads arrays fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern}) fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern}) ``` ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Change to output directory cd "${output_dir_align}" ############## BEGIN HISAT2 ALIGNMENTS ############## for sample in "${!sample_info_map[@]}" do # Create and switch to dedicated sample directory mkdir --parents "${sample}" && cd "$_" # Create HISAT2 list of fastq R1 files # and generated MD5 checksums file. for fastq in "${fastq_array_R1[@]}" do # Parse sample name from FastQ filename fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}') # Generate checksum/list of input files used md5sum "${fastq}" >> input_fastqs_checksums.md5 # Create comma-separated lists of FastQs for HISAT2 printf -v joined_R1 '%s,' "${fastq}" fastq_list_R1=$(echo "${joined_R1%,}") done # Create HISAT2 list of fastq R1 files # and generated MD5 checksums file. for fastq in "${fastq_array_R2[@]}" do # Parse sample name from FastQ filename fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}') # Generate checksum/list of input files used md5sum "${fastq}" >> input_fastqs_checksums.md5 # Create comma-separated lists of FastQs for HISAT2 printf -v joined_R2 '%s,' "${fastq}" fastq_list_R2=$(echo "${joined_R2%,}") done ``` ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Change to output directory cd "${output_dir_align}" # HISAT2 alignments # Sets read group info (RG) using samples array "${programs_array[hisat2]}" \ -x "${genome_index_name}" \ -1 "${fastq_list_R1}" \ -2 "${fastq_list_R2}" \ --threads "${threads}" \ --rg-id "${sample}" \ --rg "SM:""${sample_info_map[$sample]}" \ 2> "${sample}"_hisat2.stats \ | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \ | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \ | tee "${sample}".sorted.bam \ | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai ``` ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Change to output directory cd "${output_dir_align}" # Run stringtie on alignments # Uses "-B" option to output tables intended for use in Ballgown # Uses "-e" option; recommended when using "-B" option. # Limits analysis to only reads alignments matching reference. "${programs_array[stringtie]}" "${sample}".sorted.bam \ -p "${threads}" \ -o "${sample}".gtf \ -G "${genome_gff}" \ -C "${sample}.cov_refs.gtf" \ -B \ -e ``` ```{r, engine = 'bash'} # Load bash variables into memory source .bashvars # Change to output directory cd "${output_dir_align}" # Add GTFs to list file, only if non-empty # Identifies GTF files that only have header gtf_lines=$(wc -l < "${sample}".gtf ) if [ "${gtf_lines}" -gt 2 ]; then echo "$(pwd)/${sample}.gtf" >> "${gtf_list}" fi # Generate checksums find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5 # Move up to orig. working directory cd .. done ```