--- title: "02.00-D-Apul-RNAseq-gff-to-gtf" author: "Sam White" date: "2024-10-08" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` This notebook will utilize [gffread](https://ccb.jhu.edu/software/stringtie/gff.shtml) [@pertea2020] to create an _A.pulchras_ GTF file from the _A.pulchra_ genome GFF, which is needed for downstream analysis with [HISAT2](https://github.com/DaehwanKimLab/hisat2); specifically for identification of exons and splice sites. # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular' echo 'export genome_dir="${timeseries_dir}/D-Apul/data"' echo 'export output_dir_top=${timeseries_dir}/D-Apul/output/02.00-D-Apul-RNAseq-gff-to-gtf' echo "" echo "# Input/output files" echo 'export genome_gff="${genome_dir}/Apulchra-genome.gff"' echo 'export transcripts_gtf="Apulchra-genome.gtf"' echo "# Paths to programs" echo 'export programs_dir="/home/shared"' echo 'export gffread="${programs_dir}/gffread-0.12.7.Linux_x86_64/gffread"' echo "# Set number of CPUs to use" echo 'export threads=40' echo "" echo "# Programs associative array" echo "declare -A programs_array" echo "programs_array=(" echo '[gffread]="${gffread}" \' echo ")" echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Preview and Validate Genome GFF ## Inspect GFF ```{r preview-genome-gff, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars # Make directories, if they don't exist mkdir --parents "${output_dir_top}" head -n 20 "${genome_gff}" ``` ## Valdiate GFF This identifies if there are rows with >9 fields (which there shouldn't be in a [GFF3](http://gmod.org/wiki/GFF3)). Additionally, it provides a preview of all rows lengths identified. ```{r validate-genome-gff, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars # Capture number of fields (NF) in each row in array. field_count_array=($(awk -F "\t" '{print NF}' "${genome_gff}" | sort --unique)) # Check array contents echo "List of number of fields in ${genome_gff}:" echo "" for field_count in "${field_count_array[@]}" do echo "${field_count}" done echo "" echo "${line}" echo "" # Preview of each line "type" with a given number of fields # Check array contents echo "" for field_count in "${field_count_array[@]}" do echo "Preview of lines with ${field_count} field(s):" echo "" awk \ -v field_count="${field_count}" \ -F "\t" \ 'NF == field_count' \ "${genome_gff}" \ | head echo "" echo "${line}" echo "" done ``` Great! This looks like a valid GFF. Can proceed with GTF generation. # Generate GTF ```{r generate-genome-gtf, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars ${programs_array[gffread]} -E \ "${genome_gff}" -T \ 1> ${output_dir_top}/"${transcripts_gtf}" \ 2> ${output_dir_top}/gffread-gff_to_gtf.stderr ``` # Inspect GTF ```{r inspect-genome-gtf, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars head ${output_dir_top}/"${transcripts_gtf}" ``` ## Copy GTF to `D-Apul/data` To help make this easier to locate, will copy to the `D-Apul/data` directory, which also contains the genome FastA, genome FastA index, and the genome GFF files. ```{r copy-genome-gtf, engine='bash', eval=FALSE} # Load bash variables into memory source .bashvars cp ${output_dir_top}/"${transcripts_gtf}" "${genome_dir}" ```