--- title: "P.tuahiniensis GFF reformatting" author: "Sam White" date: "2025-01-17" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Background This notebook reformats the original *P.tuahiniensis* GFF ([Pocillopora_meandrina_HIv1.genes.gff3](../data/Pocillopora_meandrina_HIv1.genes.gff3)), to be compliant with the GFF3 standard. The genome was originally downloaded from here: - [@stephens2022] Despite the naming convention of the file, there are no designated `gene` regions. The reformatting will add a `gene` feature, based solely on the `transcript` coordinates. There aren't any other features beyoned `CDS` and `exon`, so adding `gene` features which match the `transcript` regions is primarily just to meet the GFF3 standard. Additionally, this notebook will generate a GTF for downstream use with HiSat2 for creating a genome index which incorporates exon/intron junctions and splice sites. ::: callout-note Unlike other scripts, this will output to [`F-Ptuh/data`](../data), instead of an output directory in `../output`. ::: ## Software requirements Requires [genometools](https://github.com/genometools/genometools) (GitHub repo) to be installed and in the system `$PATH` and is used for GFF3 validation/formatting. Requires [gffread](https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread) [@pertea2020] which is used for GFF3-to-GTF conversion. ## Inputs - [`Pocillopora_meandrina_HIv1.genes.gff3`](../data/Pocillopora_meandrina_HIv1.genes.gff3) ## Outputs - [`Pocillopora_meandrina_HIv1.genes-validated.gff3`](../data/Pocillopora_meandrina_HIv1.genes-validated.gff3) - [`Pocillopora_meandrina_HIv1.genes-validated.gtf`](../data/Pocillopora_meandrina_HIv1.genes-validated.gtf) ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Load libraries ```{r load-libraries, eval=TRUE} library(tidyverse) ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Programs" echo 'export gffread=~/programs/gffread/gffread' echo "# Data directories" echo 'export repo_dir=~/gitrepos/urol-e5/deep-dive-expression' echo 'export data_dir=${repo_dir}/F-Ptuh/data' echo "" echo "# Input files" echo 'export original_gff="Pocillopora_meandrina_HIv1.genes.gff3"' echo 'export intermediate_gff="intermediate.gff3"' echo 'export validated_gff="Pocillopora_meandrina_HIv1.genes-validated.gff3"' echo 'export gtf="Pocillopora_meandrina_HIv1.genes-validated.gtf"' echo "" echo "# Print formatting" echo 'export line="--------------------------------------------------------"' echo "" } > .bashvars cat .bashvars ``` # Read GFF3 file ```{r load-input-gff, eval=TRUE} gff <- read_delim("../data/Pocillopora_meandrina_HIv1.genes.gff3", delim="\t", col_names=c("seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"), show_col_types = FALSE) str(gff) ``` # Get unique types ```{r unqique-features, eval=TRUE} # Get and print unique types, one per line cat("Unique feature types in original GFF3:\n") gff %>% distinct(type) %>% pull(type) %>% walk(~cat("-", .x, "\n")) ``` # Create gene entries from transcript entries ```{r create-gene-entries, eval=TRUE} # Create gene entries gene_entries <- gff %>% filter(type == "transcript") %>% mutate( type = "gene", attributes = paste0("ID=gene-", str_remove(attributes, "^ID=")) ) str(gene_entries) ``` # Create mRNA features ```{r mrna-features, eval=TRUE} # Create mRNA entries mrna_entries <- gff %>% filter(type == "transcript") %>% mutate( type = "mRNA", gene_id = paste0("gene-", str_remove(attributes, "^ID=")), attributes = paste0("ID=mrna-", str_remove(attributes, "^ID="), ";Parent=", gene_id) ) %>% select(-gene_id) str(mrna_entries) ``` # Other features Prepends a feature ID (e.g. `cds-`) to the corresponding attribute ID. Also increments each `exon` feature type within each mRNA, starting at \`. ```{r other-features, eval=TRUE} # Process remaining features other_features <- gff %>% group_by(seqid, group_id = cumsum(type == "transcript")) %>% mutate( mrna_id = first(if_else(type == "transcript", paste0("mrna-", str_remove(attributes, "^ID=")), NA_character_)), # Calculate exon number within each group exon_num = if_else(type == "exon", dense_rank(if_else(type == "exon", row_number(), NA_real_)), NA_real_), # Format attributes attributes = if_else( type != "transcript", paste0( "ID=", tolower(type), "-", str_remove(first(if_else(type == "transcript", attributes, NA_character_)), "^ID="), if_else(!is.na(exon_num), paste0("-", exon_num), ""), ";Parent=", mrna_id ), attributes ) ) %>% ungroup() %>% filter(type != "transcript") %>% select(-group_id, -mrna_id, -exon_num) str(other_features) ``` # Combine and sort ```{r combine-and-sort, eval=TRUE} intermediate_gff <- bind_rows( gene_entries, mrna_entries, other_features ) %>% arrange(seqid, start) str(intermediate_gff) ``` # Unique features ```{r unqique-features-updated-gff, eval=TRUE} # Get and print unique types, one per line cat("Unique feature types in final GFF3:\n") intermediate_gff %>% distinct(type) %>% pull(type) %>% walk(~cat("-", .x, "\n")) ``` # Write output ```{r write-reformatted-gff, eval=TRUE} write_delim(intermediate_gff, "../data/intermediate.gff3", delim="\t", col_names=FALSE) ``` # Validate GFF Validate GFF using genometools `gff3`. - `-tidy`: Attempts to clean/fix any potential issues. - `-checkids`: Checks IDs. - `-retainids`: Retains IDs from input GFF instead of assigning new ones. ```{r validate-gff, engine='bash', eval=TRUE} source .bashvars gt gff3 \ -tidy \ -checkids \ -retainids \ "${data_dir}"/"${intermediate_gff}" \ > "${data_dir}"/"${validated_gff}" \ 2> "${data_dir}"/gt_gff3_validation_errors.log ``` ## Check for error(s) in validation Process would stop if error occurred, so only need to check end of file. ::: callout-note Warnings are expected, as this program warns the user when it is inserting text. In this instance, when it is inserting GFF3-compliant comment lines denoting regions. ::: ```{r validate-gff-error-check, engine='bash', eval=TRUE} source .bashvars tail "${data_dir}"/gt_gff3_validation_errors.log ``` ## Inspect Validated GFF ```{r validated-gff, engine='bash', eval=TRUE} source .bashvars head "${data_dir}"/"${validated_gff}" ``` ## Remove intermediate GFF ```{r rm-intermediate-gff, engine='bash', eval=TRUE} source .bashvars rm "${data_dir}"/"${intermediate_gff}" ``` # Convert to GTF ```{r create-GTF, engine='bash', eval=TRUE} source .bashvars ${gffread} \ -E \ -T \ "${data_dir}"/"${validated_gff}" \ > "${data_dir}"/"${gtf}" \ 2> "${data_dir}"/gffread-gtf.log ``` ## Inspect GTF ```{r inspect-GTF, engine='bash', eval=TRUE} source .bashvars head "${data_dir}"/"${gtf}" ``` ## Check GTF Log ```{r check-GTF-log, engine='bash', eval=TRUE} source .bashvars head "${data_dir}"/gffread-gtf.log ``` # REFERENCES