---
title: "P.tuahiniensis GFF reformatting"
author: "Sam White"
date: "2025-01-17"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---

# Background

This notebook reformats the original *P.tuahiniensis* GFF ([Pocillopora_meandrina_HIv1.genes.gff3](../data/Pocillopora_meandrina_HIv1.genes.gff3)), to be compliant with the GFF3 standard. The genome was originally downloaded from here:

-   <http://cyanophora.rutgers.edu/Pocillopora_meandrina/> [@stephens2022]

Despite the naming convention of the file, there are no designated `gene` regions. The reformatting will add a `gene` feature, based solely on the `transcript` coordinates. There aren't any other features beyoned `CDS` and `exon`, so adding `gene` features which match the `transcript` regions is primarily just to meet the GFF3 standard.

Additionally, this notebook will generate a GTF for downstream use with HiSat2 for creating a genome index which incorporates exon/intron junctions and splice sites.

::: callout-note
Unlike other scripts, this will output to [`F-Ptuh/data`](../data), instead of an output directory in `../output`.
:::

## Software requirements

Requires [genometools](https://github.com/genometools/genometools) (GitHub repo) to be installed and in the system `$PATH` and is used for GFF3 validation/formatting.

Requires [gffread](https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread) [@pertea2020] which is used for GFF3-to-GTF conversion.

## Inputs

-   [`Pocillopora_meandrina_HIv1.genes.gff3`](../data/Pocillopora_meandrina_HIv1.genes.gff3)

## Outputs

-   [`Pocillopora_meandrina_HIv1.genes-validated.gff3`](../data/Pocillopora_meandrina_HIv1.genes-validated.gff3)

-   [`Pocillopora_meandrina_HIv1.genes-validated.gtf`](../data/Pocillopora_meandrina_HIv1.genes-validated.gtf)

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# Load libraries

```{r load-libraries, eval=TRUE}
library(tidyverse)
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Programs"
echo 'export gffread=~/programs/gffread/gffread'

echo "# Data directories"
echo 'export repo_dir=~/gitrepos/urol-e5/deep-dive-expression'
echo 'export data_dir=${repo_dir}/F-Ptuh/data'
echo ""

echo "# Input files"
echo 'export original_gff="Pocillopora_meandrina_HIv1.genes.gff3"'
echo 'export intermediate_gff="intermediate.gff3"'
echo 'export validated_gff="Pocillopora_meandrina_HIv1.genes-validated.gff3"'
echo 'export gtf="Pocillopora_meandrina_HIv1.genes-validated.gtf"'
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
```

# Read GFF3 file

```{r load-input-gff, eval=TRUE}
gff <- read_delim("../data/Pocillopora_meandrina_HIv1.genes.gff3", 
                  delim="\t", 
                  col_names=c("seqid", "source", "type", "start", "end", 
                            "score", "strand", "phase", "attributes"),
                  show_col_types = FALSE)
str(gff)
```

# Get unique types

```{r unqique-features, eval=TRUE}
# Get and print unique types, one per line
cat("Unique feature types in original GFF3:\n")
gff %>%
  distinct(type) %>%
  pull(type) %>%
  walk(~cat("-", .x, "\n"))
```

# Create gene entries from transcript entries

```{r create-gene-entries, eval=TRUE}
# Create gene entries
gene_entries <- gff %>%
  filter(type == "transcript") %>%
  mutate(
    type = "gene",
    attributes = paste0("ID=gene-", str_remove(attributes, "^ID="))
  )

str(gene_entries)
```

# Create mRNA features

```{r mrna-features, eval=TRUE}
# Create mRNA entries
mrna_entries <- gff %>%
  filter(type == "transcript") %>%
  mutate(
    type = "mRNA",
    gene_id = paste0("gene-", str_remove(attributes, "^ID=")),
    attributes = paste0("ID=mrna-", str_remove(attributes, "^ID="), 
                       ";Parent=", gene_id)
  ) %>%
  select(-gene_id)

str(mrna_entries)
```

# Other features

Prepends a feature ID (e.g. `cds-`) to the corresponding attribute ID.

Also increments each `exon` feature type within each mRNA, starting at \`.

```{r other-features, eval=TRUE}
# Process remaining features
other_features <- gff %>%
  group_by(seqid, group_id = cumsum(type == "transcript")) %>%
  mutate(
    mrna_id = first(if_else(type == "transcript", 
                           paste0("mrna-", str_remove(attributes, "^ID=")), 
                           NA_character_)),
    # Calculate exon number within each group
    exon_num = if_else(type == "exon",
                      dense_rank(if_else(type == "exon", row_number(), NA_real_)),
                      NA_real_),
    # Format attributes
    attributes = if_else(
      type != "transcript",
      paste0(
        "ID=", tolower(type), "-", str_remove(first(if_else(type == "transcript", 
                                                           attributes, 
                                                           NA_character_)), "^ID="),
        if_else(!is.na(exon_num),
                paste0("-", exon_num), ""),
        ";Parent=", mrna_id
      ),
      attributes
    )
  ) %>%
  ungroup() %>%
  filter(type != "transcript") %>%
  select(-group_id, -mrna_id, -exon_num)

str(other_features)
```

# Combine and sort

```{r combine-and-sort, eval=TRUE}
intermediate_gff <- bind_rows(
  gene_entries,
  mrna_entries,
  other_features
) %>%
  arrange(seqid, start)

str(intermediate_gff)
```

# Unique features

```{r unqique-features-updated-gff, eval=TRUE}
# Get and print unique types, one per line
cat("Unique feature types in final GFF3:\n")
intermediate_gff %>%
  distinct(type) %>%
  pull(type) %>%
  walk(~cat("-", .x, "\n"))
```

# Write output

```{r write-reformatted-gff, eval=TRUE}
write_delim(intermediate_gff, 
            "../data/intermediate.gff3", 
            delim="\t",
            col_names=FALSE)
```

# Validate GFF

Validate GFF using genometools `gff3`.

-   `-tidy`: Attempts to clean/fix any potential issues.

-   `-checkids`: Checks IDs.

-   `-retainids`: Retains IDs from input GFF instead of assigning new ones.

```{r validate-gff, engine='bash', eval=TRUE}
source .bashvars

gt gff3 \
-tidy \
-checkids \
-retainids \
"${data_dir}"/"${intermediate_gff}" \
> "${data_dir}"/"${validated_gff}" \
2> "${data_dir}"/gt_gff3_validation_errors.log
```

## Check for error(s) in validation

Process would stop if error occurred, so only need to check end of file.

::: callout-note
Warnings are expected, as this program warns the user when it is inserting text.

In this instance, when it is inserting GFF3-compliant comment lines denoting regions.
:::

```{r validate-gff-error-check, engine='bash', eval=TRUE}
source .bashvars

tail "${data_dir}"/gt_gff3_validation_errors.log
```

## Inspect Validated GFF

```{r validated-gff, engine='bash', eval=TRUE}
source .bashvars

head "${data_dir}"/"${validated_gff}"
```

## Remove intermediate GFF

```{r rm-intermediate-gff, engine='bash', eval=TRUE}
source .bashvars

rm "${data_dir}"/"${intermediate_gff}"
```

# Convert to GTF

```{r create-GTF, engine='bash', eval=TRUE}
source .bashvars

${gffread} \
-E \
-T \
"${data_dir}"/"${validated_gff}" \
> "${data_dir}"/"${gtf}" \
2> "${data_dir}"/gffread-gtf.log

```

## Inspect GTF

```{r inspect-GTF, engine='bash', eval=TRUE}
source .bashvars

head "${data_dir}"/"${gtf}"

```

## Check GTF Log

```{r check-GTF-log, engine='bash', eval=TRUE}
source .bashvars

head "${data_dir}"/gffread-gtf.log

```

# REFERENCES