---
title: "02.00-D-Apul-RNAseq-gff-to-gtf"
author: "Sam White"
date: "2024-10-08"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: references.bib
---

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

This notebook will utilize [gffread](https://ccb.jhu.edu/software/stringtie/gff.shtml) [@pertea2020] to create an _A.pulchras_ GTF file from the _A.pulchra_ genome GFF, which is needed for downstream analysis with [HISAT2](https://github.com/DaehwanKimLab/hisat2); specifically for identification of exons and splice sites.

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export genome_dir="${timeseries_dir}/D-Apul/data"'
echo 'export output_dir_top=${timeseries_dir}/D-Apul/output/02.00-D-Apul-RNAseq-gff-to-gtf'
echo ""

echo "# Input/output files"

echo 'export genome_gff="${genome_dir}/Apulchra-genome.gff"'
echo 'export transcripts_gtf="Apulchra-genome.gtf"'

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export gffread="${programs_dir}/gffread-0.12.7.Linux_x86_64/gffread"'

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""


echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[gffread]="${gffread}" \'
echo ")"
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
```

# Preview and Validate Genome GFF

## Inspect GFF
```{r preview-genome-gff, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

# Make directories, if they don't exist
mkdir --parents "${output_dir_top}"

head -n 20 "${genome_gff}"
```

## Valdiate GFF

This identifies if there are rows with >9 fields (which there shouldn't be in a [GFF3](http://gmod.org/wiki/GFF3)).

Additionally, it provides a preview of all rows lengths identified.

```{r validate-genome-gff, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

# Capture number of fields (NF) in each row in array.
field_count_array=($(awk -F "\t" '{print NF}' "${genome_gff}" | sort --unique))


# Check array contents
echo "List of number of fields in ${genome_gff}:"
echo ""
for field_count in "${field_count_array[@]}"
do
  echo "${field_count}"
done

echo ""
echo "${line}"
echo ""

# Preview of each line "type" with a given number of fields
# Check array contents
echo ""
for field_count in "${field_count_array[@]}"
do
  echo "Preview of lines with ${field_count} field(s):"
  echo ""
  awk \
    -v field_count="${field_count}" \
    -F "\t" \
    'NF == field_count' \
    "${genome_gff}" \
    | head
  echo ""
  echo "${line}"
  echo ""
done
```
Great! This looks like a valid GFF. Can proceed with GTF generation.

# Generate GTF

```{r generate-genome-gtf, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

${programs_array[gffread]} -E \
"${genome_gff}" -T \
1> ${output_dir_top}/"${transcripts_gtf}" \
2> ${output_dir_top}/gffread-gff_to_gtf.stderr

```

# Inspect GTF

```{r inspect-genome-gtf, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

head ${output_dir_top}/"${transcripts_gtf}"

```

## Copy GTF to `D-Apul/data`

To help make this easier to locate, will copy to the `D-Apul/data` directory, which also contains
the genome FastA, genome FastA index, and the genome GFF files.

```{r copy-genome-gtf, engine='bash', eval=FALSE}
# Load bash variables into memory
source .bashvars

cp ${output_dir_top}/"${transcripts_gtf}" "${genome_dir}"

```