---
title: "Example exon expression table"
author: "Sam White"
date: "2023-04-04"
output: html_document
---
# R Markdown file to generate an example exon expression (read count) table per [this GitHub Issue](https://github.com/RobertsLab/resources/issues/1609).

The expected output file:

> row is genes, and columns are exon count data..(eg exon01, exon02, econ03 ....)


## Load libraries
```{r load-libraries}
library("tidyverse")
```


# Reformat Ballgown exon expression table to match BED formatting

## View Ballgown exon table
```{r view-ballgown-exon-table}
system("head '../data/ballgown/S12M/e_data.ctab'")

```
## Create BED format

Moves `e_id` column to represent `name` column, and moves `rcount` to `score` column.
Also removes header.
```{bash view-ballgown-exon-table}
awk '{print $2"\t"$4"\t"$5"\t"$1"\t"$6"\t"$3}' ../data/ballgown/S12M/e_data.ctab | tail -n +2 \
> ../output/19-exon-expression/S12M-e_data.bed

head ../output/19-exon-expression/S12M-e_data.bed

```

## Perform exon/gene intersection using bedtools
```{bash exon-gene-intersetion-bedtools}
# Use bedtools to intersect the exons BED file with the genes BED file
/home/shared/bedtools2/bin/bedtools intersect \
-b ../output/19-exon-expression/S12M-e_data.bed \
-a ../data/C_virginica-3.0_Gnomon_genes.bed -wo \
> ../output/19-exon-expression/S12M-exon_gene_intersect.txt

head ../output/19-exon-expression/S12M-exon_gene_intersect.txt
```

## Reorient data to be organized by gene
```{r}
# Read in the combined BED file as a data frame
# Set column names
combined_df <- read.table("../output/19-exon-expression/S12M-exon_gene_intersect.txt", header = FALSE, col.names = c("chrom", "start", "end", "gene_name", "gene_score", "gene_strand","exon_chrom", "exon_start", "exon_end", "exon_name", "exon_score", "exon_strand", "overlap_length"))

# Reorder the columns
combined_df <- combined_df[, c("chrom", "start", "end", "gene_name", "gene_score", "gene_strand", "exon_score")]

# Group by gene name
grouped_df <- combined_df %>%
  group_by(gene_name)

# Add a new column with exon numbering for each gene
grouped_df <- grouped_df %>%
  mutate(exon_num = row_number())

# Pivot the table to have exon numbers as column headers
pivoted_df <- grouped_df %>%
  pivot_wider(names_from = exon_num, values_from = exon_score, names_prefix = "exon_")

# Remove unneeded columns
final_table <- pivoted_df %>%
  select(-chrom, -start, -end, -gene_score, -gene_strand)

str(final_table)

```

## Write to file
```{r write-to-file}
write.table(final_table, file = "../output/19-exon-expression/S12M-exon_expression.tab", sep = "\t", quote = FALSE, row.names = FALSE)
```