---
title: "8_7_merged_immune_cluster_counts.Rmd"
author: "Aspen Coyle"
date: "7/19/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

In scripts 8_3 and 8_4, we manually clustered expression of all transcripts with GO terms linked to immune response from 2 of our 3 transcriptomes. Reminder: they are as follows:

cbai_transcriptomev2.0: unfiltered

cbai_transcriptomev4.0: filtered to include only likely _Chionoecetes_ sequences

hemat_transcriptomev1.6: filtered to include only likely _Alveolata_ sequences

We didn't create clusters for hemat_transcriptomev1.6, as it only had 5 transcripts with GO terms linked to immune response.

However, once we named our modules, we had some with duplicate names. Modules describing the same expression patterns (as determined by names and assigned in 8_3 and 8_4) were merged in scripts 5_5 and 8_6. At the end of each of those scripts, we wrote the line count - which is equal to the number of genes in each module - of the merged modules for each crab to a file. 

In this script, we will take those word counts and turn them into tables for optimal presentation and reproducibility


```{r libraries, message=FALSE, warning=FALSE}
# Add all required libraries here
list.of.packages <- c("tidyverse", "janitor")
# Get names of all required packages that aren't installed
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
# Install all new packages
if(length(new.packages)) install.packages(new.packages)


# Load all required libraries
lapply(list.of.packages, FUN = function(X) {
  do.call("require", list(X))
})


```

### Reading in data

```{r}
# Get the path of all relevant files
file_list <- Sys.glob("../output/manual_clustering/*/immune_genes/merged_modules_raw_counts.txt")

# In each iteration of the for loop, we'll choose a different transcriptome's raw counts to examine, create two neat summary tables - one with percentages, one with counts - and write as CSVs
for (i in 1:length(file_list)) {
  counts <- read.table(file_list[i])

  # Remove the last line - we can figure out the total on our own
  counts <- head(counts, -1)

  # Split the path column by slashes
  counts <- separate(counts, 2, into = c("A", "B", "C", "D", "E", "F", "G", "H"), sep = "/")

  # Remove columns without multiple values. Should leave us with columns for counts, crab, and module type
  counts <- counts[vapply(counts, function(x) length(unique(x)) > 1, logical(1L))]
  
  # Rename existing columns
  colnames(counts) <- c("Genes", "Crab", "Module")
  
  # Remove the _merged.txt part of each Module column
  counts$Module <- str_replace(counts$Module, "_merged.txt", "")
  
  # Pivot wider so that each module type is its own column
  counts <- counts %>%
    pivot_wider(names_from = Module, values_from = Genes)
  
  # Create another table with percentage of module membership for each crab (each crab should sum to 100%)
  percentages <- adorn_percentages(counts, denominator = "row", na.rm = TRUE)
  
  # Move crab column to rowname for both tables
  counts <- column_to_rownames(counts, var = "Crab")
  percentages <- column_to_rownames(percentages, var = "Crab")
  
  # Round percentages to the nearest few digits
  percentages <- round(percentages, digits = 3)
  
  # Get the path for that transcriptome
  path <- file_list[i]
  
  # Remove the ending part of the path
  path <- str_replace(path, "merged_modules_raw_counts.txt", "")
  
  # Write our counts table
  write.csv(counts, file = paste0(path, "merged_modules_counts_table.csv"))
  # Write our percentages table
  write.csv(percentages, file = paste0(path, "merged_modules_percentages_table.csv"))
}
```