---
title: "Step 5: Create a Count Matrix and run Differential Gene Expression"
subtitle: "Using `DESeq2`"
author: "Sarah Tanja"
date: 11/14/2024
format:
  gfm: default  # or html if you want to render in HTML
toc: true
toc-depth: 3
link-external-icon: true
link-external-newwindow: true
reference-location: margin
citation-location: margin
---

> DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). We provide a Python script (prepDE.py, or the Python 3 version:prepDE.py3 ) that can be used to extract this read count information directly from the files generated by StringTie (run with the -e parameter).

> prepDE.py derives hypothetical read counts for each transcript from the coverage values estimated by StringTie for each transcript, by using this simple formula: reads_per_transcript = coverage \* transcript_len / read_len

# Count Matrix

```{r, engine='bash'}
ls ../output/04_align/stringtie/*gtf 
```

```{r, engine='bash'}
ls ../output/04_align/stringtie/*gtf > ../output/05_count/gtf_list.txt
```

```{r, engine='bash'}
ls ../output/04_align/stringtie/*.gtf | awk -F'/' '{path=$0; file=$NF; gsub(".gtf$", "", file); print file "\t" path}' > ../output/05_count/gtf_list.txt
```

```{r, engine='bash'}
cat ../output/05_count/gtf_list.txt
```

```{r, engine='bash'}
python3 /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3 \
-i ../output/05_count/gtf_list.txt \
-g ../output/05_count/gene_count_matrix.csv \
-t ../output/05_count/transcript_count_matrix.csv
```

::: callout-important
###### Don't forget to always rsync backup!

```         
rsync -avz /media/4TB_JPG_ext/stanja/gitprojects\
stanja\@gannet.fish.washington.edu:/volume2/web/stanja/ravenbackup
```
:::