--- title: "Step 5: Create a Count Matrix and run Differential Gene Expression" subtitle: "Using `DESeq2`" author: "Sarah Tanja" date: 11/14/2024 format: gfm: default # or html if you want to render in HTML toc: true toc-depth: 3 link-external-icon: true link-external-newwindow: true reference-location: margin citation-location: margin --- > DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). We provide a Python script (prepDE.py, or the Python 3 version:prepDE.py3 ) that can be used to extract this read count information directly from the files generated by StringTie (run with the -e parameter). > prepDE.py derives hypothetical read counts for each transcript from the coverage values estimated by StringTie for each transcript, by using this simple formula: reads_per_transcript = coverage \* transcript_len / read_len # Count Matrix ```{r, engine='bash'} ls ../output/04_align/stringtie/*gtf ``` ```{r, engine='bash'} ls ../output/04_align/stringtie/*gtf > ../output/05_count/gtf_list.txt ``` ```{r, engine='bash'} ls ../output/04_align/stringtie/*.gtf | awk -F'/' '{path=$0; file=$NF; gsub(".gtf$", "", file); print file "\t" path}' > ../output/05_count/gtf_list.txt ``` ```{r, engine='bash'} cat ../output/05_count/gtf_list.txt ``` ```{r, engine='bash'} python3 /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3 \ -i ../output/05_count/gtf_list.txt \ -g ../output/05_count/gene_count_matrix.csv \ -t ../output/05_count/transcript_count_matrix.csv ``` ::: callout-important ###### Don't forget to always rsync backup! ``` rsync -avz /media/4TB_JPG_ext/stanja/gitprojects\ stanja\@gannet.fish.washington.edu:/volume2/web/stanja/ravenbackup ``` :::