--- title: "Step 5: Create a Count Matrix" subtitle: "Using a python script`" author: "Sarah Tanja" date: 11/14/2024 format: gfm: default # or html if you want to render in HTML toc: true toc-depth: 3 link-external-icon: true link-external-newwindow: true reference-location: margin citation-location: margin --- > DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). We provide a Python script (prepDE.py, or the Python 3 version:prepDE.py3 ) that can be used to extract this read count information directly from the files generated by StringTie (run with the -e parameter). > prepDE.py derives hypothetical read counts for each transcript from the coverage values estimated by StringTie for each transcript, by using this simple formula: reads_per_transcript = coverage \* transcript_len / read_len # Generate Count Matrix ```{r, engine='bash'} ls ../output/04_align/stringtie/*gtf ``` ```{r, engine="bash"} cd ../output/04_align/stringtie find . -type f -name "*.gtf" | wc -l ``` Should be 63 total samples! Save the list of basenames (sample_names) and their respective file paths to gtf_list.txt ```{r, engine='bash'} ls ../output/04_align/stringtie/*.gtf | awk -F'/' '{path=$0; file=$NF; gsub(".gtf$", "", file); print file "\t" path}' > ../output/05_count/gtf_list.txt ``` Look at the gtf_list.txt we just made to make sure there is a basename and a filepath for each sample ```{r, engine='bash'} cat ../output/05_count/gtf_list.txt ``` Run the python script `prepDR.py3`, which takes the basenames, filepaths to gtf files and churns out a gene count matrix and a transcript count matrix. ```{r, engine='bash'} python3 /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py3 \ -i ../output/05_count/gtf_list.txt \ -g ../output/05_count/gene_count_matrix.csv \ -t ../output/05_count/transcript_count_matrix.csv ``` ::: callout-important ###### Don't forget to always rsync backup! ``` rsync -avz /media/4TB_JPG_ext/stanja/gitprojects\ stanja\@gannet.fish.washington.edu:/volume2/web/stanja/ravenbackup ``` :::