06.2-cod-RNAseq-alignment-genome
================
Kathleen Durkin
2024-04-16
- 1 Create a Bash variables
file
- 2 Align to reference genome
(Hisat2)
- 2.1 Retrieving the
reference genome and gff
- 2.2 Verify genome FastA MD5
checksum
- 2.3 Building
Index
- 2.4 Alignment
- 3 Read
Summarization
- 3.1 Exons
- 3.2 Genes
``` r
library(tidyverse)
library(dplyr)
library(magrittr)
library(knitr)
library(ggplot2)
library(plotly)
```
Code for aligning RNAseq data to reference genome, to be used on
[Pacific cod RNAseq
data](https://shedurkin.github.io/Roberts-LabNotebook/posts/projects/pacific_cod/2023_12_13_pacific_cod.html).
- Raw reads found
[here](https://owl.fish.washington.edu/nightingales/G_macrocephalus/30-943133806/)
- Trimmed reads:
- Genome downloaded from
[NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_031168955.1/),
stored
[here](https://owl.fish.washington.edu/halfshell/genomic-databank/GCF_031168955.1_ASM3116895v1_rna.fna)
as a part of lab [genomic
resources](https://robertslab.github.io/resources/Genomic-Resources/#gadus-macrocephalus-pacific-cod)
- Genome GTF downloaded from
[NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_031168955.1/)
# 1 Create a Bash variables file
This allows usage of Bash variables (e.g. paths to common directories)
across R Markdown chunks.
``` bash
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export cod_dir=/home/shared/8TB_HDD_02/shedurkin/project-cod-temperature'
echo 'export output_dir_top=${cod_dir}/output/06.2-cod-RNAseq-alignment-genome'
echo 'export genome_fasta_dir=${cod_dir}/data'
echo 'export trimmed_reads_dir=${cod_dir}/output/05-cod-RNAseq-trimming/trimmed-reads'
echo ""
echo "# Input/Output files"
echo 'export genome_fasta_name="GCF_031168955.1_ASM3116895v1_genomic"'
echo 'export genome_fasta="${genome_fasta_dir}/${genome_fasta_name}"'
echo 'export genome_gtf_name="genomic"'
echo 'export genome_gtf="${genome_fasta_dir}/${genome_gtf_name}"'
echo 'export hisat2_exon_name="G_macrocephalus_exon"'
echo 'export hisat2_exon="${hisat2_output_dir}/${hisat_exon_name}"'
echo 'export hisat2_splice_sites_name="G_macrocephalus_splice_sites"'
echo 'export hisat2_splice_sites="${hisat2_output_dir}/${hisat2_splice_sites_name}"'
echo 'export hisat2_index_name="G_macrocephalus_Hisat2_index"'
echo 'export hisat2_index="${hisat2_output_dir}/${hisat2_index}"'
echo "# External data URLs and checksums"
echo 'export genome_fasta_url="https://owl.fish.washington.edu/halfshell/genomic-databank/GCF_031168955.1_ASM3116895v1_genomic.fna"'
echo 'export genome_fasta_checksum="5144890d4eceb0b258d92db3f35c681e"'
echo 'export genome_gtf_url="https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_031168955.1/download?include_annotation_type=GENOME_GTF"'
#echo 'export genome_gtf_checksum="173fb3c159e474391c5c4aa1f7230024"'
echo "# Paths to programs"
echo 'export hisat2_exons=/home/shared/hisat2-2.2.1/hisat2_extract_exons.py'
echo 'export hisat2_splice_sites=/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py'
echo 'export hisat2_build=/home/shared/hisat2-2.2.1/hisat2-build'
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=20'
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2_exons]="${hisat2_exons}" \'
echo '[hisat2_splice_sites]="${hisat2_splice_sites}" \'
echo '[hisat2_build]="${hisat2_build}" \'
echo '[trinity_abund_to_matrix]="${trinity_abund_to_matrix}" \'
echo ")"
} > .bashvars
cat .bashvars
```
#### Assign Variables ####
# Data directories
export cod_dir=/home/shared/8TB_HDD_02/shedurkin/project-cod-temperature
export output_dir_top=${cod_dir}/output/06.2-cod-RNAseq-alignment-genome
export genome_fasta_dir=${cod_dir}/data
export trimmed_reads_dir=${cod_dir}/output/05-cod-RNAseq-trimming/trimmed-reads
# Input/Output files
export genome_fasta_name="GCF_031168955.1_ASM3116895v1_genomic"
export genome_fasta="${genome_fasta_dir}/${genome_fasta_name}"
export genome_gtf_name="genomic"
export genome_gtf="${genome_fasta_dir}/${genome_gtf_name}"
export hisat2_exon_name="G_macrocephalus_exon"
export hisat2_exon="${hisat2_output_dir}/${hisat_exon_name}"
export hisat2_splice_sites_name="G_macrocephalus_splice_sites"
export hisat2_splice_sites="${hisat2_output_dir}/${hisat2_splice_sites_name}"
export hisat2_index_name="G_macrocephalus_Hisat2_index"
export hisat2_index="${hisat2_output_dir}/${hisat2_index}"
# External data URLs and checksums
export genome_fasta_url="https://owl.fish.washington.edu/halfshell/genomic-databank/GCF_031168955.1_ASM3116895v1_genomic.fna"
export genome_fasta_checksum="5144890d4eceb0b258d92db3f35c681e"
export genome_gtf_url="https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_031168955.1/download?include_annotation_type=GENOME_GTF"
# Paths to programs
export hisat2_exons=/home/shared/hisat2-2.2.1/hisat2_extract_exons.py
export hisat2_splice_sites=/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py
export hisat2_build=/home/shared/hisat2-2.2.1/hisat2-build
# Set number of CPUs to use
export threads=20
# Programs associative array
declare -A programs_array
programs_array=(
[hisat2_exons]="${hisat2_exons}" \
[hisat2_splice_sites]="${hisat2_splice_sites}" \
[hisat2_build]="${hisat2_build}" \
[trinity_abund_to_matrix]="${trinity_abund_to_matrix}" \
)
# 2 Align to reference genome (Hisat2)
## 2.1 Retrieving the reference genome and gff
``` bash
# Load bash variables into memory
source .bashvars
wget \
--directory-prefix ${genome_fasta_dir} \
--recursive \
--no-check-certificate \
--continue \
--no-host-directories \
--no-directories \
--no-parent \
--quiet \
--execute robots=off \
--accept "${genome_fasta_name}.fna" ${genome_fasta_url}
```
v NOT CURRENTLY WORKING v, had to download locally and then upload to
server for use
``` bash
# Load bash variables into memory
source .bashvars
cd ../data
curl -O "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_031168955.1/download?include_annotation_type=GENOME_GTF"
#curl -O "${genome_gtf_url}"
```
``` bash
# Load bash variables into memory
source .bashvars
ls -lh "${genome_fasta_dir}"
```
total 1.9G
drwxr-xr-x 3 shedurkin labmembers 4.0K Mar 4 11:05 05-cod-RNAseq-trimming
-rw-r--r-- 1 shedurkin labmembers 13K Dec 27 15:45 Cod_RNAseq_NGS_Template_File.xlsx
-rw-r--r-- 1 shedurkin labmembers 2.1K Mar 20 20:55 DESeq2_Sample_Information.csv
-rw-r--r-- 1 shedurkin labmembers 38M Oct 25 2023 Gadus_macrocephalus.coding.gene.V1.cds
-rw-r--r-- 1 shedurkin labmembers 537M Oct 16 2023 GCF_031168955.1_ASM3116895v1_genomic.fna
-rw-r--r-- 1 shedurkin labmembers 351M Oct 16 2023 GCF_031168955.1_ASM3116895v1.gff
-rw-r--r-- 1 shedurkin labmembers 169M Oct 16 2023 GCF_031168955.1_ASM3116895v1_rna.fna
-rw-r--r-- 1 shedurkin labmembers 404M Apr 23 14:29 genomic.gtf
-rw-r--r-- 1 shedurkin labmembers 47K Oct 25 2023 Pcod Temp Growth experiment 2022-23 DATA.xlsx
-rw-r--r-- 1 shedurkin labmembers 231K Mar 4 17:41 Sample.QC.report.of_30-943133806_240118025106.pdf
-rw-r--r-- 1 shedurkin labmembers 12K Mar 4 17:41 Sample.QC.report.of_30-943133806_240118025106.xlsx
-rw-r--r-- 1 shedurkin labmembers 12K Oct 25 2023 temp-experiment.csv
-rw-r--r-- 1 shedurkin labmembers 271M Oct 25 2023 uniprot_sprot_r2023_04.fasta
-rw-r--r-- 1 shedurkin labmembers 88M Apr 17 11:54 uniprot_sprot_r2023_04.fasta.gz
## 2.2 Verify genome FastA MD5 checksum
``` bash
# Load bash variables into memory
source .bashvars
cd "${genome_fasta_dir}"
# Checksums file contains other files, so this just looks for the sRNAseq files.
md5sum --check <<< "${genome_fasta_checksum} ${genome_fasta_name}.fna"
```
``` bash
# Load bash variables into memory
source .bashvars
cd "${genome_fasta_dir}"
# Checksums file contains other files, so this just looks for the sRNAseq files.
md5sum --check <<< "${genome_gtf_checksum} ${genome_gtf_name}.gtf"
```
## 2.3 Building Index
``` bash
# Load bash variables into memory
source .bashvars
# Create Hisat2 exons tab file
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/genomic.gtf \
> ../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_exon.tab
# Create Hisat2 exons tab file
#"${programs_array[hisat2_exons]}" \
#"${genome_gtf}.gtf" \
#> "${hisat2_exon}.tab"
```
``` bash
# Create Hisat2 splice sites tab file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/genomic.gtf \
> ../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_splice_sites.tab
#"${programs_array[hisat2_splice_sites]}" \
#"${genome_gtf}.gtf" \
#> "${hisat2_splice_sites}.tab"
```
``` bash
# Build Hisat2 reference index using splice sites and exons
/home/shared/hisat2-2.2.1/hisat2-build \
../data/GCF_031168955.1_ASM3116895v1_genomic.fna \
../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_index.idx \
--exon ../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_exon.tab \
--ss ../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_splice_sites.tab \
-p 20 \
2> ../output/06.2-cod-RNAseq-alignment-genome/hisat2/hisat2-build_stats.txt
#"${programs_array[hisat2_build]}" \
#"${genome_fasta}.fna" \
#"${hisat2_index}.idx" \
#--exon "${hisat2_exon}.tab" \
#--ss "${hisat2_splice_sites}.tab" \
#-p "${threads}" \
#2> "${output_dir_top}/hisat2-build_stats.txt"
```
``` bash
# Load bash variables into memory
source .bashvars
ls -lh "${output_dir_top}"
```
total 56K
drwxr-xr-x 3 shedurkin labmembers 4.0K Apr 30 15:04 featureCounts-exon
drwxr-xr-x 3 shedurkin labmembers 36K May 1 15:34 featureCounts-gene
drwxr-xr-x 2 shedurkin labmembers 12K Apr 30 14:23 hisat2
## 2.4 Alignment
``` bash
# Load bash variables into memory
source .bashvars
## Sample Quantification
# Hisat2 alignments
find ../output/05-cod-RNAseq-trimming/trimmed-reads/*.gz \
| xargs basename -s .flexbar_trim.R_1.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/06.2-cod-RNAseq-alignment-genome/hisat2/G_macrocephalus_index.idx \
-p 20 \
-1 ../output/05-cod-RNAseq-trimming/trimmed-reads/{}.flexbar_trim.R_1.fastq.gz \
-2 ../output/05-cod-RNAseq-trimming/trimmed-reads/{}.flexbar_trim.R_2.fastq.gz \
-S ../output/06.2-cod-RNAseq-alignment-genome/hisat2/{}.sam
&> ../output/06.2-cod-RNAseq-alignment-genome/hisat2/{}_hisat2.log
# # Hisat2 alignments
# find ${trimmed_reads_dir}/*.gz \
# | xargs basename -s .flexbar_trim.R_1.fastq.gz | xargs -I{} \
# "${programs_array[hisat2]}" \
# -x "${hisat2_index}.idx" \
# -p 20 \
# -1 ${trimmed_reads_dir}/{}.flexbar_trim.R_1.fastq.gz \
# -2 ${trimmed_reads_dir}/{}.flexbar_trim.R_2.fastq.gz \
# -S ${output_dir_top}/{}.sam
#
```
``` bash
# Sort SAM files, convert to BAM, and index
for samfile in ../output/06.2-cod-RNAseq-alignment-genome/hisat2/*.sam; do
bamfile="${samfile%.sam}.bam"
sorted_bamfile="${samfile%.sam}.sorted.bam"
# Check if the output file already exists
if [[ ! -e "$sorted_bamfile" ]]; then
# Convert SAM to BAM
/home/shared/samtools-1.12/samtools view -bS -@ 10 "$samfile" > "$bamfile"
# Sort BAM
/home/shared/samtools-1.12/samtools sort -@ 10 "$bamfile" -o "$sorted_bamfile"
# Index sorted BAM
/home/shared/samtools-1.12/samtools index -@ 10 "$sorted_bamfile"
fi
done
```
``` bash
# Count the number of samples for which we have sorted bam files -- I should have 79 (one for each input sample)
find ../output/06.2-cod-RNAseq-alignment-genome/hisat2/ -type f -name "*.sorted.bam" | wc -l
```
79
``` bash
# Delete unneccessary index files
rm ../output/06.2-cod-RNAseq-alignment-genome/hisat2/*.ht2
# Delete unneeded SAM files
rm ../output/06.2-cod-RNAseq-alignment-genome/hisat2/*.sam
# # Sort SAM files, convert to BAM, and index
# ${programs_array[samtools_view]} \
# -@ "${threads}" \
# -Su "${sample_name}".sam \
# | ${programs_array[samtools_sort]} - \
# -@ "${threads}" \
# -o "${sample_name}".sorted.bam
# ${programs_array[samtools_index]} "${sample_name}".sorted.bam
#
#
# # Delete unneccessary index files
# rm "${genome_index_name}"*.ht2
#
# # Delete unneeded SAM files
# rm ./*.sam
```
``` bash
# View the output files
ls -lh ../output/06.2-cod-RNAseq-alignment-genome/hisat2/
```
total 364G
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 12:27 100.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 12:28 100.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 12:28 100.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 24 12:29 107.bam
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 24 12:30 107.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 852K Apr 24 12:30 107.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 24 12:32 108.bam
-rw-r--r-- 1 shedurkin labmembers 2.1G Apr 24 12:35 108.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 12:35 108.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.4G Apr 24 12:39 109.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 12:41 109.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 12:41 109.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 24 12:48 10.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 24 12:49 10.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 12:49 10.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.2G Apr 24 12:58 110.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 24 13:00 110.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 13:01 110.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 24 13:09 117.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 13:12 117.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 13:12 117.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 24 13:21 118.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 13:24 118.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 13:24 118.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.4G Apr 24 13:27 119.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 13:30 119.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 13:30 119.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 24 13:33 11.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 13:35 11.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 13:35 11.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.1G Apr 24 13:38 120.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 24 13:40 120.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 13:40 120.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.2G Apr 24 13:43 121.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 24 13:45 121.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 816K Apr 24 13:45 121.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.6G Apr 24 13:53 127.bam
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 24 13:55 127.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 13:55 127.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 13:57 128.bam
-rw-r--r-- 1 shedurkin labmembers 940M Apr 24 13:58 128.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 713K Apr 24 13:58 128.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 24 14:01 129.bam
-rw-r--r-- 1 shedurkin labmembers 1.1G Apr 24 14:02 129.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.9M Apr 24 14:02 129.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.1G Apr 24 14:07 12.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 24 14:09 12.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 14:09 12.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 24 14:12 131.bam
-rw-r--r-- 1 shedurkin labmembers 883M Apr 24 14:12 131.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 736K Apr 24 14:12 131.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.3G Apr 24 14:15 137.bam
-rw-r--r-- 1 shedurkin labmembers 780M Apr 24 14:15 137.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 659K Apr 24 14:15 137.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.3G Apr 24 14:16 138.bam
-rw-r--r-- 1 shedurkin labmembers 773M Apr 24 14:17 138.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 656K Apr 24 14:17 138.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 24 14:18 139.bam
-rw-r--r-- 1 shedurkin labmembers 906M Apr 24 14:18 139.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 786K Apr 24 14:19 139.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 24 14:20 13.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 24 14:22 13.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 14:22 13.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 24 14:23 140.bam
-rw-r--r-- 1 shedurkin labmembers 904M Apr 24 14:23 140.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 749K Apr 24 14:23 140.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.3G Apr 24 14:24 147.bam
-rw-r--r-- 1 shedurkin labmembers 812M Apr 24 14:25 147.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 692K Apr 24 14:25 147.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.6G Apr 24 14:27 148.bam
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 24 14:30 148.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 14:30 148.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 15G Apr 24 14:43 149.bam
-rw-r--r-- 1 shedurkin labmembers 8.9G Apr 24 14:54 149.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 7.1M Apr 24 14:55 149.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.8G Apr 24 14:58 150.bam
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 15:00 150.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 15:00 150.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 15:03 18.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 15:05 18.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 15:05 18.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 15:16 19.bam
-rw-r--r-- 1 shedurkin labmembers 3.5G Apr 24 15:09 19-G.bam
-rw-r--r-- 1 shedurkin labmembers 2.3G Apr 24 15:12 19-G.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:12 19-G.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.7G Apr 24 15:21 19-S.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 15:17 19.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1023K Apr 24 15:17 19.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 15:24 19-S.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 15:25 19-S.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.2G Apr 24 15:27 1.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 24 15:29 1.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:29 1.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 24 15:37 20.bam
-rw-r--r-- 1 shedurkin labmembers 3.6G Apr 24 15:32 20-G.bam
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 15:34 20-G.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:35 20-G.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.5G Apr 24 15:40 20-S.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 15:38 20.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:38 20.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 24 15:41 20-S.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 935K Apr 24 15:41 20-S.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 15:43 21.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 15:44 21.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 15:44 21.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 24 15:46 28.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 24 15:47 28.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:47 28.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 24 15:49 29.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 24 15:51 29.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 24 15:51 29.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 24 15:53 2.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 24 15:54 2.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 990K Apr 24 15:54 2.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 24 15:56 30.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 24 15:58 30.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 15:58 30.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 24 16:00 31.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 24 16:01 31.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 839K Apr 24 16:01 31.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 16:03 37.bam
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 24 16:04 37.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 847K Apr 24 16:04 37.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 16:06 38.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 24 16:08 38.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 16:08 38.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 24 16:10 39.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 16:11 39.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 936K Apr 24 16:11 39.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 24 16:13 3.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 24 16:15 3.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 16:15 3.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 16:17 40.bam
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 24 16:18 40.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 24 16:18 40.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 24 16:20 41.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 24 16:22 41.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 24 16:22 41.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.7G Apr 26 16:47 47.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 26 16:49 47.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 989K Apr 26 16:49 47.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 26 16:51 48.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 26 16:53 48.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 927K Apr 26 16:53 48.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 26 16:55 49.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 16:57 49.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 16:57 49.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 26 16:59 4.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 26 17:02 4.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 17:02 4.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 26 17:04 50.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 26 17:05 50.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 17:05 50.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 26 17:14 57.bam
-rw-r--r-- 1 shedurkin labmembers 4.3G Apr 26 17:09 57-G.bam
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 26 17:11 57-G.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 17:11 57-G.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 17:17 57-S.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 17:15 57.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 17:15 57.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 17:18 57-S.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1001K Apr 26 17:18 57-S.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 26 17:27 58.bam
-rw-r--r-- 1 shedurkin labmembers 4.4G Apr 26 17:22 58-G.bam
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 26 17:24 58-G.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 26 17:25 58-G.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 26 17:30 58-S.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 17:28 58.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 17:29 58.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 17:32 58-S.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 745K Apr 26 17:32 58-S.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.9G Apr 26 17:34 59.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 26 17:36 59.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 899K Apr 26 17:36 59.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 26 17:38 5.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 17:40 5.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.3M Apr 26 17:40 5.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.1G Apr 26 17:43 60.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 26 17:45 60.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 17:45 60.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.2G Apr 26 17:47 67.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 26 17:49 67.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 17:49 67.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.1G Apr 26 17:52 68.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 26 17:54 68.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 17:54 68.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 26 17:56 69.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 26 17:58 69.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 944K Apr 26 17:58 69.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 26 17:59 70.bam
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 18:01 70.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 914K Apr 26 18:01 70.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 26 18:03 78.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 26 18:04 78.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 18:05 78.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 26 18:07 79.bam
-rw-r--r-- 1 shedurkin labmembers 1.7G Apr 26 18:08 79.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 18:09 79.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 26 18:10 80.bam
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 18:12 80.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 18:12 80.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 26 18:14 83.bam
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 18:15 83.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 967K Apr 26 18:15 83.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.8G Apr 26 18:17 88.bam
-rw-r--r-- 1 shedurkin labmembers 1.8G Apr 26 18:19 88.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 18:19 88.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 26 18:21 90.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 26 18:23 90.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 18:23 90.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.0G Apr 26 18:25 91.bam
-rw-r--r-- 1 shedurkin labmembers 1.9G Apr 26 18:27 91.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 18:27 91.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.5G Apr 26 18:29 97.bam
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 26 18:31 97.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 943K Apr 26 18:31 97.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 3.3G Apr 26 18:34 98.bam
-rw-r--r-- 1 shedurkin labmembers 2.0G Apr 26 18:36 98.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.5M Apr 26 18:36 98.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.5G Apr 26 18:38 99.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 26 18:40 99.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.2M Apr 26 18:40 99.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 7.4M Apr 23 14:34 G_macrocephalus_exon.tab
-rw-r--r-- 1 shedurkin labmembers 7.1M Apr 23 14:40 G_macrocephalus_splice_sites.tab
-rw-r--r-- 1 shedurkin labmembers 12K Apr 23 14:54 hisat2-build_stats.txt
-rw-r--r-- 1 shedurkin labmembers 2.4G Apr 26 18:42 RESUB-116.bam
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 26 18:43 RESUB-116.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1.1M Apr 26 18:43 RESUB-116.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.5G Apr 26 18:45 RESUB-156.bam
-rw-r--r-- 1 shedurkin labmembers 1.5G Apr 26 18:47 RESUB-156.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 974K Apr 26 18:47 RESUB-156.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.3G Apr 26 18:48 RESUB-36.bam
-rw-r--r-- 1 shedurkin labmembers 1.4G Apr 26 18:50 RESUB-36.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 852K Apr 26 18:50 RESUB-36.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.2G Apr 26 18:52 RESUB-76.bam
-rw-r--r-- 1 shedurkin labmembers 1.3G Apr 26 18:53 RESUB-76.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 895K Apr 26 18:53 RESUB-76.sorted.bam.bai
-rw-r--r-- 1 shedurkin labmembers 2.6G Apr 26 18:55 RESUB-94.bam
-rw-r--r-- 1 shedurkin labmembers 1.6G Apr 26 18:57 RESUB-94.sorted.bam
-rw-r--r-- 1 shedurkin labmembers 1014K Apr 26 18:57 RESUB-94.sorted.bam.bai
# 3 Read Summarization
Will be summarizing reads using
[featureCounts](https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts)
in the
[Rsubread](https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf)
Bioconductor package
## 3.1 Exons
``` bash
/home/shared/subread-2.0.5-Linux-x86_64/bin/featureCounts \
-p --countReadPairs \
-T 5 \
-t exon \
-g gene_id \
-a ../data/genomic.gtf \
-o ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/featureCounts_exon_matrix.txt \
../output/06.2-cod-RNAseq-alignment-genome/hisat2/*.sorted.bam \
&> ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/featureCounts_exon.log
```
Save a version of the featureCounts count file with no header (aka
remove line 1, which contains the program and command info)
``` bash
sed '1d' ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/featureCounts_exon_matrix.txt > ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/featureCounts_exon_matrix_noheader.txt
```
``` bash
/home/sam/programs/mambaforge/bin/multiqc \
../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/featureCounts_exon_matrix.txt.summary \
-o ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon
# View directory contents
ls -lh ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon
```
I also want to include the treatment/tank info when plotting alignment
rates across samples
``` r
# Load multiqc stats
featureCounts_exon_multiqc <- read.csv("../output/06.2-cod-RNAseq-alignment-genome/featureCounts-exon/multiqc_data/multiqc_featureCounts.txt", sep = '\t')
# Adjust sample name formatting (to prep for join)
featureCounts_exon_multiqc$Sample <- paste("sample_", featureCounts_exon_multiqc$Sample, sep = "")
# Load experimental data
cod_sample_info_OG <- read.csv("../data/DESeq2_Sample_Information.csv")
featureCounts_exon_multiqc_plustreatment <- left_join(cod_sample_info_OG, featureCounts_exon_multiqc, by = c("sample_name" = "Sample")) %>%
na.omit()
featureCounts_exon_multiqc_plustreatment <- featureCounts_exon_multiqc_plustreatment[order(featureCounts_exon_multiqc_plustreatment$sample_number),]
ggplot(featureCounts_exon_multiqc_plustreatment,
aes(x=reorder(sample_name, sample_number), y=percent_assigned, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-exon-1.png)
``` r
ggplot(featureCounts_exon_multiqc_plustreatment,
aes(x=reorder(sample_name, sample_number), y=Total, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-exon-2.png)
``` r
# sample149 is kind of throwing off the visualization, so lets remove and redo
ggplot(featureCounts_exon_multiqc_plustreatment[featureCounts_exon_multiqc_plustreatment$sample_name != "sample_149", ],
aes(x=reorder(sample_name, sample_number), y=Total, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-exon-3.png)
## 3.2 Genes
``` bash
/home/shared/subread-2.0.5-Linux-x86_64/bin/featureCounts \
-p --countReadPairs \
-T 5 \
-t gene \
-g gene_id \
-a ../data/genomic.gtf \
-o ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/featureCounts_gene_matrix.txt \
../output/06.2-cod-RNAseq-alignment-genome/hisat2/*.sorted.bam \
&> ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/featureCounts_gene.log
```
Save a version of the featureCounts count file with no header (aka
remove line 1, which contains the program and command info)
``` bash
sed '1d' ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/featureCounts_gene_matrix.txt > ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/featureCounts_gene_matrix_noheader.txt
```
``` bash
/home/sam/programs/mambaforge/bin/multiqc \
../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/featureCounts_gene_matrix.txt.summary \
-o ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene
# View directory contents
ls -lh ../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene
```
I also want to include the treatment/tank info when plotting alignment
rates across samples
``` r
# Load multiqc stats
featureCounts_gene_multiqc <- read.csv("../output/06.2-cod-RNAseq-alignment-genome/featureCounts-gene/multiqc_data/multiqc_featureCounts.txt", sep = '\t')
# Adjust sample name formatting (to prep for join)
featureCounts_gene_multiqc$Sample <- paste("sample_", featureCounts_gene_multiqc$Sample, sep = "")
featureCounts_gene_multiqc_plustreatment <- left_join(cod_sample_info_OG, featureCounts_gene_multiqc, by = c("sample_name" = "Sample")) %>%
na.omit()
featureCounts_gene_multiqc_plustreatment <- featureCounts_gene_multiqc_plustreatment[order(featureCounts_gene_multiqc_plustreatment$sample_number),]
ggplot(featureCounts_gene_multiqc_plustreatment,
aes(x=reorder(sample_name, sample_number), y=percent_assigned, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-gene-1.png)
``` r
ggplot(featureCounts_gene_multiqc_plustreatment,
aes(x=reorder(sample_name, sample_number), y=Total, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-gene-2.png)
``` r
# sample149 is kind of throwing off the visualization, so lets remove and redo
ggplot(featureCounts_gene_multiqc_plustreatment[featureCounts_gene_multiqc_plustreatment$sample_name != "sample_149", ],
aes(x=reorder(sample_name, sample_number), y=Total, fill=as.factor(temp_treatment))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size=7))
```
![](06.2-cod-RNAseq-alignment-genome_files/figure-gfm/plot-alignment-rates-gene-3.png)