--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Gene Expression - Hematodinium MEGAN6 with Trinity and EdgeR date: '2020-02-07 13:17' tags: - Trinity - EdgeR - Hematodinium - gene expression - GOseq - gene ontology - GO - enrichment categories: - 2020 - Miscellaneous --- After completing [annotation of the _Hematodinium_ MEGAN6 taxonomic-specific Trinity assembly using Trinotate on 20200126](https://robertslab.github.io/sams-notebook/posts/2020/2020-01-26-Transcriptome-Annotation---Trinotate-Hematodinium-MEGAN6-Taxonomic-specific-Trinity-Assembly-on-Mox/), I performed differential gene expression analysis and gene ontology (GO) term enrichment analysis using Trinity's scripts to run EdgeR and GOseq, respectively. The comparison listed below is the only comparison possible, as there were no reads present in the uninfected _Hematodinium_ extractions. It should be noted that this comparison does not have any replicate samples. I made a weak attempt to coerce some results from these by setting a `dispersion` value in the edgeR command. However, I'm not expecting much, nor am I certain I would really trust the results from those particular comparisons. SBATCH script (GitHub): - [D12-vs-D26](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200207_hemat_DEG_D12-vs-D26.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=DEG_hemat_D12-vs-D26 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=01-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200207_hemat_DEG/D12-vs-D26 # Exit script if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log wd="$(pwd)" timestamp=$(date +%Y%m%d) species="hemat" threads=28 comparison="${wd##*/}" fasta_prefix="20200122.hemat.megan.Trinity" ## Set input file locations trimmed_reads_dir="/gscratch/srlab/sam/data/Hematodinium/RNAseq" salmon_out_dir="${wd}" transcriptome_dir="/gscratch/srlab/sam/data/Hematodinium/transcriptomes" transcriptome="${transcriptome_dir}/${fasta_prefix}.fasta" fasta_seq_lengths="${transcriptome_dir}/${fasta_prefix}.fasta.seq_lens" samples="${wd}/${comparison}.samples.txt" trinotate_feature_map="/gscratch/scrubbed/samwhite/outputs/20200126_hemat_trinotate_megan/20200126.hemat.trinotate.annotation_feature_map.txt" gene_map="${transcriptome_dir}/${fasta_prefix}.fasta.gene_trans_map" salmon_gene_matrix="${salmon_out_dir}/salmon.gene.TMM.EXPR.matrix" salmon_iso_matrix="${salmon_out_dir}/salmon.isoform.TMM.EXPR.matrix" go_annotations="${transcriptome_dir}/20200126.hemat.trinotate.go_annotations.txt" # Standard output/error files diff_expr_stdout="diff_expr_stdout.txt" diff_expr_stderr="diff_expr_stderr.txt" matrix_stdout="matrix_stdout.txt" matrix_stderr="matrix_stderr.txt" salmon_stdout="salmon_stdout.txt" salmon_stderr="salmon_stderr.txt" tpm_length_stdout="tpm_length_stdout.txt" tpm_length_stderr="tpm_length_stderr.txt" trinity_DE_stdout="trinity_DE_stdout.txt" trinity_DE_stderr="trinity_DE_stderr.txt" edgeR_dir="" #programs trinity_home=/gscratch/srlab/programs/trinityrnaseq-v2.9.0 trinity_annotate_matrix="${trinity_home}/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl" trinity_abundance=${trinity_home}/util/align_and_estimate_abundance.pl trinity_matrix=${trinity_home}/util/abundance_estimates_to_matrix.pl trinity_DE=${trinity_home}/Analysis/DifferentialExpression/run_DE_analysis.pl diff_expr=${trinity_home}/Analysis/DifferentialExpression/analyze_diff_expr.pl trinity_tpm_length=${trinity_home}/util/misc/TPM_weighted_gene_length.py # Create directory/sample list for ${trinity_matrix} command trin_matrix_list=$(awk '{printf "%s%s", $2, "/quant.sf " }' "${samples}") cd ${trimmed_reads_dir} time ${trinity_abundance} \ --output_dir "${salmon_out_dir}" \ --transcripts ${transcriptome} \ --seqType fq \ --samples_file "${samples}" \ --SS_lib_type RF \ --est_method salmon \ --aln_method bowtie2 \ --gene_trans_map "${gene_map}" \ --prep_reference \ --thread_count "${threads}" \ 1> "${salmon_out_dir}"/${salmon_stdout} \ 2> "${salmon_out_dir}"/${salmon_stderr} # Move output folders mv ${trimmed_reads_dir}/D* \ "${salmon_out_dir}" cd "${salmon_out_dir}" # Convert abundance estimates to matrix ${trinity_matrix} \ --est_method salmon \ --gene_trans_map ${gene_map} \ --out_prefix salmon \ --name_sample_by_basedir \ ${trin_matrix_list} 1> ${matrix_stdout} \ 2> ${matrix_stderr} # Integrate functional Trinotate functional annotations "${trinity_annotate_matrix}" \ "${trinotate_feature_map}" \ salmon.gene.counts.matrix \ > salmon.gene.counts.annotated.matrix # Generate weighted gene lengths "${trinity_tpm_length}" \ --gene_trans_map "${gene_map}" \ --trans_lengths "${fasta_seq_lengths}" \ --TPM_matrix "${salmon_iso_matrix}" \ > Trinity.gene_lengths.txt \ 2> ${tpm_length_stderr} # Differential expression analysis cd ${transcriptome_dir} ${trinity_DE} \ --matrix "${salmon_out_dir}"/salmon.gene.counts.matrix \ --method edgeR \ --samples_file "${samples}" \ 1> ${trinity_DE_stdout} \ 2> ${trinity_DE_stderr} mv edgeR* "${salmon_out_dir}" # Run differential expression on edgeR output matrix # Set fold difference to 2-fold (ie. -C 1 = 2^1) # P value <= 0.05 # Has to run from edgeR output directory # Pulls edgeR directory name and removes leading ./ in find output cd "${salmon_out_dir}" edgeR_dir=$(find . -type d -name "edgeR*" | sed 's%./%%') cd "${edgeR_dir}" mv "${transcriptome_dir}/${trinity_DE_stdout}" . mv "${transcriptome_dir}/${trinity_DE_stderr}" . ${diff_expr} \ --matrix "${salmon_gene_matrix}" \ --samples "${samples}" \ --examine_GO_enrichment \ --GO_annots "${go_annotations}" \ --include_GOplot \ --gene_lengths "${salmon_out_dir}"/Trinity.gene_lengths.txt \ -C 1 \ -P 0.05 \ 1> ${diff_expr_stdout} \ 2> ${diff_expr_stderr} ``` --- # RESULTS Output folder: - [20200207_hemat_DEG/](https://gannet.fish.washington.edu/Atumefaciens/20200207_hemat_DEG/) D12-vs-D26 Took ~17mins to run: ![D12 vs D26 runtime](https://raw.githubusercontent.com/RobertsLab/sams-notebook/master/images/screencaps/20200207_hemat_DEG_D12-vs-D26_runtime.png) - [D12-vs-D26/](https://gannet.fish.washington.edu/Atumefaciens/20200207_hemat_DEG/D12-vs-D26) No differentially expressed genes between these two groups. NOTE: Since no DEGs, that's why this run shows as `FAILED` in the above runtime screencap. This log file captures the error message that kills the job and generates the `FAILED` indicator: - [20200207_hemat_DEG/D12-vs-D26/edgeR.28118.dir/diff_expr_stderr.txt](https://gannet.fish.washington.edu/Atumefaciens/20200207_hemat_DEG/D12-vs-D26/edgeR.28118.dir/diff_expr_stderr.txt) `Error, no differentially expressed transcripts identified at cuttoffs: P:0.05, C:1 at /gscratch/srlab/programs/trinityrnaseq-v2.9.0/Analysis/DifferentialExpression/analyze_diff_expr.pl line 203.`