This notebook will align trimmed A.pulchra RNA-seq data to the A.pulchra genome using HISAT2 (Kim et al. 2019). Follwed by StringTie (Pertea et al. 2015, 2016) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2 and/or [Ballgown](https://github.com/alyssafrazee/ballgown.
Since the BAM files produced by this notebook are too large for GitHub, they can be accessed on our server here:
Input(s)
*fastp-trim.fq.gzApulcrha-genomeApulchra-genome.gtfM-multi-species/data/rna_metadata.csvOutputs:
Primary:
checksums.md5: MD5 checksum for all files in this directory. Excludes subdirectories.
apul-gene_count_matrix.csv: Gene count matrix for use in DESeq2.
apul-transcript_count_matrix.csv: Transcript count matrix for use in DESeq2.
prepDE-sample_list.txt: Sample file list provided as input to StringTie for DESeq2 count matrix generation. Also serves as documentation of which files were used for this step.
Apulchra-genome.stringtie.gtf: Canonical StringTie GTF file compiled from all individual sample GTFs.
sorted-bams-merged.bam: Merged (and sorted) BAM consisting of all individual sample BAMs.
sorted-bams-merged.bam.bai: BAM index file. Useful for visualizing assemblies in IGV.
sorted_bams.list: List file needed for merging of BAMS with samtools. Also serves as documentation of which files were used for this step.
multiqc_report.html: MultiQC report aggregating all individual HISAT2 alignment stats and samtools flagstats.
gtf_list.txt: List file needed for merging of GTF files with StringTie. Also serves as documentation of which files were used for this step.
Individuals:
Each subdirectory is labelled based on sample name and each contains individual HISAT2 alignment and StringTie output files.
<sample_name>_checksums.md5: MD5 checksums for all files in the directory.
*.ctab: Data tables formatted for import into Ballgown.
<sample_name>.cov_refs.gtf: StringTie genome reference sequnce coverage GTF.
<sample_name>.gtf: StringTie GTF.
<sample_name>.sorted.bam: HISAT2 assembly BAM.
<sample_name>.sorted.bam.bai: BAM index file. Useful for visualizing assemblies in IGV.
<sample_name>-hisat2_output.flagstat: samtools flagstat output file.
<sample_name>_hisat2.stats: HISAT2 assembly stats.
input_fastqs_checksums.md5: MD5 checksums of files used as input for assembly. Primarily serves as documentation to track/verify which files were actually used.
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export genome_dir="${timeseries_dir}/D-Apul/data"'
echo 'export genome_index_dir="${timeseries_dir}/D-Apul/output/02.10-D-Apul-RNAseq-genome-index-HiSat2"'
echo 'export output_dir_top="${timeseries_dir}/D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2"'
echo 'export trimmed_fastqs_dir="${timeseries_dir}/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC/trimmed-fastqs"'
echo 'export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/"'
echo ""
echo "# Location of Hisat2 index files"
echo "# Must keep variable name formatting, as it's used by HiSat2"
echo 'export HISAT2_INDEXES="${genome_index_dir}"'
echo "# Input files"
echo 'export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"'
echo 'export genome_index_name="Apulchra-genome"'
echo 'export genome_gff="${genome_dir}/Apulchra-genome.gff"'
echo 'export genome_fasta="${genome_dir}/Apulchra-genome.fa"'
echo 'export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"'
echo 'export transcripts_gtf="${genome_dir}/Apulchra-genome.gtf"'
echo "# Output files"
echo 'export gtf_list="${output_dir_top}/gtf_list.txt"'
echo 'export merged_bam="${output_dir_top}/sorted-bams-merged.bam"'
echo ""
echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"'
echo 'export hisat2="${hisat2_dir}/hisat2"'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo 'export samtools="${programs_dir}/samtools-1.12/samtools"'
echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"'
echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"'
echo ""
echo "# Set FastQ filename patterns"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "# Set average read length - for StringTie prepDE.py"
echo 'export read_length=125'
echo ""
echo "## Initialize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo "declare -A sample_timepoint_map"
echo ""
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2]="${hisat2}" \'
echo '[multiqc]="${multiqc}" \'
echo '[prepDE]="${prepDE}" \'
echo '[samtools]="${samtools}" \'
echo '[stringtie]="${stringtie}" \'
echo ")"
echo ""
echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular
export genome_dir="${timeseries_dir}/D-Apul/data"
export genome_index_dir="${timeseries_dir}/D-Apul/output/02.10-D-Apul-RNAseq-genome-index-HiSat2"
export output_dir_top="${timeseries_dir}/D-Apul/output/02.20-D-Apul-RNAseq-alignment-HiSat2"
export trimmed_fastqs_dir="${timeseries_dir}/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC/trimmed-fastqs"
export trimmed_reads_url="https://gannet.fish.washington.edu/Atumefaciens/gitrepos/urol-e5/timeseries_molecular/D-Apul/output/01.00-D-Apul-RNAseq-trimming-fastp-FastQC-MultiQC/trimmed-fastqs/"
# Location of Hisat2 index files
# Must keep variable name formatting, as it's used by HiSat2
export HISAT2_INDEXES="${genome_index_dir}"
# Input files
export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"
export genome_index_name="Apulchra-genome"
export genome_gff="${genome_dir}/Apulchra-genome.gff"
export genome_fasta="${genome_dir}/Apulchra-genome.fa"
export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"
export transcripts_gtf="${genome_dir}/Apulchra-genome.gtf"
# Output files
export gtf_list="${output_dir_top}/gtf_list.txt"
export merged_bam="${output_dir_top}/sorted-bams-merged.bam"
# Paths to programs
export programs_dir="/home/shared"
export hisat2_dir="${programs_dir}/hisat2-2.2.1"
export hisat2="${hisat2_dir}/hisat2"
export multiqc=/home/sam/programs/mambaforge/bin/multiqc
export samtools="${programs_dir}/samtools-1.12/samtools"
export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"
export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"
# Set FastQ filename patterns
export R1_fastq_pattern='*_R1_*.fq.gz'
export R2_fastq_pattern='*_R2_*.fq.gz'
export trimmed_fastq_pattern='*fastp-trim.fq.gz'
# Set number of CPUs to use
export threads=40
# Set average read length - for StringTie prepDE.py
export read_length=125
## Initialize arrays
export fastq_array_R1=()
export fastq_array_R2=()
export R1_names_array=()
export R2_names_array=()
declare -A sample_timepoint_map
# Programs associative array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2}" \
[multiqc]="${multiqc}" \
[prepDE]="${prepDE}" \
[samtools]="${samtools}" \
[stringtie]="${stringtie}" \
)
# Print formatting
export line="--------------------------------------------------------"
If needed, download raw RNA-seq.
Change eval=FALSE to eval=TRUE to execute the next two chunks to download RNA-seq and then verify MD5 checksums.
# Load bash variables into memory
source .bashvars
# Make output directory if it doesn't exist
mkdir --parents ${trimmed_fastqs_dir}
# Run wget to retrieve FastQs and MD5 files
wget \
--directory-prefix ${trimmed_fastqs_dir} \
--recursive \
--no-check-certificate \
--continue \
--cut-dirs 3 \
--no-host-directories \
--no-parent \
--quiet \
--accept="*fastp-trim*, *.md5"
${trimmed_reads_url}
ls -lh "${trimmed_fastqs_dir}"
Verify raw read checksums
# Load bash variables into memory
source .bashvars
cd "${trimmed_fastqs_dir}"
# Verify checksums
for file in *.md5
do
  md5sum --check "${file}"
done
This requires usage of the rna_metadata.csv
This step has a lengthy, semi-complex workflow:
rna_metadata.csv for A.pulchra sample names and time point. This info will be used for downstream file naming and to assing the time point to the read group (SM:) in the alignment file.# Load bash variables into memory
source .bashvars
# Make output directories, if they don't exist
mkdir --parents "${output_dir_top}"
# Change to ouput directory
cd "${output_dir_top}"
# Create associative array with sample and timepoint
metadata="../../../M-multi-species/data/rna_metadata.csv"
# Declare the array
declare -A sample_timepoint_map
# Read the metadata file line by line
while IFS=',' read -r sample_number sample_name plate well_number azenta_sample_name colony_id timepoint sample_type species_strain SampleBuffer; do
    # Check if the species is "Acropora pulchra"
    if [[ "${species_strain}" == "Acropora pulchra" ]]; then
        # Add the Azenta sample name as the key and Timepoint as the value in the associative array
        sample_timepoint_map["${azenta_sample_name}"]="${timepoint}"
    fi
done < <(tail -n +2 "${metadata}") # Skip the header
## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})
############## BEGIN HISAT2 ALIGNMENTS ##############
for sample in "${!sample_timepoint_map[@]}"
do
  # Create and switch to dedicated sample directory
  mkdir --parents "${sample}" && cd "$_"
  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R1[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_-]" '{print $3}')
    
    # Process matching FastQ file, based on sample name
    if [ "${fastq_sample}" == "${sample}" ]; then
      
      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5
      
      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R1 '%s,' "${fastq}"
      fastq_list_R1=$(echo "${joined_R1%,}")
    fi
  done
  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R2[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_-]" '{print $3}')
    
    # Process matching FastQ file, based on sample name
    if [ "${fastq_sample}" == "${sample}" ]; then
      
      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5
      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R2 '%s,' "${fastq}"
      fastq_list_R2=$(echo "${joined_R2%,}")
    fi
  done
  # HISAT2 alignments
  # Sets read group info (RG) using samples array
  "${programs_array[hisat2]}" \
  -x "${genome_index_name}" \
  -1 "${fastq_list_R1}" \
  -2 "${fastq_list_R2}" \
  --threads "${threads}" \
  --rg-id "${sample}" \
  --rg "SM:""${sample_timepoint_map[$sample]}" \
  2> "${sample}"_hisat2.stats \
  | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
  | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
  | tee "${sample}".sorted.bam \
  | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai
  
  
  # Run stringtie on alignments
  # Uses "-B" option to output tables intended for use in Ballgown
  # Uses "-e" option; recommended when using "-B" option.
  # Limits analysis to only reads alignments matching reference.
  "${programs_array[stringtie]}" "${sample}".sorted.bam \
  -p "${threads}" \
  -o "${sample}".gtf \
  -G "${genome_gff}" \
  -C "${sample}.cov_refs.gtf" \
  -B \
  -e 
  
  
  # Add GTFs to list file, only if non-empty
  # Identifies GTF files that only have header
  gtf_lines=$(wc -l < "${sample}".gtf )
  if [ "${gtf_lines}" -gt 2 ]; then
    echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
  fi 
  # Generate checksums
  find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
  # Move up to orig. working directory
  cd ..
done
View the resulting directory structure of resulting from the HISAT2/StringTie process.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Display HISAT2 output directory structure
# with directory (--du) and file sizes (-h)
tree --du -h
.
├── [1.8G]  1A1
│   ├── [ 553]  1A1_checksums.md5
│   ├── [4.8M]  1A1.cov_refs.gtf
│   ├── [ 34M]  1A1.gtf
│   ├── [ 451]  1A1-hisat2_output.flagstat
│   ├── [ 637]  1A1_hisat2.stats
│   ├── [1.7G]  1A1.sorted.bam
│   ├── [936K]  1A1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 420]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1A10
│   ├── [ 559]  1A10_checksums.md5
│   ├── [3.5M]  1A10.cov_refs.gtf
│   ├── [ 34M]  1A10.gtf
│   ├── [ 449]  1A10-hisat2_output.flagstat
│   ├── [ 638]  1A10_hisat2.stats
│   ├── [1.8G]  1A10.sorted.bam
│   ├── [779K]  1A10.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1A12
│   ├── [ 559]  1A12_checksums.md5
│   ├── [8.3M]  1A12.cov_refs.gtf
│   ├── [ 34M]  1A12.gtf
│   ├── [ 449]  1A12-hisat2_output.flagstat
│   ├── [ 636]  1A12_hisat2.stats
│   ├── [1.5G]  1A12.sorted.bam
│   ├── [777K]  1A12.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1A2
│   ├── [ 553]  1A2_checksums.md5
│   ├── [3.1M]  1A2.cov_refs.gtf
│   ├── [ 34M]  1A2.gtf
│   ├── [ 449]  1A2-hisat2_output.flagstat
│   ├── [ 640]  1A2_hisat2.stats
│   ├── [1.7G]  1A2.sorted.bam
│   ├── [864K]  1A2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 420]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.5G]  1A8
│   ├── [ 553]  1A8_checksums.md5
│   ├── [6.9M]  1A8.cov_refs.gtf
│   ├── [ 34M]  1A8.gtf
│   ├── [ 449]  1A8-hisat2_output.flagstat
│   ├── [ 637]  1A8_hisat2.stats
│   ├── [1.5G]  1A8.sorted.bam
│   ├── [789K]  1A8.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.9G]  1A9
│   ├── [ 553]  1A9_checksums.md5
│   ├── [7.2M]  1A9.cov_refs.gtf
│   ├── [ 34M]  1A9.gtf
│   ├── [ 450]  1A9-hisat2_output.flagstat
│   ├── [ 638]  1A9_hisat2.stats
│   ├── [1.9G]  1A9.sorted.bam
│   ├── [795K]  1A9.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1B1
│   ├── [ 553]  1B1_checksums.md5
│   ├── [6.9M]  1B1.cov_refs.gtf
│   ├── [ 34M]  1B1.gtf
│   ├── [ 449]  1B1-hisat2_output.flagstat
│   ├── [ 638]  1B1_hisat2.stats
│   ├── [1.8G]  1B1.sorted.bam
│   ├── [824K]  1B1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 420]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1B10
│   ├── [ 559]  1B10_checksums.md5
│   ├── [7.0M]  1B10.cov_refs.gtf
│   ├── [ 34M]  1B10.gtf
│   ├── [ 449]  1B10-hisat2_output.flagstat
│   ├── [ 637]  1B10_hisat2.stats
│   ├── [1.7G]  1B10.sorted.bam
│   ├── [819K]  1B10.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1B2
│   ├── [ 553]  1B2_checksums.md5
│   ├── [5.0M]  1B2.cov_refs.gtf
│   ├── [ 34M]  1B2.gtf
│   ├── [ 449]  1B2-hisat2_output.flagstat
│   ├── [ 638]  1B2_hisat2.stats
│   ├── [1.7G]  1B2.sorted.bam
│   ├── [814K]  1B2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.0G]  1B5
│   ├── [ 553]  1B5_checksums.md5
│   ├── [7.8M]  1B5.cov_refs.gtf
│   ├── [ 34M]  1B5.gtf
│   ├── [ 450]  1B5-hisat2_output.flagstat
│   ├── [ 637]  1B5_hisat2.stats
│   ├── [1.9G]  1B5.sorted.bam
│   ├── [919K]  1B5.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.0G]  1B9
│   ├── [ 553]  1B9_checksums.md5
│   ├── [3.8M]  1B9.cov_refs.gtf
│   ├── [ 34M]  1B9.gtf
│   ├── [ 449]  1B9-hisat2_output.flagstat
│   ├── [ 640]  1B9_hisat2.stats
│   ├── [1.9G]  1B9.sorted.bam
│   ├── [832K]  1B9.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.5G]  1C10
│   ├── [ 559]  1C10_checksums.md5
│   ├── [4.0M]  1C10.cov_refs.gtf
│   ├── [ 34M]  1C10.gtf
│   ├── [ 449]  1C10-hisat2_output.flagstat
│   ├── [ 636]  1C10_hisat2.stats
│   ├── [1.4G]  1C10.sorted.bam
│   ├── [714K]  1C10.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1C4
│   ├── [ 553]  1C4_checksums.md5
│   ├── [6.8M]  1C4.cov_refs.gtf
│   ├── [ 34M]  1C4.gtf
│   ├── [ 450]  1C4-hisat2_output.flagstat
│   ├── [ 638]  1C4_hisat2.stats
│   ├── [1.8G]  1C4.sorted.bam
│   ├── [837K]  1C4.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.9G]  1D10
│   ├── [ 559]  1D10_checksums.md5
│   ├── [6.3M]  1D10.cov_refs.gtf
│   ├── [ 34M]  1D10.gtf
│   ├── [ 450]  1D10-hisat2_output.flagstat
│   ├── [ 637]  1D10_hisat2.stats
│   ├── [1.8G]  1D10.sorted.bam
│   ├── [848K]  1D10.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.4G]  1D3
│   ├── [ 553]  1D3_checksums.md5
│   ├── [4.1M]  1D3.cov_refs.gtf
│   ├── [ 34M]  1D3.gtf
│   ├── [ 449]  1D3-hisat2_output.flagstat
│   ├── [ 636]  1D3_hisat2.stats
│   ├── [1.3G]  1D3.sorted.bam
│   ├── [713K]  1D3.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.0G]  1D4
│   ├── [ 553]  1D4_checksums.md5
│   ├── [4.5M]  1D4.cov_refs.gtf
│   ├── [ 34M]  1D4.gtf
│   ├── [ 449]  1D4-hisat2_output.flagstat
│   ├── [ 638]  1D4_hisat2.stats
│   ├── [1.9G]  1D4.sorted.bam
│   ├── [884K]  1D4.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1D6
│   ├── [ 553]  1D6_checksums.md5
│   ├── [6.8M]  1D6.cov_refs.gtf
│   ├── [ 34M]  1D6.gtf
│   ├── [ 449]  1D6-hisat2_output.flagstat
│   ├── [ 637]  1D6_hisat2.stats
│   ├── [1.6G]  1D6.sorted.bam
│   ├── [778K]  1D6.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.5G]  1D8
│   ├── [ 553]  1D8_checksums.md5
│   ├── [5.6M]  1D8.cov_refs.gtf
│   ├── [ 34M]  1D8.gtf
│   ├── [ 449]  1D8-hisat2_output.flagstat
│   ├── [ 637]  1D8_hisat2.stats
│   ├── [1.4G]  1D8.sorted.bam
│   ├── [688K]  1D8.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.9G]  1D9
│   ├── [ 553]  1D9_checksums.md5
│   ├── [2.4M]  1D9.cov_refs.gtf
│   ├── [ 34M]  1D9.gtf
│   ├── [ 449]  1D9-hisat2_output.flagstat
│   ├── [ 640]  1D9_hisat2.stats
│   ├── [1.8G]  1D9.sorted.bam
│   ├── [799K]  1D9.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.0G]  1E1
│   ├── [ 553]  1E1_checksums.md5
│   ├── [5.6M]  1E1.cov_refs.gtf
│   ├── [ 34M]  1E1.gtf
│   ├── [ 450]  1E1-hisat2_output.flagstat
│   ├── [ 638]  1E1_hisat2.stats
│   ├── [1.9G]  1E1.sorted.bam
│   ├── [919K]  1E1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 420]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1E3
│   ├── [ 553]  1E3_checksums.md5
│   ├── [7.4M]  1E3.cov_refs.gtf
│   ├── [ 34M]  1E3.gtf
│   ├── [ 449]  1E3-hisat2_output.flagstat
│   ├── [ 637]  1E3_hisat2.stats
│   ├── [1.6G]  1E3.sorted.bam
│   ├── [783K]  1E3.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1E5
│   ├── [ 553]  1E5_checksums.md5
│   ├── [6.2M]  1E5.cov_refs.gtf
│   ├── [ 34M]  1E5.gtf
│   ├── [ 449]  1E5-hisat2_output.flagstat
│   ├── [ 637]  1E5_hisat2.stats
│   ├── [1.5G]  1E5.sorted.bam
│   ├── [822K]  1E5.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1E9
│   ├── [ 553]  1E9_checksums.md5
│   ├── [4.4M]  1E9.cov_refs.gtf
│   ├── [ 34M]  1E9.gtf
│   ├── [ 449]  1E9-hisat2_output.flagstat
│   ├── [ 638]  1E9_hisat2.stats
│   ├── [1.8G]  1E9.sorted.bam
│   ├── [814K]  1E9.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1F11
│   ├── [ 559]  1F11_checksums.md5
│   ├── [7.9M]  1F11.cov_refs.gtf
│   ├── [ 34M]  1F11.gtf
│   ├── [ 449]  1F11-hisat2_output.flagstat
│   ├── [ 636]  1F11_hisat2.stats
│   ├── [1.5G]  1F11.sorted.bam
│   ├── [745K]  1F11.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1F4
│   ├── [ 553]  1F4_checksums.md5
│   ├── [7.9M]  1F4.cov_refs.gtf
│   ├── [ 34M]  1F4.gtf
│   ├── [ 449]  1F4-hisat2_output.flagstat
│   ├── [ 637]  1F4_hisat2.stats
│   ├── [1.5G]  1F4.sorted.bam
│   ├── [786K]  1F4.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1F8
│   ├── [ 553]  1F8_checksums.md5
│   ├── [2.9M]  1F8.cov_refs.gtf
│   ├── [ 34M]  1F8.gtf
│   ├── [ 449]  1F8-hisat2_output.flagstat
│   ├── [ 640]  1F8_hisat2.stats
│   ├── [1.8G]  1F8.sorted.bam
│   ├── [784K]  1F8.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.1G]  1G5
│   ├── [ 553]  1G5_checksums.md5
│   ├── [8.5M]  1G5.cov_refs.gtf
│   ├── [ 34M]  1G5.gtf
│   ├── [ 450]  1G5-hisat2_output.flagstat
│   ├── [ 638]  1G5_hisat2.stats
│   ├── [2.0G]  1G5.sorted.bam
│   ├── [988K]  1G5.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1H11
│   ├── [ 559]  1H11_checksums.md5
│   ├── [8.3M]  1H11.cov_refs.gtf
│   ├── [ 34M]  1H11.gtf
│   ├── [ 449]  1H11-hisat2_output.flagstat
│   ├── [ 636]  1H11_hisat2.stats
│   ├── [1.5G]  1H11.sorted.bam
│   ├── [778K]  1H11.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   └── [3.7M]  t_data.ctab
├── [1.3G]  1H12
│   ├── [ 559]  1H12_checksums.md5
│   ├── [3.5M]  1H12.cov_refs.gtf
│   ├── [ 34M]  1H12.gtf
│   ├── [ 449]  1H12-hisat2_output.flagstat
│   ├── [ 636]  1H12_hisat2.stats
│   ├── [1.3G]  1H12.sorted.bam
│   ├── [698K]  1H12.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  1H6
│   ├── [ 553]  1H6_checksums.md5
│   ├── [2.9M]  1H6.cov_refs.gtf
│   ├── [ 34M]  1H6.gtf
│   ├── [ 449]  1H6-hisat2_output.flagstat
│   ├── [ 640]  1H6_hisat2.stats
│   ├── [1.8G]  1H6.sorted.bam
│   ├── [784K]  1H6.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   └── [3.7M]  t_data.ctab
├── [1.5G]  1H7
│   ├── [ 553]  1H7_checksums.md5
│   ├── [7.4M]  1H7.cov_refs.gtf
│   ├── [ 34M]  1H7.gtf
│   ├── [ 449]  1H7-hisat2_output.flagstat
│   ├── [ 637]  1H7_hisat2.stats
│   ├── [1.5G]  1H7.sorted.bam
│   ├── [780K]  1H7.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  1H8
│   ├── [ 553]  1H8_checksums.md5
│   ├── [6.6M]  1H8.cov_refs.gtf
│   ├── [ 34M]  1H8.gtf
│   ├── [ 449]  1H8-hisat2_output.flagstat
│   ├── [ 637]  1H8_hisat2.stats
│   ├── [1.6G]  1H8.sorted.bam
│   ├── [791K]  1H8.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.1G]  2B2
│   ├── [ 553]  2B2_checksums.md5
│   ├── [5.3M]  2B2.cov_refs.gtf
│   ├── [ 34M]  2B2.gtf
│   ├── [ 450]  2B2-hisat2_output.flagstat
│   ├── [ 639]  2B2_hisat2.stats
│   ├── [2.0G]  2B2.sorted.bam
│   ├── [961K]  2B2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  2B3
│   ├── [ 553]  2B3_checksums.md5
│   ├── [8.2M]  2B3.cov_refs.gtf
│   ├── [ 34M]  2B3.gtf
│   ├── [ 449]  2B3-hisat2_output.flagstat
│   ├── [ 638]  2B3_hisat2.stats
│   ├── [1.7G]  2B3.sorted.bam
│   ├── [818K]  2B3.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.8G]  2C1
│   ├── [ 553]  2C1_checksums.md5
│   ├── [4.0M]  2C1.cov_refs.gtf
│   ├── [ 34M]  2C1.gtf
│   ├── [ 448]  2C1-hisat2_output.flagstat
│   ├── [ 638]  2C1_hisat2.stats
│   ├── [1.7G]  2C1.sorted.bam
│   ├── [636K]  2C1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 422]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  2C2
│   ├── [ 553]  2C2_checksums.md5
│   ├── [7.6M]  2C2.cov_refs.gtf
│   ├── [ 34M]  2C2.gtf
│   ├── [ 450]  2C2-hisat2_output.flagstat
│   ├── [ 637]  2C2_hisat2.stats
│   ├── [1.5G]  2C2.sorted.bam
│   ├── [794K]  2C2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.0G]  2D2
│   ├── [ 553]  2D2_checksums.md5
│   ├── [2.9M]  2D2.cov_refs.gtf
│   ├── [ 34M]  2D2.gtf
│   ├── [ 449]  2D2-hisat2_output.flagstat
│   ├── [ 640]  2D2_hisat2.stats
│   ├── [1.9G]  2D2.sorted.bam
│   ├── [795K]  2D2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.4G]  2E2
│   ├── [ 553]  2E2_checksums.md5
│   ├── [2.2M]  2E2.cov_refs.gtf
│   ├── [ 33M]  2E2.gtf
│   ├── [ 449]  2E2-hisat2_output.flagstat
│   ├── [ 639]  2E2_hisat2.stats
│   ├── [1.4G]  2E2.sorted.bam
│   ├── [632K]  2E2.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [2.6G]  2F1
│   ├── [ 553]  2F1_checksums.md5
│   ├── [2.9M]  2F1.cov_refs.gtf
│   ├── [ 34M]  2F1.gtf
│   ├── [ 451]  2F1-hisat2_output.flagstat
│   ├── [ 642]  2F1_hisat2.stats
│   ├── [2.5G]  2F1.sorted.bam
│   ├── [980K]  2F1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.2M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [1.6G]  2G1
│   ├── [ 553]  2G1_checksums.md5
│   ├── [6.1M]  2G1.cov_refs.gtf
│   ├── [ 34M]  2G1.gtf
│   ├── [ 449]  2G1-hisat2_output.flagstat
│   ├── [ 637]  2G1_hisat2.stats
│   ├── [1.6G]  2G1.sorted.bam
│   ├── [724K]  2G1.sorted.bam.bai
│   ├── [2.4M]  e2t.ctab
│   ├── [ 15M]  e_data.ctab
│   ├── [1.9M]  i2t.ctab
│   ├── [7.3M]  i_data.ctab
│   ├── [ 424]  input_fastqs_checksums.md5
│   └── [3.7M]  t_data.ctab
├── [ 35M]  Apulchra-genome.stringtie.gtf
├── [5.1M]  apul-gene_count_matrix.csv
├── [5.2M]  apul-transcript_count_matrix.csv
├── [ 587]  checksums.md5
├── [5.1K]  gtf_list.txt
├── [306K]  multiqc_data
│   ├── [5.0K]  multiqc_bowtie2.txt
│   ├── [ 307]  multiqc_citations.txt
│   ├── [268K]  multiqc_data.json
│   ├── [2.7K]  multiqc_general_stats.txt
│   ├── [4.1K]  multiqc.log
│   ├── [7.7K]  multiqc_samtools_flagstat.txt
│   └── [ 14K]  multiqc_sources.txt
├── [1.1M]  multiqc_report.html
├── [5.2K]  prepDE-sample_list.txt
├── [ 856]  sorted_bams.list
├── [ 64G]  sorted-bams-merged.bam
└── [ 13M]  sorted-bams-merged.bam.bai
 135G used in 41 directories, 535 files
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
${programs_array[multiqc]} .
Merge all BAMs to singular BAM for use in transcriptome assembly later, if needed.
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
## Create list of sorted BAMs for merging
find . -name "*sorted.bam" > sorted_bams.list
## Merge sorted BAMs
${programs_array[samtools]} merge \
-b sorted_bams.list \
${merged_bam} \
--threads ${threads}
## Index merged BAM
${programs_array[samtools]} index ${merged_bam}
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Create singular transcript file, using GTF list file
"${programs_array[stringtie]}" --merge \
"${gtf_list}" \
-p "${threads}" \
-G "${genome_gff}" \
-o "${genome_index_name}".stringtie.gtf
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Create file list for prepDE.py
while read -r line
do
  sample_no_path=${line##*/}
  sample=${sample_no_path%.*}
  echo ${sample} ${line}
done < gtf_list.txt >> prepDE-sample_list.txt
# Create count matrices for genes and transcripts
# Compatible with import to DESeq2
python3 "${programs_array[prepDE]}" \
--input=prepDE-sample_list.txt \
-g apul-gene_count_matrix.csv \
-t apul-transcript_count_matrix.csv \
--length=${read_length}
# Load bash variables into memory
source .bashvars
# Change to ouput directory
cd "${output_dir_top}"
# Uses find command to avoid passing
# directory names to the md5sum command.
find . -maxdepth 1 -type f -exec md5sum {} + \
| tee --append checksums.md5