--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Analysis - Identification of Potential Contaminating Sequences in Panopea-generosa-v1.0 Assembly Using BlobToolKit on Mox date: '2021-07-12 05:46' tags: - blobtoolkit - blobtools - Panopea-generosa-v1.0 - genome - Panopea generosa - Pacific geoduck - mox categories: - 2021 - Geoduck Genome Sequencing --- As part of our [_Panopea generosa_ (Pacific geoduck)](http://en.wikipedia.org/wiki/Geoduck) genome sequencing efforts, Steven came across a tool designed to help identify if there are any contaminating sequences in your assembly. The software is [BlobToolKit](https://blobtoolkit.genomehubs.org/). The software is actually a complex pipeline of separate tools ([minimap2])https://github.com/lh3/minimap2, [`BLAST`](https://www.ncbi.nlm.nih.gov/books/NBK279690/), [`DIAMOND`](https://github.com/bbuchfink/diamond) BLAST, and [BUSCO](https://busco.ezlab.org/)) which aligns sequencing reads and assigns taxonomy to the reads, as well as marking regions of the assembly with various taxonomic assignments. The pipeline has been regularly updated and when I initially tried to run it, failed because the pipeline required an internet connection. Side note, the tool was really written to analyze all of the genome assemblies that exist in NCBI. As such, it was written to automatically download assemblies/reads while running things. Considering the amount of data I had, this could only be run on Mox, which does _not_ have internet access for any submitted jobs. The developer has been remarkably helpful and worked to update the pipeline to remove this barrier, allowing for local (i.e. offline) analysis. There were some other bumps along the way and the developer addressed those issues, too! One _great_ aspect of this pipeline is that it's written as a Snakemake pipeline, which incorporates checkpoints. This allowed me to continue the pipeline each time it died due to some weird bug here and there. This job would've taken much longer if I had to restart it from the beginning each time. The job was run on Mox. SBATCH script (GitHub): - [20210712_pgen_blobtools_Panopea-generosa-v1.0.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20210712_pgen_blobtools_Panopea-generosa-v1.0.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=20210712_pgen_blobtools_Panopea-generosa-v1.0 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=10-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210712_pgen_blobtools_Panopea-generosa-v1.0 ### Script to run the Blobtools2 Pipeline ### on trimmed 10x Genomics/HiC FastQs from 20210401. ### Using to identify sequencing contaminants in Panopea-generosa-v1.0 genome assembly ### Generates a Snakemake config file ### Outputs Blobtools2 JSON files for use in the Blobtools2 viewer ### Utilizes NCBI taxonomy dump and customized UniProt database for DIAMOND BLASTx ### Requires Anaconda to be in system $PATH! ### Follows instructions for release v2.6.0 (https://github.com/blobtoolkit/pipeline/tree/release/v2.6.0) ### And changes noted in this GitHub Issue: https://github.com/blobtoolkit/pipeline/issues/13#issuecomment-868510883 ### MODIFED HARD-CODED THREAD COUNTS IN rules/run_minimap2_index.smk and rules/run_minimap2_align.smk ################################################################################### # These variables need to be set by user # Load blobtoolkit module to add btk to PATH module load blobtoolkit-v2.6.1.module # Set working directory wd=$(pwd) echo "Working directory is ${wd}." echo "" # Set base directory for blobltools structure base_dir=${wd}/blobtoolkit # Pipeline options ## BLASTn evalue evalue="1.0e-10" ## NCBI Tax ID and genus/species ncbi_tax_id=1049056 species="Panopea generosa" ## NCBI Taxonomy Root ID to begin from root=1 # Set number of CPUs to use threads=40 # Input/output files assembly_name=Panopea_generosa_v1 orig_fasta=/gscratch/srlab/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa fastq_checksums=fastq_checksums.md5 trimmed_reads_dir=/gscratch/scrubbed/samwhite/outputs/20210401_pgen_fastp_10x-genomics ## New genome name for BTK filename requirements (no periods) genome_fasta=${wd}/Panopea_generosa_v1.fasta.gz # Programs ## Blobtools2 directory blobtools2=/gscratch/srlab/programs/blobtoolkit-v2.6.1/blobtools2 ## BTK pipeline directory btk_pipeline=/gscratch/srlab/programs/blobtoolkit-v2.6.1/pipeline ## Conda environment directory conda_dir=/gscratch/srlab/programs/anaconda3/envs/btk_env # Databases ## BUSCO lineage database directory busco_dbs=/gscratch/srlab/sam/data/databases/BUSCO ## Blobtools NCBI taxonomy database directory btk_ncbi_tax_dir=/gscratch/srlab/blastdbs/20210401_ncbi_taxonomy ## NCBI nt database dir ncbi_db=/gscratch/srlab/blastdbs/20210401_ncbi_nt ncbi_db_name="nt" ## Uniprot DIAMOND database dir uniprot_db=/gscratch/srlab/blastdbs/20210401_uniprot_btk uniprot_db_name=reference_proteomes # Programs associative array declare -A programs_array programs_array=() ################################################################################### # Exit script if any command fails set -e # Load Anaconda # Uknown why this is needed, but Anaconda will not run if this line is not included. . "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh" # Gzip FastA - needed for blobltoolkit to run properly if [ ! -f "${genome_fasta}" ]; then gzip -c ${orig_fasta} > "${genome_fasta}" fi # Generate checksum for "new" FastA if [ ! -f "${genome_fasta}" ]; then md5sum "${genome_fasta}" > genome_fasta.md5 fi # Concatenate all R1 reads if [ ! -f "reads_1.fastq.gz" ]; then for fastq in "${trimmed_reads_dir}"/*R1*.fq.gz do echo "" echo "Generating checksum for ${fastq}" md5sum "${fastq}" >> ${fastq_checksums} echo "Checksum generated for ${fastq}." echo "" echo "Concatenating ${fastq} to reads_1.fastq.gz" cat "${fastq}" >> reads_1.fastq.gz echo "Finished concatenating ${fastq} to reads_1.fastq.gz" done fi # Concatenate all R2 reads if [ ! -f "reads_2.fastq.gz" ]; then for fastq in "${trimmed_reads_dir}"/*R2*.fq.gz do echo "" echo "Generating checksum for ${fastq}" md5sum "${fastq}" >> ${fastq_checksums} echo "Checksum generated for ${fastq}." echo "" echo "Concatenating ${fastq} to reads_2.fastq.gz" cat "${fastq}" >> reads_2.fastq.gz echo "Finished concatenating ${fastq} to reads_2.fastq.gz" done fi # Count scaffolds in assembly scaffold_count=$(grep -c ">" "${genome_fasta}") # Count nucleotides in assembly genome_nucleotides_count=$(grep -v ">" "${genome_fasta}" | wc | awk '{print $3-$1}') # Create BTK config YAML # BUSCO lineage order is important! List in order of most taxonomic specific to most general. if [ -f "config.yaml" ]; then rm "config.yaml" fi { printf "%s\n" "assembly:" printf "%2s%s\n" "" "accession: draft" "" "file: ${genome_fasta}" "" "level: scaffold" "" "scaffold-count: ${scaffold_count}" "" "span: ${genome_nucleotides_count}" printf "%2s%s\n" "" "prefix: ${assembly_name}" printf "%s\n" "busco:" printf "%2s%s\n" "" "download_dir: ${busco_dbs}" printf "%2s%s\n" "" "lineages:" printf "%4s%s\n" "" "- arthropoda_odb10" "" "- eukaryota_odb10" "" "- metazoa_odb10" "" "- bacteria_odb10" "" "- archaea_odb10" printf "%2s%s\n" "" "basal_lineages:" printf "%4s%s\n" "" "- archaea_odb10" "" "- bacteria_odb10" "" "- eukaryota_odb10" printf "%s\n" "reads:" printf "%2s%s\n" "" "paired:" printf "%4s%s\n" "" "- prefix: reads" printf "%6s%s\n" "" "platform: ILLUMINA" "" "file: ${wd}/reads_1.fastq.gz;${wd}/reads_2.fastq.gz" printf "%s\n" "settings:" printf "%2s%s\n" "" "taxdump: ${btk_ncbi_tax_dir}" printf "%2s%s\n" "" "blast_chunk: 100000" printf "%2s%s\n" "" "blast_max_chunks: 10" printf "%2s%s\n" "" "blast_overlap: 0" printf "%2s%s\n" "" "blast_min_length: 1000" printf "%s\n" "similarity:" printf "%2s%s\n" "" "defaults:" printf "%4s%s\n" "" "evalue: ${evalue}" "" "max_target_seqs: 10" "" "import_evalue: 1.0e-25" "" "taxrule: buscogenes" printf "%2s%s\n" "" "diamond_blastx:" printf "%4s%s\n" "" "name: ${uniprot_db_name}" "" "path: ${uniprot_db}" printf "%2s%s\n" "" "diamond_blastp:" printf "%4s%s\n" "" "name: ${uniprot_db_name}" "" "path: ${uniprot_db}" "" "import_max_target_seqs: 100000" printf "%2s%s\n" "" "blastn:" printf "%4s%s\n" "" "name: ${ncbi_db_name}" "" "path: ${ncbi_db}" printf "%s\n" "taxon:" printf "%2s%s\n" "" "name: ${species}" "" "taxid: '${ncbi_tax_id}'" } >> config.yaml # Activate blobtoolkit conda environment conda activate btk_env # Run snakemake, btk pipeline snakemake -p \ --use-conda \ --conda-prefix ${conda_dir} \ --directory "${base_dir}" \ --configfile "${wd}"/config.yaml \ --stats ${assembly_name}.blobtoolkit.stats \ -j ${threads} \ --rerun-incomplete \ -s ${btk_pipeline}/blobtoolkit.smk \ --resources btk=1 ################################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help fi ${programs_array[$program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true # If MultiQC is in programs_array, copy the config file to this directory. if [[ "${program}" == "multiqc" ]]; then cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml fi done fi # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log echo "Finished logging system PATH" ``` --- # RESULTS Usually, I post the run time for my Mox jobs here. However, this was stopped/started numerous times, making it difficult to actually determine the cumulative runtime. Output folder: - [20210712_pgen_blobtools_Panopea-generosa-v1.0/](https://gannet.fish.washington.edu/Atumefaciens/20210712_pgen_blobtools_Panopea-generosa-v1.0/) - #### JSON output directory - [20210712_pgen_blobtools_Panopea-generosa-v1.0/blobtools/Panopea_generosa_v1/](https://gannet.fish.washington.edu/Atumefaciens/20210712_pgen_blobtools_Panopea-generosa-v1.0/blobtools/Panopea_generosa_v1/) - _These files are needed to use with the Blobtools viewer software._ Next up will be loading the JSON output files into the Blobtools viewer and analysing taxonomic assignments of reads/regions throughout our genome assembly.