--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Data Received - Anthopleura elegantissima - aggregating anenome - NanoPore Genome Sequence from Jay Dimond date: '2021-02-01 12:09' tags: - nanopore - ont - Anthopleura elegantissima - anenome - DNA - genome categories: - 2021 - Miscellaneous --- Jay asked me to help get his _A.elegantissima_ (aggregating anenome) NanoPore gDNA sequencing data submitted to [NCBI Sequencing Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra). He sent a hard drive (HDD) with all the NanoPore sequencing Fast5 files. The HDD was received on 2/2/2021. Here're are details provided in the reamde file in the Ae_ONT directory. Readme file: ``` samb@computer:/media/samb/SeagatePortableDrive/Ae_ONT$ cat readme.txt This directory contains genomic data for the sea anemone Anthopleura elegantissima from multiple Oxford Nanopore MinION DNA sequencing runs, as well as a genome assembly. Folders with the 'bar' prefix contain data for a particular barcoded sample. Each barcode folder contains the merged fastq file plus a folder with raw fast5 data. Three libraries with one aposymbiotic individual and one symbiotic individual each, both individually barcoded, were prepared using the PCR-free ONT Ligation Sequencing Kit with the Native Barcoding Expansion Kit (SQK-LSK109 and EXP-NBD103), following the manufacturer’s 1D native barcoding gDNA protocol. Each library was run on a separate FLO-MIN 106D R9.4 flow cell. Basecalling and demultiplexing was performed with ONT Albacore Sequencing Pipeline Software v. 2.3.3, and sequencing adapters were removed with Porechop v. 0.2.4 (Wick et al. 2017). Sample info bar01 = anemone A4 (aposymbiotic) bar02 = anemone G2 (symbiotic) bar03 = anemone A3 (aposymbiotic) bar04 = anemone G4 (symbiotic) bar05 = anemone A1 (aposymbiotic) bar06 = anemone G3 (symbiotic) The 'Genome_data' folder contains data specific to the genome assembly, which was generated using exclusively aposymbiotic anemone samples. The file 'merged2.fq.zip' is a fastq file containing all reads from aposymbiotic anemones used for the genome assembly. However, in addition to containing data from the Ligation sequencing libraries described above, this file contains additional reads from two flow cell runs of another aposymbiotic sample prepared using the PCR-free, transposase-based ONT Rapid Sequencing Kit (SQK-RAD004) following manufacturer guidelines. Sequencing was done on two FLO-MIN 106D R9.4 flow cells. Within the 'Genome_data' folder, the folder 'wtdbg2_genome' contains the genome assembly generated by the software program wtdbg2. The wtdbg2-generated draft genome comprises 243 Mb, including 5359 contigs with an N50 of 87 kb and N90 of 19.2 kb. All aposymbiotic sequences were used to generate the draft genome, providing a total of 5.6 Gb and an estimated coverage of 23x. ``` Due to a few factors (no write permissions on HDD, insufficient space on local HDD), it took me awhile to get this data processed. I ended up rsync-ing just the Fast5 files to dedicated directories on Gannet. After that, I compressed them to gzipped tarballs (`tar.gz`), per the requirements for submitting ONT Fast5 files to [NCBI Sequencing Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra). I generated checksums for these gzipped tarballs and then rsync'ed to our [Nightingales sequencing repository on Owl](https://owl.fish.washington.edu/nightingales/A_elegantissima/). Additionally, the file transfers themselves took quite some time, as they constitute a large amount of data (>300GB). `rsync` command: ```bash samb@computer:/media/samb/SeagatePortableDrive/Ae_ONT$ time for dir in bar0* do cd "${dir}" || exit if [ "${dir}" = "bar01" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}." # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A4_aposymb/" echo "Finished syncing ${dir}." echo "" elif [ "${dir}" = "bar02" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}" # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G2_symb/" echo "Finished syncing ${dir}." echo "" elif [ "${dir}" = "bar03" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}" # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A3_aposymb/" echo "Finished syncing ${dir}." echo "" elif [ "${dir}" = "bar04" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}" # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G4_symb/" echo "Finished syncing ${dir}." echo "" elif [ "${dir}" = "bar05" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}" # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A1_aposymb/" echo "Finished syncing ${dir}." echo "" elif [ "${dir}" = "bar06" ]; then # Print dividing line printf '=%.0s' {1..50} echo "" echo "Syncing ${dir}" # Run rsync to only sync Fast5 files. # Utilizes a "named pipe" to send output of find as a files list for rsync to use # The command will only copy files and will not replicate directory structure. rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G3_symb/" echo "Finished syncing ${dir}." echo "" fi cd "${ext_hdd}" || exit done ================================================== Syncing bar01. Finished syncing bar01. ================================================== Syncing bar02 Finished syncing bar02. ================================================== Syncing bar03 Finished syncing bar03. ================================================== Syncing bar04 Finished syncing bar04. ================================================== Syncing bar05 Finished syncing bar05. ================================================== Syncing bar06 Finished syncing bar06. real 885m51.779s user 24m10.217s sys 78m51.111s ``` Files were compressed into gzipped tarballs, MD5 checksums generated, rsync'd to Owl, and checksums verified: ![Screencap of checksum verification after rsync to Owl](https://raw.githubusercontent.com/RobertsLab/sams-notebook/master/images/screencaps/20210201_aele_data-received_checksum-verification.png) I updated our [Nightingales Google Sheet](http://b.link/nightingales).