--- title: "Step 1: Get Sequence Data from Azenta through `sftp`" author: "Sarah Tanja" date: 10/09/2024 date-format: long date-modified: today format: gfm toc: true toc-depth: 3 link-external-icon: true link-external-newwindow: true --- # Goals - Transfer project files to Roberts Lab raven server as a working directory - Transfer project files to Roberts Lab gannet server for backup - Verify checksums using `md5sum` Environment: - raven : RStudio Server hosted on a Unix OS - gannet: Linux , command line # Get sequences 1. First access the server that you want to download sequences to. Make sure there is plenty of space on the server for these large files. 2. From within the desired server, navigate using `cd` commands to the directory that you want the files copied to. 3. Here, we used Azenta for sequencing services and are accessing their server which has the raw files we need. Once inside your target directory where you want to transfer these files to, `.ssh` into the Azenta server using `sftp username@azenta.genewiz.com` 4. Type in the password to access the azenta server (sent to you in an email from azenta when they notified you that the sequences were generated) # Check file integrity We chek file integrity with `md5sum` . What is an MD5 checksum? An MD5 checksum is a set of 32 hexadecimal letters and numbers that represent a file's mathematical algorithm. It's used to verify that a file is an accurate copy of the original and hasn't been modified or corrupted. ::: callout-info [Learn How to Generate and Verify Files with MD5 Checksum in Linux](https://www.tecmint.com/generate-verify-check-files-md5-checksum-linux/) ::: `6C14_R1_001.fastq.gz.md5` is a MD5 checksum output file that Azenta generated, it looks like: ```{bash} cd ../rawfastq/00_fastq/ less -S 6C14_R1_001.fastq.gz.md5 ``` `986886738a844beca568362da97600c9 ./6C14_R1_001.fastq.gz` The `md5sum` command will **generate a MD5 checksum** for the file I downloaded from the Azenta server: ```{r, engine='bash'} md5sum ../rawfastq/00_fastq/6C14_R1_001.fastq.gz ``` Success! The checksums are the same. So now we can automate this process with `md5sum -c` ```{bash} md5sum --help ``` The following command will: - look at all of the `*md5` files generated by azenta - generate MD5 checksums for each `*.fastq.gz` file that the `*md5` file points to - compare the MD5 checksum from the azenta provided `*md5` file to the generated MD5 checksums ```{bash} cd ../rawfastq/00_fastq # move to the directory that has both fastq.gz and md5 files md5sum -c 6C14_R1_001.fastq.gz.md5 ``` ::: callout-caution **Handling File Paths**: Ensure that the paths inside the `.md5` files correctly reference their associated files, whether relative or absolute. `md5sum -c` relies on these paths to locate the files. You should be running `md5sum -c` from within the directory that contains both the `.md5` and `fastq.gz` files! ::: Now lets do this for all the files transferred Use the `md5sum -c` command to compare the downloaded against all of Azenta's generated md5sums (the files ending in `.md5`) : ```{r, engine = 'bash'} cd ../rawfastq/00_fastq md5sum -c *.md5 ``` ::: {.callout-tip appearance="minimal" icon="false"} Each file should show `OK` ! ::: # Summary