# Step 1: Get Sequence Data from Azenta through `sftp`
Sarah Tanja
October 9, 2024
- [Goals](#goals)
- [Get sequences from Azenta server](#get-sequences-from-azenta-server)
- [Check file integrity with
`md5sum`](#check-file-integrity-with-md5sum)
- [Summary](#summary)
# Goals
- Transfer project files to Roberts Lab raven server as a working
directory
- Transfer project files to Roberts Lab gannet server for backup
- Verify checksums using `md5sum`
Environment:
- raven : RStudio Server hosted on a Unix OS
- gannet: Linux , command line
# Get sequences from Azenta server
1. First access the server that you want to download sequences to. For
this you will `.ssh` into the server.
2. From within the server, navigate using `cd` commands to the
directory that you want the files copied to
3. Once inside your target directory, `.ssh` into the Azenta server
using `sftp username@azenta.genewiz.com`
4. Type in the password to access the azenta server (sent to you in an
email from azenta when they notified you that the sequences were
generated)
# Check file integrity with `md5sum`
What is an MD5 checksum? An MD5 checksum is a set of 32 hexadecimal
letters and numbers that represent a file’s mathematical algorithm. It’s
used to verify that a file is an accurate copy of the original and
hasn’t been modified or corrupted.
[Learn How to Generate and Verify Files with MD5 Checksum in
Linux](https://www.tecmint.com/generate-verify-check-files-md5-checksum-linux/)
`6C14_R1_001.fastq.gz.md5` is a MD5 checksum output file that Azenta
generated, it looks like:
``` bash
cd ../rawfastq/00_fastq/
less -S 6C14_R1_001.fastq.gz.md5
```
986886738a844beca568362da97600c9 ./6C14_R1_001.fastq.gz
`986886738a844beca568362da97600c9 ./6C14_R1_001.fastq.gz`
The `md5sum` command will **generate a MD5 checksum** for the file I
downloaded from the Azenta server:
``` bash
md5sum ../rawfastq/00_fastq/6C14_R1_001.fastq.gz
```
986886738a844beca568362da97600c9 ../rawfastq/00_fastq/6C14_R1_001.fastq.gz
Success! The checksums are the same.
So now we can automate this process with `md5sum -c`
``` bash
md5sum --help
```
Usage: md5sum [OPTION]... [FILE]...
Print or check MD5 (128-bit) checksums.
With no FILE, or when FILE is -, read standard input.
-b, --binary read in binary mode
-c, --check read MD5 sums from the FILEs and check them
--tag create a BSD-style checksum
-t, --text read in text mode (default)
The following five options are useful only when verifying checksums:
--ignore-missing don't fail or report status for missing files
--quiet don't print OK for each successfully verified file
--status don't output anything, status code shows success
--strict exit non-zero for improperly formatted checksum lines
-w, --warn warn about improperly formatted checksum lines
--help display this help and exit
--version output version information and exit
The sums are computed as described in RFC 1321. When checking, the input
should be a former output of this program. The default mode is to print a
line with checksum, a space, a character indicating input mode ('*' for binary,
' ' for text or where binary is insignificant), and name for each FILE.
GNU coreutils online help:
Full documentation at:
or available locally via: info '(coreutils) md5sum invocation'
The following command will:
- look at all of the `*md5` files generated by azenta
- generate MD5 checksums for each `*.fastq.gz` file that the `*md5` file
points to
- compare the MD5 checksum from the azenta provided `*md5` file to the
generated MD5 checksums
``` bash
cd ../rawfastq/00_fastq # move to the directory that has both fastq.gz and md5 files
md5sum -c 6C14_R1_001.fastq.gz.md5
```
./6C14_R1_001.fastq.gz: OK
> [!CAUTION]
>
> **Handling File Paths**: Ensure that the paths inside the `.md5` files
> correctly reference their associated files, whether relative or
> absolute. `md5sum -c` relies on these paths to locate the files. You
> should be running `md5sum -c` from within the directory that contains
> both the `.md5` and `fastq.gz` files!
Now lets do this for all the files transferred
Use the `md5sum -c` command to compare the downloaded against all of
Azenta’s generated md5sums (the files ending in `.md5`) :
``` bash
cd ../rawfastq/00_fastq
md5sum -c *.md5
```
./101112C14_R1_001.fastq.gz: OK
./101112C14_R2_001.fastq.gz: OK
./101112C4_R1_001.fastq.gz: OK
./101112C4_R2_001.fastq.gz: OK
./101112C9_R1_001.fastq.gz: OK
./101112C9_R2_001.fastq.gz: OK
./101112H14_R1_001.fastq.gz: OK
./101112H14_R2_001.fastq.gz: OK
./101112H4_R1_001.fastq.gz: OK
./101112H4_R2_001.fastq.gz: OK
./101112H9_R1_001.fastq.gz: OK
./101112H9_R2_001.fastq.gz: OK
./101112L14_R1_001.fastq.gz: OK
./101112L14_R2_001.fastq.gz: OK
./101112L4_R1_001.fastq.gz: OK
./101112L4_R2_001.fastq.gz: OK
./101112L9_R1_001.fastq.gz: OK
./101112L9_R2_001.fastq.gz: OK
./101112M14_R1_001.fastq.gz: OK
./101112M14_R2_001.fastq.gz: OK
./101112M4_R1_001.fastq.gz: OK
./101112M4_R2_001.fastq.gz: OK
./101112M9_R1_001.fastq.gz: OK
./101112M9_R2_001.fastq.gz: OK
./123C14_R1_001.fastq.gz: OK
./123C14_R2_001.fastq.gz: OK
./123C4_R1_001.fastq.gz: OK
./123C4_R2_001.fastq.gz: OK
./123C9_R1_001.fastq.gz: OK
./123C9_R2_001.fastq.gz: OK
./123H14_R1_001.fastq.gz: OK
./123H14_R2_001.fastq.gz: OK
./123H4_R1_001.fastq.gz: OK
./123H4_R2_001.fastq.gz: OK
./123H9_R1_001.fastq.gz: OK
./123H9_R2_001.fastq.gz: OK
./123L14_R1_001.fastq.gz: OK
./123L14_R2_001.fastq.gz: OK
./123L4_R1_001.fastq.gz: OK
./123L4_R2_001.fastq.gz: OK
./123L9_R1_001.fastq.gz: OK
./123L9_R2_001.fastq.gz: OK
./123M4_R1_001.fastq.gz: OK
./123M4_R2_001.fastq.gz: OK
./123M9_R1_001.fastq.gz: OK
./123M9_R2_001.fastq.gz: OK
./131415C14_R1_001.fastq.gz: OK
./131415C14_R2_001.fastq.gz: OK
./131415C4_R1_001.fastq.gz: OK
./131415C4_R2_001.fastq.gz: OK
./131415C9_R1_001.fastq.gz: OK
./131415C9_R2_001.fastq.gz: OK
./131415H14_R1_001.fastq.gz: OK
./131415H14_R2_001.fastq.gz: OK
./131415H4_R1_001.fastq.gz: OK
./131415H4_R2_001.fastq.gz: OK
./131415H9_R1_001.fastq.gz: OK
./131415H9_R2_001.fastq.gz: OK
./131415L14_R1_001.fastq.gz: OK
./131415L14_R2_001.fastq.gz: OK
./131415L4_R1_001.fastq.gz: OK
./131415L4_R2_001.fastq.gz: OK
./131415L9_R1_001.fastq.gz: OK
./131415L9_R2_001.fastq.gz: OK
./131415M14_R1_001.fastq.gz: OK
./131415M14_R2_001.fastq.gz: OK
./131415M4_R1_001.fastq.gz: OK
./131415M4_R2_001.fastq.gz: OK
./131415M9_R1_001.fastq.gz: OK
./131415M9_R2_001.fastq.gz: OK
./13M14_R1_001.fastq.gz: OK
./13M14_R2_001.fastq.gz: OK
./456C4_R1_001.fastq.gz: OK
./456C4_R2_001.fastq.gz: OK
./456H4_R1_001.fastq.gz: OK
./456H4_R2_001.fastq.gz: OK
./456H9_R1_001.fastq.gz: OK
./456H9_R2_001.fastq.gz: OK
./456L14_R1_001.fastq.gz: OK
./456L14_R2_001.fastq.gz: OK
./456L4_R1_001.fastq.gz: OK
./456L4_R2_001.fastq.gz: OK
./456L9_R1_001.fastq.gz: OK
./456L9_R2_001.fastq.gz: OK
./456M14_R1_001.fastq.gz: OK
./456M14_R2_001.fastq.gz: OK
./456M4_R1_001.fastq.gz: OK
./456M4_R2_001.fastq.gz: OK
./456M9_R1_001.fastq.gz: OK
./456M9_R2_001.fastq.gz: OK
./45C14_R1_001.fastq.gz: OK
./45C14_R2_001.fastq.gz: OK
./45C9_R1_001.fastq.gz: OK
./45C9_R2_001.fastq.gz: OK
./45H14_R1_001.fastq.gz: OK
./45H14_R2_001.fastq.gz: OK
./67C9_R1_001.fastq.gz: OK
./67C9_R2_001.fastq.gz: OK
./6C14_R1_001.fastq.gz: OK
./6C14_R2_001.fastq.gz: OK
./789C4_R1_001.fastq.gz: OK
./789C4_R2_001.fastq.gz: OK
./789H14_R1_001.fastq.gz: OK
./789H14_R2_001.fastq.gz: OK
./789H4_R1_001.fastq.gz: OK
./789H4_R2_001.fastq.gz: OK
./789H9_R1_001.fastq.gz: OK
./789H9_R2_001.fastq.gz: OK
./789L14_R1_001.fastq.gz: OK
./789L14_R2_001.fastq.gz: OK
./789L4_R1_001.fastq.gz: OK
./789L4_R2_001.fastq.gz: OK
./789L9_R1_001.fastq.gz: OK
./789L9_R2_001.fastq.gz: OK
./789M14_R1_001.fastq.gz: OK
./789M14_R2_001.fastq.gz: OK
./789M4_R1_001.fastq.gz: OK
./789M4_R2_001.fastq.gz: OK
./7C14_R1_001.fastq.gz: OK
./7C14_R2_001.fastq.gz: OK
./89C14_R1_001.fastq.gz: OK
./89C14_R2_001.fastq.gz: OK
./89C9_R1_001.fastq.gz: OK
./89C9_R2_001.fastq.gz: OK
./89M9_R1_001.fastq.gz: OK
./89M9_R2_001.fastq.gz: OK
> [!TIP]
>
> Each file should show `OK` !
# Summary