---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
date: 2016-01-09 04:17:12+00:00
layout: post
slug: data-received-bisulfite-treated-illumina-sequencing-from-genewiz
title: Data Received - Bisulfite-treated Illumina Sequencing from Genewiz
categories:
- 2016
- BS-seq Libraries for Sequencing at Genewiz
- Olympia oyster reciprocal transplant
tags:
- BS-seq
- Crassostrea gigas
- Katie Lotterhos
- olympia oyster
- Ostrea lurida
- Pacific oyster
- wget
---
Received notice the sequencing data was ready from Genewiz for the samples submitted [20151222](https://robertslab.github.io/sams-notebook/posts/2015/2015-12-22-sample-submission-bs-seq-library-pool-to-genewiz/).
Download the FASTQ files from Genewiz project directory:
wget -r -np -nc -A "*.gz" ftp://username:password@ftp2.genewiz.com/Project_BS1512183
Since two species were sequenced (_C.gigas_ & _O.lurida_), the corresponding files are in the following locations:
[https://owl.fish.washington.edu/nightingales/O_lurida/](http://owl.fish.washington.edu/nightingales/O_lurida/)
[https://owl.fish.washington.edu/nightingales/C_gigas/](http://owl.fish.washington.edu/nightingales/C_gigas/)
In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called 'bsseq':
bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")
Explanation:
bsseq=
* This initializes a variable called "bsseq" to the values contained in the command following the equals sign.
$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")
* This lists (ls) all files, pipes them to the grep command (|), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results (|) to another grep command which excludes (-v) any results containing the text "2bRAD".
FILENAME | SAMPLE NAME | SPECIES |
1_ATCACG_L001_R1_001.fastq.gz | 1NF11 | O.lurida |
2_CGATGT_L001_R1_001.fastq.gz | 1NF15 | O.lurida |
3_TTAGGC_L001_R1_001.fastq.gz | 1NF16 | O.lurida |
4_TGACCA_L001_R1_001.fastq.gz | 1NF17 | O.lurida |
5_ACAGTG_L001_R1_001.fastq.gz | 2NF5 | O.lurida |
6_GCCAAT_L001_R1_001.fastq.gz | 2NF6 | O.lurida |
7_CAGATC_L001_R1_001.fastq.gz | 2NF7 | O.lurida |
8_ACTTGA_L001_R1_001.fastq.gz | 2NF8 | O.lurida |
9_GATCAG_L001_R1_001.fastq.gz | M2 | C.gigas |
10_TAGCTT_L001_R1_001.fastq.gz | M3 | C.gigas |
11_GGCTAC_L001_R1_001.fastq.gz | NF2_6 | O.lurida |
12_CTTGTA_L001_R1_001.fastq.gz | NF_18 | O.lurida |
totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads
Total reads = 138,530,448
_C.gigas_ reads: 22,249,631
_O.lurida_ reads: 116,280,817
Code explanation:
totalreads=0;
* Creates variable called "totalreads" and initializes value to 0.
for i in $bsseq;
* Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
* Creates variable called "linecount" that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
* Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
* Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the "readcount" variable.
totalreads=$((readcount+totalreads));
* Adds the readcount for the current file and adds the value to totalreads.
done;
* End the for loop.
echo $totalreads
* Prints the value of totalreads to the screen.
Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.
for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done
Code explanation:
for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
* Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
* This formats the the printed output. The "%s\t%s\n\n" portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
* This appends the result from each loop to the readme.md file and ends the for loop (done).