--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left date: 2015-12-31 20:57:48+00:00 layout: post slug: data-received-oly-2brad-illumina-sequencing-from-genewiz title: Data Received - Oly 2bRAD Illumina Sequencing from Genewiz categories: - 2015 - 2bRAD Library Tests for Sequencing at Genewiz - Olympia oyster reciprocal transplant tags: - 2bRAD - bash - code - command line - olympia oyster - Ostrea lurida - RAD - shell --- The data was made available to use on 20151224 and took two days to download. The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): [White_BS1511196_R2_barcodes](https://docs.google.com/spreadsheets/d/1DJP4zpF3OcISOAQ-MM8bW85WcJqdB5EvcExs2wGvzcg/edit?usp=sharing) The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we'll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: https://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015 Find and copy all FASTQ files: find /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ -name '*.fastq.*' -exec cp -n '{}' /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/ \; Code explanation: find * Command line program used for searching for files /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ * Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted. -name '*.fastq.*' * The name argument tells the find command to look for filenames that have ".fastq" in them. -exec cp -n '{}' * The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I'm using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files. /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/2bRAD_Dec2015 \; * The location where I want the matched files copied. I created a readme file in the directory directory with these files: [readme.md](https://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/readme.md) I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file. Here's how to count the total of all reads generated in this project totalreads=0; for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads Total Reads: 588,396,334 Code explanation: totalreads=0; * Creates variable called "totalreads" and initializes value to 0. for i in *.gz; * Initiates a for loop to process any filenames that end with ".gz". The FASTQ files have been compressed with gzip and end with the .gz extension. do linecount= * Creates variable called "linecount" that stores the results of the following command: `gunzip -c "$i" | wc -l`; * Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files. readcount=$((linecount/4)); * Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the "readcount" variable. totalreads=$((readcount+totalreads)); * Adds the readcount for the current file and adds the value to totalreads. done; * End the for loop. echo $totalreads * Prints the value of totalreads to the screen. Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file. for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done Code explanation: for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); * Same for loop as above that calculates the number of reads in each FASTQ file. printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md; * This formats the the printed output. The "%s\t%s\n\n" portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks. >> readme.md; done * This appends the result from each loop to the readme.md file and ends the for loop (done).