Welcome to a tale of many trials and few triumphs!
Here is a picture of a copper rockfish to get the energy up:
I mean, just look at it!!
I will be taking RAD-seq data from three species of closely-related rockfish species (Sebastes spp.) through a bioinformatics processing pipeline and producing a PCA plot that displays population structure clusters.
process_radtags
program and remove PCR duplicates with clone_filter
bowtie2
gstacks
and populations
modules in STACKS to perform SNP calling and basic statisticsadegenet
, an R package, to examine inter- and intraspecific population structure patterns & clusteringFastQC sequence counts
FastQC sequence quality
Demultiplexing is normally a pretty standard step in the pipeline, but my data has decided to send me on a journey instead :)
So I will be sharing where I’m at and how I got there!
Here is what a peek at my raw sequence file - R1, or the forward read - looks like using the head
command:
[lizboggs@klone-login03 radseq_data]$ zcat GC3F-LB-10209---8853_S0_L003_R1_001.fastq.gz | head
@A01335:621:H5WGLDSXF:3:1101:1036:1000 1:N:0:GCCAAT+AGATCT
GGACAGCAGATGCAGGCATGCGCCATGGAGATTAGCTACACCGCAGTGGCGTGGAACACATAACAGGCTCATTAGGTAGCATATTCTCCAGGAGGCATGCGCCTATGGAGAATAGCCACACCGCAGTGGCGTGGAACACATAACAGGCTCACTAGGTAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:FFF,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF,FFFFFF:FFFFFFFFFFFFFF:,FFFFFFFFFFFF:FF::,
@A01335:621:H5WGLDSXF:3:1101:1072:1000 1:N:0:CGATGT+AGATCT
CGGAGGAGCAGGTGTTGATGGGGCCGTTCTTCATCTGCTCCAGAGTGTCCAGCACCACGTCGGGAGCCGGCCGGCAGCCCTGCCGCTCCACCTGCCTCAACTCGCCCACAGCCACCGTCACCGTCACCTCGAAGTCCTGCATGTCTATCCCAGATCGGA
+
F:FFFFF:F:FFFF,::F:FF:,F,FFF::F:FF:F,F,::FFFF,,F,FF,FFFFF:,FFFFFF,FF:FF:,:FF,FFF,::F,F,:F:,FF:FF,:FFF,,,,FF,,F,:::::,,FFFF,,:,:F::,F,:,F,,,F:,:F:,F,,FFFFFF,,,,
@A01335:621:H5WGLDSXF:3:1101:1090:1000 1:N:0:ATCACG+AGATCT
CCACCACACAGCCGCCCTCCAGCTCACCACGGCCCTCAGCCTCTTCACGAGCTGCAACCACCACACAGCTGCCCTCCACCTCACCACGGCCCTCAGCCTCTTTGCGAGCTGCCCCCTCCACACAGCTGCTCTCCACCTCACCACGGCCCTCAGCCTCTT
The first line is the sequence header (unique to each read), second line is the read itself, third is a +
separator line, and fourth is a Phred quality score for each base
Here is the script I am using to run STACKS process_radtags
on my data:
#!/bin/bash
#SBATCH --job-name=process_radtags_rfnewbarcode_lib5_712fix
#SBATCH --account=merlab
#SBATCH --partition=compute-hugemem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=2-16:00:00
#SBATCH --mem=500G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=lizboggs@uw.edu
##### ENVIRONMENT SETUP ##########
module purge
source /gscratch/merlab/software/miniconda3/etc/profile.d/conda.sh
conda activate stacks_env
# Set paths
RAWDIR=/mmfs1/gscratch/merlab/lizboggs/radseq_data
OUTDIR=/mmfs1/gscratch/scrubbed/lizboggs/newbarcodedemux_lib5indx712
BARCODE_DIR=/mmfs1/gscratch/merlab/lizboggs/barcodes
process_radtags \
-1 ${RAWDIR}/GC3F-LB-10209---8853_S0_L003_R1_001.fastq.gz \
-2 ${RAWDIR}/GC3F-LB-10209---8853_S0_L003_R2_001.fastq.gz \
-b ${BARCODE_DIR}/allbarcodes_new_withlib5indx712fix.txt \
-o ${OUTDIR} \
--renz_1 sbfI \
--inline_index \
-i gzfastq \
-y fastq \
-E phred33 \
--bestrad \
-q \
--filter_illumina \
-c \
-t 130 \
--adapter_1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-r \
--barcode_dist_1 1 \
--barcode_dist_2 1
# Close environment
conda deactivate
6228064294 total reads; -0 failed Illumina reads; -3580073710 ambiguous barcodes; -28122894 ambiguous RAD-Tags; +83758324 recovered; -5554816 low quality reads; 2563948807 retained reads.
50364067 reads with adapter sequence.
Closing files, flushing buffers...done.
6228064294 total sequences
1494217429 reads were transposed [based on the BestRAD flag] (24.0%)
0 failed Illumina filtered reads (0.0%)
50364067 reads contained adapter sequence (0.8%)
3580073710 barcode not found drops (57.5%)
5554816 low quality read drops (0.1%)
28122894 RAD cutsite not found drops (0.5%)
2563948807 retained reads (41.2%)
Only 41.2% of reads retained and 57.5% “barcode not found” drops… something is up.
Some important background: reads are demultiplexed based on their indexes. In order for process_radtags
to know which reads are what, it needs to be told where to find the indexes.
This is done with program flags - in my case, I used the flag --inline_index
, as my 10 base-pair individual barcodes are inline, ie. they are found in the actual read. My library index - ie. which library the individual belonged to - is found in the header.
I’ll show on the next slide why this organization is causing an issue in my processing.
When using the command grep -A 20 "barcodes_not_recorded" process_radtags.radseq_data.log
, which pulls all “barcodes not recorded” (ie. dropped), it reports this:
BEGIN barcodes_not_recorded
# A list and count of barcodes found that were not specified in the barcodes input file.
Barcode Total
GGGAGCTGAA-AGATCT 104988366
GGAAACATCG-AGATCT 88023132
GGCCGTGAGA-AGATCT 55435788
GGGACAGTGC-AGATCT 52357626
GGCGAACTTA-AGATCT 51776254
GGACCACTGT-AGATCT 50442810
GGGATAGACA-AGATCT 50208624
GGCATCAAGT-AGATCT 46791976
GGAACGCTTA-AGATCT 44114520
GGATCCTGTA-AGATCT 44070132
GGTCCGTCTA-AGATCT 42453748
GGAGAGTCAA-AGATCT 40700258
GGCCTCCTGA-AGATCT 40518226
GGCGCATACA-AGATCT 40145892
GGTGGTGGTA-AGATCT 38694062
GGCAAGACTA-AGATCT 38661102
GGTGAAGAGA-AGATCT 37881860
GGAATCCGTC-AGATCT 37656988
--
END barcodes_not_recorded
It seems to be matching a handful of the 10bp individual barcodes with the 6bp i5 Illumina barcode that is on every read, and thus… it can’t assign those reads to a single individual, because 4-5 individuals will have both of those barcodes attached to them.
Reviewing this section of my R1 file:
@A01335:621:H5WGLDSXF:3:1101:1036:1000 1:N:0:GCCAAT+AGATCT
GGACAGCAGATGCAGGCATGCGCCATGGAGATTAGCTACACCGCAGTGGCGTGGAACACATAACAGGCTCATTAGGTAGCATATTCTCCAGGAGGCATGCGCCTATGGAGAATAGCCACACCGCAGTGGCGTGGAACACATAACAGGCTCACTAGGTAT
The library index is the first in the index+index pair; here it is GCCAAT, which is the index for my library 5. The second in that pair is that Illumina i5 index that is on every read. The 10bp individual barcode is the first 10bp of the actual read. So while process_radtags
is correctly reading at least 40% of reads correctly - pulling the first index barcode and pairing it with its respective individual barcode - it has also paired the wrong ones together and doesn’t know what to do with them, so it drops them as “not found.”
1. Cry (check! Just kidding… maybe)
Field as much help as possible, from lab mates to the internet, ie. Bioinformatics Stack Exchange (check!)
Contact the sequencing facility and make it their problem by sending my data back for THEM to demux (very close to doing this…)
Once data is finally demux’d in whatever way… continue with pipeline!!!
I still need to prove my proficiency in making a Quarto presentation, so let’s do some other fun things with my metadata instead
Species | # of Samples |
---|---|
Brown | 97 |
Copper | 181 |
Quillback | 217 |
This is in total, accounting for combining my rockfish with those from a previous study (not done by me). Lots of fish!
Plot of my rockfish samples by species!
I hope this demonstrated my ability to work with Quarto and showed off my very neat very fun coding adventure… hopefully have a good result soon!