QC Analysis Report
Contract Information | Contract Content |
---|---|
{{item}} | {% endfor %}
A. Library Preparation and Sequencing
From the DNA samples to the final data, each step, including sample test, library preparation, and sequencing, influences the quality of the data, and data quality directly impacts the analysis results. To guarantee the reliability of the data, quality control (QC) is performed at each step of the procedure. The workflow is as follows:
1 Sample Quality Control
Please refer to Novogene’s QC report for methods of sample quality control.
2 Library Construction, Quality Control and Sequencing
The genomic DNA was randomly sheared into short fragments. The obtained fragments were end repaired, A-tailed and further ligated with Illumina adapter. The fragments with adapters were PCR amplified, size selected, and purified. Following is the workflow of library construction:
The library was checked with Qubit and real-time PCR for quantification and bioanalyzer for size distribution detection. Quantified libraries will be pooled and sequenced on Illumina platforms, according to effective library concentration and data amount required.
B. Results and Instructions
1 Data Quality Control
1.1 Distribution of Sequencing Quality
The “e” represents the sequence error rate and Qphred represents the base quality value,Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is shown in the table below:
Phred score | Error base | Right base | Q-score |
---|---|---|---|
10 | 1/10 | 90% | Q10 |
20 | 1/100 | 99% | Q20 |
30 | 1/1000 | 99.9% | Q30 |
40 | 1/10000 | 99.99% | Q40 |
The distribution of quality score is shown in Fig.1:
Fig.1 Distribution of Sequencing Quality
The base position is on the horizontal axis and the sequencing quality is on the vertical axis.
1.2 Distribution of Sequencing Error Rate
For Illumina SBS technology, the distribution of sequencing error rate has two features:
(1) Error rate grows with sequenced reads extension because of the consumption of sequencing reagent. The phenomenon is common in the Illumina high-throughput sequencing platform (Erlich Y. et al. 2008; Jiang et al. 2011).
(2) The reason for the high error rate of the first six bases is that the random hex-primers and RNA template bind incompletely in the process of cDNA synthesis (Jiang et al.2011).
The error rate of this project is shown in Fig.2:
Fig.2 Error Rate Distribution
The base position is on the horizontal axis and the single base error rate is on the vertical axis
1.3 Distribution of A/T/G/C Base
It is used to identify the separation situation of AT and GC by checking the distribution of GC content. According to the principle of complementary bases, the content of AT and GC should be equal at each sequencing cycle and be constant and stable in the whole sequencing procedure. But in practical measurement, due to the primer amplification bias and some other reasons, the first 6 to 7 nucleotides will fluctuate which is normal and reasonable.
The distribution of GC content is shown in Fig.3:
Fig.3 A/T/G/C Distribution
The base position is on the horizontal axis and the single base percentage is on the vertical axis
1.4 Results of Raw Data Filtering
The sequenced reads (raw reads) often contain low quality reads and adapters, which will affect the analysis quality. So it's necessary to filter the raw reads and get the clean reads. The filtering process is as follows:
(1) Remove reads containing adapters.
(2) Remove reads containing N > 10% (N represents the base cannot be determined).
(3) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.
Sequences of adapter
{% if single_cell_V %}  5' Adapter:
  5'-CTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'
  3' Adapter:
  5'-CTGTCTCTTATACACATCTCCGAGCCCACGAGAC-3'
{% else %}  5' Adapter:
  5'-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'
  3' Adapter:
  5'-GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGATGACTATCTCGTATGCCGTCTTCTGCTTG-3'
{% endif %} {% if single_cell_type %}  SMART-Seq PCR primer:
  5'-AAGCAGTGGTATCAACGCAGAGTAC-3'
{% endif %}The Sequencing data filtration of this project can be seen in Fig.4 :
Fig.4 Composition of Raw Data
Different color for different components:
(1)Adapter related: (reads containing adapter) / (total raw reads)
(2)Containing N: (reads with more than 10% N) / (total raw reads)
(3)Low quality: (reads of low quality) / (total raw reads)
(4)Clean reads: (clean reads) / (total raw reads)
{% if single_cell_type %}(5)PCR Primer Contamination: (reads containing PCR Primer) / (total raw reads)
{% endif %}2 Summary of Sequencing Data Information
The total output of data on the sequencer: Raw data {{total_raw}} G.
The detail statistics for the quality of sequencing data are shown in Table 1.
Table 1 Data Quality Summary
{{item}} | {% endfor %}||||||||
---|---|---|---|---|---|---|---|---|
{{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} |
{{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} |
Sample: sample name
Library_Flowcell_Lane: Library ID_Flowcell ID_lane ID, for raw data file naming.
Raw reads: total amount of reads of raw data, each four lines taken as one unit. For paired-end sequencing, it equals the amount of read1 and read2, otherwise it equals the amount of read1 for single-end sequencing.
Raw data: (Raw reads) * (sequence length), calculating in G. For paired-end sequencing like PE150, sequencing length equals 150, otherwise it equals 50 for sequencing like SE50.
Effective: (Clean reads/Raw reads)*100%
Error: base error rate
Q20, Q30: (Base count of Phred value > 20 or 30) / (Total base count)
GC: (G & C base count) / (Total base count)
C. Appendix
1 Introduction of Sequencing Data Format
The original raw data from Illumina platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below (Cock P.J.A. et al. 2010):
@HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH
Line 1 begins with a '@' character and is followed by the Illumina Sequence Identifiers and an optional description. The details of Illumina sequence identifier are listed in the table below.
Identifier | Meaning |
---|---|
HWI-ST1276 | Instrument – unique identifier of the sequencer |
71 | Run number – Run number on instrument |
C1162ACXX | Flow Cell ID – ID of flow cell |
1 | Lane Number – positive integer |
1101 | Tile Number – positive integer |
1208 | X – x coordinate of the spot. Integer which can be negative |
2458 | Y – y coordinate of the spot. Integer which can be negative |
1 | Read number. 1 can be single read or Read 2 of paired-end. |
N | Y if the read is filtered (did not pass), N otherwise. |
0 | Control number - 0 when none of the control bits are on, otherwise it is an even number |
CGATGT | Illumina index sequences |
Line 2 is the raw sequence of the read.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.
Line 4 encodes the quality values for the bases in Line 2 and contains the same number of characters as the bases in the read (Cock, 2009.).
2 Explanation of Sequencing Data Related
(1) The data delivered is a compressed file in format of '.fq.gz'. Before data delivery, we will calculate the md5 value of each compressed file and
please check it when you get the data. There are two ways to check the md5 value. In Linux environment, you can use 'md5sum -c <*md5.txt>' command under the data directory. In Windows environment, you can use
a calibration tool e.g. hashmyfiles. If the md5 value of compressed file doesn't match with the one we provide in md5 file in data directory, the file may have been damaged during the transmitting procedure.
(2) For paired-end (PE) sequencing, every sample should have 2 data flies (read1 file and read2 file). These 2 files have the same line number, you could use 'wc -l' command to check the line number in Linux environment.
The line number divide by 4 is the number of reads.
(3) The data size is the space occupied by the data in the hard disk. It's related to the format of the disk and compression ratio. And it has no influence on the quantity of sequenced bases.
Thus, the size of read1 file may be unequal to the size of read2 file.
(4) When customer’s samples need large amount of data e.g. whole genome sequencing data, we would use separate-lane sequencing strategy to make sure the quality of data, so it's possible that one sample has several parts sequencing data.
For example, if sample 1 has two read1 files, sample1_L1_1.fq.gz and sample1_L2_1.fq.gz, that means this sample was sequenced on different lanes.
(5) If we agree to deliver the clean data before the project starts, we will filter the data strictly according to the standard to obtain high quality clean data which can be used for further research and paper writing. We will discard the paired reads in the following situation: when either one read contains adapter contamination; when either one read contains more than 10 percent uncertain nucleotides; when either one read contains more than 50 percent low quality nucleotides (base quality less than 5). The data analysis results based on the clean data that is filtered by this standard can be approved by high level magazines (Yan L.Y. et al . 2013). If you want to get more information, please refer to the official website of Novogene (www.novogene.com).
(6) The Index is normally in the middle of the adapter during the process of experiment and sequencing except the special library. We can get the Read1 sequence and Read2 sequence according to the Index read. They are all the sequence of samples so that it's no necessary to dispose the beginning and end of reads in the downstream analysis (e.g. mapping).
(7) 30 days after the data delivery, we will delete outdated data. So please keep your data properly. If you have any questions or concerns, please contact us as soon as possible.
3 References
Cock P.J.A. et al (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.
Hansen K.D. et al (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.
Erlich Y.et al (2008). Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.Nature Methods,5,679-682.
Jiang L.C. et al (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21, 1543-1551.
Yan L.Y. et al (2013). Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol.