QC Analysis Report

{% for col in title_project %} {% for item in col %} {% endfor %} {% endfor %}

Contract Information	Contract Content
{{item}}

Novogene Co., Ltd

A. Library Preparation and Sequencing

From the DNA samples to the final data, each step, including sample test, library preparation, and sequencing, influences the quality of the data, and data quality directly impacts the analysis results. To guarantee the reliability of the data, quality control (QC) is performed at each step of the procedure. The workflow is as follows:

1 Sample Quality Control

Please refer to Novogene’s QC report for methods of sample quality control.

2 Library Construction, Quality Control and Sequencing

The genomic DNA was randomly sheared into short fragments. The obtained fragments were end repaired, A-tailed and further ligated with Illumina adapter. The fragments with adapters were PCR amplified, size selected, and purified. Following is the workflow of library construction:

The library was checked with Qubit and real-time PCR for quantification and bioanalyzer for size distribution detection. Quantified libraries will be pooled and sequenced on Illumina platforms, according to effective library concentration and data amount required.

Novogene Co., Ltd

B. Results and Instructions

1 Data Quality Control

1.1 Distribution of Sequencing Quality

The “e” represents the sequence error rate and Q_phred represents the base quality value,Q_phred=-10log₁₀(e). The relationship between sequencing error rate (e) and sequencing base quality value (Q_phred) is shown in the table below:

Phred score	Error base	Right base	Q-score
10	1/10	90%	Q10
20	1/100	99%	Q20
30	1/1000	99.9%	Q30
40	1/10000	99.99%	Q40

The distribution of quality score is shown in Fig.1:

{% for item in figure_qscore %}

{% endfor %}

{% for item in figure_qscore %} {{item.0}} {% endfor %}

Fig.1 Distribution of Sequencing Quality

The base position is on the horizontal axis and the sequencing quality is on the vertical axis.

Novogene Co., Ltd

1.2 Distribution of Sequencing Error Rate

For Illumina SBS technology, the distribution of sequencing error rate has two features:

(1) Error rate grows with sequenced reads extension because of the consumption of sequencing reagent. The phenomenon is common in the Illumina high-throughput sequencing platform (Erlich Y. et al. 2008; Jiang et al. 2011).

(2) The reason for the high error rate of the first six bases is that the random hex-primers and RNA template bind incompletely in the process of cDNA synthesis (Jiang et al.2011).

The error rate of this project is shown in Fig.2:

{% for item in figure_error %}

{% endfor %}

{% for item in figure_error %} {{item.0}} {% endfor %}

Fig.2 Error Rate Distribution

The base position is on the horizontal axis and the single base error rate is on the vertical axis

Novogene Co., Ltd

1.3 Distribution of A/T/G/C Base

It is used to identify the separation situation of AT and GC by checking the distribution of GC content. According to the principle of complementary bases, the content of AT and GC should be equal at each sequencing cycle and be constant and stable in the whole sequencing procedure. But in practical measurement, due to the primer amplification bias and some other reasons, the first 6 to 7 nucleotides will fluctuate which is normal and reasonable.

The distribution of GC content is shown in Fig.3:

{% for item in figure_gc %}

{% endfor %}

{% for item in figure_gc %} {{item.0}} {% endfor %}

Fig.3 A/T/G/C Distribution

The base position is on the horizontal axis and the single base percentage is on the vertical axis

Novogene Co., Ltd

1.4 Results of Raw Data Filtering

The sequenced reads (raw reads) often contain low quality reads and adapters, which will affect the analysis quality. So it's necessary to filter the raw reads and get the clean reads. The filtering process is as follows:

(1) Remove reads containing adapters.

(2) Remove reads containing N > 10% (N represents the base cannot be determined).

(3) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.

Sequences of adapter

{% if single_cell_V %}

5' Adapter:

5'-CTGTCTCTTATACACATCTGACGCTGCCGACGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'

3' Adapter:

5'-CTGTCTCTTATACACATCTCCGAGCCCACGAGAC-3'

{% else %}

5' Adapter:

5'-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'

3' Adapter:

5'-GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGATGACTATCTCGTATGCCGTCTTCTGCTTG-3'

{% endif %} {% if single_cell_type %}

SMART-Seq PCR primer:

5'-AAGCAGTGGTATCAACGCAGAGTAC-3'

{% endif %}

The Sequencing data filtration of this project can be seen in Fig.4 :

{% for item in figure_C %}

{% endfor %}

{% for item in figure_C %} {{item.0}} {% endfor %}

Fig.4 Composition of Raw Data

Different color for different components:

(1)Adapter related: (reads containing adapter) / (total raw reads)

(2)Containing N: (reads with more than 10% N) / (total raw reads)

(3)Low quality: (reads of low quality) / (total raw reads)

(4)Clean reads: (clean reads) / (total raw reads)

{% if single_cell_type %}

(5)PCR Primer Contamination: (reads containing PCR Primer) / (total raw reads)

{% endif %}

Novogene Co., Ltd

2 Summary of Sequencing Data Information

The total output of data on the sequencer: Raw data {{total_raw}} G.

The detail statistics for the quality of sequencing data are shown in Table 1.

Table 1 Data Quality Summary

{% for item in table_qc_head %}{% endfor %} {% for each in table_qc %} {% for x in each %} {% if forloop.first %} {% else %} {% endif %} {% endfor %} {% endfor %}

{{item}}
{{x.0}}	{{x.1}}	{{x.2}}	{{x.3}}	{{x.4}}	{{x.5}}	{{x.6}}	{{x.7}}	{{x.8}}
{{x.0}}	{{x.1}}	{{x.2}}	{{x.3}}	{{x.4}}	{{x.5}}	{{x.6}}	{{x.7}}

Sample: sample name
Library_Flowcell_Lane: Library ID_Flowcell ID_lane ID, for raw data file naming.
Raw reads: total amount of reads of raw data, each four lines taken as one unit. For paired-end sequencing, it equals the amount of read1 and read2, otherwise it equals the amount of read1 for single-end sequencing.
Raw data: (Raw reads) * (sequence length), calculating in G. For paired-end sequencing like PE150, sequencing length equals 150, otherwise it equals 50 for sequencing like SE50.
Effective: (Clean reads/Raw reads)*100%
Error: base error rate
Q20, Q30: (Base count of Phred value > 20 or 30) / (Total base count)
GC: (G & C base count) / (Total base count)

Novogene Co., Ltd

C. Appendix

1 Introduction of Sequencing Data Format

The original raw data from Illumina platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below (Cock P.J.A. et al. 2010):

@HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH

Line 1 begins with a '@' character and is followed by the Illumina Sequence Identifiers and an optional description. The details of Illumina sequence identifier are listed in the table below.

Identifier	Meaning
HWI-ST1276	Instrument – unique identifier of the sequencer
71	Run number – Run number on instrument
C1162ACXX	Flow Cell ID – ID of flow cell
1	Lane Number – positive integer
1101	Tile Number – positive integer
1208	X – x coordinate of the spot. Integer which can be negative
2458	Y – y coordinate of the spot. Integer which can be negative
1	Read number. 1 can be single read or Read 2 of paired-end.
N	Y if the read is filtered (did not pass), N otherwise.
0	Control number - 0 when none of the control bits are on, otherwise it is an even number
CGATGT	Illumina index sequences

Line 2 is the raw sequence of the read.

Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.

Line 4 encodes the quality values for the bases in Line 2 and contains the same number of characters as the bases in the read (Cock, 2009.).

2 Explanation of Sequencing Data Related

(1) The data delivered is a compressed file in format of '.fq.gz'. Before data delivery, we will calculate the md5 value of each compressed file and please check it when you get the data. There are two ways to check the md5 value. In Linux environment, you can use 'md5sum -c <*md5.txt>' command under the data directory. In Windows environment, you can use a calibration tool e.g. hashmyfiles. If the md5 value of compressed file doesn't match with the one we provide in md5 file in data directory, the file may have been damaged during the transmitting procedure.

(2) For paired-end (PE) sequencing, every sample should have 2 data flies (read1 file and read2 file). These 2 files have the same line number, you could use 'wc -l' command to check the line number in Linux environment. The line number divide by 4 is the number of reads.

(3) The data size is the space occupied by the data in the hard disk. It's related to the format of the disk and compression ratio. And it has no influence on the quantity of sequenced bases. Thus, the size of read1 file may be unequal to the size of read2 file.

(4) When customer’s samples need large amount of data e.g. whole genome sequencing data, we would use separate-lane sequencing strategy to make sure the quality of data, so it's possible that one sample has several parts sequencing data. For example, if sample 1 has two read1 files, sample1_L1_1.fq.gz and sample1_L2_1.fq.gz, that means this sample was sequenced on different lanes.

(5) If we agree to deliver the clean data before the project starts, we will filter the data strictly according to the standard to obtain high quality clean data which can be used for further research and paper writing. We will discard the paired reads in the following situation: when either one read contains adapter contamination; when either one read contains more than 10 percent uncertain nucleotides; when either one read contains more than 50 percent low quality nucleotides (base quality less than 5). The data analysis results based on the clean data that is filtered by this standard can be approved by high level magazines (Yan L.Y. et al . 2013). If you want to get more information, please refer to the official website of Novogene (www.novogene.com).

(6) The Index is normally in the middle of the adapter during the process of experiment and sequencing except the special library. We can get the Read1 sequence and Read2 sequence according to the Index read. They are all the sequence of samples so that it's no necessary to dispose the beginning and end of reads in the downstream analysis (e.g. mapping).

(7) 30 days after the data delivery, we will delete outdated data. So please keep your data properly. If you have any questions or concerns, please contact us as soon as possible.

3 References

Cock P.J.A. et al (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.

Hansen K.D. et al (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.

Erlich Y.et al (2008). Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.Nature Methods,5,679-682.

Jiang L.C. et al (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21, 1543-1551.

Yan L.Y. et al (2013). Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol.