Read Distribution

These curves show how coverage is distributed amongst reads. Ideally, the cumulative proportion of reads will transition sharply from low to high.

Portions to the left of the transition might correspond roughly to sequencing or sample processing errors, and correspond to reads that are represented relatively infrequently. 10-15%; of reads in a typical Genome Analyzer 'control' lane fall in this category.

Portions to the right of the transition represent reads that are over-represented compared to expectation. These might include inadvertently sequenced primer or adapter sequences, sequencing or base calling artifacts (e.g., poly-A reads), or features of the sample DNA (highly repeated regions) not adequately removed during sample preparation. About 5% of Genome Analyzer 'control' lane reads fall in this category.

Broad transitions from low to high cumulative proportion of reads may reflect sequencing bias or (perhaps intentional) features of sample preparation resulting in non-uniform coverage. the transition is about 5 times as wide as expected from uniform sampling across the Genome Analyzer 'control' lane.

  df <- qa[["sequenceDistribution"]]
  ShortRead:::.plotReadOccurrences(df[df$type=="read",], cex=.5)
@READ_OCCURRENCES_FIGURE@

Common duplicate reads might provide clues to the source of over-represented sequences. Some of these reads are filtered by the alignment algorithms; other duplicate reads might point to sample preparation issues.

  ShortRead:::.freqSequences(qa, "read")
@FREQUENT_SEQUENCES_READ@

Common duplicate reads after filtering

  ShortRead:::.freqSequences(qa, "filtered")
@FREQUENT_SEQUENCES_FILTERED@

Common aligned duplicate reads are

  ShortRead:::.freqSequences(qa, "aligned")
@FREQUENT_SEQUENCES_ALIGNED@