Next generation sequencing
What is next generation DNA sequencing and how does it work?
This progress has been achieved by a variety of incremental developments such as the ability to array and copy single DNA molecules on beads and solid surfaces, and the development of reversible (chemically or enzymatically) DNA chain terminators. The latter allow DNA sequencing by measuring which bases are added into an elongating DNA chain, physically anchored to a glass slide or array of beads. This removes the need for gel electrophoresis.

Schematic illustrating principles of next-generation sequencing
In the first step (top left) DNA molecules are anchored to a solid surface: beads or glass slides, for example. The dilution and anchoring process ensures only one template per location is tethered. The anchored DNA templates are copied in situ (in some devices with different physical configuration, truly single molecules are sequenced). The distribution of millions of DNA templates over a solid surface allows each to be sequenced in parallel with the others. Sequencing occurs by flowing reagents stepwise onto the device. The first step in the process is templated addition of a reversible terminator, which results in a single base extension to each template. After the unincorporated nucleotides are removed, the base which was added is detected by laser and a CCD camera or similar device that scans the reaction chamber. In the last step, the termination moiety of the incorporated based is reversed, chemically or enzymatically, and the cycle can start again. The number of cycles determines the length of sequence.
The first generation approaches, for example Roche 454 sequencing, currently achieve 200-400 base pair reads over hundreds of thousands of templates. The current generation of machines, for example the Illumina GA devices, ABI SoliD machines (which use ligation mediated sequencing), and the truly single molecule Helioscope, are capable of sequencing tens of millions of individual templates in parallel, with sequence read lengths now approaching 100 nucleotides. At time of writing, 30Gb of DNA sequence — or about 10 haploid human genome equivalents — can be obtained in a week, for approximately $15,000. The draft human genome reference sequence (3Gb) completed in 2001 to about 6 fold redundancy, took over 5 years of sequencing effort by multiple laboratories and cost several billion dollars.
In addition to scale, there are two important novel characteristics of the massively parallel short DNA sequence reads, above those offered by conventional sequencing. First, the nature of the process means that each template sequence originates from a single DNA molecule, rather than a population of molecules as occurs in PCR amplicon based sequencing. This means that the abundance of a given DNA sequence in a population can be determined by counting the number of templates sequenced.

Mutation frequency analysis from next-generation sequencing
Next generation devices produce sequence from single templates. Thus the abudance of a nucleotide at any given position gives a direct measure of the frequency in the population. In genomic DNA sequencing, this can be used to infer the presence of subpopulations of cells bearing mutations. In RNA seq it can be used to infer allelic expression. In contrast with PCR amplicon Sanger sequencing, the whole population of molecules is amplified and the sequence read as a ratio of two peaks from electrophoresis. This limits the detection of minor alleles to a frequency of above ~15%.
Moreover, mixtures of DNA templates can be sampled. This will, for example, have big implications for understanding of tumour mutational heterogeneity. The ability to identify mutations and other genomic features in a background of normal (wild-type) alleles potentially alleviates some of the requirements for laser capture and other microdisssection techniques. This will hugely advance the study of cancers such as lobular breast cancer or in situ malignancy, that are characterized by individual cells of interest buried in stroma, rather than contiguous sheets of tumours cells. Second, most devices now sequence from both ends of each single DNA template. This is known as mate-pair or paired-end sequencing and it allows the investigator to determine whether the DNA template is chimeric, by comparing the location of the sequences from each end on a reference genome. This approach has been recently used to search for gene fusion events in cancers.

Paired-end read sequencing
Multiple aberration types can be inferred from DNA sequencing or cDNA sequencing (RNA-seq). After random fragmentation and sequencing, each end of the DNA molecule is sequenced. Alignment to a reference sequence allows inference of the aberrations seen. Single nucleotide variants in DNA or RNA can be inferred from single end alignment. Using the mate-pair information however allows sequences that do not map to the expected co-location to be identified. Depending on the context this can be used to infer translocations/fusions or alternative splicing or inversions.
In practice, entire human genomes can now be re-sequenced in a matter of weeks by these devices although sequencing to effective saturation with short reads, requires high redundancy – for a diploid human genome, a theoretical estimate suggests 27 fold haploid coverage (ie 27x 2.8Gb alignable human genome) or about 76Gb, to have 95% certainty that all alleles are captured. These are quantities of sequence that were unimaginable at the time the first draft human genome sequence was reported in 2001, but can now be generated in about 2 weeks by a single of the most advanced next generation devices. Using these methods the complete re-sequencing of several individual germline genomes has now been reported. An ambitious project to sample human genetic diversity by light coverage re-sequencing of 1000 individual human germline genomes is underway and will be complete within 2 years. Of relevance to cancer pathology, the sequencing of tumour genomes is now well within reach and many groups are reporting the first results.
Sequencing of genomic DNA requires random fragmentation of the sample, to DNA lengths compatible with the sequencing library construction. Sequencing libraries can now be built in multiwell plates, reducing the requirements for input DNA and increasing efficiency. Given sufficient depth of genomic DNA sequence, all aberrations in the genome from segmental copy number changes, down to single nucleotide variants, can be reconstructed from the sequence information.
Genomic DNA can be sequenced directly, however for transcriptome sequencing, the RNA must first be converted to cDNA ,which is then fragmented and entered into library construction. Alignment of short reads from cDNA to the transcriptome is non-trivial, because the presence of introns means intron spanning reads would be rejected by most short read alignment methods. Our approach to this has been to create comprehensive junction library databases that are used for alignment.
Massively parallel sequencing of RNA species is now an efficient method of very deeply sampling the expressed genome including microRNAs and other small RNA species. Not only can the abundance of RNAs be determined accurately, but the presence of expressed mutations (point mutations, indels, inversions), alternative splicing, gene fusion events, RNA editing and the detection of novel splice events can all be retrieved from transcriptome libraries at once. Transcriptome sequencing coupled with ChIP-seq now allows the accurate mapping of chromatin modifications and transcription factor binding sites in conjunction with the transcriptome.
0
0