Next Generation Sequencing

The Department of Molecular Oncology uses the Illumina Genome Analyzer devices at the Michael Smith Genome Sciences Centre for DNA and RNA sequencing. Several of our ongoing projects use this technology (described in detail below) to analyze cancer genomes and transcriptomes. What is next generation DNA sequencing and how does it work?

The rate-limiting step in conventional DNA sequencing arises from the need to separate randomly terminated DNA polymers by gel electrophoresis. Next generation sequencing devices bypass this limitation by physically arraying DNA molecules on solid surfaces and determining the DNA sequence in situ, without the need for gel separation.This progress has been achieved by a variety of incremental developments such as the ability to array and copy single DNA molecules on beads and solid surfaces, and the development of reversible (chemically or enzymatically) DNA chain terminators. The latter allow DNA sequencing by measuring which bases are added into an elongating DNA chain, physically anchored to a glass slide or array of beads. This removes the need for gel electrophoresis.

The first generation approaches, for example Roche 454 sequencing, currently achieve 200-400 base pair reads over hundreds of thousands of templates. The current generation of machines, for example the Illumina GA devices, ABI SoliD machines (which use ligation mediated sequencing), and the truly single molecule Helioscope, are capable of sequencing tens of millions of individual templates in parallel, with sequence read lengths now approaching 100 nucleotides. At time of writing, 30Gb of DNA sequence — or about 10 haploid human genome equivalents — can be obtained in a week, for approximately $15,000. The draft human genome reference sequence (3Gb) completed in 2001 to about 6 fold redundancy, took over 5 years of sequencing effort by multiple laboratories and cost several billion dollars.

In addition to scale, there are two important novel characteristics of the massively parallel short DNA sequence reads, above those offered by conventional sequencing. First, the nature of the process means that each template sequence originates from a single DNA molecule, rather than a population of molecules as occurs in PCR amplicon based sequencing. This means that the abundance of a given DNA sequence in a population can be determined by counting the number of templates sequenced.

Next generation devices produce sequence from single templates. Thus the abundance of a nucleotide at any given position gives a direct measure of the frequency in the population. In genomic DNA sequencing, this can be used to infer the presence of subpopulations of cells bearing mutations. In RNA seq it can be used to infer allelic expression. In contrast with PCR amplicon Sanger sequencing, the whole population of molecules is amplified and the sequence read as a ratio of two peaks from electrophoresis. This limits the detection of minor alleles to a frequency of above ~15%.

Moreover, mixtures of DNA templates can be sampled. This will, for example, have big implications for understanding of tumour mutational heterogeneity. The ability to identify mutations and other genomic features in a background of normal (wild-type) alleles potentially alleviates some of the requirements for laser capture and other microdissection techniques. This will hugely advance the study of cancers such as lobular breast cancer or in situ malignancy, that are characterized by individual cells of interest buried in stroma, rather than contiguous sheets of tumours cells. Second, most devices now sequence from both ends of each single DNA template. This is known as mate-pair or paired-end sequencing and it allows the investigator to determine whether the DNA template is chimeric, by comparing the location of the sequences from each end on a reference genome. This approach has been recently used to search for gene fusion events in cancers.

Multiple aberration types can be inferred from DNA sequencing or cDNA sequencing (RNA-seq). After random fragmentation and sequencing, each end of the DNA molecule is sequenced. Alignment to a reference sequence allows inference of the aberrations seen. Single nucleotide variants in DNA or RNA can be inferred from single end alignment. Using the mate-pair information however allows sequences that do not map to the expected co-location to be identified. Depending on the context this can be used to infer translocations/fusions or alternative splicing or inversions.

In practice, entire human genomes can now be re-sequenced in a matter of weeks by these devices although sequencing to effective saturation with short reads, requires high redundancy – for a diploid human genome, a theoretical estimate suggests 27 fold haploid coverage (ie 27x 2.8Gb alignable human genome) or about 76Gb, to have 95% certainty that all alleles are captured. These are quantities of sequence that were unimaginable at the time the first draft human genome sequence was reported in 2001, but can now be generated in about 2 weeks by a single of the most advanced next generation devices. Using these methods the complete re-sequencing of several individual germline genomes has now been reported. An ambitious project to sample human genetic diversity by light coverage re-sequencing of 1000 individual human germline genomes is underway and will be complete within 2 years. Of relevance to cancer pathology, the sequencing of tumour genomes is now well within reach and many groups are reporting the first results.

Sequencing of genomic DNA requires random fragmentation of the sample, to DNA lengths compatible with the sequencing library construction. Sequencing libraries can now be built in multiwell plates, reducing the requirements for input DNA and increasing efficiency. Given sufficient depth of genomic DNA sequence, all aberrations in the genome from segmental copy number changes, down to single nucleotide variants, can be reconstructed from the sequence information.

Genomic DNA can be sequenced directly, however for transcriptome sequencing, the RNA must first be converted to cDNA ,which is then fragmented and entered into library construction. Alignment of short reads from cDNA to the transcriptome is non-trivial, because the presence of introns means intron spanning reads would be rejected by most short read alignment methods. Our approach to this has been to create comprehensive junction library databases that are used for alignment.

Massively parallel sequencing of RNA species is now an efficient method of very deeply sampling the expressed genome including microRNAs and other small RNA species. Not only can the abundance of RNAs be determined accurately, but the presence of expressed mutations (point mutations, indels, inversions), alternative splicing, gene fusion events, RNA editing and the detection of novel splice events can all be retrieved from transcriptome libraries at once. Transcriptome sequencing coupled with ChIP-seq now allows the accurate mapping of chromatin modifications and transcription factor binding sites in conjunction with the transcriptome.

Recent related papers from the Department of Molecular Oncology

  • Bell D, Berchuck A, Birrer M, Chien J, Cramer DW, Dao F, Dhir R, Disaia P, Gabra H, Glenn P, Godwin AK, Gross J, Hartmann L, Huang M, Huntsman DG, Iacocca M, Imielinski M, Kalloger S, Karlan BY, Levine DA, Mills GB, Morrison C, Mutch D, Olvera N, Orsulic S, Park K, Petrelli N, Rabeno B, Rader JS, Sikic BI, Smith-McCune K, Sood AK, Bowtell D, Penny R, Testa JR, Chang K, Dinh HH, Drummond JA, Fowler G, Gunaratne P, Hawes AC, Kovar CL, Lewis LR, Morgan MB, Newsham IF, Santibanez J, Reid JG, Trevino LR, Wu YQ, Wang M, Muzny DM, Wheeler DA, Gibbs RA, Getz G, Lawrence MS, Cibulskis K, Sivachenko AY, Sougnez C, Voet D, Wilkinson J, Bloom T, Ardlie K, Fennell T, Baldwin J, Gabriel S, Lander ES, Ding L, Fulton RS, Koboldt DC, McLellan MD, Wylie T, Walker J, O’Laughlin M, Dooling DJ, Fulton L, Abbott R, Dees ND, Zhang Q, Kandoth C, Wendl M, Schierding W, Shen D, Harris CC, Schmidt H, Kalicki J, Delehaunty KD, Fronick CC, Demeter R, Cook L, Wallis JW, Lin L, Magrini VJ, Hodges JS, Eldred JM, Smith SM, Pohl CS, Vandin F, Raphael BJ, Weinstock GM, Mardis ER, Wilson RK, Meyerson M, Winckler W, Getz G, Verhaak RG, Carter SL, Mermel CH, Saksena G, Nguyen H, Onofrio RC, Lawrence MS, Hubbard D, Gupta S, Crenshaw A, Ramos AH, Ardlie K, Chin L, Protopopov A, Zhang J, Kim TM, Perna I, Xiao Y, Zhang H, Ren G, Sathiamoorthy N, Park RW, Lee E, Park PJ, Kucherlapati R, Absher DM, Waite L, Sherlock G, Brooks JD, Li JZ, Xu J, Myers RM, Laird PW, Cope L, Herman JG, Shen H, Weisenberger DJ, Noushmehr H, Pan F, Triche Jr T, Berman BP, Van Den Berg DJ, Buckley J, Baylin SB, Spellman PT, Purdom E, Neuvial P, Bengtsson H, Jakkula LR, Durinck S, Han J, Dorton S, Marr H, Choi YG, Wang V, Wang NJ, Ngai J, Conboy JG, Parvin B, Feiler HS, Speed TP, Gray JW, Levine DA, Socci ND, Liang Y, Taylor BS, Schultz N, Borsu L, Lash AE, Brennan C, Viale A, Sander C, Ladanyi M, Hoadley KA, Meng S, Du Y, Shi Y, Li L, Turman YJ, Zang D, Helms EB, Balu S, Zhou X, Wu J, Topal MD, Hayes DN, Perou CM, Getz G, Voet D, Saksena G, Zhang J, Zhang H, Wu CJ, Shukla S, Cibulskis K, Lawrence MS, Sivachenko A, Jing R, Park RW, Liu Y, Park PJ, Noble M, Chin L, Carter H, Kim D, Karchin R, Spellman PT, Purdom E, Neuvial P, Bengtsson H, Durinck S, Han J, Korkola JE, Heiser LM, Cho RJ, Hu Z, Parvin B, Speed TP, Gray JW, Schultz N, Cerami E, Taylor BS, Olshen A, Reva B, Antipin Y, Shen R, Mankoo P, Sheridan R, Ciriello G, Chang WK, Bernanke JA, Borsu L, Levine DA, Ladanyi M, Sander C, Haussler D, Benz CC, Stuart JM, Benz SC, Sanborn JZ, Vaske CJ, Zhu J, Szeto C, Scott GK, Yau C, Hoadley KA, Du Y, Balu S, Hayes DN, Perou CM, Wilkerson MD, Zhang N, Akbani R, Baggerly KA, Yung WK, Mills GB, Weinstein JN, Penny R, Shelton T, Grimm D, Hatfield M, Morris S, Yena P, Rhodes P, Sherman M, Paulauskis J, Millis S, Kahn A, Greene JM, Sfeir R, Jensen MA, Chen J, Whitmore J, Alonso S, Jordan J, Chu A, Zhang J, Barker A, Compton C, Eley G, Ferguson M, Fielding P, Gerhard DS, Myles R, Schaefer C, Mills Shaw KR, Vaught J, Vockley JB, Good PJ, Guyer MS, Ozenberger B, Peterson J, Thomson E. Integrated genomic analyses of ovarian carcinoma. Nature 2011: 474(7353):609-615.
  • Schrader KA, Heravi-Moussavi A, Waters PJ, Senz J, Whelan J, Ha G, Eydoux P, Nielsen T, Gallagher B, Oloumi A, Boyd N, Fernandez BA, Young TL, Jones SJM, Hirst M, Shah SP, Marra MA, Green J, Huntsman DG.Using next-generation sequencing for the diagnosis of rare disorders: a family with retinitis pigmentosa and skeletal abnormalities.J Pathol, In Press.
  • McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG, Griffith M, Heravi Moussavi A, Senz J, Melnyk N, Pacheco M, Marra MA, Hirst M, Nielsen TO, Sahinalp SC, Huntsman D, Shah SP. deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data.PLoS Comput Biol. 2011: 7(5):e1001138.
  • McPherson A, Wu C, Hajirasouliha I, Hormozdiari F, Hach F, Lapuk A, Volik S, Shah S, Collins C, Sahinalp SC.Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data. Bioinformatics 2011: 27(11):1481-8.
  • McConechy MK, Anglesio MS, Kalloger SE, Yang W, Senz J, Chow C, Heravi-Moussavi A, Morin GB, Mes-Masson AM; Australian Ovarian Cancer Study Group, Carey MS, McAlpine JN, Kwon JS, Prentice LM, Boyd N, Shah SP, Gilks CB, Huntsman DG. Subtype-specific mutation of PPP2R1A in endometrial and ovarian carcinomas.J Pathol. 2011: 223(5):567-73.
  • Steidl C, Shah SP, Woolcock BW, Rui L, Kawahara M, Farinha P, Johnson NA, Zhao Y, Telenius A, Neriah SB, McPherson A, Meissner B, Okoye UC, Diepstra A, van den Berg A, Sun M, Leung M, Jones SJ, Connors JM, Huntsman DG, Savage KJ, Rimsza LM, Horsman DE, Staudt LM, Steidl U, Marra MA, Gascoyne RD. MHC class II transactivator CIITA is a recurrent gene fusion partner in lymphoid cancers. Nature, in press. doi:10.1038/nature09754
  • Wiegand KC, Shah SP, Al-Agha OM, Zhao Y, Tse K, Zeng T, Senz J, McConechy M, Anglesio MS, Kalloger SE, Yang W, Heravi-Moussavi A, Giuliany R, Chow C, Fee J, Zayed A, Melnyk N, Turashvili G, Delaney A, Madore J, Yip S, McPherson AW, Ha G, Bell L, Fereday S, Tam A, Galletta L, Tonin PN, Provencher D, Miller D, Jones S, Moore RA, Gregg Morin GB, Oloumi A, Boyd N, Aparicio SA, Shih IM, Mes-Masson AM, Bowtell D, Hirst M, Gilks CB, Marra MA, Huntsman DG. ARID1A Gene Mutations in Endometriosis-Associated Ovarian Carcinomas. N Engl J Med. 2010 363(16):1532-43.
  • Morin RD, Johnson NA, Severson TM, Mungall AJ, An J, Goya R, Paul JE, Boyle M, Woolcock BW, Kuchenbauer F, Yap D, Humphries RK, Griffith OL, Shah S, Zhu H, Kimbara M, Shashkin P, Charlot JF, Tcherpakov M, Corbett R, Tam A, Varhol R, Smailus D, Moksa M, Zhao Y, Delaney A, Qian H, Birol I, Schein J, Moore R, Holt R, Horsman DE, Connors JM, Jones S, Aparicio S, Hirst M, Gascoyne RD, Marra MA.Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin. Nat Genet 2010: 42: 181-185
  • Shah SP, Morin RD, Khattra J, Prentice L, Pugh T, Burleigh A, Delaney A, Gelmon K, Guliany R, Senz J, Steidl C, Holt RA, Jones S, Sun M, Leung G, Moore R, Severson T, Taylor GA, Teschendorff AE, Tse K, Turashvili G, Varhol R, Warren RL, Watson P, Zhao Y, Caldas C, Huntsman D, Hirst M, Marra MA, Aparicio S. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature 2009: 461: 809-813
  • Shah S, Köbel M, Senz J, Morin R, Wiegand K, Kalloger S, Sun M, Guiliany R, Yorida E, Swenerton K, Miller D, Clement P, Crane C, Madore J, Provencher D, Leung P, DeFazio A, Turashvili G, Zhao Y, Zeng T, Glover M, Vanderhyden B, Mes-Masson AM, Brenton J, Aparicio S, Boyd N, Hirst M, Gilks CB, Marra M, Huntsman D (2009) Mutation of the FOXL2 gene in granulosa cell tumors of the ovary. N Engl J Med. 2009: 360(26):2719-29.