Genome Wide Association Studies (GWAS) using SNP markers has been widely used for discovery of genetic risk factors for human disease. However, this method will only reveal risk factors that are common in the population. Instead, a more comprehensive method such as whole-genome sequencing would reveal all de-novo and low frequency alleles that contribute to genetic risk for disease. At this time, whole genome sequencing is too expensive to perform for large sample sizes. A recently developed technique has filled the niche between GWAS and whole-genome sequencing; capture and sequencing of only the exons in the genome. This comprises roughly 2-3% of the genome. The approach reduces the sequencing needed by 97%, yet retains the most likely sources of genetic disease risk. The method is a cost-effective way to study families affected by inherited disease or a large number of family trios. Custom panels can be synthesized that cover a limited gene set or for model organisms.
The method is well established; three major companies produce commercial products for whole-exome capture and sequencing. These are Roche/NimbleGen (SeqCap EZ), Agilent Technologies (Sure-Select) and Illumina (TruSeq Exome Enrichment). All methods are variants on the following strategy: DNA is sheared and then ligated to adapters that enable sequencing on the Illumina platform. The library is amplified by PCR and then hybridized to a pool of biotinylated oligos specific for exons (“baits”). The size of the genome region captured is generally 30-60 megabase. Strepavidin magnetic beads are used to separate genomic DNA-bait hybrids; this library fraction should be enriched for exons (~3% of genome). A second round of PCR is used to amplify the library to sufficient levels to sequence. The enriched library is checked by real-time PCR to determine the exon enrichment was successful. Then, the library is sequenced with paired end reads (100nt x 2 on the HiSeq 2000) to achieve a depth of 20-30x per base.
Genomic Library Preparation
Three steps are required for preparation of an exon enriched gDNA library for massively parallel sequencing; (1) a library of genomic DNA is created, (2) the library is enriched for exon targets, and (3) the library quality and enrichment is verified.
Illumina sequencing performs best when a DNA strands are <500 bp in size, thus, high molecular weight genomic DNA must be sheared. We use the Covaris Acoustic Disruptor (E210), to achieve a narrow 200-300nt size distribution of chromosomal DNA. The Covaris instrument has major advantages over standard shearing techniques: it can be used with low quantities of DNA (250ng – 5 µg), sample recovery is higher than nebulization, and the samples are sheared in individual tubes, reducing the risk of cross-contamination by sonication probe. DNA derived from blood, saliva or tissue may be used; samples may be processed in a 96-well format.
Each molecule in the DNA library must contain two specific sequences at its ends to be sequenced on the Illumina platform. These sequences are introduced by the following methods: after shearing the sequence library contains ends with 5´ and 3´ overhangs, these are filled-in using T4 polymerase and T4 polynucleotide kinase resulting in blunt-ended DNA molecules. Then a deoxyadenosine (dA) 5´ tail is added to DNA strands using the Klenow fragment (exo-). Double stranded DNA adapters with 3´ thymidine overhangs are ligated to the dA tailed library using T4 ligase; the adapters contain the sequences needed for binding to the sequencing flow cell and sequencing primer binding sites. Sequencing adapter-dimers wastes reagents; to exclude adapter-dimers from the sequencing pool, the library must be size selected by agarose gel electrophoresis or with SPRI beads (Beckman Coultier Ampure XP). The DNA library is then enriched for sequences with 5´ and 3´ adapters by PCR with primers complementary to the adapter sequences (ligation-mediated PCR, LM-PCR).
Capturing Library Strands Containing Exons
Genomic DNA libraries prepared as above contain ~2-3% exon derived sequences. Changes in amino acid sequence are often the cause of phenotypic variation. Thus, exon sequencing is cost effective compared to whole genome sequencing because it is likely to yield informative variants with a fraction of sequencing effort. Three commercial platforms are currently available for exon sequence enrichment from DNA libraries: Agilent Technologies’ “Sure Select”, Roche NimbleGen’s “Seq Cap EZ” and Illumina’s “TruSeq Exome Enrichment”. Both methods involve the following: a tiling exon oligonucleotide microarray is synthesized and used to create a pool of biotinylated nucleic acids or “baits”. The capture reagent baits may be RNA (Agilent Technologies) or DNA (Roche NimbleGen) and may vary in length from 60 to 120 nt. Baits are incubated with adapter-ligated DNA libraries in solution over 24-72 hours, allowing the hybridization of bait to target sequences. The bait-target hybrids are captured by streptavidin magnetic microbeads, and then beads are washed to remove non-specifically bound DNA and eluted. A secondary LM-PCR is performed to generate sufficient DNA for sequencing.
Enriched libraries are then validated by Agilent BioAnalyzer for size distribution. Before the library is sequenced, we perform a real-time SYBR Green PCR using 6 exon-intron pairs to compare equal masses of whole-genome and exon-captured libraries. An exon-capture experiment is considered successful if all six exons are enriched and all six introns are depleted; this proxy method allows detection of failed capture experiments before sequencing is initiated.
Samples are then ready for sequencing on the Illumina HiSeq 2000. As of March 2012, sufficient coverage for SNP calling is achieved by sequencing three samples per lane.