The Icahn Institute for Data Science and Genomic Technology and the Department of Genetics and Genomic Sciences oversee whole-genome sequencing. Genome-wide association studies (GWAS) using SNP markers have been widely used to discover genetic risk factors for human disease. However, this method only reveals common risk factors. A more comprehensive method, such as whole-genome sequencing, would better reveal all de-novo and low frequency alleles that contribute to genetic risk for disease.
At this time, whole genome sequencing is too expensive to perform for large sample sizes. A recently developed technique, which captures and sequences only the exons in the genome, has filled the niche between GWAS and whole-genome sequencing. This comprises roughly 2 to 3 percent of the genome. The approach reduces the sequencing needed by 97 percent, yet retains the most likely sources of genetic disease risk. The method is a cost-effective way to study families affected by inherited disease or a large number of family trios. Custom panels can be synthesized to cover a limited gene set or for model organisms.
The method is well established with three major companies producing commercial products, including Roche/NimbleGen (SeqCap EZ), Agilent Technologies (Sure-Select) and Illumina (TruSeq Exome Enrichment). All methods are variants on the following strategy:
- DNA is sheared and then ligated to adapters that enable sequencing on the Illumina platform.
- The library is amplified by PCR and then hybridized to a pool of biotinylated oligos specific for exons ("baits").
- The size of the genome region captured is generally 30-60 megabase.
- Strepavidin magnetic beads are used to separate genomic DNA-bait hybrids. This library fraction should be enriched for exons (~3 percent of genome).
- A second round of PCR is used to amplify the library to sufficient levels to sequence.
- The enriched library is checked by real-time PCR to determine the exon enrichment was successful.
- The library is sequenced with paired end reads (100nt x 2 on the HiSeq 2500) to achieve a depth of 20-30x per base.
Genomic Library Preparation
Three steps are required for preparation of an exon enriched gDNA library for massively parallel sequencing:
- A library of genomic DNA is created
- The library is enriched for exon targets
- The library quality and enrichment is verified.
Illumina sequencing performs best when DNA strands are <500 bp in size; thus, high molecular weight genomic DNA must be sheared. We use the Covaris Acoustic Disruptor (E210) to achieve a narrow 200 to 300nt size distribution of chromosomal DNA. The Covaris instrument has major advantages over standard shearing techniques––it can be used with low quantities of DNA (250ng to 5 µg), sample recovery is higher than nebulization, and the samples are sheared in individual tubes, reducing the risk of cross-contamination by sonication probe.
DNA derived from blood, saliva, or tissue may be used, and samples may be processed in a 96-well format. Each molecule in the DNA library must contain two specific sequences at its ends to be sequenced on the Illumina platform. These sequences are introduced by the following methods:
- After shearing, the sequence library contains ends with 5´ and 3´ overhangs, which are filled-in using T4 polymerase and T4 polynucleotide kinase resulting in blunt-ended DNA molecules.
- A deoxyadenosine (dA) 5´ tail is added to DNA strands using the Klenow fragment (exo-).
- Double stranded DNA adapters with 3´ thymidine overhangs are ligated to the dA tailed library using T4 ligase. The adapters contain the sequences needed for binding to the sequencing flow cell and sequencing primer binding sites.
- Sequencing adapter-dimers wastes reagents. Exclude adapter-dimers from the sequencing pool and size select the library by agarose gel electrophoresis or with SPRI beads (Beckman Coultier Ampure XP).
- The DNA library is then enriched for sequences with 5´ and 3´ adapters by PCR with primers complementary to the adapter sequences (ligation-mediated PCR, LM-PCR).
Capturing Library Strands Containing Exons
Genomic DNA libraries prepared as above contain approximately 2 to 3 percent exon-derived sequences. Changes in amino acid sequence are often the cause of phenotypic variation. Exon sequencing is cost-effective compared to whole genome sequencing because it yields informative variants with a fraction of sequencing effort.
Three commercial platforms are currently available for exon sequence enrichment from DNA libraries: Agilent Technologies' Sure Select, Roche NimbleGen's Seq Cap EZ, and Illumina's TruSeq Exome Enrichment. These methods involve synthesizing a tiling exon oligonucleotide microarray to create a pool of biotinylated nucleic acids or “baits.” The capture reagent baits may be RNA (Agilent Technologies) or DNA (Roche NimbleGen) and may vary in length from 60 to 120 nt. Baits are incubated with adapter-ligated DNA libraries in solution anywhere from 24 to 72 hours, allowing the hybridization of bait to target sequences. The bait-target hybrids are captured by streptavidin magnetic microbeads, and then beads are washed to remove non-specifically bound DNA and eluted. A secondary LM-PCR is performed to generate sufficient DNA for sequencing.
Enriched libraries are then validated by Agilent BioAnalyzer for size distribution. Before the library is sequenced, we perform a real-time SYBR Green PCR using 6 exon-intron pairs to compare equal masses of whole-genome and exon-captured libraries. An exon-capture experiment is considered successful if all six exons are enriched and all six introns are depleted. This proxy method allows detection of failed capture experiments before sequencing is initiated. Samples are then ready for sequencing on the Illumina HiSeq 2500.