Gene Transcript Profiling

The Icahn Institute for Data Science and Genomic Technology and the Department of Genetics and Genomic Sciences handle gene transcript profiling. mRNA-Seq is becoming the preferred technology for transcript profiling because of its high reproducibility, dynamic range, and richness of data. Whereas microarray expression profiling suffers from three key flaws: Profiling is limited to only known genes/splice variants, hybridization artifacts, and difficulty in reproducibility.

Traditional digital gene expression techniques--like serial analysis of gene expression (SAGE)--generate information similar to mRNA-Seq, but are limited by relatively high per-base costs of Sanger sequencing and need for bacterial cloning steps. However, they do carry several advantages, such as:

  • High dynamic range: mRNA-Seq is estimated to span 5 orders of magnitude, which is significantly higher than microarray platforms (Wang, Gerstein et al. 2009).
  • High reproducibility: Excellent concordance between identical samples has been observed, reducing the need for technical replicates (Marioni, Mason et al. 2008).
  • RNA splice patterns: Characterization in relative abundance of splice variants is possible using paried-end reads (Trapnell, Pachter et al. 2009).
  • SNP typing and allelic expression: Transcripts in high quantities can be used to catalog patient haplotypes directly from RNA (Verlaan, Ge et al. 2009). This method has been extended to identify allelic imbalances in expression (Heap, Yang et al. 2010).

Detection of rare transcripts is limited only by sequencing depth. Early studies of 8M reads yielded novel insights (Sultan, Schulz et al. 2008), but sequencing technology has advanced significantly. An ultra-deep (10 gigabase) mouse transcript sequencing study revealed that 80M reads is sufficient to detect the vast majority of unique sequence tags--most of the information is derived from the first 10 to 40M reads (Cloonan, Forrest et al. 2008) (Wang, Gerstein et al. 2009).

Our HiSeq 2500 produces around 120 to 150M sequence reads per lane, and is capable of paired-end sequencing for up to 100nt in length.

Library Preparation From RNA Samples

This technique selects polyadenylated mRNA transcripts from total RNA, fractionates them, and then converts them to dsDNA for sequencing. The method proceeds as follows:

  • 1ug of high quality total RNA is incubated with oligo(dT) magnetic beads (SeraMag or Dynal) in order to enrich for mRNA with poly-A tails.
  • The eluted RNA is incubated at 94°C in Tris buffer with potassium acetate and magnesium acetate. This yields fractionated RNA in the 200-500 nt range.
  • The RNA is ethanol precipitated with sodium acetate, then resuspended in water.
  • Reverse transcription with random oligonucleotide hexamers (Invitrogen SuperScript III) is performed to generate cDNA.
  • Then, the RNA is degraded by addition of RNAse and DNA polymerase is added to generate a second strand. The DNA is then ready for standard Illumina adaptor ligation for sequencing (Mortazavi, Williams et al. 2008).

The Illumina platform employs an in-situ amplification technique followed by dye-terminator sequencing (Bentley, Balasubramanian et al. 2008). Short oligonucleotides covalently bound to the sequencing flow cell are used to immobilize the DNA strand being sequenced. Each molecule in the DNA library must contain two specific sequences at its ends. These are introduced by a DNA ligase. This method is executed as follows:

  • dsDNA after second strand synthesis contains ends with 5´ and 3´ overhangs, which are filled-in using T4 polymerase and T4 polynucleotide kinase resulting in a blunt-ended DNA molecule.
  • A deoxyadenosine (dA) 5´ tail is added to DNA strands using the Klenow fragment (exo-).
  • Double stranded DNA adapters with 3´ thymidine overhangs are ligated to the dA tailed library using T4 ligase. The adapters contain the sequences needed for binding to the flow cell and sequencing primer binding sites.

The mRNA-Seq protocol takes approximately two days to complete. The first day is spent on RNA preparation –– checking RNA quality with BioAnalyzer, selecting poly-A RNA with oligo dT beads, first and second strand synthesis with reverse transcriptase, and DNA cleanup. The second day is for library preparation, including end-polishing of dsDNA, ligating adaptors, size selection by agarose gel/SPRI beads, enrichment of the library by PCR, and validation of the library by BioAnalyzer.

RNA sample quality is important. mRNA-Seq uses oligo dT beads to select polyadenylated RNA from a total RNA population. This results in a strong 3' bias to sequencing reads because of fragmented RNA molecules. This can be mitigated in part by selecting high-quality RNA before sequencing (strong 18S/28S peaks on BioAnalyzer). Downstream bioinformatic analysis attempts to correct for this bias when reads are counted and per-gene expression is computed. Reverse transcriptase may have varying efficiencies based on templates. This may also be a source of bias in read counts, but that is difficult to eliminate. Also, reverse transcriptase template switching can be a source of error. This is usually dealt through software that identifies chimeric reads with low counts and excludes them.

microRNA Profiling

mRNA-Seq can capture pre-microRNAs that are polyadenylated. However, it cannot detect cleaved microRNAs because of the oligo dT selection step. The size of sequencing adapter dimers is close to the size of true microRNA library sequences with ligated adapters. If total RNA is used, this can lead to substantial difficulty when separating.

A separate microRNA specific sample preparation method has been developed by Illumina that differs substantially from their mRNA-Seq protocol. The method uses RNA-RNA ligation to add the adapter sequences. The sequences are then converted to DNA with reverse transcriptase for sequencing. To ensure a high yield for microRNA profiling, the RNA-ligation method should be used. MicroRNA-Seq should be carried out in parallel, though could potentially be sequenced in the same reaction with pooled libraries.