Institute for Genomics and Multiscale Biology

The Institute for Genomics and Multiscale Biology at the Icahn School of Medicine is a "core technology institute", which enables Mount Sinai Researchers to carry out cutting-edge genomic-based basic and translational research that brings state of the art high throughput genomic technologies under one roof to provide a centralized resource to carry out large scale genomic studies. The Genomic institute has three core units: The Genomics Core, the Proteomics/Metabolomics Core, and the Computational Genomics Core. Each core has state-of-the-art equipment, technical and analytical infrastructure manned and run by highly skilled and knowledgeable technical and scientific staff to facilitate sophisticated genomic, proteomic and metabolomic research. Eric Schadt, PhD, Chair of the Department of Genetics and Genomic Sciences, directs the Institute.  Andrew Kasarskis, PhD is the Co-Director of the Institute.

The Genomics Core:
Apart from housing the standard Sanger Sequencing facility, the Genomics Core provides the latest next generation sequencing and microarray technologies. The Genomics Core currently operates multiple next generation sequencing platforms: 4 Illumina HiSeqs, one MiSeq, one Pac Bio RS system. Two Astro platforms that are the latest high tech models over the RS system will be commissioned by the end of 2012. In addition, plan is afoot for early access purchase of latest Ion Proton Sequencers from Ion Torrent/Life Technologies that has the potential to sequence a genome for $1000.00 in one day. The Genomics Core facility is directed by Dr Milind C. Mahajan, and has 12 full time staff including three Ph.D level staff appointments. The day-to-day production, R&D operation, and scientific and business oversight of Genomics Core are handled by Drs. Milind Mahajan and Andrew Kasarskis. All necessary infrastructures such as high performance computation and bioinformatics support are already in place. The types of high throughput sequencing projects that are being routinely carried out include preparation of Seq-libraries followed by sequencing for RNA-seq, CHIP-seq, small RNA Seq, targeted gene selection sequencing and whole exome and genomic sequencing from variety of clinical and non-clinical samples from different biological sources.  

Microarray Technology is offered for Genome Wide Association Studies (GWAS), Pharmocogenetics, Genomics, Methylation, candidate gene, and genotyping analyses. The Core has extensive experience and expertise in genotyping and expression array analysis on Affymetrix and Illumina platforms. Currently, the Core uses the Illumina platform for genotyping employing the Golden Gate assay as well as OmniExpress, OmniQuad2.5, and Omni2.5-8 arrays. The Core has all the necessary liquid robotic automation systems, a HiScan and Bead Express system to carry out wide range of sample volumes. With the existing latest state of the art equipment and facilities, the current capacity is to process up to 600 samples per week.

Quantitative PCR analysis for small-scale gene expression quantitation and detection of single nucleotide polymorphisms (SNPs) is available using an Applied Biosystem 7900HT with associated robotics for accurate reaction set-up using a Biomek FX for automated liquid handeling.

Automated, high throughput nucleic acid purification from blood, tissue or bacteria are performed in a 96 well format using a Qiagen Univeral BioRobot.  To assess the quality of nucleic acids, the Genomic Institute has an Agilent Bioanalyzer 2100.

The Genomic Institute has ample computing power, and software developed in-house that is described below and listed in the Major Equipment document. The data are processed on a 360-core compute cluster with 128GB of RAM available on each node (with ~50 cores per node). Additional servers, with large amounts of RAM, bring up the total amount of available RAM to 1.2 TB. A 400 TB networked storage device from Isilon provides ample storage for the sequencing data. The storage and computer clusters are connected to each other through a 10G interconnect. A special server from Avere provides tiered data access, with frequently used files being cached for rapid access. An externally accessible website (http://katahdin.mssm.edu) is hosted on two mirrored servers, with load balancing controlled by an F5 load-balancer. Nodes on the compute cluster are pre-loaded with all of the necessary software to perform routine analysis of high-throughput sequencing data and are regularly updated with the latest public genomic data resources. Several tools, for analysis and data exploration, have been developed in-house, including Kismeth (Gruntman et al. 2008) and Geoseq (Gurtowski et al. 2010). These are deployed as browser-based tools backed by distributed web-services that help maintain performance and availability. The servers are housed in two 42U racks with ample power and air-conditioning. Both power and air-conditioning have backup, ensuring 24/7 operation. A dedicated system administrator helps keep the software up to-date. In addition, two programmers/analysts develop software and analyze data.

Scientific Computing Infrastructure:
Mount Sinai has committed over $50M to its scientific computational and data cyberinfrastructure, recognizing that a well-designed and managed infrastructure empowers its scientists and researchers to be more productive and effective.  This significant investment includes professional staff, expertise, hardware, software and a new computational and data facility dedicated to scientific computing.  Mount Sinai closely collaborates and partners with other facilities and vendors, to keep its cyberinfrastructure and services state-of-the-art.  The staff follow the communities’ best practices and procedures to ensure that the computing and data services are the most efficient and effective for the researchers.  The guide the development of scientific computing at Mount Sinai, we have recruited Patricia Kovatch as Associated Dean, Scientific Computing.  Ms. Kovatch is an expert in high performance computing, and most recently served as Director for University of Tennessee’s $110M National Institute for Computational Sciences (NICS) at Oak Ridge National Laboratory.  NICS is the National Science Foundation (NSF)’s #1 provider of high performance computing resources, providing over 70% of all compute cycles to NSF’s national base of scientists and researchers.  In its resource portfolio is Kraken, a 1.2 petaflop Cray XT5 with 112,896 cores and 147 terabyes of memory.  Kraken is the largest academic supercomputer, and was #3 fastest in the world in November 2009.

Mount Sinai’s robust computing and data cyberinfrastructure has been designed for the rapid and accurate ingest of the sequencer output and high performance post-processing and analysis by the computational cluster and affiliated file systems and storage.  The cyberinfrastructure resources have been tailored specifically to handle the computational and data workflow from the sequencer, including a high bandwidth network.

The computing and data center at Mount Sinai contains over 10,000 square-feet of space.  It offers several complementary resources, including an SGI Altix 1300 cluster with 816 CPU cores, 64 Nehalem-class nodes with Quad Data Rate (QDR) Infiniband connectivity and 38 nodes with Double Data Rate (DDR) Infiniband.  All of those nodes are attached to 65 Terabytes of Lustre-based high-speed shared storage.  The software and programming environments are the best of breed, and include community standards such as Linux and MPI.  The clusters also run resource managers and schedulers optimized for the job workload, optimized to process as many jobs as possible for the highest overall machine utilization, job throughput and job success rate.  The clusters are operated with over 95% uptime, using scalable and reproducible configuration management techniques.  The machines are also monitored for security.

High Performance Computing (HPC) Cluster:
The new High Performance Computing (HPC) cluster and storage will be installed at the beginning of March 2012, and expect to open the cluster up for general access in mid-April 2012. The resource expansion is specifically designed for applications in learning and modeling biological networks, application of Bayesian analysis frameworks more generally, second and third-generation sequence analysis, modeling the kinetics of DNA synthesis to detect 6-methyladenosine, 4- and 5-methylcytosine, 8-oxoguanine and other modified nucleotides.  The hardware accessible for analysis includes systems optimized for embarrassingly parallel  jobs such as QTL analyses that are CPU bound as well as parallel jobs such as Bayesian network reconstruction that are memory bound.  In addition, there is support for jobs requiring substantial shared memory, such as all-by-all comparisons or splice-form specific RNA-seq results to generate isoform-specific coexpression networks, an n2 problem with n estimated between 100,000 and 200,000 for most tissues.  High-availability storage is provided to support both very large files and very large numbers of small files in these analyses.  In summary, the compute infrastructure is designed to be flexible to allow both exact mathematical models of high complexity and heuristic machine learning approaches to be explored in addressing biological hypotheses in an iterative and complementary way.

For more information:
Visit:  http://www.mssm.edu/research/institutes/genomics-institute
Email:  peter.warburton@mssm.edu
Last Update:  April 16, 2012