The Charles Bronfman Institute for Personalized Medicine

The Center for Genomic Data Analytics

The Center for Genomic Data Analytics (CGDA) is nested in The Charles Bronfman Institute for Personalized Medicine (CBIPM) at the Icahn School of Medicine at Mount Sinai (ISMMS). The overarching mission at the Center for Genomic Data Analytics is to develop and apply innovative new tools and techniques for genomic data science in order to take advantage of large population-scale biobanks.

Our goals at the CGDA are:

  • To develop and apply new tools and approaches for genomic data science, particularly in maximizing the use of large-scale population-based biobanks.
  • To build user-friendly web-based graphical user interfaces, and repositories to facilitate sharing of genomic tools, outputs and other resources.
  • To become an analytical hub for interdisciplinary collaboration among researchers and clinicians.

Within the Center for Genomic Data Analytics, we have highlighted selected areas of ongoing research:

Defining and analyzing new digital biomarkers with machine learning. We apply state-of-the-art machine learning and A.I methods to build new digital biomarkers for disease from electronic health records (EHR). In the hands of physicians, these biomarkers have the potential to accelerate diagnosis and improve treatment choices. They also provide valuable insights into disease biology and pathogenesis. We are currently working on developing digital biomarkers for more diseases and analyzing their properties across diseases. We also focus on developing methods to incorporate multiple modalities of data into digital biomarker analyses, including data from whole genome and whole exome sequences, proteomics, metabolomics, and pharmacogenomics. Our goal is to develop a set of digital biomarkers and a sophisticated suite of downstream genetic analyses powered by them.

Fine-grained annotation of genetic variation. We design sophisticated and innovative statistical approaches to characterize the effects of genetic variation beyond a simple pathogenic-benign or constrained-unconstrained annotation. Using population-scale genomic data, we develop and publish variant-level annotations of penetrance, mode of inheritance, and horizontal pleiotropy across the human genome, and apply these annotations to characterize the population-wide impact of these features. We continue to refine and improve these annotations and the methods behind them to take advantage of new population-scale genomic data that is constantly becoming available. Our goal is to expand the scope of computational prediction of variant effects from simple binary predictions to detailed personalized phenotypic profiles.

Scaling causal inference to population-scale data. We develop and apply statistical methods to infer biological causation from genomic data, including approaches based on Mendelian Randomization and colocalization. By developing innovative new adaptations of these approaches and implementing scalable pipelines to apply them to huge genome-wide and phenome-wide datasets, we can leverage new population-scale datasets to produce new and unexpected insights into biology. Our goal is to build tools for causal analysis that both scale easily to dense data from millions of samples and power innovative analyses.

Informing drug outcomes using human genetics. We develop tools to prioritize drug targets with genetic evidence. We have developed web interfaces where users can perform queries to search for gene targets linked to specific drug outcomes, that is supported by genetic association evidence. These prioritization schemes can inform selection of drug targets for drug discovery.

The Center for Genomic Data Analytics is comprised of a team dedicated to research.

Ron Do, PhD, Director
Ron Do, PhD is Professor in the Charles Bronfman Institute for Personalized Medicine, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai. His research lies at the intersection of human genetics, statistical genetics and population genetics with a focus on understanding the genetic, biological and clinical basis of human diseases.  

Daniel Jordan, PhD, Director of Computational Genomics
Daniel Jordan is a computational biologist who studies the complex relationship between genotype, phenotype, and natural selection. He received his PhD from Harvard in 2015 and completed postdoctoral training at Icahn School of Medicine at Mount Sinai in 2020. His research focuses on developing methods to predict the effects of genetic variants on disease and explore the hidden causal structure underlying disease biology using statistical inference, evolutionary theory, and machine learning.

Ghislain Rocheleau, PhD, Director of Statistical Genetics
Ghislain holds a PhD in Statistics from the University of Montreal. During his postdoctoral training at McGill University, he actively took part in the first Genome-wide association study (GWAS) in type 2 diabetes, and later coauthored several papers in genomic endocrinology. From 2011 to 2016, he was a member of the European Genomic Institute for Diabetes and Maître de Conférences at the University of Lille, France. His research interests focus in the development of new statistical analysis methods applied to genomic data, mainly those generated by genetic association studies.

Ha My Vy, PhD., Research Scientist
Ha My completed her bachelor’s in Theoretical Physics at Hanoi National University of Education in Vietnam and earned her PhD in population genetics at Ewha Womans University in South Korea. During her PhD, Ha My has worked on developing theoretical tools for detecting incomplete selective sweeps from sequence polymorphism and applying them to Drosophila population genomic data. Her current research focuses on investigating the impact of natural selection on human biology and disease and investigating polygenic risk scores for disease.

Ben Omega Petrazzini, B.Sc., Associate Bioinformatician
Ben Omega is a Uruguayan biologist developing machine learning models to predict and characterize complex diseases using clinical records from the Mount Sinai Hospital. His current studies intersect the use of cutting-edge AI models with genetic association studies.

Members of The Center for Genomic Data Analytics have in the past developed the following tools and resources

MR-PRESSO. The MR-PRESSO R package allows the evaluation of horizontal pleiotropy when conducting multi-instrument Mendelian Randomization using genome-wide association study summary statistics. MR-PRESSO will detect horizontal pleiotropy, correct for it via outlier removal (invalid instrument), and test for a significant distortion in the causal estimate before and after outlier removal. 

HOPS. The HOPS R package computes a horizontal pleiotropy score for each variant in GWAS summary statistics. The package includes a shiny tool to visualize and download the full set of HOPS results obtained for ~770,000 variants in 372 heritable traits of the UK Biobank project.

biPheMap. The biPheMap allows the exploration of a phenome-wide network map of colocalized genes and phenotypes from the UK Biobank project. Colocalized signals were uncovered by combining GTEx eQTLs in 48 tissues and GWAS loci of ~1,400 phenotypes. Clusters of biologically related genes and phenotypes could be visualized in a R shiny app.

srMLGenes. srMLGenes is an interactive web interface investigating recessive selection in human population data. It allows the exploration of inferences about dominance and selection from simulations, as well as gene enrichments in different dominance and selection categories. srMLGenes could also be installed to run locally on your computer.

Disease characterization and risk prediction using EHR. The machine learning-based workflow allows for characterization of complex diseases on a spectrum and risk prediction at a desired time before diagnosis. The workflow uses vital measurements, laboratory test results, medication prescription and diagnostic measurements from EHR to train tree-based models.

Disease characterization and risk prediction using metabolite data. The machine learning-based workflow allows for characterization of complex diseases on a spectrum and risk prediction at a desired time before diagnosis. The workflow uses metabolite data from a blood test panel to train tree-based models.

MOI-Pred. MOI-Pred is a machine learning-based predictor of recessive inheritance in human missense variants. A reduced set of 19 evolutionary and functional features are needed to predict recessive inheritance in any missense variant from the human genome. Pre-computed recessive inheritance predictions for 71M missense variants in the hg38 human reference genome are available at


Make a Gift