Mount Sinai Center for Bioinformatics


We, at the Mount Sinai Center for Bioinformatics, develop algorithms, pipelines, web-based software systems, and databases that enable experimental biologists to better analyze their data by unravelling the regulatory networks within mammalian cells. We use a variety of mathematical and computational methods such as machine learning and dimensionality reduction to organize data for further discovery and for making predictions.

We were awarded funding from the NIH to establish the following centers and resources:

The NIH Common Fund (CF) programs have produced transformative datasets, databases, methods, bioinformatics tools and workflows that are significantly advancing biomedical research in the United States and worldwide. Currently, CF programs are mostly isolated. However, integrating data from across CF programs has the potential for synergistic discoveries. To address this challenge, the NIH established the Common Fund Data Ecosystem (CFDE) program. Our team was selected to establish the Data Resource Center (DRC) for the CFDE. We are tasked to produce two main products: the CFDE information portal and the CFDE data resource portal. The CFDE data resource portal contains metadata, data, workflows, and tools which are the products of the CF programs, and their data coordination centers (DDCs). The portal provides processed data in various formats including: 1) knowledge graph assertions; 2) gene, drug, metabolite, and other set libraries; 3) data matrices ready for machine learning and other AI applications; and 4) uniformly formatted metadata. In addition, the extract, transform, and load (ETL) scripts to process the data into these formats are provided. To achieve these goals, we work collaboratively with the other CFDE centers, the participating CFDE DCCs, the CFDE NIH team, and relevant external entities and potential consumers of these resource towards accomplishing the goal of developing a lively and productive Common Fund Data Ecosystem.

Learn more about the award

CFDE Data Portal

Many cancer-related independent studies that employ bulk and single cell RNA-seq remain under-used due to their lower findability, accessibility, interoperability, and reusability. The data from these studies can be found in the Gene Expression Omnibus (GEO) but it is provided mostly as raw FASTQ files with non-uniform metadata annotations. While some studies provide aligned reads files, these are processed non-uniformly. This shortcoming makes it difficult to query and integrate this data across studies and with additional external data. To bridge the gap that currently exists between RNA-seq data generation and RNA-seq data processing and reuse, we developed the resource All RNA-seq and ChIP-Seq Sample and Signature Search (ARCHS4). ARCHS4 provides processed RNA-seq data from GEO to support retrospective data analyses and reuse. ARCHS4 caters to users with different levels of computational expertise and has been already employed for many post-hoc analyses and projects. The goals of the project go far beyond just providing researchers with direct access to RNA-seq data through a web-based user interface. We use the ARCHS4 data for numerous applications including transforming other transcriptomics data into RNA-seq-like profiles with Deep Learning, identifying pathogenic sequences in human RNA-seq samples, predicting gene function from co-expression data including ways to modulate the expression of long non-coding RNAs with small molecules, and most importantly, using the ARCHS4 cost-effective infrastructure to provide a free FASTQ alignment service to the community.

Learn more about the award

ARCHS4: Uniform alignment of all human and mouse RNA-seq samples from the Gene Expression Omnibus (GEO)


Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 2018 Apr 10;9(1):1366.

There is a rapid growth in the production of omics datasets collected by the diabetes research community. However, such published data are underutilized for knowledge discovery. To make bioinformatics tools and published omics datasets from the diabetes field more accessible to biomedical researchers, we developed the Diabetes Data and Hypothesis Hub (D2H2). D2H2 contains hundreds of high-quality curated transcriptomics datasets relevant to diabetes, accessible via a user-friendly web-based portal. The collected and processed datasets are curated from the Gene Expression Omnibus (GEO). Each curated study has a dedicated page that provides data visualization, differential gene expression analysis, and single-gene queries. To enable the investigation of these curated datasets and to provide easy access to bioinformatics tools that serve gene and gene set-related knowledge, we developed the D2H2 chatbot. Utilizing GPT, we prompt users to enter free text about their data analysis needs. Parsing the user prompt, together with specifying information about all D2H2 available tools and workflows, we answer user queries by invoking the most relevant tools via the tools’ API. D2H2 also has a hypotheses generation module where gene sets are randomly selected from the bulk RNA-seq precomputed signatures. We then find highly overlapping gene sets extracted from publications listed in PubMed Central with abstract dissimilarity. With the help of GPT, we hypothesize about a possible explanation of the high overlap between the identified gene sets. Overall, D2H2 is a platform that provides a suite of bioinformatics tools and curated transcriptomics datasets for hypothesis generation for the diabetes research community.

Learn more about the award

D2H2: Platform to facilitate data-driven hypotheses for the diabetes research community


D2H2: diabetes data and hypothesis hub. Bioinformatics Advances 2023 Dec 4;3(1):vbad178.