Mount Sinai Center for Bioinformatics


We collaborate with researchers both within the Icahn School of Medicine at Mount Sinai and elsewhere by analyzing their data with the tools and pipelines we developed. In particular, the Center focuses on the strong need for analysis, visualization, and mining of data from omics studies such as transcriptomics, epigenomics, proteomics, and metabolomics for drug discovery.

Software Tools

We have developed several powerful and popular web-based software tools that can be used to discover new knowledge from data, and predict small molecules as novel leads, for a variety of projects involving different data types.

Gene-List Enrichment Analysis

This integrative web-based and mobile gene-list enrichment analysis tool includes 156 gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library Data-Driven Documents (D3). Enrichr is open source and freely available online. Users can easily embed this software into any tools that perform gene list analysis.

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update
Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool

modEnrichr: A Suite of Gene Set Enrichment Analysis Tools for Model Organisms

An expansion of Enrichr for four model organisms: fish, fly, worm and yeast. The modEnrichr suite of tools provides the ability to convert gene lists across species using an ortholog conversion tool that automatically detects the species.

modEnrichr: a suite of gene set enrichment analysis tools for model organisms

Biological Knowledge Engine

This tool is a biological knowledge engine built on top of information about genes and proteins from 114 datasets. To create the Harmonizome, we distilled information from original datasets into attribute tables that define significant associations between genes and attributes, where attributes could be genes, proteins, cell lines, tissues, experimental perturbations, diseases, phenotypes, or drugs, depending on the dataset. Gene and protein identifiers were mapped to NCBI Entrez Gene Symbols and attributes were mapped to appropriate ontologies. We also computed gene-gene and attribute-attribute similarity networks from the attribute tables. These attribute tables and similarity networks can be integrated to perform many types of computational analyses for knowledge discovery and hypothesis generation.

The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins

All RNA-seq and ChIP-seq Signature Search Space

ARCHS4 provides access to gene counts from HiSeq 2000, HiSeq 2500 and NextSeq 500 platforms for human and mouse experiments from GEO and SRA. The website enables downloading of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample and gene spaces. Search features allow browsing of the data by metadata annotation, ability to submit your own up and down gene sets, and explore matching samples enriched for annotated gene sets. Selected sample sets can be downloaded into a tab-separated text file through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against the GRCh38 human reference genome, and mouse samples against the GRCm38 mouse reference genome.

Massive Mining of Publicly Available RNA-seq Data from Human and Mouse

L1000 Characteristic Direction Signature Search Engine

This tool finds consensus signatures that match a user’s input gene lists or input signatures. The underlying dataset is the LINCS L1000 small molecule expression profiles generated at the Broad Institute by the Connectivity Map team. We calculated the differentially expressed genes of these profiles using our multivariate method called the Characteristic Direction.

L1000CDS2: LINCS L1000 characteristic direction signatures search engine

L1000 Fireworks Display

L1000FWD is a web application that provides interactive visualization of over 17,000 drug and small-molecule induced gene expression signatures. L1000FWD enables coloring of signatures by different attributes such as cell type, time point, concentration, as well as drug attributes such as MOA and clinical phase. Signature similarity search is implemented to enable the search for mimicking or opposing signatures given as input of up and down gene sets. Each point on the L1000FWD interactive map is linked to a signature landing page, which provides multifaceted knowledge from various sources about the signature and the drug. Notably such information includes most frequent diagnoses, co-prescribed drugs and age distribution of prescriptions as extracted from the Mount Sinai Health System electronic medical records (EMR). Overall, L1000FWD serves as a platform for identifying functions for novel small molecules using unsupervised clustering, as well as for exploring drug MOA.

L1000FWD: fireworks visualization of drug-induced transcriptomic signatures

Automatically Generate RNA-seq Data Analysis Notebooks

BioJupies is a web server that enables automated creation, storage, and deployment of Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, their gene expression tables, or fetch data from >9,000 published studies containing >300,000 preprocessed RNA-seq samples. Generated notebooks have the executable code of the entire pipeline, rich narrative text, interactive data visualizations, differential expression, and enrichment analyses. The notebooks are permanently stored in the cloud and made available online through a persistent URL.

BioJupies: Automated Generation of Interactive Notebooks for RNA-Seq Data Analysis in the Cloud

Identify Drugs and Small Molecules to Regulate Expression of Target Genes

Drug Gene Budger (DGB) is a web-based and mobile application developed to assist investigators in order to prioritize small molecules that are predicted to maximally influence the expression of their target gene of interest. With DGB, users can enter a gene symbol along with the wish to upregulate or downregulate its expression. The output of the application is a ranked list of small molecules that have been experimentally determined to produce the desired expression effect. The table includes log-transformed fold change, p-value and q-value for each small molecule, reporting the significance of differential expression as determined by the limma method.

Drug Gene Budger (DGB): An application for ranking drugs to modulate a specific gene based on transcriptomic signatures

Linking Expression Signatures to Upstream Cell Signaling Networks

X2K Web infers upstream regulatory networks from signatures of differentially expressed genes. By combining transcription factor enrichment analysis, protein-protein interaction network expansion, with kinase enrichment analysis, X2K Web produces inferred networks of transcription factors, proteins, and kinases predicted to regulate the expression of the inputted gene list. X2K Web provides the results as tables and interactive vector graphic figures that can be readily embedded within publications.

eXpression2Kinases (X2K) Web: linking expression signatures to upstream cell signaling networks

Visualization and Analysis Tool for High-Dimensional Biological Data

Clustergrammer is a web-based visualization tool with interactive features such as: zooming, panning, filtering, reordering, sharing, performing enrichment analysis, and providing dynamic gene annotations. Clustergrammer can be used to generate shareable interactive visualizations by uploading a data table to a web-site, or by embedding Clustergrammer in Jupyter Notebooks.

Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data