Mount Sinai Center for Bioinformatics

Resources

We collaborate with researchers both within the Icahn School of Medicine at Mount Sinai and elsewhere by analyzing their data with the tools and pipelines we developed. In particular, the Center focuses on the strong need for analysis, visualization, and mining of data from omics studies such as transcriptomics, epigenomics, proteomics, and metabolomics for drug discovery.

Software Tools

We have developed several powerful and popular web-based software tools that can be used to discover new knowledge from data, and predict small molecules as novel leads, for a variety of projects involving different data types.

Gene-List Enrichment Analysis

This integrative web-based and mobile gene-list enrichment analysis tool includes 232 gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library Data-Driven Documents (D3). Enrichr is open source and freely available online. Users can easily embed this software into any tools that perform gene list analysis.

Publications:
Gene set knowledge discovery with Enrichr
Enrichr: a comprehensive gene set enrichment analysis web server 2016 update
Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool

modEnrichr: A Suite of Gene Set Enrichment Analysis Tools for Model Organisms

An expansion of Enrichr for four model organisms: fish, fly, worm and yeast. The modEnrichr suite of tools provides the ability to convert gene lists across species using an ortholog conversion tool that automatically detects the species.

Publication:
modEnrichr: a suite of gene set enrichment analysis tools for model organisms

Biological Knowledge Engine

This tool is a biological knowledge engine built on top of information about genes and proteins from 140 datasets. To create the Harmonizome, we distilled information from original datasets into attribute tables that define significant associations between genes and attributes, where attributes could be genes, proteins, cell lines, tissues, experimental perturbations, diseases, phenotypes, or drugs, depending on the dataset. Gene and protein identifiers were mapped to NCBI Entrez Gene Symbols and attributes were mapped to appropriate ontologies. We also computed gene-gene and attribute-attribute similarity networks from the attribute tables. These attribute tables and similarity networks can be integrated to perform many types of computational analyses for knowledge discovery and hypothesis generation.

Publication:
The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins

Collection of Web-based Applications to Execute Bioinformatics Workflows

Appyters extend Jupyter notebooks to broaden their accessibility by turning Jupyter notebooks into fully functional standalone web-based bioinformatics applications. Each Appyter presents to users an entry form enabling them to upload their data and set various parameters for executing a multitude of bioinformatics analysis pipelines. Once the form is filled, the Appyter executes the corresponding notebook in the cloud, producing a report without requiring the user to interact directly with the code. Appyters can be applied to a variety of workflows including building customized machine learning pipelines, analyzing omics data, and producing publishable figures.

Publication:
Appyters: Turning Jupyter Notebooks into data-driven web apps

All RNA-seq and ChIP-seq Signature Search Space

ARCHS4 provides access to gene counts from HiSeq 2000, HiSeq 2500 and NextSeq 500 platforms for human and mouse experiments from GEO and SRA. The website enables downloading of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample and gene spaces. Search features allow browsing of the data by metadata annotation, ability to submit your own up and down gene sets, and explore matching samples enriched for annotated gene sets. Selected sample sets can be downloaded into a tab-separated text file through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against the GRCh38 human reference genome, and mouse samples against the GRCm38 mouse reference genome.

Publication:
Massive Mining of Publicly Available RNA-seq Data from Human and Mouse

Massive Mining of Gene Expression Signatures from the Gene Expression Omnibus (GEO)

RummaGEO is a web server application that enables gene expression signature search against all human and mouse RNA-seq studies deposited into GEO. To enable such a search engine, we performed offline automatic identification of conditions from uniformly aligned GEO studies available from ARCHS4, and then computed differential expression signatures to extract gene sets from these signatures.

Publication:
RummaGEO: Automatic mining of human and mouse gene sets from GEO

ChIP-X Enrichment Analysis Version 3

A transcription factor enrichment analysis tool that ranks TFs associated with user-submitted gene sets. The ChEA3 background database contains a collection of gene set libraries generated from multiple sources including TF–gene co-expression from RNA-seq studies, TF–target associations from ChIP-seq experiments, and TF–gene co-occurrence computed from crowd-submitted gene lists.

Publication:
ChEA3: transcription factor enrichment analysis by orthogonal omics integration

Kinase Enrichment Analysis 3

Infers upstream kinases whose putative substrates are overrepresented in a user-inputted list of genes or differentially phosphorylated proteins. The KEA3 database contains putative kinase-substrate interactions collected from publicly available datasets. Gene sets of putative kinase substrates are used as the primary units of analysis in KEA3. These gene sets are organized in gene set “libraries". Libraries are supersets of kinase substrate sets that are aggregated based on the database from which they are derived..

Publication:
KEA3: improved kinase enrichment analysis via data integration

Automatically Generate RNA-seq Data Analysis Notebooks

BioJupies is a web server that enables automated creation, storage, and deployment of Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, their gene expression tables, or fetch data from >9,000 published studies containing >300,000 preprocessed RNA-seq samples. Generated notebooks have the executable code of the entire pipeline, rich narrative text, interactive data visualizations, differential expression, and enrichment analyses. The notebooks are permanently stored in the cloud and made available online through a persistent URL.

Publication:
BioJupies: Automated Generation of Interactive Notebooks for RNA-Seq Data Analysis in the Cloud

Tool to Identify Cell Surface Immunotherapeutic Targets

TargetRanger is a web server application that identifies targets from user-inputted RNA-seq samples collected from the cells they wish to target. By comparing the inputted samples with processed RNA-seq and proteomics data from several atlases, TargetRanger identifies genes that are highly expressed in the target cells while lowly expressed across normal human cell types, tissues, and cell lines.

Publication:
GeneRanger and TargetRanger: processed gene and protein expression levels across cells and tissues for target discovery

Massive Mining of Gene Sets from Supporting Materials of Biomedical Research Publications

Rummagene is a web server application that provides access to hundreds of thousands human and mouse gene sets extracted from supporting materials of publications listed on PubMed Central (PMC). Users of Rummagene can submit their own gene sets to find matching gene sets ranked by their overlap with the input gene set. In addition to providing the extracted gene sets for search, we investigated the massive corpus of these gene sets for statistical patterns. We show how Rummagene can be used for transcription factor and kinase enrichment analyses, for universal predictions of cell types for single cell RNA-seq data, and for gene function predictions.

Publication:
Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Linking Expression Signatures to Upstream Cell Signaling Networks

X2K Web infers upstream regulatory networks from signatures of differentially expressed genes. By combining transcription factor enrichment analysis, protein-protein interaction network expansion, with kinase enrichment analysis, X2K Web produces inferred networks of transcription factors, proteins, and kinases predicted to regulate the expression of the inputted gene list. X2K Web provides the results as tables and interactive vector graphic figures that can be readily embedded within publications.

Publication:
eXpression2Kinases (X2K) Web: linking expression signatures to upstream cell signaling networks

Data and Metadata Search Engine for a Million Gene Expression Signatures

SigCom LINCS is a web server that serves over a million gene expression signatures processed, analyzed, and visualized from LINCS, GTEx, and GEO. SigCom LINCS is built with Signature Commons, a cloud-agnostic skeleton Data Commons with a focus on serving searchable signatures. SigCom LINCS provides a rapid signature similarity search for mimickers and reversers given sets of up and down genes, a gene set, a single gene, or any search term.

Publication:
SigCom LINCS: data and metadata search engine for a million gene expression signatures