The Center for Disease Neurogenomics (CDN) aims to identify the intermediate steps along the chain of causality from DNA to high-level phenotype that result in neuropsychiatric disease. Using a functional genomics approach, the CDN generates large-scale datasets from the human brain assaying gene expression and epigenetic variation across individuals, disease states, cell types, and brain regions. The complexity and scale of these functional genomics datasets requires developing novel analytical approaches to address biological questions about neuropsychiatric disease. The role of the Statistical Neurogenomics Group, under the direction of Gabriel E. Hoffman, PhD, is to develop new statistical methods and software that can scale to these large datasets. Due to the scale of these datasets and the complexity of the statistical models we develop, we take advantage of Mount Sinai’s High Performance Computing resource, Minerva.
Innovative Statistical Methods to Track Dynamic Molecular Processes
Molecular processes in the brain are extremely dynamic, and understanding their role in neuropsychiatric disease requires decoupling multiple sources of variation. Gene expression is driven by multiple sources affecting variation. Not only do the genetics drive gene expression variation, but the disease itself can drive variation in gene expression. Many other factors can come into play as well. Donors’ age, sex, and other biological factors drive expression variation, but technical and stochastic factors are also important. As a result, sophisticated statistical models are needed to decouple the sources of variation we are studying from those we are not specifically interested in.
To address these issues, we have developed the open-source software package variancePartion. Widely used at the CDN and within the broader scientific community, variancePartition enables rapid interpretation of complex gene expression studies as well as other high-throughput genomics assays. Using a linear mixed model, variancePartition quantifies variation in each expression trait attributable to disease status, sex, cell, or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. This statistical framework has served as a core part of other statistical models and software packages we have developed over the past few years.
Isolating the Factors That Drive Disease
One of the challenges of identifying the genetic variants causing neuropsychiatric disease is that the variants occur in clusters, called linkage disequilibrium blocks. Since they are physically located so close together, these variants are highly correlated, and standard analysis can’t identify the specific variant or variants that are responsible for disease. Using a fine mapping approach, we have developed a statistical method to analyze each variant within the cluster and estimate the probability that it is the true causal variant. Based on these probabilities, we can invest with confidence on further experimentation.
By linking genetic variants, gene expression traits, and disease phenotype through integrative analyses, we can trace the chain of causality and identify important intermediate biological processes. Integrating data and these three levels through joint statistical fine mapping, we can identify variants, which when modified, will affect both gene expression and disease risk.
We recently developed a new statistical approach, the multivariate multiple eQTL, which increases sample size available in these analyses by integrating donors with diverse ancestries. This approach increases power by enabling us to use bulk brain tissue yet focus on specific brain regions, cell types, genes, and variants through a statistical modeling approach. With this sophisticated linear regression model, we are able to increase fine-mapping resolution and identify conditional eQTLs that are enriched for cell-type regulatory effects. In a recent study, we performed a trans-ethnic eQTL meta-analysis of 3,188 RNA-seq samples from 2,029 donors, with an effective sample size of 2,974, to produce the largest resource to date characterizing the genetics of gene expression in the human brain.
In this study, our statistical approach indicated that a variant in the APH1B gene associated with Alzheimer’s disease risk acted by regulating expression even though it also disrupts the protein sequence. This regulatory function—the when, where, and how, and how much a gene is expressed—is a common theme for most of the neuropsychiatric diseases we study.
Sharing Our Software With the Scientific Community
We are committed to distributing our open-source software packages to the broader scientific community. While method development is focused on datasets and biological questions of interest to the CDN, colleagues and collaborators, we aim to contribute to the broader genomics and scientific communities. By making our innovative tools publicly available, we are helping the scientific community solve our common problems.