Statistical Methods for Human Genetics

We have a strong interest in developing methods for systems genetics. This emerging field studies how DNA variations affect molecular and cellular level phenotypes, such as gene expression, metabolites, DNA methylation and cellular compositions. Such studies provide a natural bridge from genetic changes to phenotypes, thus can help elucidate the mechanism of how variants affect disease risks. A major focus of the lab is to develop methods to extract information from systems genetic data. One method we developed, called Sherlock, combine eQTL and data from genome-wide association studies (GWAS) in a novel fashion. It can take advantage of loci linked to gene expression in trans (i.e. distant from the gene itself), and can discover genes that would be impossible to find by GWAS alone. A key challenge of system genetics is to establish causal roles of molecular/cellular level traits in phenotypes. Mendelian Randomization (MR) is often used for such analysis. However, MR also makes strong assumptions and can lead to false positive findings. We developed a method, CAUSE, to address these challenges.


Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS. He X, Fuller CK, Song Y, Meng Q, Zhang B, Yang X, Li H. Am J Hum Genet, 2013 May 2;92(5):667-80

Mendelian Randomization Accounting for Correlated and Uncorrelated Pleiotropic Effects Using Genome-Wide Summary StatisticsMorrison J, Knoblauch N, Marcus JH, Stephens M, He X. Nature Genetics, 2020, May 25

Regulatory Variations and Gene Mapping of Complex Traits

Most variants associated with complex traits are located in non-coding regions. Identifying functional variants in such regions in a tissue-specific manner is thus critical for mapping causal variants of complex traits. We have been working with experimental collaborators to identify regulatory variants and leverage such findings to study human genetics. We found that a particular class of variants that affect mRNA modifications (m6A) contribute significantly to heritability of complex traits. These variants work largely independently of transcription or splicing, representing a novel path from genetic to phenotypic variations. We have also used chromatin accessibility profiles in iPS cell-derived neurons to narrow down putative causal variants of neuropsychiatric disorders.


Genetic Analyses Support the Contribution of mRNA N6-methyladenosine (m6A) Modification to Human Disease Heritability. Zhang Z, Luo K, Zou Z, Qiu M, Tian J, Sieh L, Shi H, Zou Y, Wang G, Morrison J, Zhu A, Qiao M, Li Z, Stephens M*, He X*, He C*. Nature Genetics, 2020 Jun 29

Allele-specific open chromatin in human iPSC neurons elucidates  functional disease variants. Zhang S, Zhang H, Zhou Y, Qiao M, Zhao S, Kozlova A, Shi J, Sanders A, Wang G, Luo K, Sengupta S, West S, Qian S, Stret M, Avramopoulos D, Cowan C, Chen M, Pang Z, Gejman P, He X*, Duan J*. Science 2020 Jul 31;369(6503):561-565

Statistical Genetics of Rare Variants

While GWAS are highly successful, most of associations found have small effects. There is a growing interest in identifying rare, large effect, disease variants through whole exome or genome sequencing studies. The statistical challenge is that the power of detecting rare risk variants is often low. We have been developing methods to address this challenges by integrating multiple types of data. We have developed a model that effectively combines de novo mutations, those occurring spontaneously during reproduction, and inherited, standing variants to test the role of a gene. This method (TADA) empowered some of the largest sequencing studies of autism. We have extended TADA in several ways, by combining with copy number variations, and by incorporating functional annotations of variants.

Figure 2

Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. He X, Sanders SJ, Liu L, De Rubeis S, Lim ET, Sutcliffe JS, Schellenberg GD, Gibbs RA, Daly MJ, Buxbaum JD, State MW, Devlin B, Roeder K. PLoS Genetics, 2013 Aug;9(8):e1003671

Synaptic, transcriptional, and chromatin genes disrupted in autism. De Rubeis S, He X, Goldberg A, Poultney C, Samocha K, et al. Nature, 2014 Nov 13;515(7526):209

Insights Into Autism Spectrum Disorder Genomic Architecture and Biology From 71 Risk Loci, Sanders S, He X, Willsey J, Ercan-Sencicek G, Samocha K, et al. Neuron, 2015 Sep 23;87(6):1215-33.

A Statistical Framework for Mapping Risk Genes From De Novo Mutations in Whole-Genome-Sequencing Studies, Liu Y, Liang Y, Cicek AE, Li Z, Li J, Muhle RA, Krenzer M, Mei Y, Wang Y, Knoblauch N, Morrison J, Zhao S, Jiang Y, Geller E, Ionita-Laza I, Wu J, Xia K, Noonan JP, Sun ZS, He XAm J Hum Genet, 2018 Jun 7;102(6):1031-1047

Cancer Genomics

Cancer is largely a genetic disease, where somatic mutations give cancer cells survival advantages and drive tumorigenesis. Identifying driver events, and how they link to changes in celular behavior and tumor microenvrionment, are major challenges of the field. We have developed a method (DriverMAPS) that models the complex pattern of positive selection acting on cancer driver genes, leading to much better detection of driver genes than existing methods. We are interested in how these driver events lead to downstream effects, in particular, the escape from the immune system.


Detailed modeling of positive selection improves detection of cancer driver genes, Zhao S, Liu J, Nanga P, Liu Y, Cicek AE, Knoblauch N, He C, Stephens M*, He X*. Nature Communications, 2019 Jul 30;10(1):3399

Regulatory Sequences and Evolution

Gene expression is controlled by enhancer sequences, which read information of cellular environment to drive specific expression patterns appropriate for cellular conditions or cell types. Variations of noncoding enhancers sequences are major driver of phenotypic variations and evolution. A major research challenge of the field is to decipher the rules that govern this process and to use such knowledge to improve our ability to interpret DNA variations in non-coding sequences. We have developed quantitative models of how enhancer sequences interact with the regulatory proteins (transcription factors) to drive gene expression. We have also studied evolution of cis-regulatory sequences in the context of fruit fly development. We found that the basic units of these sequences, called transcription factor binding sites, can turnover rather rapidly even when the sequences generate similar pattern of gene expression.


Evolution of regulatory sequences in 12 Drosophila species. Kim J*, He X*, Sinha S. PLoS Genet, 2009 Jan;5(1):e1000330

Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. He X, Samee MA, Blatti C, Sinha S. PLoS Comput Biol, 2010 Sep 16;6(9). pii: e1000935

Evolutionary Origins of Transcription Factor Binding Site Clusters. He X, Duque TS, Sinha S. Mol Biol Evol, 2012, 29(3):1059-70