We focus on developing machine learning approaches and bioinformatics tools for interpreting functional genomics and molecular mechanisms at both single-cell and tissue levels and improving genotype-phenotype predictions in complex biological systems, especially for brains and brain disesaes. In particular, our current applied research areas include:
* Functional genomics, Computational genomics, Comparative genomics
* Brains, Neurodevelopment, Neuropsychiatric & Neurodegenerative diseases, Cancers, Bioenergy
Interpretable machine learning for multi-modal data integration and phenotype prediction: We have developed many machine learning methods to predict cell-type and bulk gene regulatory networks in human diseases such as scGRNom [Genome Medicine, 13, 95, 2021]. Further, we have developed interpretable deep learning models, Varmole [Bioinformatics, 37 (12), 1772-1775, 2021], deepManReg [Nature Computational Science, 2, 38–46, 2022], that significantly improved the prediction of disease phenotypes in complex brains. As multi-modal data is thriving, we developed a multiview empirical risk minimization (MV-ERM) framework that enables embedding biological knowledge into the learning models for discovering functional multi-omics [PLoS Computational Biology, 16(4): e1007677, 2020]. Using this framework, our manifold learning method, ManiNetCluster, successfully found the gene functions linking different conditions [BMC Genomics 20, 1003, 2019]. Another method is ECMarker, a semi-Restricted Boltzmann Machines model to predict biomarker genes for early diseases [Bioinformatics, 37 (8), 1115-1124, 2021]. All our methods are open-source available in R/Python along with README files, including tutorials, workflows, and demos at our Github site for general use.
Multi-omics data analysis and functional genomics: We have built a comprehensive functional genomic resource for the human brain across 1866 individuals (resource.psychencode.org) using multi-omics data from PsychENCODE and other large consortia. It contains ~79K brain-active enhancers, sets of Hi-C linkages and TADs, single-cell expression profiles for many cell types, expression QTLs, and further QTLs associated with chromatin, splicing, and cell-type proportions. We deconvolved the bulk tissue expression across individuals using single-cell data and found that varying cell-type proportions largely account for the cross-population variation in expression (with >88% reconstruction accuracy). Leveraging our QTLs and Hi-C datasets, we predicted a full regulatory network, linking GWAS variants to genes (e.g., 321 for schizophrenia). We embedded this network into an interpretable deep-learning model, which improves disease prediction ~6X vs. polygenic risk scores and identifies key genes and pathways in psychiatric disorders. [Science 362, eaat8464, 2018]
Comparative network biology: We designed a novel cross-species clustering algorithm to demonstrate conserved and species-specific gene and non-coding RNA regulatory modules during embryonic development between C. elegans and D. mel. We found that in both species, the orthologous genes work more closely during the phylotypic developmental stage (aka the vertebrate body plan stage) than other developmental stages. This lays the groundwork for evolutionary expression patterns during embryogenesis and enabled us to systematically study interactions between evolutionary conserved and species-specific functions during development. [Nature 512, 445–448, 2014; Genome Biology 15:R100, 2014]
Dynamic models in biological systems: We developed computational methods identifying the principal gene expression patterns for complex biological processes such as embryogenesis, allowing integration of the state-space model and dimensionality reduction by matrix factorizations for the first time. This approach produced an entirely new analytical platform with promise to open new avenues of investigation into systematic and robust dynamic patterns from high dimensional, complex and noisy gene expression data [PLoS Computational Biology, 12(10): e1005146, 2016; PLoS ONE 7(1): e28805, 2012; IEEE/ACM Transactions on Computational Biology and Bioinformatics, 430-437, 2012].
Gene regulatory logics: We developed a computational method by integrating ENCODE and TCGA data to identify a genome-wide regulatory logic of transcription factors and microRNAs reporting on logic patterns observed in leukemia. Until this point, similar logics had only been reported in simple organisms like yeast. These results provided unprecedented insights into the gene regulatory circuit logics in complex and more advanced biological systems like cancer [PLoS Computational Biology 11(4): e1004132, 2015].
Cross-disciplinary network transferability: Our recent review compared the characteristics of biological networks with other disciplines, and discussed the cross-disciplinary transferability of network formalisms to help gain novel biological insights at the system level. We illustrated how these comparisons benefit the field with a few specific examples related to network growth, organizational hierarchies, and the evolution of adaptive systems [Cell Systems, 2, 147-157, 2016].
Academic social network: We analyzed the academic social networks driven by large scientific consortia (Big Science), which revealed temporal dynamics of collaborative patterns between consortia members and non-member users [Trends in Genetics, 32, 251-253, 2016].