Research Overview

Our research is driven by the question: How do we optimally apply computational statistics and machine learning to big biological data to produce actionable insights and predictions?

In practice, our work involves core research projects, management and analysis of big biological data, and development of new analysis methods and associated algorithms. A major component of our work involves data wrangling: the processing of data sets to analyzable forms. The data we work with include genomic and other high-throughput molecular data - spanning genomic, epigenomic, metabolic, and proteomic data from cell-free to single-cell to spatial, collected by next-generation sequencing (NGS), array, and mass spectrometry platforms - medical and disease imaging data - spanning digital pathology to magnetic resonance imaging (MRI) - and clinical health data - spanning clinical trial electronic data capture (EDC), electronic health record (EHR), and real world evidence (RWE) data. We also regularly develop computational statistics and machine learning methods for specific applications, where our output includes theory - deriving theorems that impact our understanding of the potential of computational analysis methodology and the properties of algorithms, computational statistics - spanning development of regularized / penalized generalized linear mixed models (GLMM), hierarchical mixture prior Bayesian models, and the simpler forms of these models, machine learning - spanning development of non-linear dimension reduction and clustering, regression trees and random forests, support vector machines (SVMs), probabilistic graphic models, and convolutional neural networks (CNNs) and algorithms - including Expectation-Maximization (EM), variational Bayes, and Markov chain Monte Carlo (MCMC).

Our core research projects are mostly within four major research areas:

Genome Variation - our published work includes papers on inferring relatedness, whole-genome analysis of human migration, and analysis of genome admixture

Statistical Genetics - our published work includes papers on association analysis of phenotypes ranging from molecular, including expression quantitative trait loci (eQTL) and related, to complex diseases, including pedigree and bioinformatics analysis of rare disease and genome-wide association studies (GWAS)

Network Discovery - our published work includes papers on network analysis of mixed genomic data types, where our methods have been used to rank top gene candidates for drug development, and papers on causal modeling methods, which have been applied to build hypotheses of drug mechanism of action

Disease Risk Prediction - our published work includes papers on behavioral and environmental biomarkers developed from miRNA and genome-wide gene expression data, identification of disease subtypes from proteomic data, assessments of disease impacts from methylation and single cell gene expression data, improvement of polygenic risk scores (PRS), and clinical predictors of cancer risk

Our current projects include development of phylogenetic methods for microbiome analysis, applying human pedigree information to improve polygenic risk scores, epistatic analysis of complex diseases, development of multi-omics cancer detection diagnostics from cell-free assays, single-cell analysis of the impacts of gene therapies, development of medical image biomarkers of disease severity, and mining electronic health records for predictors of drug responses.

Please see our publications or contact us for more information on our current work.