Session 14 - Department of Statistics - Purdue University Skip to main content

Challenges and Opportunities in Statistical Bioinformatics

Speaker(s)

  • Bhramar Mukherjee (University of Michigan, Ann Arbor)
  • Hongzhe Li (University of Pennsylvania)
  • Seyoung Kim (Carnegie Mellon University)
  • Daniela Witten (University of Washington)
  • George Michailidis (University of Michigan, Ann Arbor)
  • Curtis Huttenhower (Harvard University)
  • Colin N. Dewey (University of Wisconsin, Madison)
  • Jeffrey Leek (Johns Hopkins University)

Description

Tremendous progress in measurement technologies, and in experimental workflows that these technologies support, has a profound impact on modern biological and biomedical investigations. These investigations are increasingly quantitative and high throughput, ask broader scientific questions, and rely on complementary data from heterogenous sources. As the result of this change, life scientists become increasingly aware of the key role that statistical methodology plays in designing informative experiments, and in reaching objective and reproducible conclusions. At the same time, new scientific questions and structures of the data create exciting opportunities for methodological development. 

The session brings together leading national and international experts in statistical bioinformatics. The speakers will discuss statistical problems and solutions in areas such as design and analysis of next-generation sequencing experiments, integration of data from heterogenous sources, feature selection for clustering and classification, and associations between molecular biomarkers and disease. The session will be of interest to both statisticians working or interested in the analysis of biological data, and to biologists interested in expanding their statistical analysis toolbox. 

Schedule

Sat, June 23 - Location: STEW 322

TimeSpeakerTitle
8:30 -9:10AM Bhramar Mukherjee Incorporating Auxiliary Information for Improved Prediction in High Dimensional Datasets: An Ensemble of Shrinkage Approaches
Abstract: With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best measure of the underlying biological process. This same biological process may also be measured by W, coming from prior technology but correlated with X. On a moderately sized sample we have (Y,X,W), and on larger sample we have (Y,W). We utilize the data on W to boost prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients of Y on X towards different targets that use information derived from W in the larger dataset, comparing these with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of the regression coefficients and balances efficiency and robustness in a data-adaptive way to theoretically yield smaller prediction error than any of its constituents. The methods are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression of 91 genes is measured by quantitative real-time polymerase chain reaction (qRT-PCR) and microarray technology on 47 lung cancer patients with microarray measurements available on an additional 392 patients. The goal is to predict survival time using qRT-PCR. The methods are evaluated on an independent sample of 101 patients. This is joint work with Jeremy Taylor and Philip S. Boonstra.
9:15-9:55AM Hongzhe Li Sparse Dirichlet-Multinomial Regression for Analysis of Microbiome Data
Abstract: With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group $l_1$ penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
10:00-10:30AM Break
10:30 -11:10 Seyoung Kim Sparse learning methods for dissecting the genetic control of biological systems
Abstract: Since the completion of genome sequencing projects for various organisms including human and other model organisms, the fundamental goal of research in computational genomics, systems biology, and genetics has been to gain a complete understanding of how the instruction sets encoded in genomes get executed within a cell system and organism. The recent advances in the high-throughput technology such as next-generation sequencing technology have allowed the researchers to collect a large amount of data for the genomes and various other aspects of a cell system. Such datasets hold the key to understanding the detailed mechanisms of the genetic control of a biological system and further deepening our knowledge of cell biology with a potential application to medicine. In this talk, I will present statistical methods that we have developed for learning from high-dimensional genomic data to dissect the genetic control of biological systems. I will focus on sparse learning methods that range from sparse regression methods to sparse probabilistic graphical models, and describe how such methods can be used to effectively extract complex epistatic and pleiotropic interactions among various entities in a cell system. In addition, I will discuss efficient learning algorithms for these methods that allow for analysis of large-scale genome-wide datasets. Using yeast genotype and gene-expression dataset, I will demonstrate how our methods can lead to new insights into the activities of genes in a cell as well as the perturbation of gene expressions by genetic variation, and discuss experimental validation that confirmed our new findings.
11:15 -11:55AM Daniela Witten Fast graphical model estimation and its applications
Abstract: The graphical lasso, recently proposed for Gaussian graphical modeling in high dimensions, involves estimating an inverse covariance matrix under a multivariate normal model by maximizing the L1-penalized log likelihood. I will begin by presenting a very simple but previously unknown necessary and sufficient condition that can be used to identify the connected components in the graphical lasso solution. This condition can be used to achieve massive computational gains: computing the graphical lasso solution with 20,000 features now takes minutes on a standard desktop machine, whereas previously the computations were prohibitive. This opens up new doors for rigorous network analysis of high-dimensional biological data. As a specific example, I will discuss estimation of graphical models under distinct biological conditions, in which we expect some, but not all, aspects of the networks to differ between conditions. An extension of the necessary and sufficient condition developed for the graphical lasso allows for extremely fast network estimation in this setting. Parts of this work are joint with Jerry Friedman, Noah Simon, Pei Wang, and Patrick Danaher.
12:00-1:30PM Lunch
1:30-2:10PM George Michailidis Pathway Enrichment Analysis: Current Approaches and Outstanding Challenges
Abstract: Pathway enrichment analysis has become an important tool for biomedical researchers for gaining insight into the underlying biology of differentially expressed genes/proteins/metabolites, as it reduces complexity and enhances explanatory power. We provide a brief overview of available methods, discuss their strengths and weaknesses and outline methodological challenges. Particular attention is paid to network based pathway enrichment analysis methods due to their superior performance. The methods are illustrated on a number of data sets from diverse Omics technologies.
2:15-2:55PM Curtis Huttenhower Bug bytes: bioinformatics for metagenomics and microbial community analysis
Abstract: Among many surprising insights, the genomic revolution has helped us to realize that we're never alone and, in fact, barely human. For most of our lives, we share our bodies with some ten times as many microbes as human cells; these are resident in our gut and on nearly every body surface, and they are responsible for a tremendous diversity of metabolic activity, immunomodulation, and intercellular signaling.

These microbial communities have only recently become well-described using high-throughput sequencing, requiring analyses that simultaneously apply techniques from genomics, "big data" mining, and molecular epidemiology. I will discuss emerging end-to-end bioinformatics approaches for metagenomics, including initial handling of sequence data for mixed microbial communities, its reconstruction into metabolic pathways, and biomarker discovery in disease. In particular, computational processing is key in identifying unique markers for microbial taxonomy, phylogeny, and in identifying genes and pathways significantly disrupted in inflammatory conditions such as Crohn's and ulcerative colitis. 
3:00-3:30PM Break
3:30-4:10PM Colin N. Dewey Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs
Abstract: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell. RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences. However, analysis of RNA-Seq data in the presence of genes with large numbers of alternative transcripts is currently challenging due to efficiency, identifiability, and representation issues. We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues. We prove that our models are often identifiable and demonstrate that our inference methods for quantification and differential splicing detection are efficient and accurate.
4:15-4:55 Jeffrey Leek Dissecting variation in RNA-sequencing data
Abstract: RNA-sequencing is rapidly replacing microarrays as the most common approach to measure gene expression. This transition is being driven by the rapid decline in cost of sequencing and the flexibility of sequencing technology to measure transcription of novel regions, allele specific transcription, and alternative transcription. In this talk, I will examine variability in expression measurements from RNA-sequencing, discuss the effects of technical and biological variation on expression experiments, and discuss parallels to early microarray analyses.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.