Bayesian/Empirical Bayes Inference on High-Dimensional Data Analysis

Speaker(s)

Martin T. Wells (Cornell University) Canceled
Dongchu Sun (University of Missouri, Columbia)
Sanat K. Sarkar (Temple University)
Malgorzata Bogdan (University of Technology, Poland)
Yuan Alan Qi (Purdue University)
Feng Liang (University of Illinois, Urbana-Champaign)
Mahlet G. Tadesse (Georgetown University)
Liping Tong (Loyola University, Chicago)

Description

An explosion of high dimensional data generated by recent technological advances poses new challenges to standard statistical inference and motivates a strong interest in dealing with the "curse of dimensionality". Many approaches have been proposed, however, most of the issues remain largely unsolved or more efficient methods are demanded. Bayesian and empirical Bayes methods are playing more and more information roles in analyzing high-dimensional data via incorporating important features/information of subjects under investigation. This section will bring experts of Bayesian and empirical Bayes inference together to share their insights on high-dimensional data analysis, and to promote discussions on future direction in this field.

Schedule

Fri, June 22 - Location: STEW 322

Time	Speaker	Title

9:15 - 10:00	Dongchu Sun	Bayesian Model Selection for a Linear Model with Grouped Covariates
	Abstract: Model selection for normal linear regression models with grouped covariates is considered under a class of Zellner's (1986) $g$-priors. The marginal likelihood function is derived under the proposed priors, and a simplified closed form expression is given assuming the commutativity of the projection matrices from the design matrices. As illustration, the marginal likelihood functions of the balanced m-way ANOVA models, either solely with main effects or with all interaction effects, are calculated using the closed form expression. The performance of the proposed priors in model comparison problems is demonstrated by simulation studies on two-way ANOVA models and by two real data studies.
10:00-10:30AM	Break
10:30 - 11:15	Sanat K. Sarkar	Capturing the Severity of Type II Errors in High-Dimensional Multiple Testing
	Abstract: Multiple testing methods controlling false discoveries are useful statistical tools for analyzing data from many modern scientific investigations, such as brain imaging, microarray analysis, astronomy, atmospheric science, and many others. A number of such methods have been proposed in the literature. However, they have been developed without addressing the following issue that is of importance in high-dimensional multiple testing with sparse signals: Miss-detecting a strong signal is often a more severe error than miss-detecting a weak signal, with the severity getting larger as the signal gets stronger. In this talk, an optimal multiple testing method controlling false discoveries is proposed addressing this issue from a Bayesian decision theoretic point of view.
11:15AM-12:00PM	Malgorzata Bogdan	Modified versions of BIC for sparse high dimensional regression
	Abstract: The problem of high dimensional model selection attracts a lot of attention due to the multitude of real life problems, which require searching large data bases for influential factors. In this context one usually assumes sparsity, i.e. that only a small proportion of all available explanatory variables has an influence on the response. The assumption of sparsity can be implicitly used in the Bayesian model selection by imposing an informative prior on the model dimension. Based on this idea, recently several modifications of the Bayesian Information Criterion have been proposed. In this talk we will present these criteria, discuss the related choices of the prior distributions and their relationship to some standard procedures for multiple testing. We will also present theoretical results on the consistency and the asymptotic optimality of these criteria in the Bayesian context. Finally, we will present some applications in the context of gene mapping. The applications make use of the efficient version of the genetic algorithm (so called memetic algorithm) which allows for an exhaustive and directed search through the space of available models and application of the model selection criteria for the estimation of posterior probabilities of inclusion for each of explanatory variables. The memetic algorithm can also be easily used to estimate these probabilities for exact Bayes calculations, with the conjugate priors on the regression model parameters. Useful references Bogdan, M., Ghosh, J.K., and Doerge, R.W. (2004). Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting quantitive trait loci. Genetics, 167, 989—999. Bogdan, M., Chakrabarti, A., Frommlet, F., Ghosh, J.K. (2011). Asymptotic Bayes Optimality under sparsity of some multiple testing procedures. Annals of Statistics, 39, 1551—1579. Chen, J. and Chen, Z. (2008). Extended Bayesian Information criteria for model selection with large model spaces. Biometrika, 95, 759—771. Frommlet, F., Chakrabarti, A., Murawska, M., Bogdan, M. (2011) Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative. arXiv:1005.4753 Frommlet, F., Ljubic, I., Arnardottir, H.B., Bogdan, M. (2012) QTL Mapping Using a Memetic Algorithm with modifications of BIC as fitness function, to appear in Statistical Applications in Genetics and Molecular Biology. Scott, J. G,, Berger J. O. (2010) Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics 38, 2587-2619 Å»ak-Szatkowska, M., Bogdan, M., (2011). Modified versions of Bayesian Information Criterion for sparse Generalized Linear Models. Computational Statistics and Data Analysis, 55: 2908-2924.
12:00-1:30PM	Lunch
1:30 - 2:15	Yuan Alan Qi	Scalable Bayesian learning for complex data
	Abstract: Data are being generated at an unprecedented pace in various scientific and engineering areas including biomedical engineering, and materials science, and social science. These data provide us with precious opportunities to reveal hidden relationships in natural or synthetic systems and predict their functions and properties. With growing complexity, however, the data impose new computational challengesâ€”for example, how to handle the high dimensionality, nonlinear interactions, and the massive volume of the data. To address these new challenges, I have been developing advanced Bayesian models to capture the data complexity and designing scalable algorithms to learn the models efficiently from data. In this talk, I will describe two of my recent works along this line: 1) efficient learning of novel nonparametric models on tensors to discover communities in social networks and predict who should be your friends on facebook; (2) a novel sparse Bayesian model that integrates generative and conditional models to select correlated variables, such as whole genome SNPs; and (3) a Bayesian online learning algorithm that learns a dynamic compact summarization of massive data and use this summarization to make principled predictions. I will present experimental results on real world data "demonstrating the superior predictive performance of the proposed approaches" and discuss other applications of these approaches such as patient drug response predictions.
2:15 - 3:00	Feng Liang	A Few Things on the One-Way ANOVA Model with Diverging Dimensions
	Abstract: Asymptotic studies on models with a diverging number of parameters have received an increasing attention in Statistics. A simple example of such a model is a one-way ANOVA model, where the number of replicates is fixed but the number of groups goes to infinity. We consider Zellner (1986)'s g-prior and its variants (such as mixed g-priors and Empirical Bayes), and provide asymptotic results on the performance (in terms of model selection and estimation) of these prior choices. (This is joint work with Bin Li from UIUC.)
3:00-3:30PM	Break
3:30 - 4:15	Mahlet G. Tadesse	A stochastic partitioning method to associate high-dimensional datasets
	Abstract: In recent years, there has been a growing interest in relating data sets in which both the number of regressors and response variables are substantially larger than the sample size. For example, in the context of genomic studies, a common goal is to identify groups of correlated gene expression levels that are modulated by sets of DNA sequence variations. This may give insights into molecular processes underlying various phenotypes. We propose a Bayesian stochastic partitioning method that combines ideas of mixtures of regression models and variable selection methods to identify cluster structures and relationships across high-dimensional data sets. We illustrate the method with applications to genomic studies.
4:15 - 5:00	Liping Tong	Bayesian Parameter Estimations in the Joint Co-Evolution Model of Social Behavior and Network
	Abstract: An individual's behaviors may be influenced by the behaviors of friends, such as hours spent watching television, playing sports, and unhealthy eating habits. However, preferences for these behaviors may also influence the choice of friends; for example, two children who enjoy playing the same sport are more likely to become friends. To study the interdependence of social network and behavior, Snidjers et al. has developed the actor based stochastic modeling (ABSM) methods for the co-evolution process, which turns out to be useful when dealing with longitudinal social network and behavior data when behavior variables are discrete and have limited number of possible values. Unfortunately, since the evolution function for behavior variable is in exponential format, the ABSM can generate unrealistic results when the behavior variable is continuous or has a large range. To realistically model continuous behavior variable, we propose a coevolution process so that the network evolution is based on an exponential random graph model and the behavior evolution is based on a linear model. We have developed a procedure based on Markov Chain Monte Carlo EM algorithm to find the maximum likelihood estimates (MLE) of parameters. However, it is computationally intensive and limit the application of our methods to small data sets only. To improve computation, we further developed the Bayesian methods, which turned out to be computationally more efficient and stable.