Mathematical Statistics, Fall 2009

Feed IconXML Feed
Schedule

Friday, August 21, 2009, 03:30 PM in BRNG 1268
Professor Bimal Roy, Indian Statistical Institute, Calcutta
Counting Tigers thru pugmarks: a statistical approach

The problem of estimating population total of animals from imperfectly characterizing animal signs poses a number of interesting statistical questions. A case in point is the estimation of tiger population from pugmark (foot print) measurements. We assume a random effects model for multivariate (features extracted from the pugmark) observations. The matrices representing "with-in tiger" & "between tiger" covariances are estimated and assuming these to be known, the number of tigers is estimated using maximum likelihood method.

Wednesday, August 26, 2009, 03:30 PM in REC 315
Professor Jayanta Ghosh, Department of Statistics, Purdue University
Organizational Meeting

All registered students should attend the first meeting. All interested faculty are also invited to attend. We will have a discussion on what we should do. After that I will present a survey of some topics of current interest.

Wednesday, September 2, 2009, 03:30 PM in REC 315
Professor Jayanta Ghosh, Department of Statistics, Purdue University
Multiple Tests for Microarrays, Mixture Models, Bayes Oracle and (Asymptotic) Optimality of the Benjamini Hochberg Test

I will introduce the problem of high dimensional multiple testing, which has become popular in microarrays and many other areas, see for example Efron (Statistical Science, 2008). Efron's model is nonparametric. I will use instead another popular parametric model and discuss use of FDR (False Discovery Rates), the Benjamini Hochberg rule for multiple test and a host of other tests. Many concrete applications, unpublished talk by Benjamini at the last Purdue Symposium (2003), and some theory based on estimation, coming from Stanford, show the importance of the BH multiple test. On the other hand, Bayesian decision theorists have pointed out that test is not easy to justify from a decision theoretic approach to testing.

I will discuss these issues briefly and then, drawing on new joint work with Professors Malgorzata Bogdan, Arijit Chakrabarti and Florian Frommlet, will present a popular mixture model approach to multiple testing. This will lead to what we call a Bayesian Oracle and results that show the BH test is asymptotically as good as the Bayes Oracle (under appropriate conditions). This would be done in two lectures on September 2 and September 9. On September 14, Professor Bogdan will continue the discussion.

My presentations will stress heuristics and intution rather than technical details.

Wednesday, September 9, 2009, 03:30 PM in REC 315
Professor Jayanta Ghosh, Department of Statistics, Purdue University
The Two Groups Model Revisits, Insights about the Benjamini Hochberg Rule, The Bayes Oracle and Asymptotic optimality of the BH rule

I will go over the two formulations for microarrays and provide more insight on the BH test in the first 20 minutes. In the rest of the talk I will explain and calculate the Bayes Oracle, state one optimality theorem carefully, and sketch the main steps in the proof. The discussion of Multiple Tests will continue on September 16. The speaker is Professor Malgorzata Bogdan, our Visiting Professor this fall. That would be the last talk on Multiple Tests.

Wednesday, September 16, 2009, 03:30 PM in REC 315
Professor Malgorzata Bogdan, Visiting Professor, Department of Statistics, Purdue University
On the asymptotic optimality of the Benjamini and Hochberg procedure

We will continue the discussion on the asymptotic optimality of multiple testing rules within the framework of Bayesian Decision Theory. Similarly as in [1] our main interest is in the asymptotic scheme under which the proportion of "true" alternatives converges to zero as the number of tests increases to infinity. According to our definition the multiple testing rule is asymptotically optimal if the ratio of its Bayes risk and that of the Bayes oracle (a rule which minimizes the Bayes risk) converges to one within this asymptotic framework. We characterize the set of fixed threshold multiple testing rules which asymptotically optimal and based on this characterization discuss some important scenarios under which the Bonferroni correction and the fixed threshold rules controlling the Bayesian False Discovery Rate (BFDR) are asymptotically optimal. Finally, we present results on the convergence of the random threshold of the Benjamini and Hochberg procedure to the threshold of BFDR controlling rule and on the asymptotic optimality of the Benjamini and Hochberg procedure under a wide range of sparsity levels and relatively mild assumptions on the ratio of losses for type I and type II errors. As far as we know, this is the first result on the decision theoretic optimality of the Benjamini-Hochberg rule in the context of hypothesis testing.

References

  1. Abramovich F., Benjamini Y., Donoho D. L. and Johnstone I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist., 34, 584—653.
  2. Bogdan M., Chakrabarti A. and Ghosh J.K. (2009) Bayes Oracle and the Asymptotic Optimality of the Multiple Testing Procedures Under Sparsity, Tech Report 02/09 Purdue University.
  3. Bogdan M, Chakrabarti A., Frommlet F. and Ghosh J.K. (2009) On the Asymptotic Optimality of the Multiple Testing Rules Under Sparsity, in preparation.

Wednesday, September 23, 2009, 03:30 PM in REC 315
Professor Jose Figueroa-Lopez, Department of Statistics, Purdue University
Some estimation problems related to time-changed Lèvy models

Time-changed Lèvy models are known to capture several stylized features of asset prices such as leptokurtic distributions and volatility clustering. In this talk we study the problem of estimating the parameters controlling the jump behavior of the process as well as the underlying random clock. We obtain consistent estimation of the relevant parameters when both the sampling time-horizon and frequency get larger, and prove central limit theorems for our estimators. The performance of the estimators are also tested numerically for a variance Gamma Lèvy process time-changed by a CIR diffusion model.

Wednesday, September 30, 2009, 03:30 PM in REC 315
Professor Sergey Kirshner, Department of Statistics, Purdue University
Crash Course on Graphical Models: Part I, Theory

Over the last 20 years, graphical models have become an incredibly important tool in dealing with problems in high-dimensional structured domains. Representing the set conditional independence relations between the observations as a graph, graphical models provide a framework for inference and parameter estimation with the computational complexity dependent only on the properties of the underlying graph.

In the first part of a two part series, I will introduce two of the most commonly used types of graphical models, Bayesian networks (directed) and Markov networks (undirected). After describing the semantics for both types of models, I will provide a brief overview of related inference and parameter estimation methods.

The second part of the series will focus on applications of such models.

Wednesday, October 7, 2009, 03:30 PM in REC 315
Professor Sergey Kirshner, Department of Statistics, Purdue University
Crash Course in Graphical Models, Part II, Theory and Applications

The second part of the series will continue where the first part left off. I will introduce Markov networks and their relation to log-linear models, and will provide an overview of inference methods in both directed and undirected graphical models. Time permitting, I will present several applications of such techniques.

Wednesday, October 14, 2009, 03:30 PM in REC 315
Bowei Xi, Department of Statistics, Purdue University
Adversarial Classification

Many applications, ranging from spam filtering to intrusion detection, are faced with active adversaries. In all these applications, the future datasets and the training dataset are not from the same population, due to the transformations employed by the adversaries. Hence a main assumption for the existing classification techniques no longer holds and initially successful classifiers will degrade easily. This becomes a game between the adversary and the data miner: The adversary modifies its strategy to avoid being detected by the current classifier; the data miner then updates its classifier based on the new threats. We investigate the possibility of an equilibrium in this seemingly never ending game, where neither party has an incentive to change. Modifying the classifier causes too many false positives with too little increase in true positives; changes by the adversary decrease the utility of the false negative items that are not detected. We develop a game theoretic framework where equilibrium behavior of adversarial classification applications can be analyzed, and provide a solution for finding an equilibrium point. A classifier's equilibrium performance indicates its eventual success or failure. The data miner could then select attributes based on their equilibrium performance, and construct an effective classifier.

Wednesday, October 21, 2009, 03:30 PM in REC 315
Professor Jun Xie, Department of Statistics, Purdue University
Statistical Challenges in Analysis of Large Scale SNP Data and Gene Expression Data

Despite progresses in statistical analyses of genomic data, more specifically SNP and gene expression data, many statistical challenges in these data sets are unsolved. SNPs are single base differences in DNA sequence among individuals. The data type is categorical, with three possible genotypes in a single SNP. But when we consider a block of 10 SNPs, there are about 60,000 categories, much larger than a typical sample size. Besides this difficulty, multiple testing is always an issue. I will present some preliminary analysis for large scale SNP data and introduce a new concept of hypothesis testing motivated by Dempster-Shafer theory for inference. I will also mention another data set of gene expression, with a challenging goal of classifying patients' responses to a drug.

This is joint work with Professor Chuanhai Liu in the Department of Statistics.

Wednesday, October 28, 2009, 03:30 PM in REC 315
Professor Michael Levine, Department of Statistics, Purdue University
Mixing Density and Mixture Density Estimation: Estimation Methods and Possible Connections

It is well known that when the true mixing distribution is continuous, its nonparametric maximum likelihood is degenerate. I will discuss an alternative method that maximizes a penalized likelihood instead. The resulting estimate is called the nonparametric maximum penalized likelihood estimate (NPMPLE). A functional EM algorithm is proposed for computing the NPMPLE of the continuous mixing density. This is a joint work with Michael Y. Zhu.

Time permitting, I will also discuss possible connections between mixing density estimation and the estimation of finite mixtures of nonparametric components. This will be a discussion of the current joint work with David Hunter and Didier Chauveaux.

Wednesday, November 4, 2009, 03:30 PM in REC 315
Professor Jayanta Ghosh, Department of Statistics, Purdue University
Wellner's Talk on the Future Directions of Statistics at the Last JSM

I will go over Wellner's slides providing some explanation of the topics, comments on what can we learn from predictions of future directions of research, and comments on and augmentation of his suggested topics. Wellner's presentation of citations of hot topics of the present and the recent past and of a conference on predictions in 1967 are especially interesting.

Wednesday, November 11, 2009, 03:30 PM in REC 315
Professor Sergey Kirshner, Department of Statistics, Purdue University
Crash Course in Graphical Models, Part III, Theory and Applications

The third part of the series will provide an overview of inference methods in both directed and undirected graphical models. Time permitting, I will present several applications of such techniques.

Wednesday, November 18, 2009, 03:30 PM in REC 315
Professor Lingsong Zhang, Department of Statistics, Purdue University
Sparse Linear Discriminant Analysis Method for Genetic Pathways

An increasing challenge in analysis of genomic data is how to interpret and gain biological insight of profiles of thousands of genes. There is an increasing interest in analysis of genomic data by incorporating prior biological knowledge using gene sets and genomic pathways, which consist of groups of biologically similar genes. Such approaches allow one to study the joint effects of a group of genes. Existing methods include over-representation analysis, gene set enrichment analysis, principal component analysis, global test, and kernel machine. However, these pathway analysis methods do not provide a selection of important genes in the pathway and the analysis can be dominated by the noises of noninformative genes. We propose sparse linear discriminant analysis (SLDA) for genetic pathway data, which allow us to study the joint effects of genes within a pathway while selecting important genes that drive the differences. We provide an efficient path algorithm to obtain the solution. We illustrate these methods by application to a type II diabetes data set and a metal fuse exposure data set.

Related Paper:

Wu, M. C., Zhang, L., Wang, Z., Christiani D. C. and Lin, X. (2009), "Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection", Bioinformatics, 25(9), pp. 1145-1151.

Wednesday, November 25, 2009,
No Mathematical Statistics Seminar

Wednesday, December 2, 2009, 03:30 PM in REC 315
Professor Guang Cheng, Department of Statistics, Purdue University
Bootstrap Consistency for General Semiparametric M-estimation

Consider M-estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and a nuisance function parameter. The bootstrap is a widely used resampling method applied to draw inferences in the context of semiparametric M-estimation. We show that, under general conditions, the bootstrap is asymptotically consistent in estimating the distribution of the M-estimate of Euclidean parameter; this is, the bootstrap distribution asymptotically imitates the distribution of the M-estimate. We also show that the bootstrap confidence set has the asymptotically correct coverage probability. These general conclusions hold, in particular, when the nuisance parameter is not estimable at root-n rate. Our results provide a theoretical justification for the use of bootstrap as an inference tool in semiparametric modelling and apply to a broad class of bootstrap methods with exchangeable bootstrap weights. In this paper, we will also apply this general theory to several popular semiparametric models, e.g. Cox regression model with survival data.