Session 010 - Department of Statistics - Purdue University Skip to main content

Recent Developments in Data Integration

Organizer: Fei Xue, Assistant Professor of Statistics

Speakers

  • Annie Qu, Chancellor's Professor, Department of Statistics, University of California-Irvine
  • Rui Duan, Assistant Professor of Biostatistics, Department of Statistics and the Department of Epidemiology, Harvard 
  • Gen Li, Associate Professor of Biostatistics, Department of Statistics, University of Michigan
  • Antik Chakraborty, Assistant Professor of Statistics, Department of Statistics, Purdue University

Speaker Title
Annie Qu Crowdsourcing Utilizing Subgroup Structure of Latent Factor Modeling

Abstract: Crowdsourcing has emerged as an alternative solution for collecting large scale labels. However, the majority of recruited workers are not domain experts, so their contributed labels could be noisy. In this talk, we propose a two-stage model to predict the true labels for multicategory classification tasks in crowdsourcing. In the first stage, we fit the observed labels with a latent factor model and incorporate subgroup structures for both tasks and workers through a multi-centroid grouping penalty. Group-specific rotations are introduced to align workers with different task categories to solve multicategory crowdsourcing tasks. In the second stage, we propose a concordance-based approach to identify high-quality worker subgroups who are relied upon to assign labels to tasks. In theory, we show the estimation consistency of the latent factors and the prediction consistency of the proposed method. The simulation studies show that the proposed method outperforms the existing competitive methods, assuming the subgroup structures within tasks and workers. We also demonstrate the application of the proposed method to real world problems and show its superiority.

Rui Duan Federated and transfer learning for healthcare data integration

Abstract: The growth of availability and variety of healthcare data sources has provided unique opportunities for data integration and evidence synthesis, which can potentially accelerate knowledge discovery and improve clinical decision-making.  However, many practical and technical challenges, such as data privacy, high dimensionality, and heterogeneity across different datasets, remain to be addressed. In this talk, I will introduce several methods for the effective and efficient integration of multiple healthcare datasets in order to train statistical or machine learning models with improved generalizability and transferability. Specifically, we develop communication-efficient federated learning algorithms for jointly analyzing multiple datasets without the need of sharing patient-level data, as well as transfer learning approaches that leverage shared knowledge learned across multiple datasets to improve the performance of statistical models in target populations of interest. I will discuss both the theoretical properties and examples of implementation of our methods in real-world research networks and data consortia.

Gen Li Multi-view Graphical Models via Multivariate Neighborhood Selection

Abstract: Multi-view data are frequently encountered in biomedical research such as multi-omics studies. It is of particular interest to characterize the conditional dependence structure between features, both within the same view and across different views. Gaussian graphical models are popular tools for identifying such conditional dependency. However, most existing methods only apply to a single data set and cannot adequately accommodate the between-view heterogeneity and structural constraints for multi-view data. In this talk, we will introduce a new multivariate neighborhood selection method that flexibly estimates both within- and between-view conditional dependence structure in multi-view data. In particular, the within-view components are estimated by covariate-adjusted graphical models, and the between-view components are efficiently estimated via high dimensional regularized regression. We will demonstrate the efficacy of the proposed method in synthetic and real data examples.

Antik Chakraborty

Optimal linear shrinkage for multi-response regression

Abstract: We focus on linear shrinkage estimators for coefficient matrix of regression models with different outcome variables. To borrow information across the multiple outcomes, we assign a common multivariate Gaussian prior on the coefficient vector of each regression model with a single outcome. Under the multivariate Gaussian prior on the coefficients, the corresponding estimator of the coefficient matrix is linear in the estimators based on each outcome separately, noted by Efron and Morris (1972). This problem is equivalent to estimating the covariance matrix under what is known as the relative savings loss which leads us to study the estimation of covariance matrices under this loss. We consider the class of rotation invariant estimators of the covariance matrix which leads us to the optimal linear shrinkage estimator of the coefficient matrix. We apply the proposed procedure to a study in genetics.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.