Big Data in Plant Science II - Department of Statistics - Purdue University Skip to main content

Big Data in Plant Science II

Co-organizers: Min Zhang, Professor of Statistics, Department of Statistics, Purdue University; Jianming Yu, Professor and Pioneer Distinguished Chair in Maize Breeding, Department of Agronomy, Iowa State University; Siva Prasad Kumpatla, Global Leader, Data Science, Corteva agriscience, Agriculture division of Dow DuPont

Chair: Jianming Yu, Professor and Pioneer Distinguished Chair in Maize Breeding, Department of Agronomy, Iowa State University

Speakers

  • Rebecca W. Doerge; Glen de Vries Dean of the Mellon College of Science, Professor of Statistics, and Professor of Biology, Carnegie Mellon University
  • Zhen Zhang, Statistician, Data Science & Informatics, Corteva agriscience, Agriculture division of Dow DuPont
  • Min Zhang, Professor of Statistics, Department of Statistics, Purdue University
  • Karl W. Broman, Professor, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
Schedule

Thursday, June 7, 1:30-3:30 p.m. in STEW 214 AB

Time Speaker Title
1:30-2:00 p.m. Rebecca W. Doerge The Future of Statistical Bioinformatics and Genomics in the Automated World of Agriculture
Abstract: World population is expected to reach 9.6 billion by 2050, and crop yields are not keeping pace fast enough to avoid widespread food shortages. Modern molecular breeding programs are extremely effective, and typically result in annual yield increases of around 4%, but are not scalable due to both the human labor and expertise required. Employing robots in agriculture has great potential to address chronic issues that are challenging food systems worldwide. While quantitative trait locus (QTL) mapping was one of the first approaches to empower molecular breeding, the power of the analysis was/is limited by small sample sizes (i.e., low numbers of phenotyped individuals) and complexities of the genetic architecture.  Addressing these issues via automation has exciting potential, but it also presents new challenges for the statistical bioinformatics and genomics communities in how automated data are collected, stored and analyzed with respect to understanding relationships between genetic composition and phenotype. Using automation the expectation is a dramatic increase, factor of 100+, in the number of individuals in the breeding population that can be evaluated.  Although automated phenotyping is a transformational technology that can be applied worldwide, understanding how it integrates with existing statistical methodology/tools remains an open question.  Following a short overview about the evolution of genetic/genomic data, analytic issues/approaches (e.g., QTL, microarray, e-QTL, and single cell analyses), and the nonlinear and unorthodox approach to my education and career (i.e., lessons learned), the challenges presented by automated phenotyping will be outlined for the purpose of, hopefully, initiating an engaging discussion. This talk will be accessible to a broad scientific audience; an in depth understanding of statistics, biology and/or computing is not required.
2:00-2:30 p.m. Zhen Zhang

Semiparametric Bayesian Analysis of Big Data with Censoring Observations

Abstract: A semiparametric Bayesian analysis was developed for regularized estimation of the regression parameters in a flexible accelerated failure time (AFT) model. The novelties of the proposed method lie in modeling the error distribution of the accelerated failure time non-parametrically, modelling the variance as a function of the mean, and adopting Bayesian LASSO technique in modeling the mean. The proposed method allowed for identifying a set of important regression parameters, estimating survival probabilities with credible intervals. Moreover, a semiparametric Bayesian approach was developed to analyze the median regression model where some regression coefficients are varying with an index variable. The novel features of the proposal include 1) modeling the error structure in the median regression via an independent sampler of the Dirichlet process mixture that is suitable for large data sets, 2) flexible Bayesian P-spline model for the varying regression coefficients. Both approaches were applied to the big cancer data with high censoring rate from the Surveillance, Epidemiology, and End Results (SEER) Program. The proposed method helps in identifying the influential prognostic factors whose effect on the median survival time evolves with the age of diagnosis, and predicting the 5-year survival probabilities in plant pathology.

2:30-3:00 p.m. Min Zhang Training and new statistical methods for big data in plant research
Abstract: With the development of modern high throughput biotechnologies, massive amounts of omics data have been generated in the life sciences and therefore, it becomes increasingly crucial to extract information from these data to better understand biological mechanisms and apply these results in practice. To accelerate the translation process, we have developed a training course for biomedical researchers with minimal statistical and computational skills. The goals and outcomes of the course will be introduced. While the course focuses on basic statistical methods, I will talk about the advanced methods that recently developed in our group for various types of big data analysis.
3:00-3:30 p.m. Karl Broman

R/qtl2: QTL analysis in multi-parent populations

Abstract: For nearly 20 years, I have been developing an R package, R/qtl, for mapping quantitative trait loci (QTL) in experimental crosses. The goal is to identify genomic regions that contribute to variation in a quantitative trait, such as blood pressure or serum insulin level. I am currently working on R/qtl2, a complete reimplementation to better handle high-dimensional data and complex cross designs (particularly crosses that include multiple founder strains, such as Collaborative Cross and Diversity Outbred mice, and MAGIC plant populations). I will summarize the challenges of maintaining and supporting R/qtl, describe my efforts on the new software, and discuss challenges and opportunities in genetic studies with high-dimensional phenotypes. https://kbroman.org/qtl2
Related paper: Broman KW (2014) Fourteen years of R/qtl: Just barely sustainable. J
Open Res Softw 2(1):e11 (https://doi.org/10.5334/jors.at)

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.