Jun Xie

Written by: Allison Cummins, M.S. candidate in Statistics

Jun Xie

Jun Xie

In recent years, new technological advances in biological areas such as genomics and proteomics have generated high dimensional data sets. Statisticians are faced with the challenge of developing models and efficient computational algorithms capable of handling these high dimensional data sets to help gain insight into biological mechanisms. The high dimensionality and other complexity of the data are stressed and one can no longer simply rely on the traditional statistical methods to adequately analyze the data.

As a graduate student at the University of California at Los Angeles (UCLA), Professor Jun Xie started her work in mathematical statistics where she studied bioinformatics, particularly protein sequence analysis. After coming to Purdue in 2001, Xie continued to work in DNA and protein sequence analysis in collaboration with other faculty members at Purdue and two of her Ph.D. students, Nak-Kyeong Kim and Lingmin Zeng. She and her students used Markov Chains and created probability models that identify important patterns in large sequences. Stochastic versions of EM algorithms and Markov chain Monte Carlo (MCMC) were used to identify parameters and missing data such as the locations of sequence motifs.

More recently, Xie has been working on analysis of high dimensional data including:

  1. Gene expression microarray data, which has been used to identify gene expression profiles associated with diseases, e.g. cancer

  2. Single Nucleotide Polymorphism (SNP) data, a variation at a single site in a DNA sequence, which is used to identify genetic variants associated with phenotypes.

Gene expression microarray data has ~10,000 variables while an SNP array has ~500,000 variables. This scale of variables will result in frequent false positive readings due to many signals being observed by chance.

Xie, with Purdue PhD graduate, Lingmin Zeng, helped develop a new variable selection method to handle correlation and colinearity among variables. In this method, they identified a group of variables that were correlated with one another and used one variable to represent the entire group. A penalized regression was created to be in favor of group selection. Group variable selection in genomic data with dependent structures has been analyzed, where both dimension reduction and variable selection, which take into account dependency, have been used to revolutionize current statistical analysis to find new paths that compensate for the higher demands.

Xie plans to apply mathematical statistics to more complicated data sets, and improve the methods currently in existence that work theoretically but when applied to real data need various adjustments. Work is also planned for classification and prediction in the area of data mining in search of the best method for a given data set. Xie is in favor of data-driven research that begins from the data at hand rather than attempt to apply statistical models to fit a particular data set. Frequently, the data does not appear the same in an experiment as was predicted in the statistical model and will henceforth not always fit as planned.

Xie received her B.S. and M.S. in Probability and Statistics from Peking University and her Ph.D. in Statistics from the University of California at Los Angeles. She currently teaches Mathematical Statistics (STAT 528). She also serves as the Graduate Chair of the Department of Statistics. For more information, please visit her homepage.

May 2009