Session 15 (Part 2) - Department of Statistics - Purdue University Skip to main content

Interactions Between Omics and Statistics: Analyzing High Dimensional Data

Speaker(s)

  • C Robin Buell (Michigan State University)
  • Shizhong Xu (University of California, Riverside)
  • Lauren McIntyre (University of Florida)
  • Nathan Springer (University of Minnesota)
  • Jianming Yu (Kansas State University)
  • Lu Lu (University of Tennessee Health Science Center)

Description

With the recent development in high throughput technologies, more and more high-dimensional data are being generated in animal, plant, and human studies. These data are presenting new challenges to researchers in statistics that require computationally efficient novel approaches. The session will be organized as an interactive frontier for prestigious researchers within statistics and other disciplines, for the purpose of gaining insight into addressing the issues associate with high-dimensional data analysis in a variety of real-world applications. 

Schedule

Sun, June 24 - Location: STEW 310

>
TimeSpeakerTitle
8:30 - 9:00AM C. Robin Buell Using RNA-Seq to reveal expression patterns and diversity in maize
[PDF Slides] Abstract: Maize is rich in genetic and phenotypic diversity. Understanding the sequence, structural, and expression variation that contributes to phenotypic diversity would facilitate more efficient varietal improvement. RNA based sequencing (RNA-seq) is a powerful approach for transcriptional analysis, assessing sequence variation, and identifying novel transcript sequences, particularly in large, complex, repetitive genomes such as maize. We first surveyed expression profiles in a panel of reproductive tissues to benchmark RNA-seq for expression analyses in maize. Second, we sequenced RNA from whole seedlings of 21 maize inbred lines representing diverse North American and exotic germplasm to identify sequence nucleotide polymorphism (SNP) and presence/absence variants that could be associated with phenotype. Ample SNPs (351,710) were identified which revealed tight clustering of the two distinct heterotic groups and exotic lines. Transcript abundance analysis revealed minimal variation in the total number of genes expressed across these 21 lines; however, the composition of transcribed gene set varied among the 21 lines. Presence/absence variants were identified in the transcriptomes through de novo assembly of unmapped RNA-seq reads; 1,321 high confidence novel transcripts were identified. Of these, 564 loci were present in all 21 lines, including B73, suggesting these are absent in the reference genome assembly. Intriguingly, 145 of the novel de novo assembled loci were present in lines from only one of the two heterotic groups consistent with the hypothesis that, in addition to sequence polymorphisms and transcript abundance, transcript presence/absence variation is present and a potential mechanism contributing to the genetic basis of heterosis.
9:00 - 9:30AM Shizhong Xu Marker Based Infinitesimal Model for Quantitative Trait Analysis
[PDF Slides] Abstract: We developed a marker based infinitesimal model for quantitative trait analysis. In contrast to the classical infinitesimal model, we now have new information about the segregation of every individual locus of the entire genome. Under this new model, we propose that the genetic effect of an individual locus is a function of the genome location (a continuous quantity). The overall genetic value of an individual is the weighted integral of the genetic effect function along the genome. Numerical integration is performed to find the integral, which requires partitioning the entire genome into a finite number of bins. Each bin may contain many markers. The integral is approximated by the weighted sum of all the bin effects. We now turn the problem of marker analysis into bin analysis so that the model dimension has decreased from a virtual infinity to a finite number of bins. This new approach can efficiently handle virtually unlimited number of markers without marker selection. The marker based infinitesimal model requires high linkage disequilibrium of all markers within a bin. For populations with low or no linkage disequilibrium, we develop an adaptive infinitesimal model. Both the original and the adaptive models are tested using simulated data and beef cattle data. Result of the beef cattle data analysis indicates that the new method can increase the predictability from 10% (marker analysis) to 33% (bin analysis). The marker based infinitesimal model paves a way towards the solution of genetic mapping using the whole genome sequence data.
9:30 - 10:00AM Lauren McIntyre Genotype-phenotype mapping in a post GWAS world
[PDF Slides] Abstract: Understanding how environmental conditions interact with metabolic reactions, cell signaling, and developmental pathways to translate an organism's genome into its phenotype is a grand challenge. Genome wide association studies (GWAS) connect genotypes to phenotypes but do not necessarily reflect known molecular interactions. Molecular biology approaches tie gene functions together in networks (GRN) but do not necessarily reflect genetic variation. Using natural variation in allele-specific expression, GWAS and GRN approaches can combined. Leveraging existing data from protein interactions can further elucidate the impacts of regulatory variation. By using a population genetics framework with whole genome data molecular pathways that underlie phenotypic variation can be elucidated. Method development as well as real data analysis will be discussed.
10:00-10:30AM Break
10:30 - 11:00AM Nathan Springer Application of RNAseq to understand transcriptional regulation
[PDF Slides] Abstract: RNAseq has been touted as a powerful tool for enabling more precise understanding of transcriptomes. While there are several key advantages of RNAseq relative to microarray analyses, there are a number of important issues that are unique to RNAseq analyses. My group uses RNAseq data to study several questions. One set of experiments has directly compared RNAseq and microarray data for the generation of co-expression networks. While the overall networks are quite similar there are some key differences that highlight important filtering steps. Another set of experiments has utilized RNAseq from heterozygous individuals to study allele-specific expression patterns. In particular, we have used RNAseq to identify imprinted genes. RNAseq has provided a powerful tool to address several questions but, as with microarrays, the experimental design is critical for our ability to provide biologically meaningful findings.
11:00 - 11:30 Jianming Yu Opportunities and Challenges of Statistical Genetics in Genome-wide Association Studies
[PDF Slides] Abstract: Advances in genomic technologies have made it possible to conduct genome-wide association study (GWAS) and genome-wide selection (GS). While GWAS allows us to address some of the fundamental questions in genetic architecture of complex traits, GS facilitates the breeding of superior genotypes by maximizing gain per unit of time. However, many emerging challenges need to be addressed by a combination of genomic technology, functional analysis, genetic design, statistical and computational analysis, genome annotation, and gene network. With our recently proposed composite resequencing-based GWAS (CR-GWAS) strategy and an Arabidopsis data set, we showed how function prediction, genome database, and network information can be integrated into the process of identifying robust associations. In a second maize GWAS study, we found that both genic and non-genic polymorphisms contribute to the phenotypic variation of quantitative traits, and trait-associated SNPs are enriched in the non-genic and promoter regions. Our findings suggest that evolutionary alterations in protein sequence may be quantitatively less important than changes in gene regulation in shaping the wide natural variation observed in maize. Genotyping or sequencing technologies that capture polymorphisms in both genic and promoter regions are likely to significantly increase the power and efficiency of routine GWAS scans in species with complex genomes.
11:30 - 12:00 Lu Lu Discovering new Alzheimer disease related genes and gene networks through systems biology methods
[PDF Slides] Abstract: Alzheimer's disease (AD) is the most common neurodegenerative disorder and the fourth leading cause of death in adults. It is now understood that genetic factors play a crucial role in the risk of developing AD, however, the molecular mechanisms of AD are still not fully understood. In this study, we use a systems biology approach to collect candidate genes related to AD, identify their upstream regulatory and downstream target genes, and construct an AD related gene network. We first used literature-based methods to identify 366 AD related genes. We then combined global hippocampal mRNA expression profiles generated on the Affymetrix Mouse Expression 430 2.0 arrays with linkage analysis to map expression quantitative trait loci (eQTL) using 67 BXD recombinant inbred (BXD RI) strains to define the genetic regulatory relationship of the selected candidate genes. Among the AD candidate genes we identified 64 whose expression is controlled by local sequence variants (cis-eQTLs ) and 66 whose expression is controlled by other regions of the genome (trans-eQTLs). Allelic-specific expression (ASE) analysis and RNA-seq analysis was used to validate cis-eQTLs. A genetic and molecular network based on covariance of expression patterns captures many of key relationships among these AD candidate genes. One strong candidate with a validated cis-eQTL is glutathione S-transferase omega 1 anti-oxidant gene (Gsto1). This gene has previously been nominated as associated with Parkinson's disease, Huntington's disease, and arsenic toxicity. Its association with AD in some human samples has not reached statistical significance (Ozturk et al., 2005). Our work suggests that the GSTO1 gene should be reexamined closely for a possible AD candidate in larger human cohorts. This study provides baseline data to infer functional genetic covariance networks associated with AD. The entire networks or select members may be of considerable value in understanding molecular vulnerability to AD.

Purdue Department of Statistics, 150 N. University St, West Lafayette, IN 47907

Phone: (765) 494-6030, Fax: (765) 494-0558

© 2023 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.