Course Description: Stat 598T for Spring 2002


Statistical Protein Motif Alignment Methods and Stochastic Models in Genetics and Evolution

The goal of this course is to explore statistical aspects of protein motif alignment in the first half and in the second half to explore some of the ways in which genetic and evolutionary processes are modelled mathematically.

The specific biological areas to be considered include genetic processes at both the evolutionary (between species) and population genetic (within species) levels. Protein and DNA sequence alignment is central to the identification of homologous regions with an evolutionary and functional relationship. At the evolutionary level, the field of phylogenetic analysis attempts to reconstruct the ancestral relationships between present-day species, based on aligned DNA or protein sequence data. At the population genetic level, the ancestral relationships between individuals are modelled with a coalescent process which allows the sampling distribution of individual DNA sequence variation to be studied, and inference to be made about populations.

In this course, students will read selected journal articles in these areas. The focus will be primarily on the modelling process and resulting statistical inference methods, rather than on a detailed analysis of the underlying models.

The first half of the course will cover alignment methods of protein sequence motifs and the construction of phylogenetic relationships. Protein motif alignment is the main approach to predict protein functions when no overall high homology can be detected. We will study papers on methods of finding motifs. Most of the algorithms have been developed into software, which we can access through the web. The multiple protein alignment is often guided by a phylogenetic tree. We will discuss papers on probabilistic approaches to phylogeny and learn how trees can be inferred from sets of sequences, either by maximum likelihood or by other methods. Phylogeny analysis involves tree construction algorithms which relate to the second half of the course.

The second half of the course will cover coalescent methods in population genetics. There the end goal is not the identification of a specific ancestral tree but rather the identification of the probability distribution of an unknown random variable, the ancestral tree. The distribution of the tree must be considered in order to understand the sampling distribution of DNA sequences from a single population. Starting with the simplest models and working towards the more complex, we will study the coalescent process in detail. Emphasis will be placed on the biological assumptions underlying the model, and on its relationship to the goal of statistical inference through computational and simulation methods.

While no specific mathematical background is assumed, an understanding of the fundamentals of statistical inference (e.g. Stat 517/528) would be helpful. An interest in genetic analysis is assumed, but much of the specific biological background will be covered as necessary. The course is intended primarily for Statistics graduate students, but statistically literate students from biological fields are also welcome.