STAT 598C

Statistical Methods For Bioinformatics and Computational Biology

Fall 2008

Instructor: Olga Vitek

Office: HAAS 120

Phone: (765) 496-9544

Email: ovitek@stat.purdue.edu

Course information

The course discusses statistical methods and algorithms for analysis of high-throughput experiments in molecular biology, using analysis of gene expression microarrays as a leading exemple. Target audience is graduate students in statistics, as well as graduate students in life sciences who had previously taken a statistics class. The objectives of the course are:


  1. Introduce relevant biological concepts, and describe the existing high-throughput technologies and biological questions that these technologies can help answer.


  1. Discuss statistical methods that now become standard practice when analyzing gene expression data and their practical use, as well as open research problems in this field.


  1. Discuss data structures and implementation of the statistical methods in the R-based open source project Bioconductor. Although prior exposure to R is desirable, the course is self-contained. Life sciences students who have previous exposure to statistical methods but never used R will be able to learn all the necessary concepts during the course.


The course is project-driven and provides hands-on experience with data analysis, critical review of literature and communication of the results. At the end of the course the students will be able to perform independent analysis of biological data in an interdisciplinary environment such as a pharmaceutical company, or a computational biology research lab.



Tentative topics:

Module 1: Introduction to statistical methods in molecular biology

- concepts in molecular biology and scientific questions

- high-throughput technologies

- tools and data structures in R and Bioconductor


Module 2: Statistical analysis of gene expression microarrays

- signal processing

- finding differentially expressed genes

- multiple comparisons

- planning a new experiment


Module 3: Machine learning concepts and tools

- exploratory multivariate analysis

- bi-clustering

- supervised classification


Module 4: Biological annotations and databases

- Gene Ontology: structures and visualization

- gene set enrichment analysis

- protein interaction networks




Workload and expectations:

Homework: approximately seven sets of homework problems will be given during the semester.


Projects: The course will include a two-part project. In the first part, the students will read, summarize and critically assess several papers on one of the topics presented in the course. In the second part, the students will perform a data analysis or a computational experiment, give a 10-15 min presentation of the problem and the analysis in class, and summarize the results in a final report.



Pre-requisites:

At least one course from the list of STAT 514, STAT 524 and STAT 525, and experience with R. A knowledge of basic biological concepts is desirable, but not required.



References:

Bioinformatics and Computational Biology Solutions Using R and Bioconductor

Editors: Robert Gentleman, Vince Carey, Wolfgang Huber, Rafael Irizarry, Sandrine Dudoit

Springer 2005