Statistical Bioinformatics, Fall 2009

The Bioinformatics Seminar meets on Tuesday afternoons from 4:30-5:20 PM in HORT 117.

Offered as STAT 598B 1-1, 1 credit
Coordinator: Dr. R. W. Doerge

Dr. R.W. Doerge's Seminar Page is here.

Feed IconXML Feed
Schedule

Tuesday, August 25, 2009, 04:30 PM in HORT 117
Professor R. W. Doerge, Department of Statistics and Department of Agronomy, Purdue University
Organizational Meeting

This is an organizational meeting. All registered students must attend. If you need a registration form signed, please bring it to class.

Tuesday, September 1, 2009, 04:30 PM in HORT 117
Professor Rick Westerman, Bioinformatics Specialist, Genomics Facility, Purdue University
Second generation sequencing: An overview plus some case studies

The advent of second generation (2nd gen) sequencing during the last couple of years has been rapidly displacing the monopoly enjoyed by Sanger sequencing. The current 2nd gen systems generate much more data at a lower cost than previous sequencers. This massive amount of data is presenting new challenges in data analysis as well as spurring on innovative methods in genomic research. The Genomics Facility at Purdue has two of the three common 2nd gen "sequencing by synthesis" systems the Roche FLX/454 system and the ABI SOLiD system (we do not have the Illumina/Solexa system.) My talk will present an overview of 2nd gen sequencing technology, cover some of the innovative algorithms that are being produced worldwide, and then delve into some of the real life problems and solutions that we have encountered while using our 2nd gen systems.

Associated Reading:

Next-generation DNA sequencing methods. Elaine R. Mardis. Annual Review of Genomics and Human Genetics. 2008. 9:387-402.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, September 8, 2009, 04:30 PM in HORT 117
Paul Livermore Auer, Department of Statistics, Purdue University
Statistical issues in next-generation sequencing — An overview and case study

Next-generation or "second-generation" sequencing has emerged as an accurate new tool that has already lent itself to a large number of applications (e.g., variant discovery, profiling of histone modifications, identifying transcription factor binding sites, resequencing, and transcriptome chararcterization). Specifically, in RNA-Sequencing (RNA-Seq) experiments, the Illumina/Solexa Genome Analyzer (a next-generation sequencing technology) has been used with great success. Even though the technology is a success there are still large domains of unsolved statistical issues that need to be addressed (e.g., understanding errors in the sequencing process and modeling gene expression in the down-stream analysis). Additionally, these problems are compounded by the both the size and complexity of RNA-Seq data. In this talk, I will provide an overview of the Solexa sequencing technology, introduce some of the statistical and computational issues involved, and detail a specific RNA-Seq data analysis that we have done for Scott Jackson's lab at Purdue University.

Recommended Reading:

Wang, Z., M. Gerstein and M. Snyder, 2009 RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57-63.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, September 15, 2009, 04:30 PM in HORT 117
Scott Jackson, Department of Agronomy, Purdue University
Dealing with duplicated genes in plant genomes

In contrast to animals, plants can have multiple copies of each gene in their genome with little or no deleterious effect. This is referred to as polyploidy. In fact, many of the plants that we eat (potato, banana, watermelon, wheat, etc...) are polyploid, that is have multiple copies of their genomes in the same nucleus. One major question in biology is how do newly formed polyploids deal with have multiple copies of each gene? In humans, for instance, if you have more than one copy any single chromosome (except for three chromosomes), it is lethal. However, plants have mechanisms to deal with multiple gene copies and we are using the recently sequenced soybean genome to understand how the genome has compensated for having multiple copies of most of its genes. I will present some analyses that have been done using next generation sequencing technologies and how we are using these to understand the biology of this important organism.

Recommended Reading:

Lex E. Flagel and Jonathan F. Wendel. 2009. Gene duplication and evolutionary novelty in plants. New Phytologist.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, September 22, 2009,
Bioinformatics Seminar

No Seminar

Tuesday, September 29, 2009, 04:30 PM in HORT 117
Matthew Palakal, Associate Dean for Research and Graduate Program in the School of Informatics and Professor of Computer Science in the School of Science at Indiana University Purdue University, Indianapolis
Bibliomics, Literature Mining, and Biomarker Discovery

Bibliomics has an important role in Systems Biology research along with the other "omics" such as genomics, transcriptomics, proteomics, metabolomics, etc. Biological literature databases continue to grow rapidly with vital information that is important for conducting sound biomedical research. As data and information space continue to grow exponentially, the need for rapidly surveying the published literature, synthesizing, and discovering the embedded "knowledge" is becoming critical to allow the researchers to conduct "informed" work, avoid repetition, and generate new hypotheses. Knowledge, in this case, is defined as one-to-many and many-to-many relationships among biological entities such as gene, protein, drug, disease, etc. In this talk, we present a literature mining system called BioMAP. The BioMAP tool can carry out large-scale biomedical literature mining that could enhance the ability of biological researchers to formulate methods for the analysis of biological data such as identifying biological pathways and provide support for disease target and new biomarker discovery. Results from a large-scale literature mining on documents related to colon rectal cancer will be presented to illustrate that novel pathways and biomarkers can be found if exhaustive mining is used instead of relying on limited manually curated literature documents.

Associated Reading:

M. Palakal, T. Sebastian and D. L. Stocum. Discovering implicit protein-protein interactions in the Cell Cycle using bioinformatics approaches, Journal of Biomedical Science, 15(3): 317-331, 2008.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, October 6, 2009,
No Bioinformatics Seminar

Tuesday, October 13, 2009,
No Bionformatics Seminar

October Break

Tuesday, October 20, 2009, 04:30 PM in HORT 117
Michelle Lacey, Department of Mathematics at Tulane University, New Orleans
Statistical Modeling of Methylation Patterns in Ovarian Carcinomas

Changes in cytosine methylation at CpG nucleotides are observed in many cancers, but the biological mechanisms responsible for these changes are not yet fully understood. Previously developed stochastic models for cancer-related methylation change have either treated CpG sites independently or employed a context dependent approach to adjust model parameters according to regional methylation levels. However, our analyses of double-stranded methylation patterns in 0.2 kb regions of the tandem repeats Sat2 and NBL2 have detected small clusters of identically methylated sites in close proximity that could not be explained by random variation. These findings suggest a high degree of site-to-site dependence, and we have developed a neighboring sites model for methylation change as an alternative approach. We have compared the independent sites, context dependent, and neighboring sites models in their ability to generate simulated sequences statistically similar to our Sat2 and NBL2 carcinoma samples, and we demonstrate that the neighboring sites model is preferred in the majority

Associated Reading:

Lacey, Michelle R. and Ehrlich, Melanie (2009) "Modeling Dependence in Methylation Patterns with Application to Ovarian Carcinomas," Statistical Applications in Genetics and Molecular Biology: Vol. 8 : Iss. 1, Article 40. DOI: 10.2202/1544-6115.1489 Available at: http://www.bepress.com/sagmb/vol8/iss1/art40.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, October 27, 2009, 04:30 PM in HORT 117
George Casella, Department of Statistics, University of Florida, Gainesville, FL
From R. A. Fisher to Microarrays: Why 70 year old theory is relevant today

The theory and practice of design of experiments has its roots in agriculture (pun intended), with the major developments at Rothamsted by people such as Fisher and Yates. They developed the theory of blocking, components of variance (split plot designs) and incomplete block designs, among other things. Most all of this theory is still relevant today, and translates almost seamlessly to modern applications such as microarray experiments.

We review these designs and their applications today, pointing out how the 70-year old theory guides us to good microarray designs. The easy availability computer packages, and their default analyses, can often result in incorrect test statistics and confidence intervals. We show how to recognize and avoid this, and look at a number of examples of both good and bad experiments. We also look at some of the designs that have arisen as a result of microarrays (reference and loops) and see what the 70-year old theory has to say.

There is nothing new in this talk, and probably nothing that you have not seen before. However, I hope to remind you of some things that you may have forgotten.

Associated Reading:

Kerr and Churchill. 2001. Statistical design and the analysis of gene expression microarray data. Genet. Res., Camb. (2001), 77, pp. 123128

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, November 3, 2009, 04:30 PM in HORT 117
John Marioni, Department of Human Genetics, University of Chicago
Using RNA-seq to understand variation in the human transcriptome

Understanding the genetic mechanisms that underlie natural variation in gene expression is a central goal of both medical and evolutionary genetics. Recently, advances in next generation sequencing technology have allowed transcript variation to be studied at unprecedented resolution. To take advantage of this new resource, we sequenced RNA from 69 lymphoblastoid cell lines (LCLs) derived from unrelated Nigerian individuals that have been extensively genotyped by the International HapMap project.

In this talk I will begin by providing the biological motivation for our study, before briefly outlining the sequencing technology and experimental design that we used. Subsequently, I will focus on a major technical hurdle that arises when mapping short sequencing reads to a reference genome, namely, the impact of SNP variation on the reliability of read mapping. At heterozygous SNPs, our results show that there is a significant bias towards higher mapping rates of the reference allele and, perhaps surprisingly, masking known SNP locations in the reference sequence does not lead to more reliable results overall. Overcoming this problem by filtering out inherently biased SNPs removes 40% of the top signals of allele specific expression (ASE). Further, we find that the remaining SNPs showing ASE are enriched in genes known to harbour cis regulatory variation or known to show uniparental imprinting. To conclude, I will describe the results of our analysis of the entire dataset. By pooling all individuals, we identify extensive use of unannotated polyadenylation sites and over 100 novel protein coding exons. Further, using genotype information, we find many genetic variants that influence overall levels of expression and splicing. Overall, our results show the power of high throughput sequencing for the joint analysis of variation in transcription, splicing, and allele specific expression across individuals.

Associated Reading:

Degner J.F., Marioni J.C., Pai A.A., Pickrell J.K., Nkadori E., Gilad Y., Pritchard J.K. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 2009 (advance access online).

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, November 10, 2009, 04:30 PM in HORT 117
Alex Lipka, Department of Statistics, Purdue University
CANCELLED
Associating Single Nucleotide Polymorphisms (SNPs) with Binary Traits

Association mapping uses statistical analyses to test for relationships between genomic markers that are called single nucleotide polymorphisms (SNPs) and traits. A statistically significant association between a SNP and a trait suggests that there exists a biological association between a nearby genomic region and the trait. This research focuses on the use of logistic regression to assess the additive, dominance, and epistatic effects when investigating associations between SNPs and binary traits, such as disease status. A very specific phenomenon, called quasi-separation of points (QSP), can arise in association mapping data, resulting in infinite maximum likelihood estimates (MLEs) of logistic regression parameters. One solution to this problem is to use Firth's MLE, which provides finite estimates in the presence of QSP. Two simulation studies are conducted to investigate the use of Firth's MLE in a QSP setting, as well as to assess the similarity between Firth's MLE and the traditional MLE when QSP is not present. Two published association mapping studies in humans are reanalyzed to demonstrate the implementation of Firth's MLE in real data settings.

Balding DJ (2006) "A Tutorial on Statistical Methods for Population Association Studies," Nature Reviews Genetics 2006: Vol. 7 Iss. 10:781-791.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, November 17, 2009, 04:30 PM in HORT 117
John M. C. Danku, Department of Horticulture & Landscape Architecture (Salt Lab), Purdue University
High-throughput analytical technology for the ionomics of model organisms

Technology, Analysis, and Goals in the Ionomics of Arabidopsis

(Part 1 of 3 lecture series)


Ionomics, the study of the ionome, involves the quantitative and simultaneous measurement of the elemental composition of living organisms and changes in this composition in response to physiological stimuli, developmental state, and genetic modifications. Ionomics requires the application of high-throughput elemental analysis technologies and their integration with both bioinformatic and genetic tools. Most ionomic analyses are generally comparative and performed over a timescale of hours to years. Analytical standardization across time and distance is crucial right from the growth stage through chemical analysis. In this presentation we highlight high-throughput elemental profiling methodologies for the analyses of yeast, Arabidopsis and rice.

Associated Reading:

David E. Salt, Ivan Baxter and Brett Lahner. Ionomics and the Study of the Plant Ionome. Annu. Rev. Plant Biol. 2008, 59,709-733.



Click
here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, November 24, 2009,
No Bioinformatics Seminar

Tuesday, December 1, 2009, 04:30 PM in HORT 117
Tilman Achberger, Department of Statistics, Purdue University
Ionomic QTL analysis in Arabidopsis thaliana

Technology, Analysis, and Goals in the Ionomics of Arabidopsis

(Part 2 of 3 lecture series)



Much effort has been made in recent years by biologists, computer scientists, statisticians and others in the study the plant model organism Arabidopsis thaliana. One such mechanism for studying complex processes in Arabidopsis thaliana is the study of its uptake of mineral nutrient levels, such as calcium, sodium, potassium and sulfur. The study of an organism's mineral nutrient levels is the study of its ionome. Finding genetic determinants of complex traits, such as ionomic traits, can be done by performing QTL mapping, a common statistical method used to locate chromosomal regions associated with the trait of interest. In this talk I will provide a general overview of QTL mapping, with particular emphasis on its application to ionomic data on a population of 411 Recombinant Inbred Lines (RIL) from a cross between the accessions Bay-0 and Shahdara collected by Dr. David E. Salt's lab at Purdue University.

Recommended Reading:

Artak Ghandilyan, et al. A strong effect of growth medium and organ type on the identification of QTLs for phytate and mineral concentrations in three Arabidopsis thaliana RIL populations. Journal of Experimental Botany. 2009, Vol. 60, No. 5, p. 1409-1425.

Click here for a full schedule of BIOINFORMATICS SEMINARS, past and present.

Tuesday, December 8, 2009, 04:30 PM in HORT 117
David Salt, Department of Horticulture & Landscape Architecture, Purdue University
TBA

Technology, Analysis, and Goals in the Ionomics of Arabidopsis

(Part 3 of 3 lecture series)