Title: Strategy for the Multiple Whole Genome Alignment of five strains of E.coli using modified suffix arrays
Bob Mau, Aaron Darling, Nicole Perna, and Fred Blattner

Abstract

We present a time-and-space efficient algorithm that aligns large sub-intervals of DNA sequence of closely related organism into segments we call backbone. Backbone represents sequence that can be putatively identified with the genome of the most recent common ancestor of the aligned organisms.

The method is ideally suited for aligning a group of genomes possessing large regions of high sequence similarity, punctuated with numerous instances horizontally transferred sequence. Not surprisingly, this fits the profile of Enterobacteriaceae, the primary focus of the Blattner Lab's sequencing effort. Our methodology automatically identifies and places large relative inversions, translocations, and inverted translocations. A recent implemen- tation has successfully aligned four strains of E.coli: K-12 MG1655, K-12 W3110, O157:H7 EDL933, O157:H7 Sakai(Enterohaemorraghic PEC), and CFT073 (Uropathogenic EC). The prototype software finished in under 5 CPU minutes and used less than 7 Megs of memory. The critical software and algorithmic constructs will be described in some detail. Key is a novel representation of the genomic match coordinates that partitions maximal matches into disjoint groups called Canonical Maximal Exact Match(MEM) Equivalence Classes.

We sketch how to finish this "rough draft" into a full fledged multiple alignment.Furthermore, we describe how some not-so-minor modifications will allow extension of the basic method to include more divergent genomes.