[Home]

[Curriculum Vita]

[Research Interests]

[Teaching]

[Personal]






I analyze algorithms and data structures using analytic combinatorics and probabilistic methods. I am also greatly interested in information theory (see "Dissertation" below) and game theory (see my "Teaching" page).

My research is currently supported by NSF grant 0603821: Asymptotic Enumeration, Reinforcement, and Effective Limit Theory (2006--2009). Robin Pemantle is the PI; I am the co-PI. Our goal is the asymptotic estimation of constants for which a multivariate generating function is known. Our research in this direction can be roughly divided into two parts. We use computational symbolic algebra to implement complex analytic methods of analysis. We also continue to apply further techniques of integration of meromorphic forms so that the above analytic methods can be applied to a wider class of generating functions.

Overview and Objectives from My NSF CAREER Application (Submitted July 2008)

My long-term research objective is to develop a methodology for the comprehensive probabilistic analysis of algorithms and data structures for massive data sequences. Trees (compacted, suffix, Patricia, B-trees, digital search trees, etc.), suffix arrays, directed acyclic graphs, and automata, are just a handful of the data structures commonly used to support algorithms for efficient pattern matching in sequences. A key goal in many applications is to elucidate the underlying structure from a plain text sequence. The Google search engine, BLAST tools for the Genbank DNA database, Lempel-Ziv data compression schemes, and online Library text archives are all well-recognized examples that depend on such pattern matching data structures.

Although the overall popularity and utility of sequence processing algorithms and data structures are increasing, significant gaps remain in the current analysis:

1.) A fundamental limitation of pattern matching analysis is that it is routinely performed in a deterministic (enumerative) or worst-case setting. Many stochastic properties of algorithms and data structures for text processing have not yet been analyzed in the average case. Moreover, extensions of current methods to include the variance, higher moments, and limiting distribution are very formidable.

2.) The probabilistic analysis is often performed using assumptions about the underlying distribution that are too simple for real-world applications. For instance, consecutive characters of a data sequence are often assumed to be independent or have Markov dependence of order 1. In contrast, data sequences arising in practice are often best modeled by a high order Markov chain or a hidden Markov model.

3.) Analysis of sequences depends on the ability to efficiently compare two or more overlapping subsequences of a common subsequence. An efficient method for multiple sequence comparisons would be considered by many to be the Holy Grail of sequence analysis. Unfortunately, overlapping subsequences result in dependencies that must be carefully handled in stochastic analysis. Effective probabilistic methods for overlaps have only recently been introduced; this is an active area of growth in applied probability theory.

Suffix trees have long been recognized as fundamental for processing sequences of data. The proposed research will begin with a thorough determination of the probabilistic characteristics of suffix trees using probabilistic, combinatorial, and analytic techniques, as described in the narrative. Later in the proposed research, I will analyze other related data structures, including other types of trees (compacted retrieval trees, Patricia trees, B-trees, etc.), suffix arrays, directed acyclic graphs, automata, and also text-processing algorithms that depend on these data structures. I will treat sequences with underlying stochastic structure sufficiently complex for practical applications, and will also corroborate the theoretical results with benchmarks of real-world data sequences. I will expand on my current research collaborations, and begin new collaborations, with several international colleagues. The goals of the proposed research include the following:

Research Goal #1. Establish a comprehensive probabilistic analysis of the suffix tree profile parameter (i.e., number of nodes per level of the tree) that encompasses the variance, all higher moments, and the limiting distribution, including fluctuations. The profile has been difficult to analyze in other types of trees, so this will be a challenging task, but results about the profile parameter yield many implications about the following parameters: depth (height), the multiplicity matching parameter (duplication of longest matches), the number of k-cousins (a generalization of siblings), the fill-up level (number of complete levels), the size (number of nodes), etc. If some aspects of the profile prove too difficult to analyze, then the relevant aspects of each of these other parameters can instead be analyzed directly.

Research Goal #2. Discover methods to make suffix tree analysis possible when the characters of the underlying sequence follow a Markov model of order higher than 1 or follow a hidden Markov model. Most of the current methods are applicable only in the case where the underlying sequence has characters that are independently generated or follow a Markov model of order 1. In contrast, massive data sequences that arise in many practical applications follow a Markov model of high order or a hidden Markov model.

Research Goal #3. Apply the new methodology from above to other related data structures, including other tree structures (retrieval trees, Patricia trees, B-trees, etc.), automata, suffix arrays, and directed acyclic graphs.

Research Goal #4. Apply the methods for the data structures in the goals above to the relevant algorithms that utilize such structures, including dynamic text indexing algorithms, algorithms for multiple sequence comparisons, biological motif filtering algorithms, etc.

Research Goal #5. Incorporate approximate pattern matching analysis in sequences, using Hamming and Levenshtein distance metrics. The acceptable degrees of approximation will be empirical, and consistent with applications in which approximate pattern matches are allowable, e.g., misspellings or near-matches in Google, mutations in GenBank, errors in Lempel-Ziv data compression.

Research Goal #6. Corroborate all theoretical analysis with data analysis using massive data sequences from practical applications. For instance, test the applicability of results using files from the Canterbury/Calgary corpus for data compression and from the GenBank genetic sequence database.

The anticipated results are expected to be among the first for data structures, such as suffix trees and suffix arrays, that are able to accurately treat not only the average behavior, but also the variance, higher moments, limiting distributions, and fluctuations. In particular, this project features a creative and original use of mathematical tools in order to identify probabilistic characteristics concerning suffix trees (currently one of the most sophisticated and practical types of randomly-generated tree structures). The proposed research is both novel and practically beneficial in several regards. For instance, the probabilistic methodology developed for analyzing suffix trees should be applicable to related classes of randomly-generated trees and stochastic branching structures built from overlapping subsequences of a common, longer sequence. Also, scientists who deal with massive data will benefit, as the stochastic runtime and storage characteristics of algorithms and data structures for processing sequences of data are precisely analyzed. New algorithms and data structures for sequence data are constantly being introduced; the analogous and fundamental stochastic theory should continue to be developed too.