![[Home]](images/sideHome.gif)
![[Curriculum Vita]](images/sideVita.gif)
![[Research Interests]](images/sideResearch.gif)
![[Teaching]](images/sideTeaching.gif)
![[Personal]](images/sidePersonal.gif)
|
 |


I analyze algorithms and data
structures using analytic combinatorics and probabilistic methods.
I am also greatly interested in information theory (see "Dissertation" below)
and game theory (see my "Teaching" page).
My research is currently supported by
NSF grant 0603821: Asymptotic Enumeration, Reinforcement, and Effective
Limit Theory (2006--2009). Robin Pemantle is the PI; I am the co-PI.
Our goal is the asymptotic estimation
of constants for which a multivariate generating function is known.
Our research in this direction can be roughly divided into two parts.
We use computational symbolic algebra to
implement complex analytic methods of analysis.
We also continue to apply further
techniques of integration of meromorphic forms so that the above
analytic methods can be applied to a wider class of generating
functions.
Overview and Objectives from My NSF CAREER Application (Submitted July 2008)
My long-term research objective is to develop a methodology for the
comprehensive probabilistic analysis of algorithms
and data structures for massive data sequences.
Trees (compacted, suffix, Patricia, B-trees, digital search trees, etc.),
suffix arrays, directed acyclic graphs, and
automata, are just a handful of the data structures
commonly used to support algorithms for efficient pattern matching
in
sequences.
A key goal in many applications is to elucidate
the underlying structure from a plain text sequence.
The Google search engine, BLAST tools for the Genbank DNA database,
Lempel-Ziv data compression schemes, and online
Library text archives are all well-recognized
examples that depend on such pattern matching data structures.
Although the
overall popularity and utility of sequence processing algorithms and
data structures are increasing, significant gaps remain in the current
analysis:
1.) A fundamental limitation of pattern matching analysis is
that it is routinely performed in a deterministic (enumerative) or
worst-case setting. Many stochastic properties of algorithms and
data structures for text processing have not yet been analyzed in
the average case. Moreover, extensions of current methods to include the
variance, higher moments, and limiting distribution are very
formidable.
2.) The probabilistic analysis is often performed using
assumptions about the underlying distribution that are too simple for
real-world applications. For instance, consecutive characters of a
data sequence are often assumed to be independent or have Markov
dependence of order 1.
In contrast, data
sequences arising in practice are often best modeled by a high order
Markov chain or a hidden Markov model.
3.) Analysis of sequences depends on the ability to efficiently
compare two or more overlapping subsequences of a common subsequence.
An efficient method for multiple sequence comparisons would be considered
by many to be the Holy Grail of sequence
analysis. Unfortunately, overlapping subsequences result in
dependencies that must be carefully handled in stochastic analysis. Effective
probabilistic methods for overlaps have only recently been
introduced; this is an active area of growth in applied probability
theory.
Suffix trees have long been recognized as fundamental for processing
sequences of data. The proposed research will begin with a thorough
determination of the probabilistic characteristics of suffix trees
using probabilistic, combinatorial, and analytic techniques, as
described in the narrative. Later in
the proposed research, I will analyze other related data
structures, including other types of trees (compacted retrieval trees, Patricia
trees, B-trees, etc.), suffix arrays, directed acyclic graphs,
automata, and also text-processing algorithms that depend
on these data structures.
I will treat sequences with underlying
stochastic structure sufficiently complex for practical applications,
and will also corroborate the theoretical results with benchmarks
of real-world data sequences. I will expand on my current research
collaborations, and begin new collaborations, with several
international colleagues.
The goals of the proposed research include the following:
Research Goal #1. Establish a comprehensive probabilistic analysis
of the suffix tree profile parameter (i.e., number of nodes per level
of the tree) that encompasses the variance, all higher moments, and
the limiting distribution, including fluctuations.
The profile has been difficult to analyze in other types of trees, so
this will be a challenging task, but results about the profile
parameter yield many
implications about the following parameters: depth (height),
the multiplicity matching parameter (duplication of longest
matches), the number of k-cousins (a generalization of siblings),
the fill-up level (number of complete levels),
the size (number of nodes), etc. If
some aspects of the profile prove too difficult to analyze,
then the relevant aspects of each of these other parameters can
instead be analyzed directly.
Research Goal #2. Discover methods to make suffix tree analysis
possible when the characters of the underlying sequence follow a
Markov model of order higher than 1 or follow a hidden Markov model.
Most of the current methods are applicable only in
the case where the underlying sequence has characters that are
independently generated or follow a Markov model of order 1. In
contrast, massive data sequences that arise in many practical
applications follow a Markov model of high order or a hidden Markov model.
Research Goal #3. Apply the new methodology
from above to other related data structures, including
other tree structures (retrieval trees, Patricia
trees, B-trees, etc.),
automata, suffix arrays,
and directed acyclic graphs.
Research Goal #4. Apply the methods for the data structures in
the goals above to the relevant algorithms that utilize such
structures, including dynamic text indexing algorithms, algorithms for multiple sequence comparisons, biological motif filtering algorithms, etc.
Research Goal #5. Incorporate approximate pattern matching analysis
in sequences, using Hamming and Levenshtein distance metrics.
The acceptable degrees of approximation will be empirical,
and consistent with applications in which approximate pattern matches
are allowable, e.g., misspellings or near-matches in Google, mutations in
GenBank, errors in Lempel-Ziv data compression.
Research Goal #6. Corroborate all theoretical analysis with
data analysis using massive data sequences from practical
applications.
For instance, test the applicability of results using files
from the Canterbury/Calgary corpus for data compression
and from the GenBank genetic sequence database.
The anticipated results are expected to be among
the first for data structures, such as suffix trees and suffix arrays,
that are able to accurately treat not only the average behavior, but also the
variance, higher moments, limiting distributions, and fluctuations.
In particular, this project features a
creative and original
use of mathematical tools in order to identify probabilistic
characteristics concerning suffix trees (currently one of the most
sophisticated and practical types of randomly-generated tree structures).
The proposed research is both novel and practically
beneficial in several regards. For instance,
the probabilistic methodology developed for analyzing suffix
trees should be applicable to related classes of randomly-generated trees and
stochastic branching structures built from overlapping subsequences of
a common, longer sequence.
Also, scientists who deal with massive data
will benefit, as the stochastic runtime and storage characteristics of
algorithms
and data structures for processing sequences of data are precisely
analyzed. New algorithms and data structures for sequence data are
constantly being introduced; the analogous and fundamental stochastic
theory should continue to be developed too.
|
|