Sequential Document Representations and Simplicial Curves
People: Guy Lebanon, Yi Mao, and Joshua Dillon
Description: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present and analyze a continuous and differentiable sequential document representation that goes beyond the bag of words assumption, and yet is efficient and effective. This representation employs smooth curves in the multinomial simplex to account for sequential information. In contrast to n-grams the new representation is able to robustly model long range sequential trends in the document. We study the representation and its geometric properties and demonstrate its applicability for the task of text classification.
Ongoing research applies this framework to sequential visualization of documents. In contrast to other methods that visualize a collection of documents as a cloud of points, we visualize the sequential content of a single document.
Publication:
G. Lebanon Sequential Document Representations and Simplicial Curves. Proc. of the 22nd Conference on Uncertainty in Artificial Intelligence, 2006.

Representation of a document in one dimension.

Representation of a document in two dimensions.

Screen shot of the visualization tool.
Screen shot of the visualization tool.