Research Profile
Guy Lebanon - Constructing Sequential Document Models
Guy Lebanon Josh Dillon Yi Mao
The approach introduced by Professor Lebanon is based on constructing a local statistical model at different document positions using non-parametric kernel smoothing. The collection of local models is equivalent to a smooth curve in a high-dimensional space whose shape reflects the sequential progression of the local word content. One of the greatest challenges of such an approach is reducing the dimensionality of these curves to two or three dimensions in order to facilitate the visualization of sequential trends within documents. Professor Lebanon, in collaboration with Y. Mao and J. Dillon, has explored several dimensionality reduction approaches, embedding the collection of local models in low dimensional spaces. These approaches include using differential operators such as the gradient of the curve (see Figure 3 (left) in [1]), which reflects the rate of change in a local word histogram, and principal component analysis (PCA) (see Figure 3 (right) in [1]) which embeds the curve in an easily visualized two dimensional space. These two visualization techniques, along with others, facilitate the graphical exploration of sequential trends in documents including segment boundaries and topic consistency; notice how the local maxima in the gradient norm correspond to news stories boundaries in Figure 3 (left) of [1] and different clusters in Figure 3 (right) of [1] correspond to different news stories.
Professor Lebanon hopes that his approach can eventually lead towards a novel browsing technology for documents that will enable visualization, automatic summarization, segmentation, and classification of each document (see Figure 8 for an example of a framework for this technology). Such technology could help the user to easily find the location of different parts of interest within a long document, or even to help understanding a document written in another language.
Professor Lebanon joined the Department of Statistics in 2005, and is currently Assistant Professor of Statistics, Assistant Professor of Electrical and Computer Engineering, and Assistant Professor of Computer Science (Courtesy). Besides modeling text documents, he is also interested in statistical approaches to privacy preservation and in modeling partially ranked data. For more information about Professor Lebanon, please visit his homepage.
[1] Y. Mao, J. Dillon, and G. Lebanon. Sequential Document Visualization. IEEE Transactions on Visualization and Computer Graphics, 13(6) 2007
[2] G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research (to appear).
Click on the image to see a larger version of the image.
Figure 3 (left) in [1]. Local maxima in the gradient norm of the local document model correspond, almost precisely, to topic or subtopic boundaries in a stream of news stories.
Click on the image to see a larger version of the image.
Figure 3 (right) in [1]. The three clusters in the curve correspond to three consecutive news stories in a Reuters news stream (RCV1 data).
Click on the image to see a larger version of the image.
Figure 8 in [1]. The original text (left) is augmented by a visual summary reflecting sequential trends within the document.
