Guy Lebanon

Written by: Andrea Rau, Ph.D. candidate in Statistics

Photo of Guy Lebanon
Guy Lebanon
Photo of Josh Dillon
Josh Dillon
Photo of Yi Mao
Yi Mao

The automated analysis and visualization of text documents are crucial to the effective use of large text archives such as news stories, email collections, and the World Wide Web. To date, most attempts have concentrated on modeling collections of documents while assuming word exchangeability within each document. Such models typically represent documents by simply counting the number of times particular word stems appear within a document. Professor Guy Lebanon, in collaboration with doctoral students Yi Mao and Joshua Dillon from the School of Electrical and Computer Engineering, has been working to construct sequential document models which capture the sequential nature of documents without assuming word independence or exchangeability.

The approach introduced by Professor Lebanon is based on constructing a local statistical model at different document positions using non-parametric kernel smoothing. The collection of local models is equivalent to a smooth curve in a high-dimensional space whose shape reflects the sequential progression of the local word content. One of the greatest challenges of such an approach is reducing the dimensionality of these curves to two or three dimensions in order to facilitate the visualization of sequential trends within documents. Professor Lebanon, in collaboration with Y. Mao and J. Dillon, has explored several dimensionality reduction approaches, embedding the collection of local models in low dimensional spaces. These approaches include using differential operators such as the gradient of the curve (see Figure 3 (left) in [1]), which reflects the rate of change in a local word histogram, and principal component analysis (PCA) (see Figure 3 (right) in [1]) which embeds the curve in an easily visualized two dimensional space. These two visualization techniques, along with others, facilitate the graphical exploration of sequential trends in documents including segment boundaries and topic consistency; notice how the local maxima in the gradient norm correspond to news stories boundaries in Figure 3 (left) of [1] and different clusters in Figure 3 (right) of [1] correspond to different news stories.

Professor Lebanon hopes that his approach can eventually lead towards a novel browsing technology for documents that will enable visualization, automatic summarization, segmentation, and classification of each document (see Figure 8 for an example of a framework for this technology). Such technology could help the user to easily find the location of different parts of interest within a long document, or even to help understanding a document written in another language.

Professor Lebanon joined the Department of Statistics in 2005, and is currently Assistant Professor of Statistics, Assistant Professor of Electrical and Computer Engineering, and Assistant Professor of Computer Science (Courtesy). Besides modeling text documents, he is also interested in statistical approaches to privacy preservation and in modeling partially ranked data. For more information about Professor Lebanon.

[1] Y. Mao, J. Dillon, and G. Lebanon. Sequential Document Visualization. IEEE Transactions on Visualization and Computer Graphics, 13(6) 2007

[2] G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research (to appear).

Figure 3 (left) in [1]. Local maxima in the gradient norm of the local document model correspond, almost precisely, to topic or subtopic boundaries in a stream of news stories.
Click on the image to see a larger version of the image.

Figure 3 (left) in [1]. Local maxima in the gradient norm of the local document model correspond, almost precisely, to topic or subtopic boundaries in a stream of news stories.



Figure 3 (right) in [1]. The three clusters in the curve correspond to three consecutive news stories in a Reuters news stream (RCV1 data).
Click on the image to see a larger version of the image.

Figure 3 (right) in [1]. The three clusters in the curve correspond to three consecutive news stories in a Reuters news stream (RCV1 data).



Figure 8 in [1]. The original text (left) is augmented by a visual summary reflecting sequential trends within the document.
Click on the image to see a larger version of the image.

Figure 8 in [1]. The original text (left) is augmented by a visual summary reflecting sequential trends within the document.