Markov Models for Text Analysis

In this activity, we take a preliminary look at how to model text using a Markov chain. What is a Markov chain? It is a stochastic (random) model for describing the way that a processes moves from state to state.

For example, suppose that we want to analyze the sentence:

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do

If we use a Markov model of order 3, then each sequence of 3 letters is a state, and the Markov process transitions from state to state as the text is read. For instance, as this sample text is processed, the system makes the following transitions:

"Ali" -> "lic" -> "ice" -> "ce " -> "e w" -> " wa" -> "was" -> "as " -> "s b" -> " be" -> "beg" -> "egi" -> "gin" -> "inn" -> "nni" -> "nin" - > "ing" -> ...

For this lab, you are welcome to use any order of a Markov chain that you want. We give examples for both Markov order 1 and Markov order 3, just as examples, which are easy to generalize.

Since this text is about a girl named Alice, then we might expect that, when we start at the state "Ali", we move to the state "lic" with high probability.

As another example, if we are using a Markov model of order 1, and the current state is "q", when we are analyzing English text we will have "u" as the next state, with high probability.

So the transitions for Markov order 1 look like:

"A" -> "l" -> "i" -> "c" -> "e" -> " " -> "w" -> "a" -> "s" -> " " -> "b" -> "e" -> "g" -> "i" -> "n" -> "n" -> "i" -> "n" -> "g"


If we analyze a large text, we can use frequencies to derive probabilities. For instance, if we reach the state "l" 100 times in the text, then 33 times out of 100, the next state might be "i", because the word Alice occurs so many times in the text.

In this laboratory, we use the text of Alice in Wonderland, already suitably "cleaned", i.e., all letters transformed to lowercase, and all punctuation removed. We will build a model for a Markov chain from the first portion of the file. The order of the Markov chain (i.e., the number of letters in each state) should be chosen first.



Version 1:
  1. We will use the first half of Alice in Wonderland as our test case, and we will save the second half of Alice in Wonderland for later (for testing, if you like).
  2. Build a matrix that approximates the transition probabilities in Alice in Wonderland from character to character. You can do this by just going through all pairs of characters in the text. Here is a basic algorithm that might help you accomplish this:

  • Test your Markov model on the second half of the text. How do the probabilities compare? As a test, you might want to do the following (or you can create something more inventive if you like!)
    1. Keep track of a score, namely, initialize the "score" to 0.
    2. Keep track of the number of "attempts", namely, the number of times that we try to predict the next letter. Initialize the total number of attempts to 0.
    3. For each i from "1" to "the length of the second text - 1", let x be the ith character in the text and y be the (i+1)st character in the text.
    4. In the first half of the text, x is followed by y with probability P[x,y]. Let's consider this as a score, so increment the score by P[x,y], i.e., set score = score + P[x,y]. Also increment the number of attempts, i.e., set attempts = attempts + 1.
    5. Your overall fraction of correct attempts is score/attempts. How well does the first half of the text tell us something about the second half of the text?

    Now think about visual representations of the text. People have done lots of things with this. The possibilities are endless, but you only have a couple of hours in the laboratory, so we leave these ideas to you.


    As another exercise, if you already know about Markov chains and you finished the laboratory above, try to model the first half of the text using a higher-order Markov chain. For instance, suppose that the chosen order is fixed as 3. Then a Markov chain consists of the following:

    1. A list of all 26^3 = 17576 triples of letters. (Note: This is assuming that we make the text more uniform by removing spaces and extra punctuation, and make the text use only lower-case letters.)

    2. A matrix of transition probabilities. Note that, from each states, there are only 26 possible transitions.

    For instance, if the Markov chain is currently at "lic", the possible states that could come next are
    ica, icb, icc, icd, ice, ..., icz

    So the matrix of transition probabilities should have 26^3 * 26 = 26^4 = 456976 entries. In general, if each state has n letters, then there are 26^n states, and the matrix of transition probabilities needs 26^(n+1) entries.

    You can recreate the whole exercise using these higher-order Markov chains and see if the performance increases!

    There are many other potential textual data files that one could use from the web. For instance, the Canterbury corpus is a very standard text corpus for data compression applications. There are descriptions of the files in the corpus, and the files from the corpus can be downloaded here as cantrbry.tar.gz or as cantrbry.zip

    If you know how to clean text (it is easy with Perl or Python, etc.) you can try some other files as well!