Markov Models for Text Analysis
In this activity, we take a preliminary look at how to model text
using a Markov chain.
What is a Markov chain? It is a stochastic (random) model for
describing the way that a processes moves from state to state.
For example, suppose that we want to analyze the sentence:
Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do
If we use a Markov model of order 3, then each sequence of 3 letters
is a state, and the Markov process transitions from state to state as
the text is read. For instance, as this sample text is processed,
the system makes the following transitions:
"Ali" -> "lic" -> "ice" -> "ce " -> "e w" -> " wa" -> "was" -> "as "
-> "s b" -> " be" -> "beg" -> "egi" -> "gin" -> "inn" -> "nni" ->
"nin" - > "ing" -> ...
For this lab, you are welcome to use any order of a Markov chain that
you want. We give examples for both Markov order 1 and Markov order
3, just as examples, which are easy to generalize.
Since this text is about a girl named Alice, then we might expect
that, when we start at the state "Ali", we move to the state "lic"
with high probability.
As another example, if we are using a Markov model of order 1, and the
current state is "q", when we are analyzing English text we will have
"u" as the next state, with high probability.
So the transitions for Markov order 1 look like:
"A" -> "l" -> "i" -> "c" -> "e" -> " " -> "w" -> "a" -> "s" -> " " ->
"b" -> "e" -> "g" -> "i" -> "n" -> "n" -> "i" -> "n" -> "g"
If we analyze a large text, we can use frequencies to derive
probabilities. For instance, if we reach the state "l" 100 times in
the text, then 33 times out of 100, the next state might be "i", because
the word Alice occurs so many times in the text.
In this laboratory, we use the text of
Alice in Wonderland, already suitably "cleaned",
i.e., all letters transformed to lowercase,
and all punctuation removed. We will
build a model for a Markov chain from the first portion of the file. The
order of the Markov chain (i.e., the number of letters in each state)
should be chosen first.
Test your Markov model on the second half of the text. How do the
probabilities compare? As a test, you might want to do the
following (or you can create something more inventive if you like!)
- We will use the first half of Alice in Wonderland as our test case,
and we will save the second half of Alice in Wonderland for later
(for testing, if you like).
- Build a matrix that approximates the transition probabilities in
Alice in Wonderland from character to character. You can do this by
just going through all pairs of characters in the text. Here is a
basic algorithm that might help you accomplish this:
- Build (and initialize to all-zeroes)
a 26 by 26 matrix M to store your transitions.
The entry M[i,j] will be used to keep track of the number of times
that the ith letter of the alphabet is followed by the jth letter of
- For i from "1" to "the length of the first text - 1", let x be the ith character
in the text and y be the (i+1)st character in the text. Then increment M[x,y].
- Now the matrix M contains the counts of all transitions. We want
to turn these counts into probabilities. Here is a method that can
do it. For each i from 1 to 26, sum the entries on the ith row,
i.e., let counter[i] = M[i,1] + M[i,2] + M[i,3] + ... + M[i,26]
- Now define P[i,j] = M[i,j] / counter[i] for all pairs i,j. This
just gives a matrix of probabilities. In other words, now P[i,j] is
the probability of making a transition from letter i to letter j.
- Hooray, you are done! You now have a matrix of probabilities that
describes a Markov model of order 1 for the first half of Alice in Wonderland!
- Keep track of a score, namely, initialize the "score" to 0.
- Keep track of the number of "attempts", namely, the number of times
that we try to predict the next letter. Initialize the total
number of attempts to 0.
- For each i from "1" to "the length of the second text - 1", let x be
the ith character in the text and y be the (i+1)st character in the text.
- In the first half of the text, x is followed by y
with probability P[x,y].
Let's consider this as a score, so increment the score by P[x,y],
i.e., set score = score + P[x,y]. Also increment the number of
attempts, i.e., set attempts = attempts + 1.
- Your overall fraction of correct attempts is score/attempts. How
well does the first half of the text tell us something about the
second half of the text?
Now think about visual representations of the text. People have done
lots of things with this. The possibilities are endless, but you only
have a couple of hours in the laboratory, so we leave these ideas to you.
As another exercise, if you already know about Markov chains and you
finished the laboratory above, try to model the first half of the text
using a higher-order Markov chain.
For instance, suppose that the chosen order is fixed as 3.
Then a Markov chain consists of the following:
- A list of all 26^3 = 17576 triples of letters.
(Note: This is assuming that we make the text more uniform by removing
spaces and extra punctuation, and make the text use only lower-case letters.)
- A matrix of transition probabilities. Note that, from each
states, there are only 26 possible transitions.
For instance, if the
Markov chain is currently at "lic", the possible states that could
come next are
ica, icb, icc, icd, ice, ..., icz
So the matrix of transition probabilities should have 26^3 * 26 = 26^4
= 456976 entries. In general, if each state has n letters, then there
are 26^n states, and the matrix of transition probabilities needs
You can recreate the whole exercise using these higher-order Markov
chains and see if the performance increases!
There are many other potential textual data files that one could use from
the web. For instance,
the Canterbury corpus is
a very standard text corpus for data compression applications.
of the files in the corpus, and the files from the corpus can be
If you know how to clean text (it is easy with Perl or Python, etc.)
you can try some other files as well!