Generating pseudo random text with Markov chains using Python

alxv · on June 16, 2009

In "The Practice of Programming", Brian Kernighan and Rob Pike use the Markov Chain Algorithm to illustrate a number of lessons about software design and implementation. I used it as my "Hello world" program when learning a new programming language.

In Python, the implementation is surprisingly succinct:

    import sys
    import random
    import collections

    MAXGEN = 10000
    NONWORD = ""
    suffixchain = collections.defaultdict(list)

    w1 = w2 = NONWORD
    for line in sys.stdin:
        for word in line.split():
            suffixchain[(w1, w2)].append(word)
            w1, w2 = w2, word
    suffixchain[(w1, w2)].append(NONWORD)

    w1 = w2 = NONWORD
    for i in range(MAXGEN):
        suffixes = suffixchain[(w1, w2)]
        next = random.choice(suffixes)
        if next is NONWORD:
            break
        print next,
        w1, w2 = w2, next

To use it, just run:

  $ python markov.py < english_corpus.txt

davidw · on June 16, 2009

This is relevant, if you want to do something fancier:

http://megahal.alioth.debian.org/How.html

BTW, has anything better than Megahal turned up in the meantime, that is also open source?

Caligula · on June 16, 2009

This is a big reason why speech recognition works. You have a list of probabilities of sequence of words occuring(language model) and it decides on the next word based on the last few previous words.

So if you have the word Super, odds are the next word is Man and not Microphone.

SuperMicrophone · on June 16, 2009

I resent that.

jcsalterego · on June 16, 2009

Not ground-breaking by any means, but a nice, concrete application of Markov chains.