Generating domain names using Markov Chains

kanny96 · on Oct 20, 2008

The new Web 2.0-esqe (or dare i say Web 3.0-esqe) nomenclature is dominated by "double-vowels" (Wii, xumii, mobee) or "vowel-combos" (cuil) not often found in normal english. So this approach will not succeed in giving you a really innovative name. Probably you need to perturb the Markov-chain to achieve that or waste a lot of free time digging your cognitive faculty!

opal · on Oct 20, 2008

But those names suck.

rudyfink · on Oct 20, 2008

Yeah, but you have to remember that those names will affect the perception of similar names. The Wii, for example, being a monster hit is going to affect the way people perceive names.

ntoshev · on Oct 20, 2008

You could train the Markov chain specifically on all registered domains. And weight by recency, if you want to be trendy.

eru · on Oct 21, 2008

This approach won't give you a name like anyone else is using, so it's not an innovative name?

jbyers · on Oct 20, 2008

"Chattly.com Beat you to it"

Registered 10/19. Nice. :)

jwilliams · on Oct 20, 2008

As an aside - Markov Chains are a key tool in Bioinformatics (Hidden Markov Models).

callmeed · on Oct 20, 2008

Very cool ... but I don't want the code, I just want a web app to generate names for me (and check that the domain is available, of course).

C'mon, where's markovr.com?

mseebach · on Oct 20, 2008

I used the tool to make the name markovable, got the domain and made this: http://www.markovable.net

marketer · on Oct 21, 2008

Keep that site up :) It's a great tool.

wmf · on Oct 20, 2008

I think you mean a web app to generate names and squat them. :-)

jbyers · on Oct 20, 2008

All this script needs is the bulk-whois gateway on the other end. But seriously, anyone know of a whois service who allows free high-volume lookups?

there · on Oct 20, 2008

don't bother with whois; query the root/gtld nameservers for a valid NS record for the domain to see if it's registered. dns is much quicker and nobody's going to shut you down for using it.

ntoshev · on Oct 20, 2008

You can subscribe to receive daily a list of all registered .com/.net/.org domains, from the registrar.

codeismightier · on Oct 20, 2008

You guys killed my server! (256 slice) I can't even ssh into it now!

henning · on Oct 20, 2008

I bet it was the downloading of that dictionary file.

gduffy · on Oct 20, 2008

Does the web console (at manage.slicehost.com) still work?

codeismightier · on Oct 20, 2008

Yep. Thanks for the suggestion. After I logged in I realized that Apache was basically in an infinite loop where it kept trying to create new child processes but those where instantly being killed because of a lack of memory. After a hard reset it seems to be working now.

vaksel · on Oct 20, 2008

someone should make this a web app.

kingkongrevenge · on Oct 20, 2008

Silly python people always writing lots of code. This should be, like, 15 lines.

http://search.cpan.org/~rclamp/Algorithm-MarkovChain-0.06/li...

ntoshev · on Oct 20, 2008

Funny I did a similar thing recently, specifically for domain names. The code to train a Markov chain (without using any libraries outside standard Python) is this:

  import re, collections
  from itertools import *
  
  def words(text):
      return (w.group() for w in re.finditer(r"\w+", text.lower()) if w.group().isalpha())
  
  F=collections.defaultdict(lambda:collections.defaultdict(lambda:1))
  
  M = 5
  
  def add_spaces(word):
      return ' ' * M  + word.strip() + ' ' * M
  
  N = 0
  
  from operator import add
  def train(words):
      for word in words:
          w = add_spaces(word)
          for i in range(len(w)-M):
              F[w[i:i+M]][w[i+M]] += 1
      global N
      N = reduce(add,(n for d in F.itervalues() for n in d.itervalues()))
  
  train(words(file('big.txt').read()))

Then, I tried to measure how "good" a particular word is (experimenting with different measures):

  from math import log
  def goodness(word):
      w = add_spaces(word)
      res = 0
      for i in range(len(w)-M):
          res += F[w[i:i+M]][w[i+M]]
  #        res += log(F[w[i:i+M]][w[i+M]])
      global N
      return float(res) / ((len(w)-M)*N)
  
  print 'test score:', reduce(add, (goodness(w) for w in words(file('test.txt').read())))

Then, tried to find the top N words at less than 2 edits distance:

  alphabet = 'abcdefghijklmnopqrstuvwxyz'
  
  import types
  def edits1(word):
      if type(word)==types.StringType:
          n = len(word)
          return set(chain((word[0:i]+word[i+1:] for i in range(n)),                     # deletion
                     (word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)), # transposition
                     (word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet), # alteration
                     (word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet)))  # insertion
      else: #Assume word is actualy an iterable of words
          return set(reduce(chain,(edits1(e) for e in word)))
  
  def edits2(word):
      return edits1(edits1(word))

  def topn(n, word):
      words = [(x, goodness(x)) for x in edits2(word)]
      words.sort(cmp = lambda x,y: -cmp(x[1],y[1]))
      return words[:n]

I have borrowed some code from Peter Norvig's spelling checker, and the style is very much influenced by it.

ashleyw · on Oct 20, 2008

It'd be nice to include the actual library:

http://search.cpan.org/src/RCLAMP/Algorithm-MarkovChain-0.06...

…so, 165+15 = 180 lines total?

kingkongrevenge · on Oct 20, 2008

Try closer to something like 800 lines in the library. It comes with demos, a test suite, etc. And I think you're skipping the base class in your naive count.

I'm not really sure how the line count of a library is relevant to anything?

schtog · on Oct 20, 2008

For pretty obvious reasons?

  import generating-domain-names as gds

  print gds.genName()

2 lines of code to do Markov-chain-generating doaminnames!

codeismightier · on Oct 20, 2008

I just deleted some dead code left from debugging and now it's a few lines shorter.