Hacker News new | past | comments | ask | show | jobs | submit login
How to Write a Spelling Corrector (norvig.com)
401 points by colobas on Sept 8, 2016 | hide | past | favorite | 126 comments



I couldn't help but read this and think about all the "coding" initiatives I've seen in K-12 and shake my head.

What Norvig is doing is what we should be teaching. He is tackling this seemingly REALLY hard problem by thinking about it methodically, translating some intuition into code, carefully constructing an argument about how to solve it, and ways that it could be extended. This is what actual engineers look like.

Everything I've seen around "coding" though has become a masochistic exercise in teaching kids random syntax details and then calling them Coders and Geniuses and Computer Scientists when they successfully copy what the teacher showed them.

When you read Norvig's code (big fan of his Sudoku one as well), you realize how the actual "code" is secondary in the sense that what it is really doing is expressing an idea. A very nunanced, elegant idea, but ultimately the product of doing some hard thinking and exploration on a problem domain.

If we taught kids to just think about problems in this way, ohh what a world it would be!


I agree that what Norvig demonstrates here and in all of his notebooks is an ideal of how we use programming to explore (nevermind implement) concepts. But how do we get there without teaching people how to program, including syntax? I think everyone agrees that kids should be able to understand the themes of "A Modest Proposal" and perhaps even write with such depth, but they have to learn their ABCs and be compelled to write about what they ate for breakfast and other insignificant topics before they get to an adequate level.

I wish I could say that in my time of teaching, I've met people who could just get these things without actually writing code. But I think for most non-geniuses, including myself, it's all too abstract until you know how to concretely write code yourself.


Ohh I completely agree with you. But what I was mostly reacting to is the idea that the way to propel these kids into the 21st century is to simply teach them some basic coding syntax and leave it at that. Even the new AP CS curriculum seems to take a step back from "hard cs" by adding lots of content not directly related to solving cs problems. All that I really meant is that the true value in coding comes in using it to express solutions to problems, whereas teachers and schools that I've seen try to implement "coding" seem to think that if these kids just memorize syntax they have some great advantage in life.

I'll give you a perfect example. I was asked to evaluate this Scratch course for middle schoolers just as they were about to present their final projects. One of the kids did a basic pong-like game with human-controlled characters. The ball would move all over the place, seemingly randomly. The game didn't seem to make any sense to the kids who played it. But, the administration felt that this was an incredible success.

I later learned that he had produced the game by mostly following along a step-by-step tutorial introduced in class. And I also learned that the reason the ball moved erratically was that the kid had absolutely no concept of how to deal with the angles, much less identify what portion of the character the ball had struck!

To me, THAT would have been the real learning! What an opportunity to have taken that kid outside and kick a soccer ball (this was in Brazil) outside and explore some intuition about how it rebounded on the wall; what a chance to see if he could not come up with a way to grok a solution to figuring out to detect where in the character the ball had hit since this didn't come out-of-the-box in Scratch.

In other words, I don't mind that they learn the syntax. But there's a reason why firms outsource a lot of "coding" to South Asian countries for pennies on the dollar. Knowing the syntax is cheap. Stimulating kids to think about problems, developing a routine and passion about solving them, that's where the real pot of gold is.


The compiler/interpreter will correct your syntax. Explain it once and show where the documentation is, then let kids loose. There's no need to do a worksheet with 100 "spot the syntax error" problems or to have to write it out by hand on a piece of paper to learn it. That kind of pain is only going to teach kids that programming is a boring game of "find the missing semi-colon".


They're going to have to play that game anyway since the compiler/interpret won't spell out the exact issue every time. But they'll get through it anyway because their goal is something else rather than that game. A lot of learning comes from getting past stepping stones, but targeting a stepping stone as an explicit thing to learn and drill on is a poor plan.


Use languages with simple syntax and it's a non issue. There is no need for both a while and do loop in an intro coding course.

For control: foreach, while, Switch, Function

Datatypes: List, enum, Strings, signed 64 bit int, 64 floating point.

Would need a small standard library with IO.

Yes, I avoided the 'if' because Switch works and get's people thinking in less binary terms. Same with skipping arrays.


But what you should realize is that the person who wrote this beautiful piece of code is someone who is at the peak of their career, a thinker, an artist, and that it must be ok to not ever rise to his level, because this is close to a master-piece. Few of these geniuses become professors, is just a guess, because no matter how quirky you are you always get a job somewhere.


I think Norvig's actually improved substantially since he wrote this — he wasn't at the peak of his career! But yes, it's a masterpiece, and yes, it's okay to not ever rise to that level, because even if nothing you ever create is as worthwhile as this, it's still worthwhile. Hacking is fun! And sometimes very useful, too.

(Also note that, despite appearances, this isn't the work of one man. Norvig didn't design Python and didn't invent the tabling technique for speeding up search; and Darius found a couple of bugs in the code. That doesn't mean it's not an authentic Norvig masterpiece.)


Sir, btw, I need a rule that I can use intuitively when deciding on wether to concatenate two words and when to put a dash between them. Do you have one?


Sir, it depends entirely on how widely used the compound word in question is. Hyphenated compounds that become sufficiently familiar lose their hyphen.


It's funny, I was given this as a lab assignment around when the article came out. In fact, we were shown the article. We had an hour, maybe two, for the lab and had to re-implement it in C++. At this point I already had 3-4 years of experience with C++ but found the task ABSOLUTELY daunting. Looking at it now it seems simple, but I'll never forget trying to figure out how this works for the first time with the two hours I was given. Even if a solution is simple, understanding it can be hard, especially if it seems like it should be hard.


I agree, you could learn so much just looking at how he defines functions.. see this for example:

    def pdist(counter):
        "Make a probability distribution, given evidence from a Counter."
        N = sum(counter.values())
        return lambda x: counter[x]/N

    P = pdist(COUNTS)
from http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%...


The resources are out there. It's more an issue of educators and policy-makers either not knowing they exist or not knowing that's what they need.

I'm thinking of books like Think Python and How to Design Programs.


HtDP is fantastic. I taught an after-school class with it and it went really well.


By that reasoning, we should not teach kids how to spell, or about punctuation, and just aim for them writing essays/stories/novels.


"If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea."


That's a lovely idea although I can't help noticing that I know of no shipbuilding companies based on the teaching of yearning while I do know of some that seem based on organising a workforce. I suspect that this is one of those ideas that while beautiful is far from being literally true.

How about "if you want your citizens to fund a space program, don't gather the workforce to build a spaceship but teach them to yearn for the vastness and potential of the void."


That does turn out to be more effective in teaching them to write coherently.

Spelling and punctuation details can come later. If the kids start out being wrong all the time, they just quit.


programming is already such an opaque concept before you learn it, kids entering the class won't even really know what programming is.

if you start with a single-minded focus on syntax, as most do, the kid's mental model of programming becomes "programming is done by entering premade code words that someone made up. if i want to do something, i need to find the premade code word that does that thing".

as opposed to, "programming is about blobs of information and what i do with them - changing their shape, organizing them, picking certain things from them, taking stuff away from them. if i want to solve a complicated problem, i won't start typing, i'll start thinking about how i'd make a machine that makes the blob i want. the best way to make a machine like that is usually out of smaller machines that live inside it. i'd figure out what the smallest machines i need are, and i'd make those, and then use those to make the bigger machines, until i'm done."

if kids were started with a functional, problem-solving oriented approach like that, you'd create very capable programmers much faster. this applies to introductory CS classes in college. i went into my first CS class knowing absolutely nothing about programming beyond CSS/HTML - i was a math / mech eng major at the time. i think it was the third class where they had us write a program in C that did some non-trivial dynamic allocation. they never told us things like what the heap or the stack are, or the concept of a memory leak, what happens if you dereference a null pointer, etc, etc. i never got that program to stop segfaulting. i came very close to failing that class and was told, because of that, i probably shouldn't continue with CS. i taught myself computer science instead and now i know that i'm not stupid or unsuited to programming - my introductory CS education was shit.


The commenter right above you expressed the same issue, and I wrote a longer reply there. But it's not that I don't think syntax is important, but I don't think we should pretend that the value in learning to code is learning the syntax. What I'm reacting to is "coding" programs that seem to eradicate any semblance of forcing kids to confront hard problems in favor of presenting them with trivial exercises for the sole purpose s of claiming "they can code."


That's how I wish foreign language were taught. I don't care about punctuation. I just want to communicate without embarrassing myself.


My sister once told me that learning a (human) language seems boring until you want to know what that cute boy is saying.


That's basically how I learned French.


Yes, my son's High School Software Course has a lot of old dusty design content (data flow diagrams), chunks of soft content - "social impact of computers".

And the actual coding is a Visual Basic forms and a bit of html/javascript.

Very little in the way of what I would call proper coding: data structures, algorithms.


Spell chequer

Martha Snow

Eye halve a spelling chequer It came with my pea sea It plainly marques four my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter perfect awl the weigh My chequer tolled me sew.


I thought you came up with it on the spot but it's a real thing. http://www.davidpbrown.co.uk/poetry/martha-snow.html


I think it's cool how you can tell the writer's accent from this - something you can rarely do with written English. "halve" and "kin knot" sound nothing like "have" or "cannot" in my native accent, for example.

Similarly, rhymes in Shakespeare's sonnets have been used to work out how Shakespeare spoke the same words, as we no longer pronounce the words the same way (and many of the rhymes have been broken).


Related: https://trialbysteam.com/2010/03/09/d-j-enright-the-typewrit...

(it's pretty subtle to catch all of the often-scatological jokes in there)


In a way, I'm surprised I can read this so easily. Do our brains read by converting text to sounds, and then parsing the sounds?


As a native English speaker, I found my self reading it and then having to back-track by a few words to say them aloud in my head to discern the intended meaning as it was the sound of the words that was important, not the meaning of the word shape. The second time I read it through it was much easier.


I had to explicitly read it "aloud" in my head, in a way that I don't do when reading other text. I can generally read text very quickly without having to hear it in my head, and in this case it was impossible.


I think many people do have brains that operate this way, probably because their understanding of language is grounded in conversation as a first experience. I suspect this is the reason you see confusion with "your, you're" and "there, their, they're."

I don't parse language in quite that manner, and even in speaking I have a different "mouth feel" for those homonyms. This creates a modest inversion of the idea of spelling errors for me, in that people who truly do say homonyms in such a way that to my hearing it is exactly the same require me to parse for context.


> I don't parse language in quite that manner, and even in speaking I have a different "mouth feel" for those homonyms.

I believe in the (somewhat controversial) non-subvocalization-based text processing and even think I do it myself, but I'm wondering if you could describe more about the "mouth feel" issue. Do you mean that you believe that you pronounce them using different phonology (that another person would potentially be able to hear), that your muscles are doing something different but not in a way that makes an auditory difference, or simply that you're subjectively aware of the spelling while speaking but not necessarily in a way that makes a physically-observable difference? Or is it not quite clear which of these is the case?

I think this is an interesting question in the philosophy of language and also in the psychology of reading. (I've thought about this myself but haven't studied the academic literature about it.) If your answer is the first one, I wonder if you'd be willing to make an audio recording of yourself pronouncing these words that might show what difference you experience.

By the way, there are documented cases where spelling differences have created new pronunciation differences that didn't previously exist in the spoken language. Maybe something like that has been happening in your idiolect?


I have a similar feeling, when I pronounce a homonym/homophone. The word "mouth feel" resonates for me, although I suspect that sometimes it's just an awareness of the spelling of the word. Other times, I'll consciously choose to pronounce a word differently to try to disambiguate it. I'll say "aunt" as "ant"/"ont" (read those second as phonetically-spelled Californian American pronunciations) depending on the situation and flow of the sentence. "They're" is usually "They-er", "their" is sometimes "thur" (like "fur" with a /ð/), and "there" feels like it's pronounced the expected way. "To" might be "tə", but "two" and "too" never are.

Of course, that's all when I'm paying close attention to what I say. I'm sure there are times that I go against those. Also, sorry for the mix of layman phonetic spelling and IPA.


Kind of a controversial issue whether this is always the case, but it's at least commonly or typically the case for many readers: https://en.wikipedia.org/wiki/Subvocalization


Honestly it took me a minute before my brain started to be able to read it a lot better. At least it was very slow because everything is so wrong. So I don't know how common either of our experiences are but I am your contrary :)


Early reading is taught that way. You may have heard of "Phonics" as an element of literacy?

Full literacy in alphabetic-phonetic languages includes the steps of (1) no longer sounding out words aloud and (2) no longer sounding out most words in your head, but rather grasping the shape of the word immediately.


It's the jump from word-as-sound to word-as-symbol, which I think a lot of people never make, partially because it's not particularly useful in their lives.

I suspect those involved in programming skew very much towards the word-as-symbol, but whether that is causative or correlative I can't say.


I definitely belong to the word-as-symbol camp when reading in my native Norwegian. Besides, I am a programmer (A shoddy one in all but assembly language, though...)

Interestingly, perhaps, is that I recently made the transition from word-as-sound to word-as-symbol in another language - morse code; that felt very odd while it was going on - I have parsed words character by character for a couple of decades, and suddenly, I found my decoding lagging further behind - I had started hearing words as units, rather than composites of characters. Funnily enough, it was through no conscious effort - just happened, over the course of a few hours.


I was pondering the same thing; also, I wonder whether not being a native English speaker makes it easier or harder to read it - English is my third language and I found I could read it without any problems.

I think I've read somewhere that reading is basically done by the brain registering the first and last few letters in a word, then just checking whether the ones in the middle are more or less what you expect and where you'd expect them. (In other words - if the start and/or end of a word is altered, your reading speed and comprehension should take a nosedive compared to just messing with the letters in the middle)

If there is some truth to that, it would go a long way towards explaining why we can read it with as little trouble as we do.


I had a surprisingly hard time reading this; but I don't subvocalize, and words are 'things' to me (for example, your/you're is read differently in my head and makes me very sensitive to that mistake).


Love it. It also reminds me of the poem The Chaos, by Gerard Nolst Trenité:

http://ncf.idallen.com/english.html


Along similar lines, "The King's English" http://holyjoe.org/poetry/anonA.htm


Thanks for posting this. It's just as much of a tongue twister as The Chaos :)


Is this where you start considering ngrams probabilities? Curious as to the best approach to this, as it's clear we can all read the above.


You sound like a character from a Dickens novel.


You should read Feersum Endjinn


Her spell checker works great, but it looks like she needs a grammar checker too.


That poem could've been written about speech recognition software as well.


Although not a spell checker specifically, I wrote an offensive language filter for a chat system used by the National Hockey League for a period when they hosted communal chat rooms during televised games.

The architecture I used is completely different from what is described here, but the goals are very similar. I had to handle any curse word in any language, including curses from one language translated into another language, as well as offensive phrases, and their translated equals, as well as offensive slang, and mispelt offensive slang translated from other languages.

I ended up with an offense dictionary of about 700K words and phrases. This was back in '99, so my memory may not be 100% here, but I remember using Perfect Hash to generate a compiled hash table for the dictionary, and then a trie to organize the dictionary lookups. The entire system was about 150K of a downloaded exe to access the NHL simulcast chat, as all the offensive language filtering occurred on the client side. Chatting anything that could be offensive turned into a series of words with their interiors all asterisk '*', and it ran in something like 500 ms. Fun times. That company and project died with the dot com bust.


> but I remember using Perfect Hash to generate a compiled hash table for the dictionary, and then a trie to organize the dictionary lookups

But... but programmers on HN told me that tries are just useless pieces of trivia used to fail people in interviews, and if you need them a library is just going to do it for you in the most efficient possible way.


Tries are lovely little datastructures, don't let anyone tell you any different. It's not like you'll use them every day, but they're nice to have when needed!


When i worked at iTunes they have a mega 'bad_words.txt' file. It was fun to read when bored.


Interesting. I had to write a badwords filter for a random name generator (for a procedural galaxy generator...) and it's quite a task just for English.


If you've not used a Trie https://en.wikipedia.org/wiki/Trie I suggest checking them out. Very useful for fuzzy searching.


Indeed. I did a spell checker for a Haskell university course, and the trie was the recommended data structure. Makes it really easy to eliminate large chunks of the dictionary at once when edit distance gets too large during traversal.


The code's been updated to more modern Python and to not try to smooth the probabilities. (Also it computes probabilities instead of frequencies now, though that shouldn't affect the result.)


His python code styling is really awesome. So concise. Probably inspired by all the LISP he wrote in the past.

Although he does seem to be using doc strings incorrectly


Notable for code archaeology: the docstrings weren't in the previous version[].

(He also switched from manually implementing counting behavior with a Dict to a Counter, added the P function which replaces an inline dict lookup, similarly added the candidates function, changed the somewhat awkward known_edits2 function to just edits2, and reorded some things.)

[] http://web.archive.org/web/20160408180602/http://norvig.com/...

edit: small edits


Came here to say this. I've combed through his pieces many times over just to glean the way he structures his code. Here, he got my head spinning for a minute with

  return set(w for w in words if w in WORDS)
and made a mental note to use that idiom in the future.

As for doc strings, well, this is just a toy piece of code after all. The main purpose for the code is to be read, instead of actually used. something something hobgoblin of little minds

Edit: formatting.


You can also use the set-comprehension syntax, which functions equivalently but looks slightly nicer:

    return {w for w in words if w in WORDS}
could have also directly used the intersection operator on the sets:

    return words & WORDS
or slightly more verbose, but maybe more clear for colleagues who don't regularly use sets in Python:

    return words.intersection(WORDS)


> return words & WORDS

Not quite. It'd have to be set(WORDS) instead of WORDS -- which'd be expensive. Or WORDS.keys(), which I'm not sure about -- I'd have to benchmark it.


You would only need to do set(WORDS) once at the beginning of the program, so its amortized cost is low.


I wouldn't worry about the expense of set construction vs the expense of an n-squared algorithm.


caught me, I hadn't read through the article, wasn't sure whether or not WORDS was a set or a dict. Like desdiv points out, this isn't a big deal, you can maintain an equivalent WORDS structure that is a set.


That was one of the phrases used in the program that taught me to touch type. Stuck with me for decades..and has been almost as useful to me as learning to touch type...


that is a great idiom. i remember the first time i saw it (possibly in is very essay), i thought it was awesome. I use it all over the place now. it is similiar to the javascript ternary operator [1], which I also find to be very useful

   var varX = (boolean) ? 'X' : 'Y'
[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


Pretty much every language has a ternary conditional operator: C++, Java, Python, C#, Ruby....


This isn't the Python ternary which is:

  varX = 'X' if boolean else 'Y'

  # And is chainable in a more natural way than C*:
  varX = 'X' if boolean else 'Y' if other else 'Z'


What is wrong with his docstrings?


Nothing's wrong with them. Though it is more usual to use triple quotes for docstrings even if they fit on one line (see PEP 257, "Triple quotes are used even though the string fits on one line. This makes it easy to later expand it.").


Easy in the sense of more pleasant version control diffs.


Couldn't it cause issues with "object definition" code in REPLs and editors, and interfere with documentation generators, i.e. [Sphinx](http://www.sphinx-doc.org/en/stable/)?


I'd be surprised if so. Most of those tools rely on the built-in parser, which treats all string literals similarly.


The last 9 times it was submitted: http://goo.gl/mVSi7W


You can also find that result by clicking on "past" under the submission.


I want to see someone do a post on how they analyzed all of the top stories on HN (ones that got more than "x" upvotes) and calculated the ones that could be resubmitted after "y" time period had gone by. For extra bonus points figure out if any particular hn handle has gone in and resubmitted old stories in the past by a similar technique and analysis.


In Go.


I found it in a StackOverflow answer and found it worth sharing. Didn't bother checking if it had already been posted on hackernews.


Normally when people tell you something has been submitted before they're saying "you might find some of these earlier comments interesting too".


  >>> import spell
  >>> spell.correction('ducking')
  'fucking'
  >>>
Yup, it works.


I love it. Spell checking was pretty relevant to some of my work recently and I needed to correct heavily typo'd address text from Chicago parking ticket data. It used difflib to recursively find similarly typo'd address until it finds a matching address within a list of correct addresses. Definitely a lot more to polish, but I'm kinda proud of the naive approach.

https://github.com/red-bin/tyop_fixer

  coumbia,columbia,0.933333333333
  argy,argyle,0.8
  menomee,menomonee,0.875
  newladn,newland,0.857142857143
  boulevard way,boulevard,0.818181818182
  sherwn,sherwin,0.923076923077
  lawrencec,lawrence,0.941176470588


If people like this, also check out his course at Udacity - https://www.udacity.com/course/design-of-computer-programs--...


The D language compiler consults a simple spell corrector when encountering an unknown identifier. The dictionary is the list of names currently in scope. It's a nice improvement to the error message about an unknown identifier.


I added a spelling corrector to an application a while back. I tried a couple of libraries that implement these ideas. Problem is it can be slow for big words.

I stumbled across this algorithm which is much faster if you allow some time to pre-process your dictionary. http://blog.faroo.com/2012/06/07/improved-edit-distance-base...

I implemented it here for fun in common lisp. Excuse the ugly code. https://github.com/RyanRiddle/lispell


Reverse of the logic presented could be used to inject typos into a document per distributed copy of it to help identify anyone sharing documents online; basically each copy is unique to allow for attribution. Hash of each document could even be given to a third party and archived to provide if needed independent verification of the claim that the document and the document itself could be encrypted, loaded to another third party to log the IP and finger print of the download provide another independent verification of the exchange.


What is the application where typos in a document are acceptable?


Classified reports. This technique is called the Canary Trap, with the name (but not the technique) coined by Tom Clancy.

https://en.wikipedia.org/wiki/Canary_trap


For that purpose you'd want to add typos that would be "corrected" into the wrong word by a typical spellchecker or that are actually correct, but atypical in context, words. Otherwise passing the document through a spellchecker would remove the watermark.


Heh, perhaps an easier way would be to have unique phrases in each document. Typos risk exposing you because they will stand out to the person who is reading the document - assuming such a person is more discerning than most, given that they're accessing classified docs.


Okay, If you broaden it enough I'm sure you can find some use for almost everything. I was thinking of more of a mainstream application.


Pre-publish review copies of a novel


The date on the top says February 2007 to August 2016.

Does anyone know which parts are new in August 2016? I've read this before and it isn't sticking out to me.


The code has been updated stylistically, and I'm not sure what else has changed. Here's the old (Python 2.5) version from http://web.archive.org/web/20160110135619/http://norvig.com/...:

    import re, collections

    def words(text): return re.findall('[a-z]+', text.lower()) 

    def train(features):
        model = collections.defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model

    NWORDS = train(words(file('big.txt').read()))

    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    def edits1(word):
       splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
       deletes    = [a + b[1:] for a, b in splits if b]
       transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
       replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
       inserts    = [a + c + b     for a, b in splits for c in alphabet]
       return set(deletes + transposes + replaces + inserts)

    def known_edits2(word):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

    def known(words): return set(w for w in words if w in NWORDS)

    def correct(word):
        candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
        return max(candidates, key=NWORDS.get)


Here's the history, sorry there is no easy way to see diffs.

https://web.archive.org/web/*/http://norvig.com/spell-correc...


I read this in July. I am sure that the Future work section has been better explained. The main code is the same.


Is there any C++ version of a decent spell checker ? Been looking for sometime (mainly out of curiosity), most are either amateur school projects or too academic/phd-ish....would love to go through a good open source C++ based spell checker that can be used in practise (with little modifications if required)


You could make a good start with the ideas from http://norvig.com/ngrams/ (a fancier corrector in the vein of the OP).


Have you tried hunspell?

https://hunspell.github.io/


What's wrong with Aspell? (Not saying you're wrong; just wondering what I've missed.)


Thanks for the suggestions, will check them out :)


This is a beautiful example of pithy high-level coding. I feel compelled to mention, however, that the example word, 'thew', is in fact a(n archaic) English word: https://www.google.com.au/#q=thew :P


I wrote one in PLV8 (Postgres) just for fun:

http://blog.databasepatterns.com/2014/08/postgresql-spelling...


I wrote a direct port in Golang with benchmark comparisons:

https://github.com/montanaflynn/toy-spelling-corrector


This is really cool and I'm wondering if you could improve the ability of this by adding a markov chain/tree structure of most word usage patterns and doing contextual searching for your word. You wouldn't need your wordlist and your could compress and package this.

The way this would work is by looking at the previous word, and the next word is available. It would find every word combination that looks like that and then do a Levenshtein distance for all of the words that come between these two items.

Is this the way "big" spelling correction methods work or is it by other means?


Yes, using language models is how it's done.

Theoretically, it's not that hard, in practice, it's really hard.

There is an online language modelling course from Stanford that you should check out if you want to take a stab.


That's not how Google does it. Try spelling errors with Google Search. Google does multi-word spelling correction.


I guess that Google looks at each word individually. And what Google does is intriguingly simple: When they see someone do two similar searches from the same person, where the second search has way more results, they record the first search as "wrong spelling", and the second one as "correction".


I wrote Rust version of spelling corrector sometime back to explore nitty gritty of language https://github.com/vpanghal/spellcorrector


Is this how modern spell checkers actually work? I assumed they would use a heuristic trying to match common misspellings to their frequent corrections. That or a combination of heuristic and Bayes.


For one, I'm fairly sure that modern spell checkers use n-gram language models rather than a 1-gram model to decrease perplexity. Disclaimer: I skimmed the article.


This is the very root of how modern spellcheckers work, but they add more layers, using language modelling.


Versions in C, C++, Java, Haskell and Racket https://github.com/jmoy/norvig-spell


Any article like this for grammar correction?

I've been interested to know why grammar checking and corrections can't be more accurate.


An open source grammar project written by CMU with LGPL license and links to current and future research: http://www.abisource.com/projects/link-grammar


"I've been interested to know why grammar checking and corrections can't be more accurate."

It's because there is no such a thing as 'grammar' :)

There is no such a thing as 'language' :)

Ok, I misrepresented that a little - but there is no such thing as a 'word list of all the English words and proper names'. And there definitely no set of clear grammatical rules for English. For some languages - such as German - the rules are more precise, but even then.

So what you end up with is a game of probability and a lot of risk of 'over-correction' (beta errors).

Context matters a lot as well - multilingual speakers, casual typers.

Go to a rap video on youtube and look at the comments. People are arguably not even writing English.


The logic would be quite similar. The dataset would be much larger. For grammar, you'd need to go for trigrams at least, and likely 4-grams and 5-grams. Maybe much further for complex structures. To reduce the dimensionality you might start bucketing some words by part of speech.


English text has an astonishing number of ambiguities and non-grammatical idioms.


"why should they know about something so far outisde their specialty"


Neat Code - expressing high-level concepts in such a concept manner.


They don't seem to address keyboard layouts.


this is a good point I think.

sometimes, people type wrong words, because certain keys are too close on the keyboard.

another problem I found about the naive spelling corrector is, it doesn't take the pronunciation into account. certain wrong spelling looks different from the correct version by edit distance. but they sound similar.


Norvig's code is beautiful.


Carefuly.


I've done this before.

Sadly - the problem is way harder.

First - you get much better results by using language models, n-grams etc. to predict the likely hood of words given previous words. That can be hard.

The really hard part comes down to language that people use.

Colloquialisms, proper names, and mixed-languages ... make this stuff really, really hard.

Getting it 'mostly right' is not hard. Getting it really good is very difficult and depends a lot on context.

'Srinivas' will not be in most people's dictionaries, but it's a common name in India. 'Le' and 'la' are words in french and some similar in Spanish - and a lot of writers jam these in all over the place. Over-correction and beta-errors become a huge problem.

It's a really interesting premise and difficult for Engineers because it's purely probabilistic there is no way to build a perfect spellchecker unless you can agree 100% on what 'language' is, precisely ... and trust me there is no agreement on that. Not even close.


Writing spell checkers is also hard in language that e.g. doesn't have spaces between composite nouns. In essence it makes it possible to make up a perfectly valid new word. We expect the spell checker to be able to recognise that word, because it knows the two words that the composite is made of.

Finnish is probably even worse because of it's conjugation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: