Natural Language Processing for the Working Programmer

danieldk · on Nov 15, 2010

We just started out one and a half week ago, joining the Pragmatic Programmer's writing month. We though a 'release early, release often' approach would be best, that's why there are just a few in-progress chapters.

We will keep you posted, and thanks for the encouragement!

angrycoder · on Nov 16, 2010

Thanks for your work, I am really enjoying it so far. As a counterpoint to some of the comments regarding the choice of language, I had the opposite response. Oh neat, I get to learn haskell and natural language processing at the same time.

If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.

danieldk · on Nov 16, 2010

Should be fixed in 10 minutes. Thanks!

roel_v · on Nov 16, 2010

You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.

I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).

- Is this an NLP problem? - If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?

danieldk · on Nov 16, 2010

Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).

Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.

_corbett · on Nov 16, 2010

this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.

a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.

nervechannel · on Nov 16, 2010

Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.

"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.

_corbett · on Nov 17, 2010

yea for sure, I meant in my advice that a non-statistical solution might be good to start out with.

jimmyjim · on Nov 16, 2010

Hey Daniel,

I actually remember reading your Slackware book a few years back. I've no doubt that the quality of this text will be as superb as that one's! Cheers!

hvs · on Nov 15, 2010

It's an interesting paper that I intend to dig into more carefully, but I kind of wish that a paper "for the Working Programmer" used a language like Python rather than Haskell. I'm aware that Haskell has a very nice type system for doing things like this -- and I'm a language nerd myself, so It's not that -- but it just seems like it would be more practical in something more "mainstream."

That said, it is interesting from what I've read so far.

albertsun · on Nov 15, 2010

Perhaps try this instead?

http://www.nltk.org/book

coderdude · on Nov 15, 2010

That book teaches you how to accomplish NLP tasks using NLTK, but does not teach you how to accomplish those tasks without the aid of the toolkit.

_corbett · on Nov 16, 2010

I think nltk in combination with another reference is a great way to learn. You can start out with a small project, have it functional off the ground, then dig deeper.

http://ocw.mit.edu/courses/electrical-engineering-and-comput... is excellent as is http://metaoptimize.com/qa/a stackoverflow style site which is active albeit specialized.

sunqiang · on Nov 16, 2010

Perhaps try this instead? http://nlp.stanford.edu/fsnlp/

riffraff · on Nov 16, 2010

we have this book in the office in dead tree form and I think it is quite good, but it's not a "working programmer" text in my opinion.

HerberthAmaral · on Nov 16, 2010

Yeah, but all theory is included :-)

1331 · on Nov 15, 2010

The title is likely a hat tip to a famous functional textbook: _ML for the Working Programmer_.

http://www.cl.cam.ac.uk/~lp15/MLbook/

silentbicycle · on Nov 15, 2010

...which is in turn a nod to _Categories for the Working Mathematician_ by Saunders Mac Lane (as is Clocksin's _Clause and Effect: Prolog Programming for the Working Programmer_, and probably several other books.)

danieldk · on Nov 15, 2010

We love it!

danieldk · on Nov 15, 2010

Or maybe it's a cover-up to bring fun languages to the working programmer.

warfangle · on Nov 16, 2010

I may try to translate the examples into Javascript (CommonJS platform, not client-side .. on second thought, I may simply do it on NodeJS with Coffeescript) - just so I can learn the topic better. I find that much like taking notes during a lecture, it helps me retain the knowledge better.

Maybe you should try to do the same with python? :)

jlees · on Nov 15, 2010

Agreed, I saw Haskell in the TOC and stopped reading. Still, an interesting project - and about as appropriate for "working" programmers as Larry Paulson's book, so no issues with the title.

It'd be easy enough to rewrite most of the examples in another language anyway (I'd hope), even if elegance is lost in the process...

dons · on Nov 16, 2010

It is practical if it gets the job done.

sabat · on Nov 16, 2010

I was just wondering if anyone's done something like this for Ruby.

apl · on Nov 16, 2010

I doubt it, given that Ruby has virtually no traction in NLP circles. That's also why all the good toolkits target other languages.

However, "translating" Python examples into relatively idiomatic Ruby shouldn't too much of a problem if you really want to go down that path.

sabat · on Nov 16, 2010

"translating" Python examples into relatively idiomatic Ruby shouldn't too much of a problem if you really want to go down that path

Really good point -- thanks for that.

waterside81 · on Nov 16, 2010

I've posted this link before, but these NLP posts keep popping up on HN, so I'll keep posting.

Over at http://www.repustate.com, we're taking the more common functions that NLTK performs (and the ones it should) and porting them over as web services. NLTK is kind of buggy here & there, and it's not too great if you're dealing with big data sets. Our API, with the obvious handicap of network latency, is lightning fast because we ported many NLTK functions down to raw C.

Our API is free so have at it, let me know if you want to see us add anything.

riffraff · on Nov 16, 2010

whiled I'm sure something nice will come out of it you may wish to temporarily disable the NER feature because, it seems to amount to "select capitalized words" at least on the few pages I tried (wikipedia, bitcoin, nytimes).

It is blazing fast though :)

waterside81 · on Nov 21, 2010

Yeah, the NER call is not functioning as well as it can. We're aiming at improving that. Thanks for checking things out.

LeBlanc · on Nov 15, 2010

I'll have to put this on my 'to read' list, it looks really interesting. I think natural language processing/understanding may become one of those next 'big things' like mobile and social media simply because understanding what a user is trying to do will become very important.

If anyone is interested in playing around with a robust natural language processing tool, I built an API for the Stanford Parser. http://nlp.naturalparsing.com/browserparser/parse

mark_l_watson · on Nov 15, 2010

Thanks Daniël, this is cool!

I am not a very good Haskel programmer, but I spend an occasional evening with it, and I am interested in NLP also (have been working off and on on NLP since the early 1980s).

From skimming through the book, it looks like a nice read and just went on my reading list.

samratjp · on Nov 16, 2010

This is pretty neat. At the risk of sounding childish, here I go -- I wish books like these could be given life like tryruby.org where you could try out examples and learn along the way. That would be wicked cool.

For now, OpenStudy will do the trick. I created a "StudyPad" if anyone wants to go through this book together. http://openstudy.com/studypads/Natural-Language-Processing-f...

jasonjei · on Nov 16, 2010

It's interesting to note that a lot of natural language processing is English-centered. It's clear that English natural language processing is way ahead of the curve, but based on the quality of Chinese results on Google Translate, I take it Asian languages don't do so hot when it comes to natural language processing?

syllogism · on Nov 16, 2010

Translation is a lot harder for language pairs that are less related. Most of the European languages are fairly close cousins, so translation between, say, English and French isn't that hard.

That said, it's generally true that for most NLP tasks, we're doing much better on languages similar to English.

brettbender · on Nov 16, 2010

Also note that translation is a very different task than parsing, or part-of-speech tagging, for example. Summarization and translation are both open research topics in NLP from what I understand, and aren't really 'solved' in any language.

_corbett · on Nov 16, 2010

translation is also a lot harder for data poor pairs of languages. machine translation relies on training on a parallel corpus (=same text, different languages), and gets better the bigger this is.