We just started out one and a half week ago, joining the Pragmatic Programmer's writing month. We though a 'release early, release often' approach would be best, that's why there are just a few in-progress chapters.
We will keep you posted, and thanks for the encouragement!
Thanks for your work, I am really enjoying it so far. As a counterpoint to some of the comments regarding the choice of language, I had the opposite response. Oh neat, I get to learn haskell and natural language processing at the same time.
If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.
You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.
I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).
- Is this an NLP problem?
- If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?
Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).
Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.
this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.
a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.
Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.
"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.
It's an interesting paper that I intend to dig into more carefully, but I kind of wish that a paper "for the Working Programmer" used a language like Python rather than Haskell. I'm aware that Haskell has a very nice type system for doing things like this -- and I'm a language nerd myself, so It's not that -- but it just seems like it would be more practical in something more "mainstream."
That said, it is interesting from what I've read so far.
I think nltk in combination with another reference is a great way to learn. You can start out with a small project, have it functional off the ground, then dig deeper.
...which is in turn a nod to _Categories for the Working Mathematician_ by Saunders Mac Lane (as is Clocksin's _Clause and Effect: Prolog Programming for the Working Programmer_, and probably several other books.)
I may try to translate the examples into Javascript (CommonJS platform, not client-side .. on second thought, I may simply do it on NodeJS with Coffeescript) - just so I can learn the topic better. I find that much like taking notes during a lecture, it helps me retain the knowledge better.
Maybe you should try to do the same with python? :)
Agreed, I saw Haskell in the TOC and stopped reading. Still, an interesting project - and about as appropriate for "working" programmers as Larry Paulson's book, so no issues with the title.
It'd be easy enough to rewrite most of the examples in another language anyway (I'd hope), even if elegance is lost in the process...
I've posted this link before, but these NLP posts keep popping up on HN, so I'll keep posting.
Over at http://www.repustate.com, we're taking the more common functions that NLTK performs (and the ones it should) and porting them over as web services. NLTK is kind of buggy here & there, and it's not too great if you're dealing with big data sets. Our API, with the obvious handicap of network latency, is lightning fast because we ported many NLTK functions down to raw C.
Our API is free so have at it, let me know if you want to see us add anything.
whiled I'm sure something nice will come out of it you may wish to temporarily disable the NER feature because, it seems to amount to "select capitalized words" at least on the few pages I tried (wikipedia, bitcoin, nytimes).
I'll have to put this on my 'to read' list, it looks really interesting. I think natural language processing/understanding may become one of those next 'big things' like mobile and social media simply because understanding what a user is trying to do will become very important.
I am not a very good Haskel programmer, but I spend an occasional evening with it, and I am interested in NLP also (have been working off and on on NLP since the early 1980s).
From skimming through the book, it looks like a nice read and just went on my reading list.
This is pretty neat. At the risk of sounding childish, here I go -- I wish books like these could be given life like tryruby.org where you could try out examples and learn along the way. That would be wicked cool.
It's interesting to note that a lot of natural language processing is English-centered. It's clear that English natural language processing is way ahead of the curve, but based on the quality of Chinese results on Google Translate, I take it Asian languages don't do so hot when it comes to natural language processing?
Translation is a lot harder for language pairs that are less related. Most of the European languages are fairly close cousins, so translation between, say, English and French isn't that hard.
That said, it's generally true that for most NLP tasks, we're doing much better on languages similar to English.
Also note that translation is a very different task than parsing, or part-of-speech tagging, for example. Summarization and translation are both open research topics in NLP from what I understand, and aren't really 'solved' in any language.
translation is also a lot harder for data poor pairs of languages. machine translation relies on training on a parallel corpus (=same text, different languages), and gets better the bigger this is.
We will keep you posted, and thanks for the encouragement!