Cool, but for each of these properties that knwl.js finds there are done extensive research and none of them are easy to do well. Sure, in the demo you find some interesting properties, but it's content was chosen by the author.
E.g. in the emotion-detection instead of using a data set such as http://sentiwordnet.isti.cnr.it/ the author defines the following data structures:
Which perhaps functions well, but in a very limited domain. The same can be said for detection of spam, phone numbers (only find American numbers in a certain format atm) etc.
As with most projects that deal with parsing natural language, it works best when optimized for your use case. For example, I created a natural language date parsing library [https://github.com/Tabule/Sherlock] that is great at parsing events for entering into a calendar, but would fail if tested against an entire news article. The cool thing about knwl is that it is open-source and the code is very clean, meaning you can easily optimize it for your use case. Knwl lays the ground work for somebody to build a smarter parser into their app.
+ 1. I fully agree here, parsing for events data, parsing a news article or parsing tweets are all very different tasks and each one requires different optimizations.
An interesting project regardless, I'll likely be using this in the near future.
This is what I immediately thought of. They've built a nice small library that can be used as a framework for something more detailed and involved, which is fantastic.
I love the concept but also agree with some of the comments below. Text extraction, and parsing is better done with classification and training. I spent sever months working on the problem with Natural[1].
Clever, but not quite good enough for us to parse our notes with. If you type a time such as "14:30PM" it can't detect it, but if you put a space before the "PM" then it works fine. Also, if the time is in 24-hour, then you don't need the "AM/PM" but it can't detect this.
Has anyone heard of solutions to just guess that something looks like a timestamps within a longer string (eg. generic logs, chats, nothing fomralized like mail headers)?
A little bit more info about the internals of the recognition engine would be welcome. It seems that one needs to manually supply sets of keywords and key phrases?
The execution could have better, as others have mentioned before me. Hopefully others will contribute to make it more intelligent. Nevertheless: Great idea!
E.g. in the emotion-detection instead of using a data set such as http://sentiwordnet.isti.cnr.it/ the author defines the following data structures:
Which perhaps functions well, but in a very limited domain. The same can be said for detection of spam, phone numbers (only find American numbers in a certain format atm) etc.