Show HN: Knwl.js - Scan through text for data that may be of interest

lordlarm · on Dec 15, 2013

Cool, but for each of these properties that knwl.js finds there are done extensive research and none of them are easy to do well. Sure, in the demo you find some interesting properties, but it's content was chosen by the author.

E.g. in the emotion-detection instead of using a data set such as http://sentiwordnet.isti.cnr.it/ the author defines the following data structures:

    this.emotion.negativeWords = ['terrible','horrible','evil','die','dick','bitch','fucked','stupid','idiot','dumb','noob','shit','vain','n00b','dickhead','cocksucker','disgusting','slut'];
    this.emotion.negativeWordsB = ['fuck','shit','kill','rape','hate','hating'];
    this.emotion.positiveWords = ['happy','good','great','amazing','awesome','wonderful','brilliant','smart'];
    this.emotion.positiveWordsB = ['love','like','want',"<3",'kiss'];

Which perhaps functions well, but in a very limited domain. The same can be said for detection of spam, phone numbers (only find American numbers in a certain format atm) etc.

guptaneil · on Dec 15, 2013

As with most projects that deal with parsing natural language, it works best when optimized for your use case. For example, I created a natural language date parsing library [https://github.com/Tabule/Sherlock] that is great at parsing events for entering into a calendar, but would fail if tested against an entire news article. The cool thing about knwl is that it is open-source and the code is very clean, meaning you can easily optimize it for your use case. Knwl lays the ground work for somebody to build a smarter parser into their app.

c16 · on Dec 15, 2013

+ 1. I fully agree here, parsing for events data, parsing a news article or parsing tweets are all very different tasks and each one requires different optimizations.

An interesting project regardless, I'll likely be using this in the near future.

rdgiii · on Dec 15, 2013

This is what I immediately thought of. They've built a nice small library that can be used as a framework for something more detailed and involved, which is fantastic.

silentrob · on Dec 15, 2013

I love the concept but also agree with some of the comments below. Text extraction, and parsing is better done with classification and training. I spent sever months working on the problem with Natural[1].

[1] - https://github.com/NaturalNode/natural

samsnelling · on Dec 15, 2013

I cannot encourage people to check out Natural enough. One of the best node packages out there.

jbrooksuk · on Dec 15, 2013

Clever, but not quite good enough for us to parse our notes with. If you type a time such as "14:30PM" it can't detect it, but if you put a space before the "PM" then it works fine. Also, if the time is in 24-hour, then you don't need the "AM/PM" but it can't detect this.

esamek · on Dec 15, 2013

Things that didn't work:

(xxx) xxx-xxxx phone number

mm-dd-yyyy date

97-109-107 · on Dec 15, 2013

Has anyone heard of solutions to just guess that something looks like a timestamps within a longer string (eg. generic logs, chats, nothing fomralized like mail headers)?

yconst · on Dec 15, 2013

A little bit more info about the internals of the recognition engine would be welcome. It seems that one needs to manually supply sets of keywords and key phrases?

michaelmcmillan · on Dec 15, 2013

The execution could have better, as others have mentioned before me. Hopefully others will contribute to make it more intelligent. Nevertheless: Great idea!

dlsym · on Dec 15, 2013

Yeah. If you could get me the indices of the extracted data, too.

That would be great.

kine · on Dec 15, 2013

Very cool. Needs some work but a very good start

imdsm · on Dec 15, 2013

Would be nice to see it take a URL.