Hacker News new | past | comments | ask | show | jobs | submit login

Yep. First letter of a sentence, or title, or any other proper noun. So... how can your program tell which it is? The correct algorithm approaches AI in complexity.



In data/text mining disciplines, correctness is fuzzy, rather than being boolean. For most industry-wide applications, having a solution that covers 99.9% of the cases (1 mis-classification in 1000 sample) is well below acceptable bounds.

So, one particular solution with these performance characteristics is building a decision tree using a bunch of training data, and eg. a maximum entropy classifier. Add some sample data from any openly available corpus's (or fire up mturk, and create your own), and you're pretty much done with it.

Of course, sentence-tokenization is only the tip of the iceberg :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: