Hacker News new | past | comments | ask | show | jobs | submit login

Just to mirror what was said on the thread a month ago when the paper came out[1], if you're interested in FastText I'd strongly recommend checking out Vowpal Wabbit[2] and BIDMach[3].

My main issue is that the FastText paper [7] only compares to other intensive deep methods and not to comparable performance focused systems like Vowpal Wabbit or BIDMach.

Many of the features implemented in FastText have been existing in Vowpal Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many other interesting, but all highly performant, ideas, and has reasonable strong documentation. The command line interface is highly intuitive and it will burn through your datasets quickly. You can recreate FastText in VW with a few command line options[6].

BIDMach is focused on "rooflining", or working out the exact performance characteristics of the hardware and aiming to maximize those[4]. While VW doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't going to be a major slow point in your systems as word2vec is actually pretty speedy.

To quote from my last comment in [1] regarding features:

Behind the speed of both methods [VW and FastText] is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.

^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.

[1]: https://news.ycombinator.com/item?id=12063296

[2]: https://github.com/JohnLangford/vowpal_wabbit

[3]: https://github.com/BIDData/BIDMach

[4]: https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_D...

[5]: https://github.com/BIDData/BIDMach/blob/master/src/main/scal...

[6]: https://twitter.com/haldaume3/status/751208719145328640

[7]: https://arxiv.org/abs/1607.01759




Sounds interesting. Can these tools work on character n-grams as FastText does?


In principle if you just put a space between each character it would, though it would also make ngrams between words which you might not want. edit: for vw, maybe the other lib has special support for character ngrams with word boundaries




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: