The introduction is misleading about what tradeoffs it makes
>>> from bounter import bounter
>>> bounts = bounter(size_mb=1)
>>> bounts.update(str(i) for i in xrange(1000000))
>>> bounts['100']
0L
should give 1. The description should be more upfront that you are getting inaccurate answers.
> Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects.
"approximative algorithms" isn't a commonly used phrase (at least according to Google) and this sentence doesn't say anything about what the actual trade-off is (loss of accuracy in the counts). A possible alternative would be writing to hard disk so what trade-off is made isn't obvious.
I also don't know what the precision and recall percentages in the example table mean.
The tradeoffs themselves I think are fine. If I have an issue it's that the three different use cases outlined have _completely_ different tradeoffs using completely different data structures, but are presented behind one API differing only in boolean flags.
I don't know about you, but a boolean flag completely changing the behavior of a collection definitely violates least surprise to me. They should be different classes with descriptive names, IMO.
They actually are completely different classes internally. The ‘bounter’ function is just a convenience wrapper (factory). For power users, the internal classes offer more control and parameters.
If you have any suggestions / concerns, please raise an issue on github. It’s a new library, we’re looking for feedback!
Probabalistic data structures and counts/sketches are well-known in the field and if you do not know what the trade-offs are then you probably should not be using them. Perhaps you should start by reading up on how a HyperLogLog and count-min sketch works to learn about the compromises being made and why.
> The description should be more upfront that you are getting inaccurate answers.
You can't really expect a data structure using a finite amount of memory to contain an infinite amount of data. Otherwise it wouldn't be advertised as a "counter", but as a magic bag of holding.
Containing an infinite amount of data and giving accurate answers are saying different things though. It can have finite memory and still give 100% accurate answers by batching from disk.
Chunking / iterative processing like this is an approach people commonly use with pandas which otherwise isn't great for data sets too big for RAM.
well the intro line says "extremely fast probabilistic counting of item frequencies"
further down, "Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where being able to handle very large collections is more important than whether an event occurs exactly 55,482x or 55,519x."
if you are on github, perhaps raise an issue to add an example similar to yours as well
And there is a variant with a Sketch function, known as "Filtered Space Saving". I cannot find a working link to the paper, but here is a Golang implementation of it:
This isn't a replacement for any native Python data structures. It's a collection of three different approximate data structures for dealing with large data sets where you're ok trading off accuracy for reduced memory usage.
I don't understand why it would be "embarrassing?" Python was designed to have certain characteristics and it has pretty obviously succeeded in terms of implementing design features that attracted a large number of users and a broad community. Nothing to be embarrassed about there. If you need faster processing there are numerous extensions and alternatives.
> Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects.
"approximative algorithms" isn't a commonly used phrase (at least according to Google) and this sentence doesn't say anything about what the actual trade-off is (loss of accuracy in the counts). A possible alternative would be writing to hard disk so what trade-off is made isn't obvious.
I also don't know what the precision and recall percentages in the example table mean.