Bounter: A Python counter that uses limited memory regardless of data size

asrp · on Oct 21, 2017

The introduction is misleading about what tradeoffs it makes

  >>> from bounter import bounter
  >>> bounts = bounter(size_mb=1)
  >>> bounts.update(str(i) for i in xrange(1000000))
  >>> bounts['100']
  0L

should give 1. The description should be more upfront that you are getting inaccurate answers.

> Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects.

"approximative algorithms" isn't a commonly used phrase (at least according to Google) and this sentence doesn't say anything about what the actual trade-off is (loss of accuracy in the counts). A possible alternative would be writing to hard disk so what trade-off is made isn't obvious.

I also don't know what the precision and recall percentages in the example table mean.

gipp · on Oct 21, 2017

The tradeoffs themselves I think are fine. If I have an issue it's that the three different use cases outlined have _completely_ different tradeoffs using completely different data structures, but are presented behind one API differing only in boolean flags.

I don't know about you, but a boolean flag completely changing the behavior of a collection definitely violates least surprise to me. They should be different classes with descriptive names, IMO.

Radim · on Oct 21, 2017

Hi there, thanks for the constructive comments!

They actually are completely different classes internally. The ‘bounter’ function is just a convenience wrapper (factory). For power users, the internal classes offer more control and parameters.

If you have any suggestions / concerns, please raise an issue on github. It’s a new library, we’re looking for feedback!

evgen · on Oct 21, 2017

Probabalistic data structures and counts/sketches are well-known in the field and if you do not know what the trade-offs are then you probably should not be using them. Perhaps you should start by reading up on how a HyperLogLog and count-min sketch works to learn about the compromises being made and why.

Myrmornis · on Oct 21, 2017

The README is for all-comers, it should certainly be explicit, briefly, that you won’t get the correct counts.

gizmo · on Oct 21, 2017

> The description should be more upfront that you are getting inaccurate answers.

You can't really expect a data structure using a finite amount of memory to contain an infinite amount of data. Otherwise it wouldn't be advertised as a "counter", but as a magic bag of holding.

tedmiston · on Oct 21, 2017

Containing an infinite amount of data and giving accurate answers are saying different things though. It can have finite memory and still give 100% accurate answers by batching from disk.

Chunking / iterative processing like this is an approach people commonly use with pandas which otherwise isn't great for data sets too big for RAM.

http://pandas-docs.github.io/pandas-docs-travis/io.html#iter...

asicsp · on Oct 21, 2017

well the intro line says "extremely fast probabilistic counting of item frequencies"

further down, "Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where being able to handle very large collections is more important than whether an event occurs exactly 55,482x or 55,519x."

if you are on github, perhaps raise an issue to add an example similar to yours as well

swalsh · on Oct 21, 2017

I'm a self-taught programmer.... and approximative algorithm made perfect sense to me. I think the description is pretty upfront.

jmmcd · on Oct 21, 2017

"Sublinear space algorithms" is another phrase that ends up meaning the same thing.

alonmln · on Oct 21, 2017

Looks useful! I like the way this library is documented, very well explained.

gtrubetskoy · on Oct 21, 2017

Is this a variant of Space Saving or Frequent algorithm?

Space Saving: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94....

Frequent: https://stackoverflow.com/questions/3260653/algorithm-to-fin...

And there is a variant with a Sketch function, known as "Filtered Space Saving". I cannot find a working link to the paper, but here is a Golang implementation of it:

https://godoc.org/github.com/dgryski/go-topk

visarga · on Oct 21, 2017

It would be useful to have load/save (mmap for faster loads).

sametmax · on Oct 21, 2017

So like an hyperloglog ?

suff · on Oct 21, 2017

It is good that someone wrote this. It is embarrassing how bad Python's native performance is.

daveFNbuck · on Oct 21, 2017

This isn't a replacement for any native Python data structures. It's a collection of three different approximate data structures for dealing with large data sets where you're ok trading off accuracy for reduced memory usage.

markbnj · on Oct 21, 2017

I don't understand why it would be "embarrassing?" Python was designed to have certain characteristics and it has pretty obviously succeeded in terms of implementing design features that attracted a large number of users and a broad community. Nothing to be embarrassed about there. If you need faster processing there are numerous extensions and alternatives.