Show HN: On-disk B+ tree for Python 3

laurencerowe · on Jan 21, 2018

ZODB has a mature B+Tree implementation for on disk use in the BTrees package. https://pypi.python.org/pypi/BTrees

dimatura · on Jan 21, 2018

Nice, I'm a big user of various data stores for scientific work. How does this compare to LMDB (http://www.lmdb.tech/doc/), which uses B-trees (and has a Python interface)?

nicolaslem · on Jan 21, 2018

I haven't done any benchmark yet but I expect my implementation to be at least an order of magnitude slower.

I've found my implementation to be CPU intensive: creating Python objects from the raw pages is expensive. That's why bulk inserts and iterations are much faster than insert/get in a loop.

sprt · on Jan 21, 2018

The code looks very clean. Uses type annotations too. Wish there were more comments though.

nicolaslem · on Jan 22, 2018

Thank you! I found that type annotations make code easier to read so I tend to write less comments. But I'll make an effort.

erezsh · on Jan 22, 2018

How does this perform compared to sqlite3?

And and can you use keys that aren't builtin values?

nicolaslem · on Jan 22, 2018

It is possible to use your own key if it has a natural order and you write your own simple serializer: https://github.com/NicolasLM/bplustree/blob/master/bplustree...

antman · on Jan 21, 2018

Cannot delete items yet

nicolaslem · on Jan 21, 2018

Yes, it's a work in progress, see the `remove` branch.

IvyMike · on Jan 21, 2018

That is a very dangerous name for a branch.

liopleurodon · on Jan 21, 2018

lol, truth

_pgmf · on Jan 22, 2018

Why is this not mentioned very clearly in the README? Seems like willful misrepresentation.

You might also mention that, if replacing large values that use overflow pages, the file has the potential to grow without bounds as it looks like overflow pages are not collected?

j_s · on Jan 22, 2018

Are there any production data stores recommended for low memory usage? What should I use to stream data to disk and back with minimal overhead, preferably with indexed lookups?

bufferoverflow · on Jan 22, 2018

An SSD with the highest random-read/write IOPS you can afford. Something like this:

https://www.computerworld.com/article/2987956/solid-state-dr...

Or FusionIO ioDrive2 Duo, which is above 900K IOPS for both reads and writes.

zerokernel · on Jan 21, 2018

https://pypi.python.org/pypi/BTrees

evolighting · on Jan 22, 2018

what about using a DB? Is there any really simple database for the same purpose?

_pgmf · on Jan 22, 2018

This is pretty cool. As I was reading the code I really wished for an explanation or simple ASCII diagram of the serialization formats for the various node/record types, as well as for the frames/wal format. Given that the poster is the author of the project, I hope you'll consider filling in these kinds of details, as they'd presumably be of interest to the people you're "Show"-ing this to.

nicolaslem · on Jan 22, 2018

Thank you. I know what you mean, I'm not happy with how the serialization is done right now, it's too complicated.

Maybe someone on HN knows a Python serialization library beyond pickle that would allow to describe how the data is laid out and take care of the rest. It looks like struct is not flexible enough for this usage.

_pgmf · on Jan 21, 2018

Thanks for sharing, I've been interested in finding pure python implementations of storage engines. Another one I came across is whoosh: https://bitbucket.org/mchaput/whoosh/src/a16ebacb47191afaf2d...

shalabhc · on Jan 22, 2018

Check out zodb and durus.

halayli · on Jan 21, 2018

What's wrong with LMDB?