Hacker News new | past | comments | ask | show | jobs | submit login
Distributed search engines using BitTorrent and SQLite (github.com/lmatteis)
246 points by tosh on Jan 20, 2021 | hide | past | favorite | 48 comments



For work in a similar vein, Mikeal Rogers has recently been working on IPSQL[0] based on peer-to-peer prdered search indexes[1] built on IPFS, which shares the content-addressed nature of BitTorrent.

[0]: https://github.com/mikeal/IPSQL

[1]: https://0fps.net/2020/12/19/peer-to-peer-ordered-search-inde...


Hypercore protocol has a distributed btree as well called Hyperbee

https://hypercore-protocol.org/guides/modules/hyperbee/


Hell yeah! Hyperbee is very cool. I'm still trying to get an intuition for the "embedded indexing", and how it compares to a standard b-tree. I did some cursory looking around and couldn't find much on it.


I haven't read the code, and have only read a few-sentence description of embedded indexing. Presuming the writer knows about B-trees and B+trees and isn't just clarifying that they're not using B+trees, embedded indexing sounds similar to Fractal Tree Indexes[0].

For a tree built on an immutable log, it makes sense to buffer some writes at each level of the tree. When you write a new record, you make a copy of the root node with your latest key-value pair appended to the root node's buffer. If this fills up the cache space in the root node, then you determine which child node corresponds to the most buffered key-value, and flush those corresponding key-value pairs down to that child node, recursively until none of the buffers in nodes are over-full. This means that often only a small number of nodes near the root need to be updated, instead of having to write lots of updated intermediate nodes like you'd need with a normal B-tree. The worst-case behavior is all of the caches being full, resulting in O(log N) nodes being written. Also, frequently written keys remain near the root, so if your read pattern resembles your write pattern, frequently read keys will remain near the root node. For TokuDB Fractal Index Trees, my understanding is that these buffered records can also represent schema changes, so that schema changes can be incrementally and lazily applied.

Of course, if one cached write being flushed down the tree catches up to an older cached write for the same key, the older value is just discarded. Presumably, there's also a periodic incremental operation to save space by removing older cached values. Even for something like a blockchain where you want to keep infinite history, you'd probably still want incremental compaction so that clients wanting to download the entire current state don't need to download the entire history.

[0] https://en.wikipedia.org/wiki/Fractal_tree_index


Check out the workshop on the hypercore website, it explains it in detail


With respect to IPFS and Merkle Search Trees, can anyone "in the know" comment on how they're materially different than Probabilistic B-Trees as defined by Noms[1] and Dolt[2]? I've been playing a lot with the Noms variant (Prolly Trees) lately and have often wondered where they differ from IPFS-ish Merkle Search Trees. If at all.

[1]: https://github.com/attic-labs/noms/blob/master/doc/intro.md#... [2]: https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...


Merkel Search Trees are like B-Trees, as described in https://hal.inria.fr/hal-02303490/document, since internal nodes can store values.

Whereas Noms Prolly-Trees are like B+Trees, since all values are stored in the leaves.

Merkel Search Trees use the prefix of the hash of inserted keyvalues (number of leading zeros) to determine which layer will contain that keyvalue. This provides a deterministic tree structure for any given set of keyvalues, when you arrange them by key order.

Noms Prolly-Trees use a rolling hash across the sequence of keyvalues to determine split points, similarly when the number of leading zeros exceeds a threshold. Then those leaves are assigned as children to keyvalues on the next highest layer, which is again split in the same way, until a single root node is left.


Excellent description, thank you! While i'm mulling on what you described, are there any use cases that jump out for Merkel over Noms?

Given that i've been using Prolly-Trees for a content addressed store, a lot of the descriptions of Merkel sound similar in properties to how Prolly-Trees behave. I struggle to think of a use case where Prolly would excel over Merkel, or where Merkel would excel over Prolly.


10 years ago I worked for a company called Wowd which built real distributed search engine. Stored the index in distributed hash table (Kademlia). We were all students at the time. But, Wowd failed to get enough traction and it got acquihired by Facebook.


What a cool project!


> Site users then start downloading the site torrent, but, rather than downloading pieces of the torrent in "rarest first" order, they download pieces based on the search query they performed.

Interesting. How does the system know where the result of the query might appear in the file?


Interesting question. I looked at the source code to understand that.

SQLite knows where to look for when you open a SQLite database and you run a query, right? It just asks the underlying filesystem to provide N bytes starting from an offset using a C function, then it repeats the same operation on different portions of the file, it does its computation and everybody is happy.

The software relies on sqltorrent, which is a custom VFS for SQLite. That means that SQLite function to read data from a file stored in the filesystem is replaced by a custom function. Such custom code computes which Torrent block(s) should have the highest priority, by dividing the offset and the number of bytes that SQLite wants to read by the size of the torrent blocks. It is just a division.

See: https://github.com/bittorrent/sqltorrent/blob/master/sqltorr...


Yep. I've used the same basic technique to write a S3 VFS wrapper for APSW. It's not quick, by any means, but it's dang useful in cases where I have very large SQLite database son S3 and only need to get a small bit of information out of them.


I was not aware of APSW. Thank you.

For future reference: "APSW has the following enhancements/differences over pysqlite 2 (wrapping SQLite 3): - APSW stays up to date with SQLite. As features are added and functionality changed in SQLite, APSW tracks them. - APSW gives all functionality of SQLite including virtual tables, Virtual File System (VFS), BLOB I/O, backups and file control. [...]" -- https://rogerbinns.github.io/apsw/pysqlite.html#pysqlitediff...


One thing I'd warn about with APSW:

https://rogerbinns.github.io/apsw/download.html#easy-install...

Basically, if you install it from pip (without following the directions to install from git directly), then you end up with some ancient version, which rather defeats the point of APSW.


SQLite uses a memory page mechanism. You can define the page size. With a static page size you can divide the chunks and with that you have an offset for each page chunk, or if you have the start and the end offset of the bytes, you can also solve the page offset with that info.

If you only implement the filesystem shims, you will receive everything already figure it out, and the btree engine(storage) will ask you for the given chunk of data where you can resolve to the right torrent chunk, download it and just care about delivering the bytes that were asked.

I deliver sqlite dbs over torrent and the torrent chunks are optimized for the sqlite page sizes which are 64K.

In my case i dont stream the database, but through filesystem VFS it can be done if i had a need for it.

The beauty here lies in the SQLite architecture and the abstraction over pages.


This is not as distributed as you might believe.

The content itself is distributed, which creates privacy challenges of its own, but control over that content is centralized. If we want automatic updates of the index, we're still relying on a single party to provide them. That single party might respond to DMCAs, remove/censor content etc.


Maybe having the centralised piece only be available over an .onion site would help keep it out of reach of bad actors?


Since SQLite executing SQL locally on a remote peer machine is essentially computation push-down, one could think of building a planet-scale distributed analytics engine using such a pattern (perhaps using DuckDB and parquet/arrow files under the hood - but which exact SQL engine is behind the query pushdown API can be abstracted away too)

edit: attracted->abstracted


You just reinvented Spark


This probably just splits B+ tree pages. I suggested something similar for using sqlite with static web pages see [0]

Distributed search engines are not new, many companies tried to make them [1].

I think the main challenges of distributed search engines are not technical. I don't think there isn't much financial incentive to create/maintain/seed them.

There isn't much advantage to using them either (if a alternative is available). I think the only viable use case for them is for indexing pirated torrents.

Maybe a foundation could use this type of technology to lower their expenses. Like a charity version of Google

[0]https://news.ycombinator.com/item?id=25843786

[1]https://en.wikipedia.org/wiki/Distributed_search_engine


Fun fact: the first and last commits were 4 years ago


It remembers me to a old projecthttps://yacy.net/ .


It seemed to have a semi decent crawler and p2p system but it was beyond useless for searching. It was like using grep to search the internet. I just searched "youtube" and the first 10 pages would be random blogs that mention the word youtube instead of the actual website.


For me, the big problem of yacy is Java. I think that maybe the team of project work to implement the "algorith" in others languages (such as torrent protocol, that it is in C++, Java, Python...).


Reminds me of when Twitter developed a Bit Torrent based deployment system: https://torrentfreak.com/twitter-uses-bittorrent-for-server-...


I don't think this meets most people's definition of a distributed search engine.


I'd checkout http://dazaar.com/ as well. Same ideas but built on Hypercore technology and with payments module built in


Does this keep my queries private?


Based on other comments in here, it doesn't appear to obfuscate what blocks of data are being requested. Its probably safest to assume queries can probably be inferred from access patterns.


Yes. It will only leak the page offset of the database you are asking (over torrent).


For a single query, the combination of index and data pages is almost certainly enough to reconstruct the search keywords with good accuracy.


Yes with the access to the index i agree, but i was assuming the other party would not have access to that.

(But of course, if its over torrent and public the other party would have access to this data..)

But lets not forget, it would only reach the data over torrent if its not local, and as it would do it only once for each block, someone that would be spying over torrent would not have access to further queries after the first hit over that particular piece.

Also, as far as i know, you would only know about the pieces, if its asking it from you, so you would need to seed from as many peers as you can, to have access to all the pieces requests from the party you are interested in..

So even assuming a sophisticated party that could reverse engineer over piece offset access, (and let not forget that if the block size is big, this gets even harder), still, i don't think it would be possible to rebuild the query.


I just hope whatever original idea you put out in public; it won't get patented by some tech giant.


Though I'm having trouble finding the source, I believe there to some degree an idea published in a public fashion becomes un-patentable / public domain after a given period (something like 6 months). Hopefully someone can chime in on details about publicly shared ideas and patentability.


Pretty sure a patent like that would be invalidated due to prior art. It might be expensive to take to court though. Obligatory IANAL.


I recall hearing somewhere that sqlite isn't hardened against opening maliciously crafted database files; can anyone substantiate it? If so, that would make this a lot less compelling.


Opening and querying untrusted SQLite database can lead to (remote) code execution. Original research at https://research.checkpoint.com/2019/select-code_execution-f...


That's a great writeup; the fact that a view can transform any known static query into an attacker-chosen SELECT makes the attack surface huge.


> This currently only works on Mac OS X.

That is already a sign that this project is not going to go anywhere. Anyone who wants to build a server or make this part of their seedbox is going to use Linux or one of the BSDs.


I think that is fine for a proof of concept.

The bigger reasonis, unless im missing something, this is not distributed in the sense most people use the term "distributed" in the context of search engines, so its not as interesting as everyone is making it out to be.


> this is not distributed in the sense most people use the term "distributed" in the context of search engines,

What sense is that? I'm not familiar with "distributed" having some protected/special meaning for search engines.


macOS is POSIX certified. Is Linux?


POSIX certification is about as useful as a keyboard with a "windows 10 compatible" sticker. Useful for marketing purposes only.


Using AppleOS is like living in a police state.


Well they work very hard to patrol that walled garden. And apparently famous people that could care less really like it.


Using AppleOS is like living in a police state where the police work for you.


I'm not convinced of the relevance of your strawman.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: