Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is new in algorithms and data structures these days?
737 points by jvanderbot on May 10, 2023 | hide | past | favorite | 211 comments
Algs/DS were my first love in CS. Nowadays, all we hear about is AI/ML. There must be hardware/software improvements coming from or necessitating fundamental Algs/DS research.

Care to share any of the favorite recent results?

Are there big fields still making gains behind all this computing surge?




Anything CRDT (conflict free replicated datatypes: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...) related is fun to read up on and play with.

Papers and references (page maintained by central academic in the world of CRDTs): https://crdt.tech

Group doing research into how they can be used to build interesting collaborative (and async) applications: https://www.inkandswitch.com

A few of the major open source implementations - mostly for rich text editing or JSON like data structures:

- Yjs: https://github.com/yjs/yjs

- Automerge: https://github.com/automerge/automerge

- Peritext: https://www.inkandswitch.com/peritext/

- Dimond types: https://github.com/josephg/diamond-types

People building eventually consistent database syncing with them:

- https://electric-sql.com (Postgres <-> SQLite)

- https://vlcn.io (SQLite <-> SQLite)

Open source colaborative servers (coordination, persistance, presence):

- https://github.com/ueberdosis/hocuspocus

- https://github.com/partykit/partykit

- https://github.com/firesync-org/firesync


Diamond types author here. I’m working on another blog post - I had a big breakthrough getting collaborative text editing many times faster again using some new ideas. I’m in the process of writing up our new work now.

And there’s another hard problem I’m stuck on: so, CRDTs like mine and Yjs use (random agent id, sequence number) as unique IDs for operations. But a troublesome user might reuse an operation ID, sending different operations to different peers and giving them the same (agent, seq) pair. This results in really awful, weird bugs. Martin Kleppman wrote a paper about BFT in CRDTs. His suggestion is we do what git does and use SHA hashes for IDs instead.

But there’s two problems:

1. We don’t want to store a hash for every keystroke a user makes. That would get big, fast. We want a DAG of hashes like git, except a hash is generated with each keystroke. How do we build an efficient system that can query by hash, without storing all of the hashes?

2. I want users to be able to permanently delete inserted text. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above? (Note if your hash includes the typed character, even if you delete the character itself from disk, you can trivially brute force the hash to recover any typed characters).

Each of these problems in isolation has a solution. But the real prize is how do you solve both at the same time? I think this problem has a solution but I don’t think anyone has solved it yet. Want to name an algorithm and get my eternal respect? Here’s your chance.


If my understanding of byzantine fault tolerance with CRDTs and collaborative editing is correct, it's really only a problem in systems where you have an untrusted user, someone who is going to actively try and corrupt the system?

It seems to me that for the majority of use cases, collaborative editing in a work sense, you would normally have trust that a user isn't actively hostile. And so I'm not convinced that BFT needs to be a high priority goal in most of these implementations.

I think I higher priority issue is with how to aid developers with the use generic CRDT structures within a system where all the rules are not encoded into the CRDTs themselves. So for example when using Yjs with Prosemirror, Prosemirror has its own schema that defines how a document can be structured. However not all those rules are encoded into the underlying Yjs structure, it's possible to have two conflicting edits that results in a Yjs document that doesn't comply with the Prosemirror scheme. The current system throws away any none conforming data when the document is loaded or updated. Because of this you need to load the document into a server side instance of Prosemirror in order to apply that schema when doing a server side merge, or even if you are just rendering the static content after "blindly" merging the documents.

My point really is that CRDTs are an encoding of user intent, and when these generic CRTD are merged you have a structure representing merged user intent, not a final conforming state. The solution is either domain specific CRDT that are tailored exactly to you application, or a schema system for these generic types that apply rules when merging. Prosemirror/Yjs does (mostly) exactly this but I think it needs to be solved in a more generic way that can be run on any layer of the system.


Yes - The BFT problem only matters when you have Byzantine actors. But I think users deserve and expect the system to be reasonably well behaved and predictable in all situations. Anything publically writable, for example, needs BFT resilience. Or any video game.

As for the prosemirror problem, I assume you’re talking about weird merges from users putting markdown in a text crdt? You’re totally right - this is a problem. Text CRDTs treat documents as a simple sequence of characters. And that confuses a lot of structured formats. For example, if two users concurrently bold the same word, the system should see that users agree that it should be bolded. But if that “bold” intent is translated into “insert double asterisks here and here”, you end up with 4 asterisks before and after the text, and that will confuse markdown parsers. The problem is that a text crdt doesn’t understand markdown. It only understands inserting items, and bolding something shouldn’t be treated as an insert.

JSON editing has similar problems. I’ve heard of plenty of people over the years putting json text into a text crdt, only to find that when concurrent edits happen, the json grows parse errors. Eg if two users concurrently insert “a” and “b” into an empty list. The result is [“a””b”] which can’t be parsed.

The answer to both of these problems is to use CRDTs which understand the shape of your data structure. Eg, use a json OT/crdt system for json data (like sharedb or automerge). Likewise, if the user is editing rich text in prosemirror then you want a rich text crdt like peritext. Rich text CRDTs add the concept of annotations - so if two users bold overlapping regions of text, the crdt understands that the result should be that the entire region is bolded. And that can be translated back to markdown if you want.

Luckily, in over a decade of work in the collaborative editing space, I haven’t found any other examples of this problem other than rich text and json. I think if a collaborative editing system supports json data, rich text and plain text then it’ll be enough to add collaborative editing support to 99% of applications.

The ink & switch people did a great write up of how rich text CRDTs work here: https://www.inkandswitch.com/peritext/


> Or any video game

Agreed, BFT is clearly needed for multiplayer CRDT backed video games.

> I assume you’re talking about weird merges from users putting markdown in a text crdt?

Nope, although that is an issue. In that case the document shouldn't be markdown, it should be a rich text CRDT that's converted to markdown as output.

On the conflicts I mentioned, an example, say you are building a to do list app. First let's do it with Prosemirror and Yjs, but for some reason we have decided to limit the length of a to do list to 10 items. Prosemirror will let you do that when defining a schema, have a maximum number of child nodes of the parent node type. With the current Yjs/Prosemirror system, if you have 9 items in the list and two people concurrently add a 10th, one of them will be dropped by prosemirror (deterministically). The document schema enforced that rule outside of the CRDT implementation. Yjs xmlFragments do not have the concept of these sort of rules.

Now say you want to do this with the json like Map and Array types. Again the array type does not have the concept of a length limit, it will merge the two concurrent edits and create a document with 11 entries. In this case your application needs to manage the no longer complying document to correct it.

The issue comes if you are naively merging the documents on the server, and dumping the json, it will not take into account your applications own conflict resolution. My suggestion is that a CRDT schema could do this, it would be a bit like a JSON Schema, but with rules about how to correct misshapen structures.

So yes, I agree these generic rich text plus JSON types cover what 99% of applications need, but they also need to enforce a shape to the data structure that isn't built into the generic types. Having a way to do that as part of the merging layer, rather than application layer, would help to ensure correctness.


Yeah I hear you. I've still never found a good way to implement schema validation rules like "this list cannot have more than 10 items" on top of a CRDT. Eventually consistent collaborative editing systems (like CRDTs and OT based systems) are optimistic. They allow any edits, and then later merge things if necessary. But if you only find out the list has 11 elements at the merging step, what do you do? Its too late to reject either of the inserts which put you in that state.

My best answer at the moment is to tell people to rethink validation rules like that. I can't think of a lot of good use cases for collaborative editing where a "length <= 10" rule is something you want.

Unfortunately, validation rules are really important for referential integrity. If you add a reference to some item in a data set and I concurrently delete that item, what should happen? Does the delete get undone? Does the reference get removed? Is the reference just invalid now? Should references only be to an item at some point in time? (Eg like a git SHA rather than a path)? Maybe optimistic systems just can't have referential integrity? Its an uncomfortable problem.


Hi Joseph,

This is a hard problem and something that I think has to be intrinsic to the serialization method of your CRDT. Unique IDs of any kind are really risky in CRDTs because they basically break the mathematical links between potential partitioned duplicates. Lists make this problem even harder because you have to contend with if you should incorporate the ordinal of each element into the hash value.

I’m working on a version control system for visual programs. You wrote a response to someone on structured edits a few months back about how the UNIX fs not being schema/structure aware is the core problem with collaborative edits and destroys open interoperability. I think you hit the nail on the head. I’ve been working on something for nearly a year now to try to address that problem. Let me know if you want to compare notes.


To answer your question 2. I don’t think you can realistically delete a key, contend with network partitions, and have a true eventually consistent data structure. I think you’re sort of running into the uncommit problem. Feels solvable but at the expense of another trade off you don’t want to make. The solution here is really in workflow. Git solves it by separating the act of committing and pushing. If you push a commit containing a secret, the only thing you can do is invalidate the secret or not push your offline branch containing the commit in the first place.


Then how about the same way as real-world situations where secrets are committed: regenerate the whole history with that string redacted. Not an ergonomic experience, but I think it’s an acceptable tradeoff and users would understand that it’s a nuclear option for a rare usecase.

Disclaimer: layman of the highest degree


I'd be interested to hear more about your version control system for visual programs.

Years back, in 2016 I had a prototype visual programming system where I built a partial Git client in JavaScript to automatically save edits to the program to GitHub. It worked pretty well even with it committing every edit (though there were some bugs).

Email is in my profile if you're interested to share.


> Let me know if you want to compare notes.

Love to. Hit me up if you like - my email address is in my profile.

I'm sometimes bad at responding to email because I'm hyperfocusing or something and falling behind on email. Sorry in advance if that happens.


> 1. We don’t want to store a hash for every keystroke a user makes.

So, why does it have to be every keystroke? We can batch them locally. An approximate rule of thumb to use to guide the batching is typing speed. Using 200 keystrokes/minute, we get an average of 3.33 keystrokes/second. If we use 3 seconds latency for the collaborative editing to resolve concurrent edits, we get 10 keystrokes per batch (3 secs). You can also have an override where smaller batches are replicated to nodes if sufficient time has elapsed since last local edit (eg: 10 secs have elapsed and we have a partial batch of 3 keystrokes only).

> 2. If I accidentally paste an AWS key into a document, I don’t want that key in the document history forever. How do I enable that while still getting all the benefits of the hashing system above?

In order for this to work, we should assume that every replica's current state is always obtained by starting with the empty document and then playing the operations log. We should also permit the author of an operation to rollback the operation. So, I suppose if the collaborative editor shows an interface of all operations that originated locally and the author can select the operations that inserted the AWS key into the replica, they can issue a rollback operations op and have it propagated to all replicas.

Since the point-in-time state of the replicas are regenerated by replaying the operations log, we would have deleted the accidentally inserted key.

In order to address the issue of the AWS key still being in the operations log, we can define the semantics of the rollback op to also include secure erasure of the previous operation from the operations log (i.e., the rollback op would update the original operation into a no-op.

So, when all replicas apply the rollback op initiated by the author, eventually all replicas will converge to having the accidentally inserted AWS key removed from the replica and the operations log.


> So, why does it have to be every keystroke? We can batch them locally.

This approach would absolutely work. Thanks for bringing it up! The problem is that it has a cost I'd rather not pay.

The problem is that batching only works if you don't send anything until the end of the batch. And that gives you an awful tradeoff - the more "live" your editing system is, the bigger the overhead. Its really nice seeing someone type live. Seeing their changes "pop in" after 5 seconds doesn't feel great. It doesn't feel current. And I think its too slow for pair programming over zoom and things like that. It also makes the DAG of changes much more complicated, because there's a much higher chance that changes are concurrent when there's that much latency. And that in turn also makes other aspects of the system do a lot more work to merge.

With hashing alone, you don't need to batch changes like that. Instead, you can just store hashes on disk for (for example) every 100 changes, and regenerate the changes in the middle when needed. (Though you need some other tricks so you can know where to look for a given hash).

Respect where its due, that would 100% solve it. I'm just looking for a way we can delete data without needing to give up the liveness.

> We should also permit the author of an operation to rollback the operation.

This doesn't completely solve the problem of hashes and guessability. Say in the 5 second batch I only type 1 character. We store some metadata, the character and the hash of the metadata & character. Then in the next 5 seconds I issue a rollback operation to delete that character. If I delete the character from durable storage, I can still figure out what the character was by brute forcing things and checking the hash for that batch.

I think thats fixable by adding junk data to every batch. I just bet there's ways to avoid doing that.


Are you using a rolling hash (such as the one employed in Rabin-Karp string search algorithm)? You can then store the hash at check points and the intermediate hashes between checkpoints can be regenerated using the rolling hash.

   +[h1] +[batch1] +[h2] [batch2] -[h3] +[batch3] +[h4]
You would then have the operations log for a replica as

   +[forwardhash(null)] +[hello] +[forwardhash(hello)] +[world] -[reversehash(world)] +[ ] +[forwardhash(wor) -[ld]+[k]
This sequence would give the following replica states after processing each hash:

   +[forwardhash(null)] +[hello] -> hello
   +[forwardhash(hello)] +[world] -> helloworld
   -[reversehash(world)] +[ ] -> hello world
   +[forwardhash(wor) -[ld]+[k] -> hello work
The direction of the roll is just a sign bit of the hash.

We don't need to encode positions separately as it is a function of the state of the replica and the sign of the hash.


I like your approach to thinking about this. Seems like a plausible set of assumptions to make.


I think keystrokes was just a metaphor for whatever operation the application has to deal with.


No, the GP is right. Text CRDTs (including mine) literally generate an operation per inserted character in a collaborative text document. Diamond types solves the data explosion problem by not storing the operations separately. Instead, we internally run-length encoding everything.

If you have these two operations:

    Insert "h" at position 5, ID (seph, 2)
    Insert "i" at position 6, ID (seph, 3)
... They get immediately merged internally to become:

    Insert "hi" at position 5-6, ID (seph, 2-3)
This reduces the number of items we need to process by an order of magnitude. In practice, the metadata (positions, IDs, etc) ends up taking up about 0.1 bytes on disk per inserted character, which is awesome.

This works great for (agent, seq) style IDs because the IDs can also be run-length encoded. But we can't run-length encode hashes. And I don't want my file size to grow by 32x because we store a hash after every keystroke.

(I didn't invent this idea. Martin Kleppman first used it for storing CRDTs on disk, and Kevin Jahns made in-memory structures use the same tricks in yjs)


I was responding to the diamond-types CRDT author's question in the parent comment. Their github project page [1] mentions text editing a lot:

> This repository contains a high performance rust CRDT for text editing. This is a special data type which supports concurrent editing of lists or strings (text documents) by multiple users in a P2P network without needing a centralized server.

> This version of diamond types only supports plain text editing. Work is underway to add support for other JSON-style data types. See the more_types branch for details.

In any case, I agree with the metaphor and batching granular operations can always be done.

[1]: https://github.com/josephg/diamond-types


Problem 2 can be solved by "cryptographic erasure" techniques.

1. Encrypt the data with an ephemeral key.

2. Upon delete, just wipe the key. The data becomes unrecoverable.

I'm not sure about problem 1. A fast seekable stream cipher like ChaCha might help here (don't use e.g. RC4 for this), but a faster hash like BLAKE3 might be more appropriate. I'm happy to noodle over this some more. These are just my unfiltered thoughts.

If you combined the two: Each entry in the graph could have a public salt used for key derivation for the actual data. This salt would not be part of the DAG, but stored outside of it. To delete an entry, zero-out the salt and the underlying key becomes unrecoverable, which makes the data unreadable, without needing to change the history of the DAG.


Yes. But diamond types stores a new entry in the DAG for every keystroke each user makes. The current scheme of IDs being (agent, sequence number) allows subsequent operations by the same user to be trivially run-length encoded. Eg if I make 1000 changes in order, we can encode that as (“seph”,1..1000). It takes much less than 1 byte of overhead on disk per character typed.

Salts per character + cryptographic erasure would solve problem 2. But how do we do it without a massive increase in storage, memory use and network bandwidth? I don’t want to generate and store a cryptographically unique salt per keystroke. (Even though it might run fine on modern hardware).


I apologize for commenting on this with almost no context about what the actual data structure looks like, but would it be possible to put the random salts "higher up in the tree"? Then whenever you want to delete some small leaf element, you actually delete a whole salted subtree and replace it with a new subtree, mostly the same but with a new salt and missing the deleted element? This might make edits more expensive, in exchange for reducing the storage overhead of the salts?


You can store a lot per typed character and never come close to the storage requirements of pictures and video.

Writing a novel is a couple million keystrokes over a year, so even storing a kilobyte per keystroke is trivial.


I'm not sure. I don't know if it's even possible, but until someone proves it isn't, I'm probably going to keep thinking about it.


I was re-reading your 'CRDTs-go-brr' blog post just today! Recently prototyping a major collaborative feature based on Automerge, and am discovering some of its performance limits on our largest data samples (25+ MB of JSON). Very very interested in JSON-like CRDTs and any progress in that area. It's a different hard problem to rich text editing, but has similarly large potential.


Do you really need to store the hashes?

I imagine in an application like this there would be a "settled" set for which all agents agree, and a "pending" set which is agreed by some but not all agents. The hash could be used solely for the pending set to get agreement on order of operations, but after that it could simply be dropped altogether. The sequence number could even be generated on the receiving agent's side.

Also, even for the "pending" set you wouldn't even need to store the hash in your primary data structure. You could simply add a dictionary from (agent id, seq) to hash, right?

The second issue seems reasonable easy to solve with this: include a salt in the hash. You're only storing the "head" hash of the "settled" set, so when the salts aren't stored there is no way to reconstruct this hash from the settled set itself.


I think I am missing something obvious, but can't you just take your existing DAG and add a merge validation step to solve this?

If I'm understanding the problem right a malicious user, uses the same source id to send action a to user 1 and action b to user 2. This has started two independent (and currently indistinguishable branches in your DAG) When those branches come into contact at say user 3 weird bugs start arising.

However, user 3 is receiving a DAG of all the operations along with the operations. Can't it use that to validate the DAG chain? If it finds a conflict it could simply create a branch at the conflict point by modifying the IDs at the point of conflict. Say by appending a hash of just that operation to the id?


it’s definitely not an exact solution, but i’ve been thinking about ways prolly trees could fit into CRDT systems. they essentially give you efficient peer-to-peer entry-wise diffs, so could be good for identifying conflicts after-the-fact. but if you store operations in a prolly tree, you end up hashing them all (plus logarithmic additional space overhead), so it might be a non-starter.

i have a blog post to shill if you haven’t heard of them: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...


1. Can you merge inserts of characters into inserts of strings reliably given both the original inserts and the current document history? E.g. if the closed set of history that only inserts in the middle of the subset of inserts that you want to turn into a string those inserts can be permanently merged into a string insert.

2. String merging should also allow null merges, e.g. identity the same closed set of history inserts that refer to only the inside of the original inserts, and so permanently deleting history of characters in the document could be replaced with a null insert that can still be referred to by other inserts/merges.


Would (1) hit problems because you may store them coalesced eventually, but you sent them to the other clients as individual events, where they may get interleaved with events from that client?

I think it might mess with fine-grained undo, also?

(idk much about CRDTs)

"Reject re-used sequence numbers" might be simpler?


I know you said you didn't like the idea of hashes, so this might be a non-starter for your use case.

But it sounds like a V5 Guid [0] might meet some of the needs for your "operation Ids". The existing "random agent id", and "sequence number" could be used as part of the namespace / salt.

[0] https://www.sohamkamani.com/uuid-versions-explained/#v5-non-...


I don’t have a problem with hashes generally. Correct me if I’m wrong, but this scheme sounds like hashing with extra steps?

I agree this could be done. But diamond types (my library) considers every keystroke to be a new change. If we store a UUID for every keystroke, that’s very inefficient.


Sounds like inverted hash chains. Generate a hash chain h_i = sha(h_i-1) for a random h_0 up to h_N. Each message m_i becomes (agent_id, state, hmac(h_(N-i+1), state), h_(N-i)). Upon receipt of message m_(i+1) the receiver can verify the hmac of the preceding state and that the key is the preimage of the earlier message. By repeated hashing any gaps in the sequence can be fast-forwarded through and still verified. This provides for deletion without a loss of verifiability up to length N.


I was thinking of something like this but where you only send the whole shebang like you mention every now and then, I'll call that a "sync point".

Between the sync points, each message only contains part of the hmac, proportional to the state size, and not h_(N-i). Ie if sending a single keystroke then a single byte from the hmac, if sending a few bytes then two bytes of hmac etc.

If the sequence is continuous you can still verify as you can regenerate h_(N-i) from the previous sync point. A deletion can then still be done by issuing a redact operation which deletes everything from the start of the secret up to the next sync point, and adding an operation which inserts back the data between the end of the secret to the next sync point.

My thought was that while finding a collision for a single keystroke message might not be hard, doing it for two messages in a row should be quite difficult. However it would "just" roughly double the size of the messages on average, rather than add dozens of bytes to each.

I was also thinking that h_0 should be based on data includes something that's not up to the individual actor, to make preimage attacks harder.

Anyway, way past bedtime and not my area at all so this might all be drivel, but interesting problem to ponder...


How familiar are you with RON? I don’t think it supports “forgetting” operations but it does a good job of eliding sequential op IDs and hashes. https://replicated.cc/


Troublesome user? As in a malicious user? Are you trying to lock down your attack surface against script kiddies or is there another instance where this situation comes up?


> or is there another instance where this situation comes up?

Yes, if you are a Byzantine general trying to coordinate an attack, and you aren't sure of your peers' true allegiance.


Yes, a malicious user. I’ve also seen buggy software try to reuse IDs - which causes similar mayhem. It would be better if we could remove this entire class of problems.


It could mean that, as in "someone who causes trouble" or it could be a result of a bug which only affects a handful of users who suddenly "have some troubles".


Use the hash of the hash instead of the hash? The hash is not secret so never needs to be deleted.

Are merkle trees relevant here?


1. Bloom for existence, re-create via brute-force search (?), checkpoint hash every $N seconds(?)

2. Godbolt => "This rendered text came from these portions of the DAG/tree, press CTRL-Del to irrecoverably remove them from the DAG and/or replace them with 'REDACTED'".

#1 is "simply" related to efficiency. You can lean on probabilistic data structures to see "yeah, probably this was HERE in the history tree", and then perhaps replay the "N" seconds checkpoints with the original input data to validate or match exactly where the hash is/was.

#2 is the same problem GIT has, where you need to rewrite / filter / "detached head" in order to obliterate something from history, and is somewhat a local/remote UI issue. "A request to obliterate history is being built in parallel over here => click here to adopt it, and diff/transfer only your new work to it"

Related to #2 I've had a thought of a social network protocol, and how to "delete malicious nudes" or "oops, I accidentally published my SSN and Tax Filing *.pdf to my friends". On the "real" open internet, google or malicious actors would/could store that info forever. However, in a semi-closed group (eg: 100-1000 "friends of friends") that are "all" running participating/conforming clients, a "request-to-delete" or "request-to-disavow" is particularly valuable.

My imagined protocol is: "post001.txt" => "$GOOD_SHA", "nudes.jpg" => "$WHOOPS_SHA", "taxes.pdf" => "$YIKES_SHA", "bait.txt" => "$BAIT_SHA".

"Of course" everything is content-addressable, ie: anyone can fetch "post001.txt" from anyone as "$GOOD_SHA", however, "it would be nice" if I as a participant in the network could query "how_many( $GOOD_SHA );" and see distribution across the network participants. "disavow( $WHOOPS_SHA, $YIKES_SHA ) && how_many( $WHOOPS_SHA, $YIKES_SHA )" => ie: in a "good" network, after a "disavow(...)" then nodes should not respond with that availability. Occasionally throw in "$BAIT_SHA w/ disavow(...)" and make sure that no one is propagating it.

Similar to "key revocation lists", you could eliminate/reject any clients which do not respect requests of non-propagation for $BAIT_SHA or anything that is "disavow(...)'d". Same story with anyone hosting https://en.wikipedia.org/wiki/Celebrity_sex_tape ... I don't really want them in my network, so query for particular content, and if they respond, I can nudge/boot/disengage from users that are hosting them (same with eg: "I love my AK-47 and assault rifle" political memes, disengage / dislike / spread apart people sharing/hosting/propagating content that I don't want).

While you're still left with malicious / non-conforming nodes potentially archiving everything for all eternity, there is a relatively immediate ability to fracture / exclude (shun) nodes that are detected as non-conforming, so "your stuff" is still out there, it's just relatively inaccessible, and you have to be "careful" to have your client be up-to-date w.r.t. revocation actions otherwise you'll be excluded from networks.

Meh. Kindof complicated, but those seem to be in the range of characteristics that you're talking about as well.


Thanks for mentioning FireSync, we haven't even offically launched yet! After 10+ years of building real time collabrative apps based on Operational Transformation (ShareLaTeX -> Overleaf.com) we have become quietly very excited about CRDT's and Yjs. Now we are focused on building a scalable Yjs compliant backend ontop of PG with FireSync.

If anyone has thoughts about this space, feature requests, would like a preview of what we are building, or anything else, please do reach out direcly to me at henry@firesync.live, I'm talking to as many people as possible at the moment.


I am interested in a concept close to a CRDT - but I think meaningfully different.

Is there a 'data structure' or 'scheme/setup' which fits the below description?

Have a repo (with a specific git commit hash), which is specifically constructed to reproducibly construct some data (database, dataset, set of examples files, etc) (which can then be checked by a checksum), such that when a different person pulls and executes that code, a decision could be made to either run the code again, or to pull it from the decentralized group of people that have that repo database.

This is effectively "code as data" which is checked by a key value pairing: [git commit hash <==> data checksum hash]. So if you have a process that runs for 24 days on a good machine and produces a 30 MB file for people to use, they can check for reproducibilityif they want with the code, or just grab the file from everyone who pulled the repo to save time. Obviously having defaults for download or rerun could be determined by certain logic like execution time etc, but the base idea is something I've looked into for a while.

Is there a name for this type of setup?


I'm very unsure that I understand your requirements, but if I squint really really hard it looks like you're asking for a blockchain.


Hmmm... Perhaps that could be one component, but I don't know if proof of work in the typical way would be required? As in 'guessing' a hash (I don't know much about that area so I could be wrong here)., I don't think would be needed - but perhaps the proof of work could simply be 'calculate the output that you want to make anyway' - such as the weights of a neural network, a read only database to do complex queries on, an operating system compilation with specific configuration, a dataset of estimates of some physical science data using a particular algorithm, etc.

So instead of using CPU cycles to do finance/currency things (or whatever is done in that arena), this would calculate whatever data you actually want directly.

In some cases it could be a ton of calculation (estimating electron density functional, or training NN weights) to receive a small file (DFT functionals are quite small). Or it could be some calculation for a large amount of data (ETL for big we scraped data).

One important aspect which is similar to the currency/Blockchain realm is that the commits would specify the data state, so the data would likely be mostly read-only for people, but if they change it, it would simply branch off to a different hash pair.


Is this what Apple uses in Notes app for its recently added collaborative features? I read it here that the technology behind it is, in fact, quite robust, in order to provide the best possible conflict prevention and resolution.


In compilers, there's been a recent uptick in industry research and adoption into using equivalency classes and graphs (egraphs) for doing optimization passes in a way that preserves information and solves the phase ordering problem. Egg[0] was one of the big libraries that came out of it, but can only handle term rewriting without guards, and so can't express y/x*x -> y because it's unsound when x=0, or optimization passes that are across control flow points and thus some of the eclasses are only valid in some locations. The Cranelift developers adapted it into a construction they call aegraphs[1] that handles this, and are migrating Cranelift to use these passes for all optimizations, which is (hopefully) faster and achieves more optimization opportunitie; Chris Fallin is presenting about their aegraph work at PLDI this year.

0: https://egraphs-good.github.io/

1: https://github.com/cfallin/rfcs/blob/main/accepted/cranelift...


The other recent research that's been interesting to watch has been in logic programming (Prolog and Datalog), where there's been a lot of papers extending Datalog with semilattice types that are more efficient for computing dataflow-type queries (and even getting lattice types added to Souffle recently), including papers like datafun[0] extending it to sound incremental queries for more efficient execution - and there's even a paper by the Egg authors using it for efficient eclass-based Datalog queries using lattices[1]. It also seems like there has been more work recently in actually integrating logic programming in existing languages and programs, like ascent[2], instead of it being off in it's own world like it seems has historically been the case.

0: http://www.rntz.net/files/seminaive-datafun.pdf

1: https://www.mwillsey.com/papers/egglog

2: https://s-arash.github.io/ascent/cc22main-p95-seamless-deduc...


How does Datalog with e-graphs compare to Datalog with BDDs in terms of what analyses can be performed? bddbddb is a Datalog engine using BDDs, that, according to their 2005 paper[0], can solve problems like context-sensitive pointer analysis for large programs and analyses implemented with it are faster than hand-tuned implementations in dramatically fewer lines of code. I'm not knowledgeable on what modern Datalog engines are based on.

0: https://people.csail.mit.edu/mcarbin/papers/aplas05.pdf


I'm under the impression that BDDs are good for more efficient representation of row membership for values, so that you can query and saturate more facts faster, but eclasses allow for writing rules that "automatically" have transitive connectivity of two rules. If you have X -> Y and Y -> Z, you only need to store one of X or Y :- Z since X and Y have been unioned together and may be treated identically, along with a same(X, Z) rule firing instead of having to have additional transitive same(X, same(Y, Z)) rules. I don't profess to be an expert in BDDs, eqsat, or Datalog, though! I have one friend that did a thesis on eqsat who wasn't that big of a fan of bddbddb but I don't remember exactly why.


Yeah, it's really unfortunate that these amazing advances in datalog don't make it into general purpose languages.


You might be interested in Flix which has first-class Datalog program values:

https://flix.dev/

https://doc.flix.dev/fixpoints.html

(I am one of the developers of Flix)


The documentation could use some proof reading. Grammar etc.

> Flix is strongly typed. Any attempt to use predicate symbol with terms of the wrong type (or with the wrong arity) is caught by the type checker.

*Statically typed


The feature list is impressive


I was about to post this. If only the vscode plugin worked I would use Flix.


What's the issue? Bug reports are most welcome.


The latest version seems to have solved the issue. At least the compiler plugin runs with an empty file now.


Mainstream languages will lag behind others forever. The average programmer in Java or C# neither cares for nor understands 'new' features or concepts. Most people I know in this area have never even heard about prolog, lisp, smalltalk, etc.


The recent work on relational, datalog-inspired egraphs in PLDI this year ("Unifying Datalog and Equality Saturation") is actually interesting because it can solve cases like the y/x*x -> y identity example, by the power of an interval analysis on x (among other things.) Sort of like adding a postulate to a rewrite term and then proving it. But instead it's by adding relations between terms in the graph. See section 6.2 of the paper.

https://github.com/egraphs-good/egglog

https://arxiv.org/pdf/2304.04332.pdf


thanks, this is cool!


Nobody has mentioned Approximate Nearest Neighbor search (aka vector search), which IMO, are fundamental data structures advancements.

Basically given a set of million (billion, trillion...) ~1000 valued vectors, and a query ~1000 valued vector, find the closest vector in the indexed set. This is "nearest neighbor" search and there have been increasingly more and more sophisticated approaches:

http://ann-benchmarks.com

https://haystackconf.com/us2022/talk-6/

And has spawned companies like Weaviate, Pinecone, etc etc (a half a dozen others)


Most commonly used ANN algorithms today are clever optimizations atop "old" data structures. DiskANN is a recent favorite: https://milvus.io/blog/2021-09-24-diskann.md


"Most commonly used X today are clever optimizations atop old Y" is pretty much the story of technology, isn't it?


It's lego all the way down to the clock cycle.


What sort of dataset are you indexing that has trillion entries?

That's 100,000 english wikipedias or 50 googles.


That's about the number of pixels in a 2-hour movie at 4k. Not too common yet, but once it's possible we're going to want to feed this amount of data into neural networks.


Using an approximate nearest neighbors search for this is just worse though.

You can retrieve the exact information even faster. From an (x,y,t) coordinate, you can find the exact index in constant time


There's about 14B trades per year on the NYSE which i'm sure could represent 10x that in entities (buyer, seller, broker, etc) and could easily hit 1000x that in log lines. The shares per day is in the billions, so hitting 1T if each share is represented uniquely.


You don't typically use vector search for trade data though. It's already ridculously well structured. Assets have identifiers, parties and counterparties have IDs, etc. I'm not sure what nearest neighbors in a vector space would add.


Nonetheless it’s an example you asked for of a dataset with over a trillion entries.


I asked which dataset they were indexing that was of this size, not whether any such dataset exists in other domains.


I believe I hear the sound of a True Scotsman knocking at the door.


My exact words are still up there

> What sort of dataset are you indexing that has trillion entries?

It doesn't say

> What sort of dataset has a trillion entries?


Dumb example but still example from practical world. Your body (assuming you are a human) has trillions of cells. Each cell way more complicated than what a 1000-dimensional vector can represent, but maybe it could be a loose estimate of some properties in each cell. Now the algorithm could be about finding the most similar cell. Could be useful for finding e.g. other cancer cells based on properties in one cancer cell.

Not that this is a practical example because we technically cannot index all cells in each body. But even if such an algorithm being studied today might be useful one day when we do capability to collect such data


Where are you going to store this data? It's dozens of petabytes.


It's only a few racks worth of disk servers.

If I was building it from my 5 minutes of googling, using 15TB nvme u2 drives, and easily available server chasis, I can get 24 drives per 2u of a rack. That's 360 TB + a couple server nodes. So ~6u per PB. A full height rack is 42u, so 6-7PB per rack once you take up some of the space with networking, etc. So dozens is doable in a short datacenter row.

Realistically you could fit a lot more storage per U, depending on how much compute you need per unit of data. The example above assumes all the disks are at the front of the server only, if you mount them internally also, you can fit a lot more. (see Backblaze's storage pods for how they did it with spinning disks).

Dozens of PB is not that much data in 2023.


This is still like tens of thousands of dollars of equipment to store information about a single person's biology.


Probably an order of magnitude or two more. Still something that is feasable in a research context - early MRI and genome sequencing had similar "too much data" problems like this, but the researchers still built it out to learn stuff. Tech marched forward and these days no one really blinks about it. I presume that if such a "all the cells scanner" was invented today, it would only be used for research for a long time - and that by the time it became widespread data storage will have caught up.


> Dozens of PB is not that much data in 2023.

Yes it is. Just transferring it at data center speeds will take days if not weeks.


A truck has more bandwidth than network adapters technically


Should theoretical research of data structures and algorithms have been capped at 1GB in 1980 because that was the biggest single hard drive available back then and you couldn’t store for example a 2GB dataset on a disk?


Not at all, I'll still call out fantastic claims when I see them though.


Google has definitely indexed over a trillion pages.


Do you have any sources for this claim?

As far as I am aware Google doesn't publish any statistics about the size of its index, which no doubt varies.



Well what do you know, they contradict the claim made above.


Sorry, they've crawled trillions of pages, and narrowed it down to an index of 100s of billions. Conveniently, the link answers your question of "can you have PB sized indices?" to which we can clearly say, yes.


Where do you think computers store data?


index every photo taken in the past ten years (1.8 trillion per year)

for each photo, extract and index SIFT vectors

https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...


Is sift still used?


i’m curious also.

i used it as an example of a way a dataset might have a huge number of feature vectors.

curious if there are better ways to do this now.


Perhaps one example would be 100 billion websites, but with a set of vectors for each website (chunked bert encoding, sum chunked glove vecs, whatever). Then you could have something like 3-100ish vectors for each website, which would be a few trillion vecs.

I'm not in the we scraping/search area, so idk about the 100B website thing (other than it's Google order of size), but the encoding would take some mega amount of time depending on how it's done, hence the suggested sum of GloVe chunks (potentially doable with decent hardware in months) rather than throwing LLM on there (would take literal centuries to process).


One application are LLMs. I've seen a project use Pinecone to enable "infinite context" for GPT-3.


This is still several orders of magnitude more items than the entire training corpus for all GPT models combined. I guess if you were to index individual codepoints in the training corpus, we'd start to see those volumes.


You don't index the training data, but other data. It gives LLMs the medium-term memory that they're missing.

Think of it like an accountant. Weights are all of their experience. Prompt is the form in front of them. A vector database makes it easier to find the appropriate tax law and have that open (in the prompt) as well.

This is useful for people as well, like literally this example. But the LLM + vector combination is looking really powerful because of the tight loops.


For context stuffing LLMs with small token limits(AKA Retriever Augmentation), you will need to break up each article in few-sentence chunks. You can get to a large number of these chunks very rapidly.


Theoretical advances are constant (just check the literature), but two particular advances in the last decade are of practical importance: one for sort and one for hash table. As in: check the implementation you are using, if it doesn't incorporate these advances you are leaving substantial performance (2x is not out of the question) on the table.

How Branch Mispredictions don't affect Quicksort (2016) https://arxiv.org/abs/1604.06697

Quicksort is still the algorithm of choice for general purpose internal unstable sort. In modern hardwares, quicksort spends a lot of time recovering from branch misprediction, because quicksort's comparison branches are inherently unpredictable. The problem is severe to the extent that with branchful quicksort, uneven split with more recursion is faster than even split, because it makes branches more predictable! Exact penalty is hardware microarchitecture specific, but one report is 20:80 split being optimal.

So... you should use branchless quicksort! The technique was pioneered by an implementation called BlockQuicksort, but pdqsort (for pattern-defeating quicksort) https://arxiv.org/abs/2106.05123 is also popular. pdqsort incorporates branchless technique and also includes other enhancements.

Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step (2017) https://abseil.io/blog/20180927-swisstables

Hash table is the workhorse of data structures, and while improvements are possible for specific usage patterns, open addressing with linear probing is still the fastest implementation for general purpose. So called "Swiss Table" combines cache-friendly metadata layout with SIMD-enhanced probing. These two optimizations work well with each other and combination of two is particularly effective.


> Theoretical advances are constant (just check the literature)

* for some arbitrary definition of “constant”

Some people throw the word “constant” around quite freely.

Scientific and mathematical advances are notoriously hard to predict. “Constant” is the last word I would use.

One simple model would be spaced in time according to a memoryless stochastic process while having “impact” scores sampled from an exponential distribution.

Of course, one could include more realisticish dynamics to the model, such as network effects and clumping (e.g. from funding priorities and “ideas in the air”.)

I feel like some areas of computer science tend to correlate with deep statistical appreciation more than others.

Speaking of the topic of stochastic behavior in general, it feels to me like (a) randomized algorithms and (b) better statistical simulations have steadily been gaining traction in algorithms and data structures.


I recall years ago someone suggesting that a 1:2 split being faster and I'm wondering if they were seeing the same thing, and either didn't follow it through to conclusion or the goalposts keep moving in each generation of processors.


It is plausible that they were seeing the same thing.


For hit-heavy workloads, I have been unable to get Swiss tables to compete with tables that hit only ~1 cache line per query.


>pdqsort (for pattern-defeating quicksort)

I'm sad that it's not "pretty darn quicksort"


It probably is that, just not officially. You know, officially, BFR stands for Big Falcon Rocket.


And the God(damn) Particle also comes to mind.


CS 168: The Modern Algorithmic Toolbox, Spring 2023

https://web.stanford.edu/class/cs168/index.html

I didn't have time to go through all of this, but the parts that I did read were very interesting to me.

It's not even cutting edge new stuff of 202x, probably most of them are new-ish results from ~2000 onwards.


Thanks! That looks great, for those of us who did our algorithm courses too long ago!



Refinement types are pretty interesting and I've seen mention of them recently. I'm curious if there are any practical reasons we don't see them implemented in more languages.


> Refinement types are pretty interesting

Indeed. To me, they look like a formalized subset of precondition contracts that can be checked statically. I don't think the idea is new, I seem to recall Ada and Eiffel having something like this, but I first learned about them via Racket[1]. I still don't know how they work or what's needed to implement them.

Interestingly, Raku has a similar concept called `subset`, eg.

    subset OneToTen of Int where * > 0 & * < 10;
but from what I know no static analysis is done on the refinement/"where" part, it's a runtime assertion only.

[1] https://blog.racket-lang.org/2017/11/adding-refinement-types...


subset OneToTen of Int where 1..10;

Why make it more complicated? :-)

The secret is that given values are smartmatched against the given object. And the values 1 through 10 will return True if smartmatched against the 1..10 Range object. The `of Int' ensures that only integer values are allowed, otherwise 3.14 would also match. Unless of course, that is what you intended. In which case you can omit the "of Int".


Raku seems very cool - absolutely stuffed with gems like this to learn.


> I'm curious if there are any practical reasons we don't see them implemented in more languages.

I believe it's because they're not exactly easy to implement and the resulting extensive type checking might also affect compiler performance.

By the way, another great example of refinement types (in JavaScript) is this one: https://kaleidawave.github.io/posts/introducing-ezno/


I'm working on a paper to show that you can 'simulate' a heap in linear time.

What I mean is: in the comparison model, assume start with an empty heap, a sequence of instructions like "insert element x" and "remove minimum element" of length n.

There's an algorithm that can tell you in O(n) time, which items are still left in the heap at the end (and thus also which items have been removed at some point in time).

The algorithm can't tell you exactly _when_ items have been removed, because that would be equivalent to sorting, and you need at least O(n log n) to sort in the comparison model.

I've worked on this problem on and off for about a decade now as a little hobby project. I would have been happy to prove it impossible or to find a solution. Interestingly, it is possible and the solution is relatively simple, too.

(I have no clue whether this is useful for anything. In practice you probably just run the obvious O(n log n) algorithm which probably has much better constant factors. I'm doing this purely for the theory.)

The data structure / algorithm I found is expressible in a purely functional setting, and (spoiler alert) builds on top of Chazelle's Soft Heap.


Are you at all worried that by posting an open problem on HN and claiming you've found a simple solution, someone might scoop you to the paper?


I'm not a professional academic.

If I find someone else who's interested enough in my pet problem to go to the hassle of scooping me, I'd be elated.


Idea to paper is non trivial effort. Lots of iterations need to be done.


Doesn't that imply a linear "n largest" algorithm? I guess that might already exist, but I wasn't aware of that!


Yes, that does imply a linear 'n largest algorithm'. Specifically also linear median finding.

As the sibling comment points out, that's not news. But it's very good for you to pick up that implication from first principles alone. It's part of figuring out how to solve this.


Quickselect and non-randomized variations should get you that; no?


Yes, exactly.

But this reduction from 'select n largest' to my problem is still instructive: it tells us that in some sense any (deterministic) solution to my pet problem has to be at least as complicated as (deterministic) median finding.


The Cuckoo Trie, an ordered index data structure explicitly designed to have memory-level parallelism, which out-of-order processors can exploit to execute DRAM accesses in parallel:

https://arxiv.org/pdf/2201.09331.pdf


There's a lot of new results in cryptography/distributed algorithms, motivated in part by applications to blockchains. But even if you don't believe in blockchains they are interesting.

  - better zero knowledge proofs/arguments
  - better consensus protocols (my expertise!)
  - multiparty computation is nearing practicality, as is fully homomorphic encryption
Also, lots of fascinating theoretical progress on primitives such as indistinguishability obfuscation, post-quantum schemes, quantum crypto, etc.


> better consensus protocols (my expertise!)

As it is your domain of expertise, can you give more details on the new/improved ones please?


There are improvements in both performance and understandability.

On the performance front, see Algorand's consensus protocol (2016 ish), which can run on 10,000s of nodes, using a cryptographic technique called "sortition" which sub-selects a small committee of nodes to actually broadcast messages of the consensus protocol.

On the understandability front, this is my research area; we put out a new protocol recently (https://eprint.iacr.org/2023/463) that is faster and much easier than PBFT (and crash fault equivalents like Paxos/Raft). But there's a long line of prior work here.

Traditionally, consensus (PBFT,Paxos,Raft) has run on a small set of nodes (i.e. a dozen) and also has used what are called "stable leaders", where a single leader process sequences all transactions until it is detected to be faulty. Stable leaders are great for data centers.

Blockchains made it so that now protocols support thousands of nodes, and also use "rotating leaders", where a different node proposes each block (for fairness reasons). It happens that using rotating leaders allows for simpler protocols (in particular, with easier proofs).


Thank you, very interesting I hadn't seen this concept of Dummy Blocks before!


Homomorphic encryption should be on that list too, I think.


Negative weight single source shortest paths in near-linear time: https://arxiv.org/abs/2203.03456

Obligatory Quanta link: https://www.quantamagazine.org/finally-a-fast-algorithm-for-...


Are there any implimentations of this? I got started working on one for rust, but got kinda stuck in a few places. This could be very useful for RTS AI I think, or anything where you need to optimize managing resources and build orders, if I understand negative weight shortest paths correctly.


It's not particularly new, but recently I've been involved a lot with solving large combinatorial optimization problems

https://en.m.wikipedia.org/wiki/Combinatorial_optimization

Algos to solve those both exact methods and heuristics are super useful in real world applications, and very underappreciated by the HN crowd IMO.


Yes! Me too, there's a whole world out there and it's codified in bullet-proof programs like CPLEX or gurobi. Sadly they are expensive and it's not obvious if a given problem's size and complexity will tailor itself to some of their heuristics.

Have you been doing it by hand or using a big program?


Answer Set Programming is an incredibly powerful tool to declaratively solve combinatorial problems. Clingo is one of the best open source implementations in my opinion: https://github.com/potassco/clingo

  - Based on conflict driven nogood learning, it has incredible performance (including good multithreading)
  - It has different optimization and enumeration modes
  - It has strong built-in heuristics, but also allows manual heuristics
  - It allows incremental solving
  - It has a nice API to integrate other solving paradigms (like difference logic and constraint programming, for which implementations already exist)
  - There is a huge ecosystem around it
And much more, go check it out!


At $DayJob, we deal with OR problems for large companies and typically, when possible, we leverage Gurobi to solve Mixed Integer Linear Programming (MIP) formulations. However it's not unusual for the problems to be huge to the point even Gurobi can't solve them in a reasonable time. And of course, there's the case where the problem can't be formulated (or approximated) as a linear program at all.

So, in a couple projects I was involved we had to resort to manually constructed heuristics such as Simulated Annealing (SA) and Large Neighborhood Search (LNS).

A very cool large scale approach we resorted in some problems (eg. airline scheduling) was Column Generation[1]. Not only it solved in minutes what took hours in the MIP implementation, unlike SA and LNS it's and exact method that provide valid bounds to global optimality.

[1]: https://en.wikipedia.org/wiki/Column_generation


I like the Column Generation idea. Not a big fan of SA and LNS as you often can't really tell the relative quality of the solutions. I've often pushed for a different, simpler model form when the larger model isn't solvable in the usual ways. Running out of memory in CPLEX or Gurobi is a sign you need to reformulate more than anything else. Of course, that requires having people capable of formulating and evaluating math programming formulations. Turns out there aren't so many of those. Surprised me at first, but I guess I shouldn't be surprised. The real trick in optimization is coming up with the right formulation.


> Algos to solve those both exact methods and heuristics are super useful in real world applications, and very underappreciated by the HN crowd IMO.

If you are in the HN crowd, the 'opposite' direction is useful:

Know that effective generic solvers exist for these problems (but don't worry about how they do it), and re-formulate many of the trickier problems you encounter in the languages of these solvers, instead of eg writing your own bespoke algorithm to assign tents at the company picnic.


> Algos to solve those both exact methods and heuristics are super useful in real world applications, and very underappreciated by the HN crowd IMO.

There just aren't many situations where they're useful. Time scheduling, logistics, operations, there are only a handful of places where these problems are so important that they need to be solved optimally or nearly optimally.


IMO, the main benefit of those exact algos are the they provide optimality bounds, allowing you to make informed decision to continue or stop and provide a non-optimal solution. This is valuable in many practical situations.

>Time scheduling, logistics, operations...

Add marketing (pricing, assortment, ...) and finance (portfolio, captial budgeting, ...) to those.

And those are huge domains, moving trillions of USD globally each year. Many companies (think most of the Global 2000) benefit from a couple % improvement on their operations that lead to many $$ in savings.


Even just arranging UI elements around a page could be done with these solvers, instead of coding up heuristics to satisfy your constraints by hand.


I don't think this is true. It's just that there aren't many optimization practitioners and the OR community has done a terrible job promoting itself. The value in scheduling and logistics is well established. So optimizers tend to congregate. But optimization could be applied to lots of other domains successfully. ML is just minimizing a loss function. I see simulation and post-hoc analyses being applied in numerous situations where optimization would be more efficient and produce a guaranteed optimal result. RL on simple, static problems is really an MDP but I've seen organizations fail to realize this. Etc.


How often do these come up? I studied some OR stuff in grad school when I was evaluating solvers and brushed up against the OR tools community for some research ideas and from what I could tell the businesses that really needed it already use these tools and the other businesses just don't have enough business dependent on optimization to be worthwhile.


Are there some interesting examples of real world applications of this stuff? Is it like planning problems or supply chain optimization etc.?

I've used SAT solvers recreationally to try solve some mathematical combinatoricsish-like problems but interesting more down-to-earth applications have eluded me. I would love to dig deeper into this stuff.


I used a MILP solver to optimize my ship loadouts in Highfleet. It's rugged-looking, but works great. https://hfopt.jodavaho.io


Learning optimal decision trees and related ML models using SAT, MILP, and other mathematical optimization methods.

Some recent literature reviews:

https://link.springer.com/article/10.1007/s11750-021-00594-1

https://www.ijcai.org/proceedings/2021/608


Gurobi (maker of a MIP solver) has a bunch of industry case studies as marketing materials on their website:

https://www.gurobi.com/case_studies/

however, it's mixed-integer programming rather than SAT solvers/constraint programming, but similar in spirit.


What makes you think they’re underappreciated? I too have been spending a lot of time in this space recently and agree it’s one of the more fertile “maths meets reality” spaces :)


Any advice on how to get to work on this for a living?


I think it depends a lot on the domain you are in. A example of some domains with lots of discrete opt problems: transport, scheduling, logistics. And then read stuff by Stephen Boyd :)


I managed to smuggle applications of those solvers into a few jobs.

Eg at Goldman Sachs I used mixed integer linear programming to reshuffle the assignments of particular departments' computing needs to specific data centres. Before I got my hands at this, people used to just do it manually and hadn't even realized this was something were you could optimize systematically and squeeze some efficiency out of.

If you have an eye for it, you can find similar opportunities in many areas of many companies. Assignment / scheduling is actually extremely common, if you can recognise it. It's also relatively easy to show an improvement that you can measure in dollars, which always goes over well with the business folks.


Unfortunately for you my favorite recent algorithms result uses AI as part of improving a very traditional algorithm. The paper is "Learned Index Structures" https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...


This is a wonderful paper. In a similar vein, "Algorithms with Predictions" (https://arxiv.org/pdf/2006.09123.pdf) presents a bunch of fun algorithmic problems related to bringing AI into traditional algorithms.


Parsing but also casting to string efficiently is still an open problem with regular breakthrough. Ryu or Dragonbox for float to string come to mind.

Parsing with errors that are useful for the human is definitely an open research topic.

Capabilities have seen interesting work, in particular in relationships to effect handlers.

Hyperloglog and other datasketch keep seeing incremental progress. I have not had time to read it all but Omar Ertl has a massive reduction in memory use in UltraLogLog that i expect a paper on soon. The code is already up.


I've been poking at prometheus internals lately and I am starting to wonder if histograms need some sort of hyperloglog treatment to improve on buckets. They don't seem like a very accurate solution.


You may appreciate ddsketch https://www.datadoghq.com/blog/engineering/computing-accurat...

Or tdigest. For the prom use case, i usually prefer ddsketch. Easier to integrate. Tdigest is really complicated.

That said there are definitely interesting work that need to happen on compressing time series histograms for heatmaps. Iirc influx had some interesting work there.


Have you seen the Prometheus “native”/“sparse” histograms that are being worked on: https://promcon.io/2022-munich/talks/native-histograms-in-pr...

They’ve released it as experimental recently, and we’ve found it to be a massive improvement over the original histograms. It’s still bucket-based at the end of the day, but the exponential bucket layout and sparse representation seems to work really well in practice.


How about some succinct data structures and delta encoding for modern databases [1]. Succinct data structures are a family of data structures which are close in size to the information theoretic minimum representation (while still being queryable).

[1] https://github.com/terminusdb/terminusdb/blob/dev/docs/white...


This is something I planned (2015) on sharing at some point but then years flew by and here we are .. :}

It is a cacheline sized 'container' (CLC) of machine-word length records, with one record used to store the record order and remaining bits for metadata. So you can project different kinds of container semantics, such as FIFO or LRU -- any ordered set semantics -- on this meta record. Using arrays of CLCs you can create e.g. a segmented LRU, where the overhead per item is substantially less than a conventional pointer-based datastructure, and, is naturally suited for concurrent operations (for example by assigning a range to distinct worker threads), and ops require a few or couple of lines to be touched. The LRU (or whatever) semantics in aggregate will be probabilistic, as the LRU order is deterministic per unit container only. It is very fast :)

https://github.com/alphazero/libclc/blob/master/include/libc...

https://github.com/alphazero/libclc/blob/master/include/libc...

As for the non-deterministic aspects: Since container semantics e.g. LRU order is only applied at unit level, the overall cache is ~LRU. We can strictly quantify the 'ordering error' by observing the age of items in FIFO mode as they are removed: for a deterministic container the age of the item is equal to the total capacity of the queue, for a segmented (array) composed of FIFO queues, the age will have a effectively gaussian distribution around the capacity (number of units x unit capacity). But since these containers can use as few as 9 bits per entry vs 24 or more bytes for pointer based solutions (which use linked-lists), for the same allocation of memory, the capacity of the array of CLCs will be much greater, so, the distribution tail of 'short-lived' items will actually be longer lived than items in a strict queue for the same exact memory. Additional techniques, such as n-array hashing, and low order 'clock' bits at container level, can tighten this distribution significantly (i.e. ~LRU -> LRU) via better loading factors.


For numerical algorithms, probabilistic numerics is an interesting research corner. It both provides a new perspective on existing algorithms, and provides a framework for tailoring new ones to specific problems.

https://www.probabilistic-numerics.org/


A couple of days ago I stumbled upon Prolly trees:

https://docs.dolthub.com/architecture/storage-engine/prolly-...

In a nutshell: B+ trees that are stable w.r.t. insertion order. Great if you want to test them for equivalence/deduplicate them efficiently.



I'm not in the field but it seems that SODA (ACM/SIAM Symposium on Discrete Algorithms) is a top conference in DSA and might be a good place to check. Here's the abstract book for 2022: [1].

[1] https://www.siam.org/Portals/0/Conferences/About_SIAM_Confer...


There is ongoing research for purely functional data structures and many implementations to be written for many functional languages, to make use of PFDS. Whenever one sees a traditional data structure, one can ask oneself, what a persistent functional one might look like, or whether a different solution would be pursued in FP languages. Traditional data structures are often a question of transcribing from an example language into a target language, with variuos language specific changes for optimization. Finding a good way to implement things without mutation or side-effects, is a challenge.


What are the advancements since Clojure brought them to a mainstream language? More data structures got functional counterparts or better optimizations for “solved” ones like lists?


For example, Chazelle's Soft Heap got a purely functional counterpart in 2013.


For Maximum Flow (and Minimal Cut) through flow network we are getting close to linear time:

2022 Paper https://arxiv.org/abs/2203.00671


I'd suggest looking up succinct data structures, cache oblivious data structures, and probabilistic data structures.


Anything specific that could cover what "NDArray-like" structures are? In concrete, to make row-oriented/column-oriented hybrids?

(I wish to found a way to encode the equivalent of `struct Data{col1:Vec<..>, ...>})


Succinct data structures ftw.


I'm interested in knowing about graph layout algorithms that allow specific constraints. If anyone knows about these things please let me know. I do not, but I have a use case for them.

For example, the constraints may be:

1) The graph is read left to right. This informs the layout such that those spiral and circular graph layout algorithms that blow out the graph do not work.

2) Try to keep certain types of nodes vertical aligned, so that they lie on the same horizontal line.

3) Minimize edge crosses.

4) Have the edges be not lines but paths made up of horizontal and vertical lines. This probably requires some path finding as well.

5) Be able to live update in the sense that a new node or edge entry into the graph does not drastically re-layout the graph for reasonable inserts.


Some time ago I found an interesting PDF about graph layout algorithms in general [0]. Nothing groundbreaking, but it also covers approaches for orthogonal layouts (aka only horizontal/vertical edge paths like in your fourth condition) and I liked how straightforward it is.

To meet all these criteria would indeed be amazing. For 5) you may get semi-reliable results by doing the layout deterministically and optimize for things like short edge length, so adding new nodes would just push around the rest a little bit. But good luck avoiding crossings at the same time.

[0] https://cs.brown.edu/people/rtamassi/papers/gd-tutorial/gd-c...


Toward 5, I've been able to get ELK.js to do somewhat stable things for simple graphs using `FIXED_ORDER` https://www.eclipse.org/elk/reference/options/org-eclipse-el...


Look at graphviz. Their "dot" program satisfies 1 through 4. Your fifth wish is hard to satisfy.


Hi, Alejo Hausner!

We worked on this problem, too, and Gordon Woodhull created a working system for dynamic hierarchical layout. I think he also incorporated an improved node positioning and edge routing algorithm of Ulrik Brandes. The code is still available here, I think: https://www.dynagraph.org

It was great work. We meant well, but were overextended for the resources we had. Gordon is a brilliant programmer.


I have looked at Graphviz before, but wasn't aware of anything there that helped with these constraints Do you happen to have any specofic pointers?

And it may be the case that (5) is actually the most implrtant constraint. Haha.


You could probably satisfy 5 with some kind of (mixed integer) linear optimization or so.


Would you mind elaborating? Thanks ahead of time.


Well, you formulate all your constraints in the language of your favourite constraint based programming solver (eg convex optimisation or mixed integer linear optimisation etc). For simplicity, let's call your candidate solution x (where x is a vector of suitable size) and the constraints are formulated as some function f from your vector to real numbers, so that the problem reduces to "minimize f(x)".

Then it's relatively easy to formulate your (5) as an additional constraint in the same language:

Given a previous solution x_0, find a new solution x_1 to slightly different constraints f_1 by minimizing af_1(x_1) + bdistance(x_1-x0)

You'd need to decide on the weights a and b and on the distance function. (Instead combining distance from old solution and constraints linearly, you could also use other ways to combine them. Eg you could also say:

Keep f_1(x_1) <= epsilon and minimize distance(x_1-x_0). Or you could do minimize the sum of squares:

af_1(x_1)^2 + bdistance(x_1-x0)^2

Specifics depends on what works well for your problem, and what your solver can handle.

See https://developers.google.com/optimization for a free library of solvers.


IMO the hip new thing in algorithms and data structures is to Just Use Arrays For Everything


Interesting and similar here is that there are a bunch of algorithms from days of yore that made sense with performance back then, but these days where the CPU absolutely doesn't care whether it's comparing a single byte or 32 bytes at once this can shift some algorithms towards much simpler ones. E.g. for substring searches this page is rather interesting: http://0x80.pl/articles/simd-strfind.html which posits that some of the older algorithms were built with the assumption that comparing parts of memory is more expensive than a table lookup, which is in many cases no longer true.


When skirt lengths change and we observe no measurable change in human anatomy, that's fashion and to follow fashion is "hip".

When the computing platform changes and you have to start paying attention to L-level caches and counting your bytes (so you can do you big compute in memory), and use n cores, and then arrays are adopted, that is not fashion; so not "hip", rather modern.


Isn't everything since the Renaissance considered modern?


touche. Exactly so, just like how every political system going forward will be "democratic". /g


Surely it’s the interesting things built on top of the arrays?


Are you talking about Vector DB's?


Also ECS / columnar data?


Got a source?


This is a lecture series, not new research, but one of the main takeaways is if you want good performance on modern machines, you had better be using arrays: https://youtube.com/playlist?list=PLHxtyCq_WDLXFAEA-lYoRNQIe...


I'm not OP, but I'm guessing he could be referring to something like this talk? https://www.youtube.com/watch?v=LrVi9LHP8Bk

I'm honestly not sure otherwise.


I think OP is talking about ML and matrix (array) multiplication to learn the right weights in a neural network.


Not sure it's still considered new as it has come out for couple years, adaptive radix tree and its variants are pretty promising.


E-graphs are pretty awesome, and worth keeping in your back pocket. They're like union-find structures, except they also maintain congruence relations (i.e. if `x` and `y` are in the same set, then `f(x)` and `f(y)` must likewise be in the same set).

https://egraphs-good.github.io/

(Incidentally, union-find structures are also great to know about. But they're not exactly "new".)


There is a bit of work within the last 5 years with creating more algorithms for temporal networks (think graphs that change over time like social networks, neural networks (with evolving architectures) or economic chain networks). These are essentially streaming algorithms over temporal edgelists. I worked on a research project that visualized these networks in real time. Very fun stuff.


Can you please share references? I'm interested in "temporal" topics.


Self-adjusting code is underutilized IMO.

Hyperdimensional computing combines LLM-style AI with Lisp-style AI, making it possible to code within a latent space. Pretty weird, but cool.

Electric Clojure is pretty groundbreaking, possibly similar to React's impact once it gets ported to a more popular platform.

I think there are real opportunties going forward to build dedicated compilers and the corresponding runtimes for distributed computing and related algorithms and data structures. Really anything that is extremely error prone, difficult, or complex would benefit from building a runtime and compiler that simply solves the problem. MLIR has good infrastructure for that.

Interprocedural data flow analysis is now scaling well in the LLVM/MLIR context: DFI: An Interprocedural Value-Flow Analysis Framework that Scales to Large Codebases

https://arxiv.org/abs/2209.02638


The field of "algorithms with predictions" studies how to use predictions/learning within traditional CS problem settings:

> We introduce algorithms that use predictions from machine learning applied to the input to circumvent worst-case analysis. We aim for algorithms that have near optimal performance when these predictions are good, but recover the prediction-less worst case behavior when the predictions have large errors.

An example is using learning to decide which variable to discretize next in a branch-and-bound integer linear programming solver. The algorithm might be extremely similar to the classic version, but the analysis is new.

https://arxiv.org/abs/2006.09123

More broadly, I think there's a lot of work on applying more refined analysis to classic algorithms - online settings, smoothed analysis, etc.


(Mostly) branchless binary search, ~twice as fast as std::lower_bound:

https://probablydance.com/2023/04/27/beautiful-branchless-bi...


Yes. There is ongoing work in making succinct full text indexes that use substantially less space than the text they model. See work on the r-index: https://doi.org/10.1089/cmb.2019.0309


Merkle trees and state tries were not in my formal education

But I had to learn them and employ them alot lately, very good for resource constrained environments


Not sure there's much new with Merkle trees.

But Verkle trees are relatively new and fascinating: https://vitalik.ca/general/2021/06/18/verkle.html


This isn't anything most people would care about, but I do some GIS work and if you enjoy the process of taking the Very Obvious To a Human and explaining how to do ... whatever ... to a handful of sand and tamed lightning, well, there's just so many problems to solve.

Just as an example, when you look at land parcels, you often get these irritating slivers (or splinters, even more apt because they are a pain and you have to dig them out) of land. You and I can look at a plot and point one out, but to a computer ...

Well, one of the hallmarks of a splinter is that there's an awful lot of fencing but very little land. Or put another way, a lot of perimeter to area enclosed. Simple division doesn't suffice here, because the end result ought to be a dimensionless number so that scale doesn't interfere. Alright, perimeter squared divided by area. The lowest you can get is a little over twelve, for a circle. A square gets you sixteen. But a splinter might give you a value well over a hundred.

Now, this sounds nice. You've got some values that are no gos (anything under twelve indicates an error), a reasonable bottom limit, a range where a human ought to take a look at it, and then a third range where you just say, "Yeah, that's a splinter alright, no need to bother a human." Sounds good. But then ... Wyoming.

Scale does end up mattering. A parcel that was two feet across and a hundred long has the same "sliver value" as one that is two hundred feet across and ten thousand feet long. Yet the latter is still quite useful. You can run a road through it and tack on some strip malls. So it isn't useless. Okay, so you say "that's great, we'll just set a lower bound on the shortest side, anything bigger than that and you'll have to reconsider."

... but we've assumed that we have handy little rectangles and that we know what the sides are. It's GIS. We have a bag of points tracing the outline of a polygon. It isn't necessarily just four points. Each side might have a bunch of points that are roughly co-linear. You and I can look at it and say "Yeah, that's a rectangle alright," but the computer cannot. And so here I am trying to develop an algorithm to determine how many sides a human would see versus just this list of points. All so I can better define my splinters.

Problems like this abound and they're quite interesting to think about. It's one of my favorite parts of programming and it requires some deep introspection, some meta-cognition: how do I recognize this, anyway? What is going on in my mind that I can do this?


I would recommend this book https://en.algorithmica.org/.

My opinion on things is that cache-oblivious programming is dead end and the constant factors by which cache-aware algorithms win are significant and worth taking.


Counterintuitively, “Cache-oblivious” algorithms typically refer to algorithms that are cache-friendly even for an unknown cache size, or for multiple levels of caches of varying (unknown) sizes. A good example is the memory hierarchy, from registers to L1–L3 processor cache to RAM to disk. Each have their own “block” size for memory transfers, and a cache-oblivious algorithm is designed to target all of those block sizes simultaneously.

https://en.m.wikipedia.org/wiki/Cache-oblivious_algorithm


Yes, that's correct, and specializing your code to the actual cache line and page sizes will generally yield significant speedups compared to this approach.


Algs/DS are still pretty fundamental to AI. Hyperparameter optimization algorithms like SHA/ASHA, stochastic gradient descent, PBT, etc. Structures like transformers, the various topologies, convolutions, etc. Not tickling the same itches?


I guess I will ask since I couldn’t find it looking, does anyone happen to have a link to the person that was creating atomic data structures by making them fit into a uint64_t? Or am I misremembering?


Cache oblivious algorithms and lock free data structures


Quantum. Will cause us to forget everything we know about classical data structures & algorithms.


there was a recent HN post related to this topic https://news.ycombinator.com/item?id=32186203


Searching.


Follow-up, anything in graph-theory?


BNW was in March of last year: https://arxiv.org/abs/2203.03456

Here is my summary of the algorithm from another HN post: https://news.ycombinator.com/item?id=34479148#34481937


I'm not exactly up to date, but last few years has seen practical improvements to graph isomorphism algorithms.


in memory computation using cheap analog chips


First Love?! Haha, jk.

I'm currently re-studying them just bc I'm interview prepping. I wonder if whiteboarding for DS/Algo is gonna be a thing of the past too though.

I would imagine there's certainly new DS/Algos in progress to aide ML training efficiency




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: