Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are some cool but obscure data structures you know about?
2051 points by Uptrenda on July 21, 2022 | hide | past | favorite | 753 comments
I'm very interested in what types of interesting data structures are out there HN. Totally your preference.

I'll start: bloom filters. Lets you test if a value is definitely NOT in a list of pre-stored values (or POSSIBLY in a list - with adjustable probability that influences storage of the values.)

Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

Bonus section: Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.




Not a very deep CS-y one, but still one of my favourite data structures: Promise Maps.

It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.

When caching the result of an expensive computation or a network call, don't actually cache the result, but cache the promise that awaits the result. Ie don't make a

    Map<Key, Result>
but a

    Map<Key, Promise<Result>>
This way, if a new, uncached key gets requested twice in rapid succession, ie faster than the computation takes, you avoid computing/fetching the same value twice. This trick works because:

- promises hold on to their result value indefinitely (until they're GC'ed)

- you can await (or .then()) an existing promise as many times as you want

- awaiting an already-resolved promise is a very low-overhead operation.

In other words, the promise acts as a mutex around the computation, and the resulting code is understandable even by people unfamiliar with mutexes, locks and so on.


FYI, if you plan to do this in ASP netcore, combine with AsyncLazy for the most optimal results https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...


Neat page from David Fowler. I didn't know about that one. Thank you.

To further explain why you want to use an AsyncLazy (or Lazy for synchronous programming) when working with ConcurrentDictionary I find this article to be helpful: https://andrewlock.net/making-getoradd-on-concurrentdictiona...


Be careful with this data structure. If the language allows async exceptions or you have a big where the promise won’t become deferred, there are a lot of edge cases. Examples of edge cases:

- if the promise never becomes determined (eg bug, async exception) your app will wait forever

- if the promise has high tail latency things can be bad

- if the language eagerly binds on determined promises (ie it doesn’t schedule your .then function) you can get weird semantic bugs.

- changing the keys of the table incrementally instead of just using for memoisation can be really messy


I have a couple pieces of code where we had to add rate limiting because it was just too easy to make too many async calls all at once, and things only got 'worse' as I fixed performance issues in the traversal code that were creating an ersatz backpressure situation.

Philosophically, the main problem is that promise caches are a primary enabler of the whole cache anti-pattern situation. People make the mistake of thinking that 'dynamic programming' means memoization and 'memoization' means caching. "Everything you said is wrong." Trying to explain this to people has been a recurring challenge, because it's such a common, shallow but strongly-held misunderstanding.

Youtube recently suggested this video to me, which does a pretty good job of explaining beginning and intermediate DP:

https://www.youtube.com/watch?v=oBt53YbR9Kk

What I love most about this video is that it introduces memoization about 10 minutes in, but by 20% of the way through it has already abandoned it for something better: tabulation. Tabulation is an interative traversal of the problem that gathers the important data not only in one place but within a single stack frame. It telegraphs a information architecture, while recursive calls represent emergent behavior. The reliance on cache to function not only obscures the information architecture, it does so by introducing global shared state. Global shared state and emergent behavior are both poison to the long-term health of a project, and here we have a single data structure that represents both in spades.

We are supposed to be fighting accidental coupling, and especially global variables, not engaging in word games to hide what we are doing.

So I think my answer to OP's question is 'tabulation', which is part data structure and part algorithm.


I wrote some Python code recently that uses a similar data structure (Futures instead of Promises, without knowing necessarily about the data structure's use in JavaScript). It wasn't really for caching.

I modeled mine on the Django ASGI reference library's server implementation, which uses the data structure for maintaining references to stateful event consumers. Exception handling is done with a pre-scheduled long-running coroutine that looks at the map.

I'm curious about your second point -- why exactly do things get bad with high tail latency? Is it only a weakness of the data structure when used for caching? I'm having trouble picturing that.


Suppose the call to evaluate has a p95 of 5 seconds (this is very large of course). If your first call to compute a value hits it, all the subsequent requests for that cell are blocked for 5s. If you didn’t try to cache, only one might block for 5s and the rest could go through fast. On the other hand if you do 20 requests then about one of them will get the p95 latency.


Best practice in my experience is to use a timeout on all async operations to handle edge cases 1 & 2 above. The third case isn't possible with JavaScript AFAIK.


- if the language eagerly binds on determined promises (ie it doesn’t schedule your .then function) you can get weird semantic bugs.

What would be an example of this? If the promise has been determined, why not just immediately run the .then function?


The reason not to do it is to limit the non-determinism of the order of execution.

In Javascript

    f1();
    p.then(f3);
    f2();
functions are always called in the relative order f1, f2, f3 and never f1, f3 , f2 even if p had already resolved.


I know this as memoization with lazy evaluation, nothing new, but at the same time very useful, and I would argue, it is very CS.


not really quite lazy evaluation, thought, at least in Javascript. Promises begin execution as soon as they are created and there is no way to delay that.


Promises are an implementation of lazy evaluation for Javascript. This is exactly lazy evaluation.

By the way, lazy evaluation is the one that offers no guarantees about when your code will be executed. If you can delay the execution, it's not lazy.


Wait. I thought lazy evaluation is defined as evaluation at the time a value is needed. After that I think it will then be in Weak Head Normal Form, which can be thought of as "evaluated at its top level"... but I'm a bit rusty. Basically, an expression gets used somewhere (e.g. pattern matched, in the ADT sense). If it's in WHNF, cool, it's evaluated already (subexpressions within may not yet, but that's their problem--that's exactly what WHNF says). If not, we evaluate it down to WHNF.

Wikipedia states it as: "When using delayed evaluation, an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression's value."

So if evaluation begins effectively at the time the promise is issued, not at the time the promise is awaited, then I would not call that lazy evaluation, and what hn_throwaway_99 said sounds correct to me.

Is my rusty old PL understanding wrong?


> Wait. I thought lazy evaluation is defined as evaluation at the time a value is needed.

Correct.

You've got it right, GP comment has it wrong.

Lazy does not simply mean "evaluated later". From what I understand, in JS, if you call an async function without `await`, it essentially gets added to a queue of functions to execute. Once the calling function stack completes, the async function will be executed.

A queue of functions to execute does not constitute "lazy evaluation" on its own.


Definition on Wikipedia says otherwise:

“In programming language theory, lazy evaluation, or call-by-need,[1] is an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).”

Haskell would be an apt example here.


Promises in javascript are absolutely not considered 'an implementation of lazy evaluation.' Promises are invoked eagerly in javascript.


I've always called it request piggy backing (atop memoization). Our db abstraction at my last gig absolutely required it for scaling purposes. Great tool for the tool belt.


Can someone explain to me why the second example is better.

To me it seems to be the same thing. Replace result with int and I literally do not see a problem with the first one.

Also why is a mutex or lock needed for Result in javascript? As far as I know... In a single threaded application, mutexes and locks are only needed for memory operations on two or more results.

With a single value, say an int, within javascript you are completely safe in terms of concurrent access. Only 2 more more variables can cause a race condition. Or am I wrong here?

--edit:

Thanks for the replies. I see now. The mutex concept is not around memory access or memory safety. It's solely around the computation itself. When the Promise is "pending". It has nothing to do with safety. The parent did mention this, but I completely glossed over it.


You can put the promise into the cache immediately but you can only put the result from the promise into the cache once the promise resolves. So if an identical request comes in a second time before the promise has been resolved, then if you are caching the promise you have a cache hit but if you are caching the result then you have a cache miss and you end up doing the work twice.


I am still not understanding the purpose of this as I believe it is grounded on the wrong assumption.

Pretty much every single asynchronous operation other than some `Promise.resolve(foo)` where foo is a static value can fail. Reading from the file system, calling an api, connecting to some database, etc.

If the original promise fails you're gonna return a cached failure.

Mind you, I'm not stating this might be completely useless, but at the end of the day you will be forced to add code logic to check all of those asynchronous computation results which will eventually outweight the cost of only saving the resolved data.


> If the original promise fails you're gonna return a cached failure.

This is usually a good thing, and even in cases where it isn't, it's often a worthwhile trade-off.

In the common case a failure will either be persistent, or - if load/traffic related - will benefit from a reduction in requests (waiting a while to try again). In both of these cases, where your first key request fails, you want the immediately-following cases to fail fast: "caching" the promise caches the failure for the duration of one request (presumably the cache is emptied once the promise resolves, allowing subsequent key accesses to retry).

The less common case where the above isn't true is where you have very unstable key access (frequent one-off failures). In those cases you might want a cache miss on the second key request, but successful key retrieval usually isn't as critical in such systems which makes the trade off very worthwhile.


> If the original promise fails you're gonna return a cached failure.

In the scenario where that's an issue, you would need to add (extremely trivial 3-5 lines) logic to handle retrying a cached failure. The underlying data structure would continue to be a promise map.


> If the original promise fails you're gonna return a cached failure.

In many cases, another near-in-time request would also fail, so returning a cached failure rather than failing separately is probably a good idea (if you need retry logic, you do it within the promise, and you still need only a single instance.)

(If you are in a system with async and parallel computation both available, you can also use this for expensive to compute pure functions of the key.)


You do one IO operation instead of two.

It's unlikely that always doing two is going to be more successful than just trying once and dealing with failure


It's a tiny optimization.

When the VERY FIRST ASYNC operation is inflight the cache is immediately loaded with a Promise, which blocks all other calls while the FIRST async operation is in flight. This is only relevant to the very FIRST call. That's it. After that the promise is pointless.

As for the Promise failure you can just think of that as equivalent of the value not existing in the cache. The logic should interpret the promise this way.


It's not always a tiny optimization. If you have an expensive query or operation, this prevents potentially many duplicative calls.

A practical example of this was an analytics dashboard I was working on years ago -- the UI would trigger a few waves of requests as parts of it loaded (batching was used, but would not be used across the entire page). It was likely that for a given load, there would be four or more of these requests in-flight at once. Each request needed the result of an expensive (~3s+) query and computation. Promise caching allowed operations past the first to trivially reuse the cached promise.

There are certainly other approaches that can be taken, but this works very well as a mostly-drop-in replacement for a traditional cache that shouldn't cause much of a ripple to the codebase.


Yep, I've been bitten by exactly this failure mode.

You have to invalidate/bust the cache when a failure is returned (which is racy, but since cache busting is on the sad path it's a fine place to protect with a plain ol' mutex, assuming you even have real parallelism/preemption in the mix).

Alternatively you can cache promises to functions that will never return failure and will instead internally retry until they succeed. This approach generalizes less well to arbitrary promises, but is more friendly to implementing the custom back off/retry scheme of your choice.


It's not the cost of saving the resolved data.

If I understand the pattern correctly, it is to avoid multiple asynchronous requests to a resource that has yet to be cached.


Yeah, that's my understanding too.

It seems like an optimization to prevent lots of downstream requests that occur in rapid succession before the first request would have finished. I'd also suspect that the entry would be removed from the map on failure.


I believe the idea is that if you stored the result instead, you would have to wait for the promise to resolve the first time. If you made two requests in quick succession, the second request wouldn't see anything in the result map for the first one (as it hasn't resolved yet) and would then need to make its own new request.

If you store the promise, then not only does the second attempt know that it doesn't need to make a new request, but it can also await on the promise it finds in the map directly.


Wait a minute.

Under Javascript, The behavior induced by this pattern is EXACTLY equivalent to that of CACHING the result directly and a BLOCKING call. The pattern is completely pointless if you view it from that angle. Might as well use blocking calls and a regular cache rather then async calls and cached promises.

So then the benefit of this pattern is essentially it's a way for you to turn your async functions into cached non-async functions.

Is this general intuition correct?

--edit: It's not correct. Different async functions will still be in parallel. It's only all blocking and serial under the multiple calls to the SAME function.


Say you have three elements in your UI that need to fetch some user info. You have two major options:

1/ Have a higher level source of truth that will fetch it once from your repository (how does it know it needs to fetch? How is cache handled ?) and distribute it to those three elements. Complex, makes components more independent but also more dumb. It's fine to have pure elements, but sometimes you just want to write <UserInfoHeader/> and let it handle its stuff.

2/ Your repository keeps this Map<Key, Promise<T>>, and every time you call getUserInfo(), it checks the map at key "userinfo" and either return the promise (which might be ongoing, or already resolved) or see that it's not there and do the call, writing the promise back into the map. This way, your three components can just call getUserInfo() without giving a damn about any other ones. The first one that calls it pre-resolves it for others.

As to why a promise instead of just the raw result: one can potentially return null (and you need to call again later to refresh, or straight up blocks during the entire call), the other one just gives you back promises and you can just listen to them and update your UI whenever it's ready (which might be in 5 seconds, or right now because the promise has already been resolved)

It's a bad implementation of a cached repository (because it ignores TTL and invalidation as well as many problems that need to be handled) that any junior developer could figure out (so it's everything but obscure), but sometimes, eh, you don't need much more.


Because the computation of Result is _expensive_ and _async_.

So if you get two requests for the computation that computes Result in close succession the Map doesn't have the result in place for the second call and as a result you get no caching effect and make 2 expensive requests.

By caching the Promise instead the second request is caught by the cache and simply awaits the Promise resolving so now you only get 1 expensive request.


I didn't know this was a formal design pattern; I do recall having implemented this myself once, as a means to avoid double requests from different components.

Later on I switched to using react-query which has something like this built-in, it's been really good.


This is only tangentially related to:

> you avoid computing/fetching the same value twice

But this problem comes up in reverse proxies too. In Nginx, you can set the `proxy_cache_lock`[0] directive to achieve the same effect (avoiding double requests).

[0]: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...


That right there is Cache Stampede[0] prevention.

[0]: https://en.wikipedia.org/wiki/Cache_stampede


Interesting! This is an issue I had with an external API which I intended to cache on my serverless workers infra.

I hit the API's rate limiter when the workers invalidated their cache, even though I staggered the lifetime of the cache keys for each replicated instance. This is how I found out how many Cloudflare workers run on a single edge instance. Hint: It's many.

Didn't know it had a name. I'm delighted, thanks!


You’re welcome, happy to help. If you are in the .NET space I suggest you to take a look at FusionCache. It has cache stampede protection built in, plus some other nice features like a fail-safe mechanism and soft/hard timeouts https://github.com/jodydonetti/ZiggyCreatures.FusionCache


I love this approach and have used it many times in JavaScript. I often end up adding an additional map in front with the resolved values to check first, because awaiting or then'ing a Promise always means you will wait until the next microtask for the value, instead of getting it immediately. With a framework like React, this means you'll have a flash of missing content even when it is already cached.


Replacing the promise with the result in the map (using a sum type as the map value type) might be more efficient than maintaining two maps?


This surprises me. I had expected the microtask to be executed right after the current one, ie before any layouting etc - isnt that the whole point the "micro" aspect?


I was able to reproduce this in a simple example [1]. If you refresh it a few times you will be able to see it flash (at least I did in Chrome on Mac). It is probably more noticeable if you set it up as an SPA with a page transition, but I wanted to keep the example simple.

[1] https://codesandbox.io/s/recursing-glitter-h4c83u?file=/src/...


I am deeply disturbed and confused. Do you understand why this happens?


The event loop shouldn't continue the circle until the microtask queue is empty, from what I remember.

Browsers probably used to squeeze layouting and such things in when the event loop completed an iteration or became idle. However nowadays it's entirely possible that have, or will have, browsers that do even work in parallel while the javascript event loop is still running, possibly even beginning to paint while javascript is still running synchronously (haven't seen that yet).

I've seen instances where browsers seem to be like "nah, I'll render this part later" and only would render some parts of the page, but certain layers (scrolling containers) would not be filled on the first paint - despite everything being changed in one synchronous operation!* And I've seen them do that at least 5 years ago.

It seems to me the situation is complicated.

* There was a loading overlay and a list being filled with data (possibly thousands of elements). The loading overlay would be removed after the list was filled, revealing it. What often happened was that the overlay would disappear, but the list still looking empty for a frame or two.


Yeah, saves you from thundering herd problems

https://github.com/ben-manes/caffeine cache does something similar by caching the future that gets returned on look-ups if the value is still being computed


I have a similar pattern when using an rxjs pipeline to do async tasks in response to changes in other values. The typical pattern would be

    readonly foo = bar.pipe(
        switchMap(v => doTask(v)),
        shareReplay(1),
    );
Where doTask returns a promise or an AsyncSubject. The shareRelpay allows foo to cache the last value so any late subscriber will get the most recent value. This has a problem of returning cached values while new work is happening (eg while a network request is in progress) so instead I like to write

    readonly foo = bar.pipe(
        map(v => doTask(v),
        shareReplay(1),
        switchMap(v => v),
    );
This way the shareReplay is caching the promise instead of the result. With this pattern I can interact with this pipeline from code like so

    async someEventHandler(userInput) {
        this.bar.next(userInput);
        const result = await firstValueFrom(this.foo);
        …
    }
Without the pattern, this code would get the old result if there was ever any cached value


It’s also the clearest and least buggy way to iterate over the results. Map over Await Promise.all(map).


Partly off-topic, but in JS I tend to avoid iterating over promises with maps because it's always confusing what will/won't work, so I use a basic FOR loop because async/await works in there.

How bad is this? Should I switch to Promise.all instead?


Iterating with async/await means you wait for every result before making another call. You basically execute the async calls one by one.

Promise.all runs them all simultaneously and waits until they are all complete. It returns an array of results in the same order as the calls, so it's usually pretty straightforward to use.

Both approaches have a purpose, so it's not like you "should" strictly use one. But you should be aware of both and be able to use the one that fits the need.


The third major strategy being Promise.any():

- as soon as any of them resolve (any)

- when all of them resolve (all)

- resolve each in order (for)


Promise.all/allSettled/etc will allow you to resolve your promises concurrently rather than awaiting each result in the for loop. Depends what your use case is, really.


Use my approach or yours, but never use forEach with await. I don’t remember exactly why, but it just doesn’t work and will lead to very confusing errors.


Because Array.forEach doesn’t return a value or anything that could be awaited. If you use Array.map you can create an Array of promises that can be awaited with Promise.all


Just make sure Map access is thread safe - awaiting promises is - but updating/reading map keys usually isn't.


Yep, in multithreaded langs you just want to use a ConcurrentMap or whatever the stdlib has for this. The write is very brief because you only write the promise and not eg keep a proper lock on the key until the result has arrived, so no need for a fancier datastructure.


js is single-threaded so no problem there.


Since OP mentioned Mutexes I assumed he's dealing with multithreaded code, but with JS this works as advertised :)


This is clever, otherwise you have to coordinate your cache setter and your async function call between potentially many concurrent calls. There are ways to do this coordination though, one implementation I've borrowed in practice in python is modes stampede decorator: https://github.com/ask/mode/blob/master/mode/utils/futures.p...


I believe that any time you await/then you wind up going back to the event loop. It's cheap ish but definitely not free. This is, of course, to make the behavior more predictable, but in this case it might be undesirable. Unfortunately, I don't think there's any obvious way around it. It can sorta be done with vanilla callbacks, but that takes you out of the async/promises world.


I have written many tools that do something like this, very useful.

Just gotta be careful with memory leaks, so (for a TS example) you might want to do something like this:

    const promiseMap<key, Promise<any>> = new Map();

    async function keyedDebounce<R>(key: string, fn: () => R) {
      const existingPromise = promiseMap.get(key);
      if (existingPromise) return existingPromise;

      const promise = new Promise(async (resolve) => {
        const result = await fn(); // ignore lack of error handling

        // avoid memory leak by cleaning up
        promiseMap.delete(key); 

        // this will call the .then or awaited promises everywhere
        resolve(result); 
      });

      promiseMap.set(key, promise);

      return promise;
    }
So that the promise sticks around for that key while the function is active, but then it clears it so that you're not just building up keys in a map.


This is a great solution to the thundering herd problem, but isn't OP explicitly trying to cache the results for later? In that case, the promise map is a clever way to make sure the cachable value is fetched at most once, and you want the keys to build up in the map, so GC'ing them is counterproductive.


Ya it obviously depends on the intended goal, but caching items in a map indefinitely better be done on a set of values that is known to be limited to a certain size over the lifetime of the server or it'll lead to a memory leak.


If the result of the async function is not necessarily the same value every time you call it, you may want to recompute it


Please correct me if I'm wrong. But you'd better avoid creating extra promises for performance reasons

  const promiseMap<key, Promise<any>> = new Map();

  async function keyedDebounce<R>(key: string, fn: () => R) {
    const existingPromise = promiseMap.get(key);
    if (existingPromise) return existingPromise;
  
    const promise = fn().then(result => {
      promiseMap.delete(key);
  
      return result;
    });
  
    promiseMap.set(key, promise);
  
    return promise;
  }


You check for if(promiseMap.get(key)) and in the NO case you do promiseMap.delete(key)?

Could you explain why that's necessary? (sorry probs a stupid question)


So if the promise still exists, that means that there is an active call out for a promise using this key already, so we'll just return it. We know this because when the function that is executing finishes, we delete the promise from the map.

A purpose of this might be, let's say you rebuild some structure very frequently, like hundreds of times per second or whatever. You can call this function for a given key as many times as you want while that async process is executing, and it will only execute it once, always returning the same promise.


I’m the no case, the promise is created, and then deleted after the function is resolved.

I’m not entirely sure of the use of this pattern where your using a map and deleting the keys as you go - seems like you’d end up doing more work kinda randomly depending on how the keys were added. I’d just relive the whole map when I was done with it.


Won’t JS garbage collect orphaned references? Why is this necessary?


there is some confusion here.

OP is intentionally caching results of, presumably, expensive computations or network calls. It's a cache. There is no memory leak. OP just did not detail how cache invalidation/replacement happens. The 2nd comment adds a rather rudimentary mechanism of removing items from cache as soon as they are ready. You get the benefit of batching requests that occur before the resolve, but you get the downside of requests coming in right after the resolve hitting the expensive process again. Requests don't neatly line up at the door and stop coming in as soon as your database query returns.

Typically you would use an LRU mechanism. All caches (memcached, redis, etc.) have a memory limit (whether fixed or unbounded, i.e. all RAM) and a cache replacement policy.


It does, but the map has a reference to it, so it will "leak" (in gc languages an unwanted reference is considered a leak). If this map got rather large, you could end up with a rather large heap and it would be un-obvious why at first.


Do Promises hold a reference to the chain of functions that end in the result? If so, that seems like a bug.


A promise is just an object, it does not contain any reference to the chained functions.


Could you use a WeakMap for this instead?


If the key is a type that you expect to be GC'ed, yes, totally. If the key is a simple value type eg a string then it behaves the same as a regular Map or an object.


Ah yeah, good point.

I've mostly used this overall pattern with something like Dataloader, where you throw away the entire cache on every request, so the GC problem doesn't crop up.


https://github.com/tc39/proposal-function-memo

This might be relevant to this exact pattern

  const getThing = (async function (arg1, arg2) {
    console.log({ arg1, arg2 });
    return arg1 + (arg2 || 0);
  }).memo();

  console.log(await getThing(1)); // New promise
  console.log(await getThing(2)); // New promise
  console.log(await getThing(1)); // Returns promise from earlier


>> Not a very deep CS-y one

yes, in fact this is more like a pattern than an actual data structure, since you might as well replace it with a list of futures. And a comparison between future based concurrency and thread based parallelism is like apples to oranges.

But it isn't surprising that this pattern ended up at the top since it is what the users of this site would be most familiar with.


Vitess does something similar, "query coalescing", which was designed for loads that are very read-heavy. If 1000 requests come in that end up making the same read query from the database, and they all came in in the same 50ms that it takes to get the results from the database, you can just return the same answer to all of them.


I wrote a version of this with Elixir: https://github.com/bschaeffer/til/tree/master/elixir/gen_ser...

Didn't know what to call it but PromiseMaps is nice name for it.


Promise Caching has a nicer name to it since that's what it is if well-implemented.


> It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.

Why would that be true?


I think the actual answer is that it only works in languages where you can eagerly execute async code disconnected from the Promise/Future construct. It would've worked in ES3 using a Promise polyfill. (Which is notable because it does not fit the aforementioned criteria—it had no real async support.)


It isn’t. The first implementation of promises I worked on was in Java. We used it as part of verifying certificate chains. If you use threads it doesn’t matter if the operations are blocking, as long as you move them off-thread.


Anybody knows if there is similar pattern for Python's asyncio?


How about a dict whose values are asyncio.Future?

https://docs.python.org/3/library/asyncio-future.html#asynci...



I wonder if Swift's AsyncAwait could be used in such a way.


Sure it's possible, we need to await for the Task if it already exists on the dictionary, for example we could imagine writing something like this inside an actor to make it threadSafe.

    private var tasksDictionary: Dictionary<String, Task<Data, Error>>

    func getData(at urlString: String) async throws -> Data {
        if let currentTask = tasksDictionary[urlString] {
            return await currentTask.value
        }
        let currentTask = Task { 
           return try await URLSession.shared.data(from: URL(string:urlString)!) 
        }
        tasksDictionary[urlString] = currentTask
        let result = await currentTask.value
        tasksDictionary[urlString] = nil
        return result
    }


Hmm, why set `taskDictionary[urlString] = nil` at the bottom there? If the whole point is to cache the result, isn't the point to leave the value there for other requests to pick it up?


Yup, nice catch, no need to reset it to nil if you wanna keep it in memory indefinitely.

I guess making it nil can be also used if you don't wanna make the same request when there is one already in flight, in case you have a 403 error and need to request a refresh token, you don't wanna make two simultaneous requests, but you also don't wanna catch it indefinitely either.


Great! So much nicer than the Combine (or Rx) version.


This is great. I'm actually running into this problem but across distributed clients. Is there a distributed way to achieve somthing like this with say Redis?


Technically yes, the promise can first do an RPC to a distributed key/value store, and only then do the expensive computation (which typically is itself an RPC to some other system, perhaps a traditional database). Promise libraries should make this pretty simple to express (though always uglier than blocking calls).

I say expensive because even in a data center, an RPC is going to take on the order of 1ms. The round-trip from a client like a browser is much greater especially where the client (if this is what you meant by "client") has a poor internet connection.

Therefore this pattern is usually done inside of an API server in the data-center. Additional benefits of the server side approach is that an API server can better control the cache keys (you could put the API server "build number" in the cache key so you can do rolling updates of your jobs), TTL, etc. Also, you don't want a rogue computer you don't control to directly talk to a shared cache since a rogue computer could put dangerous values into your cache, remove elements from the cache they shouldn't, etc.


What do you mean by first class citizen, here? I’m pretty sure this works in all languages with promises, but I might be misunderstanding something.


The data structure here is a Map<T>


can you please explain this a bit more how this avoid fetching the same value twice? maybe with example.


It sounds like, if you have a process that takes a long time to do, you check to see if you already have initiated that process (see if the key is in the list of promises), rather than firing off another promise for the same request.

If you get 10 requests to fire function lookUpSomethingThatTakesAWhile(id) that returns a promise, rather than firing off 10 promises, you fire it off once, look up the ID in your map for the next nine, and return that same promise for the next 9.


What is a promise in this context?


A promise is an object representing an asynchronous result. Some languages, such as Python, call them futures. A promise object will typically have methods for determining if result is available, getting the result if it is, and registering callback functions to be called with the result when it's available.


Thank you!


More generally, I believe this is a monad structure.


Promises are not monads, for one simple reason: they're not referentially transparent. The whole point of OP's example is to double down and take advantage of that.


Referential transparency is a property of expressions, not type classes. Referential transparency is nowhere in the definition of a monad because a monad is obviously not an expression but a type class.

It is definitely true that Promise is a data type that _could_ admit a monad instance. It has:

- a data type M a which is Promise a

- a function pure with the signature a => M a, which is x => Promise.resolve(x)

- a bind function with the signature M a => a => M b => M b which is essentialy `then`

But a monad requires three properties to hold true: Right identity, Left identity and associativity. Right identity holds for every function `f: a => M b`, but left identity and associativity do not hold.


Maybe comonads fit better


promise map?

do you mean

    setmetatable({}, {__index = function(tab, key)
          local co = coroutine.running()
          uv.read(fd, function(err, result)
               tab[key] = result
               return coroutine.resume(result)
            end)
          return coroutine.yield()
      end })
This is an incomplete implementation, clearly: but I've never missed Promises with coroutines and libuv.


Cache-Oblivious Data Structures: https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf

A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.

A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.

In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.

This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.


Realtime collision detection[1] has a fantastic chapter in this with some really good practical examples if I remember right.

Great book, I used to refer to it as 3D "data structures" book which is very much in theme with this thread.

[1] https://www.amazon.com/Real-Time-Collision-Detection-Interac...


The implicit grid data structure from there is a personal favorite of mine. I used it in a game once and it performed incredibly well for our use case.

It's a bit too complicated to totally summarize here, but it uses a bit per object in the scene. Then bit-wise operations are used to perform quick set operations on objects.

This data structure got me generally interested in algorithms that use bits for set operations. I found the Roaring Bitmap github page has a lot of useful information and references wrt this topic: https://github.com/RoaringBitmap/RoaringBitmap


I spent a lot of time optimizing an octree for ray tracing. It is my favorite "spatial index" because it's the only one I know that can be dynamically updated and always results in the same structure for given object placement.


It's a stellar book. I work on a commercial realtime physics engine, the "orange book" is practically required reading here.


I often recommend this book, RTCD, as a "better intro to data structures and algorithms" (sometimes as "the best"). The advice often falls on deaf ears, unfortunately.


Seconding the RTCD recommendation. Handy code examples, and my favorite part is that the book is real-world-performance-conscious (hence the "real-time" in the title).


> In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

I was about to write a comment suggesting that if we made better use of three dimensional space in constructing our computers and data storage devices, we could get this extra latency factor down to the cube root of n.

But then, I decided to imagine an absurdly large computer. For that, one has to take into account a curious fact in black hole physics: the radius of the event horizon is directly proportional to the mass, rather than, as one might expect, the cube root of the mass [1]. In other words, as your storage medium gets really _really_ big, you also have to spread it out thinner and thinner to keep it from collapsing. No fixed density is safe. So, in fact, I think the extra factor for latency in galactic data sets is neither the square root nor cube root, but n itself!

[1] https://en.wikipedia.org/wiki/Schwarzschild_radius


The main difficulty with growing computing devices in three dimensions is cooling. All that heat has to be somehow carried away. It's already difficult with 2D devices. One would have to incorporate channels for cooling liquids into the design, which takes a bite out of the gains from 3D.

https://www.anandtech.com/show/12454/analyzing-threadripper-...


This series of blog posts explains this sqrt behaviour quite nicely, covering both theory (including black holes) and practice:

http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html


If its stored far away enough re-doing the calculation is faster. Depending on how much accuracy you need you might as well load a 2022-07-22 07:26:34 earth and have that guy on HN evaluate his thoughts all over again.


How do you accurately include those sqrt(n)’s in your analysis?


I spent a long time looking at an algorithm for long range force calculations called the Fast Multipole Method which is O(N) and found that practically it couldn't compete against an O(N log N) method we used that involved FFTs because the coefficient was so large that you'd need to simulate systems way bigger than is feasible in order for it to pay off because of cache locality, etc.


Honestly, that is not too uncommon. For top level algorithm choice analysis, "rounding" O(log(n)) to O(1), and O(n*log(n)) to O(n), is often pretty reasonable for algorithm comparison, even though it is strictly incorrect.

It is not uncommon for a simple O(n*log(n)) algorithm to beat out a more complex O(n) algorithm for realistic data sizes. If you "round" them to both O(n), then picking the simpler algorithm because the obvious choice, and can easily turn out to have better perf.


To be fair, in the general case (particles placed in arbitrary positions), it worked very well. The issue was that in the area of electromagnetics I worked on at the time, we make an assumption about things being placed on a regular grid, and that allows you to do the force calculation by convolving a kernel with the charges/dipole strength in Fourier space.


A helpful little heuristic that always stuck with me from a Stanford lecture by Andrew Ng was, "log N is less than 30 for all N".



This implementation and associated paper improve the cache locality to make the algorithm more competitive https://github.com/dmalhotra/pvfmm

They compare solving toy problems with different methods here https://arxiv.org/pdf/1408.6497.pdf

But as you mention, this is just fiddling with constants, and it still may be too costly for your problem


I saw an amazing presentation on Purely Functional Data structures with a really approachable and understandable proof on ho w a particular structure was asymptotically constant or linear time (I believe), and then followed up by saying that in practice none of it worked as well as a mutable version because of caching and data locality.

Oh well.


Is there by any chance a publicly available recording of this presentation on purely functional data structures?


NE Scala Symposium Feb 2011

http://vimeo.com/20262239 Extreme Cleverness: Functional Data Structures in Scala

or maybe

http://vimeo.com/20293743 The Guerrilla Guide to Pure Functional Programming

(At work, so I can't check the video)

Both of them were excellent speakers


Does anyone actually use cache-oblivious data structure in practice? Not, like, "yes I know there's a cache I will write a datastructure for that", that's common, but specifically cache-oblivious data structures?

People mention them a lot but I've never heard anyone say they actually used them.


To an extent, the B-Tree data structure (and its variants) are cache oblivious. They smoothly improve in performance with more and more cache memory, and can scale down to just megabytes of cache over terabytes of disk.

The issue is that the last tree level tends to break this model because with some caching models it is "all or nothing" and is the biggest chunk of the data by far.

There are workarounds which make it more truly cache oblivious. How often this is used in production software? Who knows...


There are also Judy arrays: https://en.wikipedia.org/wiki/Judy_array


This sounded promising but then I saw a performance benchmark [1] that was done on a pentium(!) with a very suboptimal hash table design compared to [2] and [3].

The Judy site itself [4] is full of phrases like "may be out of date", when the entire site was last updated in 2004. If it was already out of date in 2004...

It seems that Judy arrays are just kind of a forgotten datastructure, and it doesn't look promising enough for me to put in effort to revive it. Maybe someone else will?

[1] http://www.nothings.org/computer/judy/

[2] https://probablydance.com/2017/02/26/i-wrote-the-fastest-has...

[3] https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...

[4] http://judy.sourceforge.net/


Judy arrays are (under the covers) a kind of radix tree, so they are comparable to things like ART (adaptive radix tree), HAMT (hashed array-mapped trie), or my qp-trie. Judy arrays had some weird issues with patents and lack of source availability in the early days. I think they are somewhat overcomplicated for what they do.

Regarding cache-friendiness, a neat thing I discovered when implementing my qp-trie is that it worked amazingly well with explicit prefetching. The received wisdom is that prefetching is worthless in linked data structures, but the structure of a qp-trie makes it possible to overlap a memory fetch and calculating the next child, with great speed benefits. Newer CPUs are clever enough to work this out for themselves so the explicit prefetch is less necessary than it was. https://dotat.at/prog/qp/blog-2015-10-11.html

I guess I partly got the idea for prefetching from some of the claims about how Judy arrays work, but I read about them way back in the 1990s at least 15 years before I came up with my qp-trie, and I don't know if Judy arrays actually use or benefit from prefetch.


It seems to me that the project is still active,: https://sourceforge.net/p/judy/bugs/29/


Cache-obliviousness is an important part of the whole array-of-structs vs structs-of-arrays. One of the advantages of the struct-of-arrays strategy is that it is cache-oblivious.


Cache oblivious data structures are absolutely used in every serious database.

From what I remember/can find online (don't take this as a reference): B-tree: relational DB, SQLite, Postgres, MySQL, MongoDB. B+-tree: LMDB, filesystems, maybe some relational DBs. LSM trees (Log-structured merge tree, very cool) for high write performance: LevelDB, Bigtable, RocksDB, Cassandra, ScyllaDB, TiKV/TiDB. B-epsilon tree: TokuDB, not sure if it exists anymore. COLA (Cache-oblivious lookahead array): I don't know where it's used.

Maybe modern dict implementations can qualify too, e.g. Python.


Yes, LSM trees are specifically designed with this propery in mind.


Doing a decent job of analysis would include the sqrt(n) memory access factor. Asymptotic analysis ignores constant factors, not factors that depend on data size. It's not so much 'on paper' as 'I decided not to add a significant factor to my paper model'.

Cache oblivious algorithms attempt to make the memory access term insignificant so you could continue to ignore it. They are super neat.


> the increasing latency/cost of data access as the data size grows

Latency Numbers Everyone Should Know https://static.googleusercontent.com/media/sre.google/en//st...


The ratios are mostly correct, but the absolute numbers are wrong these days. For example, memory read latency can be 45ns (not 100) with a decent DDR4 kit and data center latency can be 10us (not 500us) with infiniband.


Do we mean different things by "merge join" and "nested loop join" ? For me "merge join" is O(n) (but requires the data to be sorted by key) whereas "nested loop join" is O(n^2).


Wouldn't you at least be looking at nlog(n) for the sort in the merge join?


Yes, if the data is not already sorted. Thus it's O(n) for already sorted data and O(n log(n) + n) -- which simplifies to O(n log(n)) -- for arbitrary data.


Yeah. I know. Why would the data already be sorted?


In a database you might have already built an index on that data.


I was experimenting a while ago with something that I think is related to this.

If you have a quadratic algorithm where you need to compare each pair of objects in a large array of objects, you might use a loop in a loop like this:

    for (int i = 0; i < size; i++) {  for (int j = 0; j < size; j++) { do_something(arr[i], arr[j]); }  }
When this algorithm runs, it will access every cache line or memory page in the array for each item, because j goes through the whole array for each i.

I thought that a better algorithm might do all possible comparisons inside the first block of memory before accessing any other memory. I wanted this to work independently of the size of cache lines or pages, so the solution would be that for each power of 2, we access all items in a position lower than that power of 2 before moving to the second half of the next power of 2.

This means that we need to first compare indexes (0,0) and then (1,0),(1,1),(0,1) and then all pairs of {0,1,2,3} that haven't yet been compared and so on.

It turns out that the sequence (0,0),(1,0),(1,1),(0,1) that we went through (bottom-left, bottom-right, top-right, top-left in a coordinate system) can be recursively repeated by assuming that the whole sequence that we have done so far is the first item in the next iteration of the same sequence. So the next step is to do (0,0),(1,0),(1,1),(0,1) again (but backwards this time) starting on the (2,0) block so it becomes (2,1),(3,1),(3,0),(2,0) and then we do it again starting on the (2,2) block and then on the (0,2) block. Now we have done a new complete block that we can repeat again starting on the (4,0) block and then (4,4) and then (0,4). Then we continue like this through the whole array.

As you can see, every time we move, only one bit in one index number is changed. What we have here is actually two halves of a gray code that is counting up, half of it in the x coordinate and the other half in y. This means that we can iterate through the whole array in the described way by making only one loop that goes up to the square of the size and then we get the two indexes (x and y) by calculating the gray code of i and then de-interleaving the result so that bits 0,2,4... make the x coordinate and bits 1,3,5... make y. Then we compare indexes x and y:

    for (int i = 0; i < size * size; i++) { get_gray_xy(i, &x, &y); do_something(arr[x], arr[y]); }
The result of all this was that the number of cache line/page switches went down by orders of magnitude but the algorithm didn't get faster. This is probably for two reasons: incremental linear memory access is very fast on modern hardware and calculating the gray code and all that added an overhead.


While reading your post I kept thinking "de-interleave the bits of a single counter" since I have used that before to great benefit. It does suffer from issues if your sizes are not powers of two, so your grey code modification seems interesting to me. I'll be looking in to that.

FYI this also extends to higher dimensions. I once used it to split a single loop variable into 3 for a "normal" matrix multiplication to get a huge speedup. Unrolling the inner 3 bits is possible too but a bit painful.

Also used it in ray tracing to scan pixels in Z-order, which gave about 10 percent performance improvement at the time.

But yes, I must look at this grey code step to see if it makes sense to me and understand why its not helping you even with reduced cache misses.


Interestingly I get more cache misses with the gray code solution but still fewer switches between cache lines. This has to be because the cache line prefetcher can predict the memory usage in the simple version but it's harder to predict in the gray code or de-interleave version.

Also, I have never thought about de-interleaving bits without using gray code but that seems to work similarly. The main reason I used gray code is that then we don't have to switch memory blocks as often since only one of the indexes changes every time so one of the current memory blocks can always stay the same as in the previous iteration.


This is similar to what I'm working on.

I am working on machine sympathetic systems. One of my ideas is concurrent loops.

Nested loops are equivalent to the Nth product of the loop × loop × loop or the Cartesian product.

  for letter in letters:
   for number in numbers:
    for symbol in symbols:
     print(letter + number + symbol)
If len(letters) == 3, len(numbers) == 3, len(symbols) == 3. If you think of this as loop indexes "i", "j" and "k" and they begin at 000 and go up in kind of base 3 001, 001, 002, 010, 011, 012, 020, 100 and so on.

The formula for "i", "j" and "k" is

  N = self.N
  for index, item in enumerate(self.sets):
    N, r = divmod(N, len(item))
    combo.append(r)
So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

We can load balance the loops and schedule the N to not starve the outer loops. We can independently prioritize each inner loop. If you represent your GUI or backend as extremely nested loops, you can write easy to understand code with one abstraction, by acting as if loops were independent processes.

So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

See https://github.com/samsquire/multiversion-concurrency-contro...

Loops are rarely first class citizens in most languages I have used and I plan to change that.

I think I can combine your inspiration of gray codes with this idea.

I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.



I decided to try implement get_gray_xy. I am wondering how you can generalise it to any number of loops and apply it to nested loops of varying sizes, that is if you have three sets and the sizes are 8, 12, 81.

Wikipedia says graycodes can be found by num ^ (num >> 1)

  a = [1, 2, 3, 4]
  b = [2, 4, 8, 16]
  indexes = set()
  correct = set()

  print("graycode loop indexes")
  for index in range(0, len(a) * len(b)):
    code = index ^ (index >> 1)
    left = code & 0x0003
    right = code >> 0x0002 & 0x0003
  
    print(left, right)
    indexes.add((left, right))
  
  assert len(indexes) == 16
  
  print("regular nested loop indexes")
  
  for x in range(0, len(a)):
    for y in range(0, len(b)):
      correct.add((x,y))
      print(x, y)
  
  
  assert correct == indexes
This produces the sequence:

    graycode loop indexes
    0 0
    1 0
    3 0
    2 0
    2 1
    3 1
    1 1
    0 1
    0 3
    1 3
    3 3
    2 3
    2 2
    3 2
    1 2
    0 2
    regular nested loop indexes
    0 0
    0 1
    0 2
    0 3
    1 0
    1 1
    1 2
    1 3
    2 0
    2 1
    2 2
    2 3
    3 0
    3 1
    3 2
    3 3


Here is the implementation of get_gray_xy that I used:

    static uint64_t deinterleave(uint64_t x) {
        x = x & 0x5555555555555555;
        x = (x | (x >>  1)) & 0x3333333333333333;
        x = (x | (x >>  2)) & 0x0f0f0f0f0f0f0f0f;
        x = (x | (x >>  4)) & 0x00ff00ff00ff00ff;
        x = (x | (x >>  8)) & 0x0000ffff0000ffff;
        x = (x | (x >> 16)) & 0x00000000ffffffff;
        return x;
    }
    static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t *y) {
        uint64_t gray = n ^ (n >> 1);
        *x = deinterleave(gray);
        *y = deinterleave(gray >> 1);
    }
I don't think the gray code solution can work well for sets of different sizes because the area of indexes that you go through grows as a square.

But it does work for different number of dimensions or different number of nested loops. For 3 dimensions each coordinate is constructed from every 3rd bit instead of 2nd.

So if you have index 14, then gray code is 9 (1001) which means in two dimensions x=1,y=2 and in three dimensions, x=3,y=0,z=0.


Thank you for sharing your idea and your code and thoughts and thanks for your reply.

I would like to generalise it for sets of different sizes. Maybe it's something different to a gray codes which you taught me today, it feels that it should be simple.

Maybe someone kind can step in and help me or I can ask it on Stackoverflow.

I need to think it through.

At one point in time I was reading a section on Intel's website of cache optimisation using a tool that shows you the cache behaviour of the CPU and column or row major iteration it had a GUI. I found it very interesting but I cannot seem to find it at this time. I did find some medium article of Cachediff.


deinterleave doesn't actually produce gray code does it? The GP said to convert to gray code before doing the deinterleave.


> but the algorithm didn't get faster

Could it be branch predictor at play? https://stackoverflow.com/questions/11227809/why-is-processi...



The union-find data structure / algorithm is useful and a lot of fun.

The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)

The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).

There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.

And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).


Agreed, union-find is great.

My favourite usage of it in practice is for solving first-order (syntactic) unification problems. Destructive unification by means of rewriting unbound variables to be links to other elements is a very pretty solution - especially when you consider the earlier solutions such as that of Robinson's unification which, when applied, often involves somewhat error-prone compositions of substitutions (and is much slower).

The fact that the forest of disjoint sets can be encoded in the program values you're manipulating is great (as opposed to, say, the array-based encoding often taught first): makes it very natural to union two elements, chase up the tree to the set representative, and only requires minor adjustment(s) to the representation of program values.

Destructive unification by means of union-find for solving equality constraints (by rewriting unbound variables into links and, thus, implicitly rewriting terms - without explicit substitution) makes for very easy little unification algorithms that require minimal extension to the datatypes you intend to use and removing evidence of the union-find artefacts can be achieved in the most natural way: replace each link with its representative by using find(l) (a process known as "zonking" in the context of type inference algorithms).

Basic demonstration of what I'm talking about (the incremental refinement of the inferred types is just basic union-find usage): https://streamable.com/1jyzg8


I remember Ravelin shared their story of using it in card fraud detection code.

Basically they managed to replace a full-fledged graph database with small piece of code using union-find in Go. Was a great talk.

https://skillsmatter.com/skillscasts/8355-london-go-usergrou...


My writeup of union find is in https://dercuano.github.io/notes/incremental-union-find.html. I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.


> I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.

I don't understand this goal. The interior connections aren't relevant to a union-find structure; ideally you have a bunch of trees of depth 2 (assuming the root is at depth 1), but the internal structure could be anything. That fact immediately means that the consequence of removing an edge is not well-defined - it would separate the two objects joined by the edge, and it would also randomly divide the original connected set into two connected sets, each containing one of those two objects. Which object each "third party" object ended up being associated with would be an artifact of the interior structure of the union-find, which is not known and is constantly subject to change as you use the union-find.

If all you want is to be able to remove a single object from a connected set, on the assumption that your union-find always has the ideal flat structure, that's very easy to do - call find on the object you want to remove, which will link it directly to the root of its structure, and then erase that link.


Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>. It additionally supports incrementally adding edges to E.

My quest was to find a way to incrementally delete edges from E, not E'. You're talking about deleting edges from E', which I agree is not generally a useful thing to do.

For example, your N might be the cells of a maze map, and your E might be the connections between adjacent cells that are not separated by a wall. In that case you can tear down a wall and add a corresponding edge to E. But it would be nice in some cases to rebuild a wall, which might separate two previously connected parts of the maze. I was looking for an efficient way to handle that, but I didn't find one.


I don't think there's a way to extend union-find to do what you want in the general case. You might have more luck starting with a minimum cut algorithm.

For the specific case where you can identify some of your edges are more or less likely to be deleted, though, you can run union-find on only the stable edges, cache that result, and then do the unstable edges. Whenever an unstable edge is deleted, reload the cache and redo all the unstable edges. This works for something like a maze with open/closable doors.


Those are some interesting ideas, thanks! I hadn't thought about trying to apply minimum-cut to the problem!


> Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>.

I don't think this is a valuable way to think about the structure. That's not what it's for.

A union-find is a direct representation of the mathematical concept of an equivalence relation. The input is a bunch of statements that two things are equal. It will then let you query whether any two things are or aren't equal to each other.

This is captured well by the more formalized name "disjoint-set structure". You have a set of objects. You can easily remove an element from the set. But it makes no conceptual sense to try to "separate two of the elements in a set". They're not connected, other than conceptually through the fact that they are both members of the same set.

A union-find is a good tool for answering whether two vertices are in the same connected component of a graph because being in the same connected component of a graph is an equivalence relation, not because the union-find is a graph-related concept. It isn't, and there is no coherent way to apply graph-theoretical concepts to it.

Putting things another way:

You describe a union-find as something used to "efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of [a graph]".

But you focused on the wrong part of that statement. What makes a union-find appropriate for that problem is the words "the same", not the words "connected component".


You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal. Contrary to your mistaken assertion, no contradictions arise from applying graph-theoretical concepts in this way; there is no problem of "coherency". It's just a form of description you aren't accustomed to.

If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not claiming my description is the only correct description of union find, just that it's the one that most closely maps to the problem I was trying to solve.

The description can equally well be formulated in the relational terms you prefer: given a relation R, union find can efficiently tell you whether, for any given x and y, (x, y) ∈ R', where R' is the symmetric transitive reflexive closure of R (and is thus the smallest equivalence relation containing R). It efficiently supports incrementally extending R by adding new pairs (a, b) to it.

This is equivalent to my graph-theoretic explanation above except that it omits any mention of E'. However, it is longer, and the relationship to the maze-construction application is less clear. Perhaps you nevertheless find the description clearer.

What I was looking for, in these terms, is a variant of union find that also efficiently supports incrementally removing pairs from R.

Does that help you understand the problem better?

— ⁂ —

In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

More generally still, as you move beyond the most elementary mathematics, you will learn that it's usually a mistake to treat one axiom system or vocabulary as more fundamental than another equivalent axiom system, since you can derive either of them from the other one. Instead of arguing about which one is the right axiom system or vocabulary, it's more productive to learn to see things from both viewpoints, flexibly, because things that are obvious from one viewpoint are sometimes subtle from the other one.


> You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal.

> If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not having any problems understanding the problem you were trying to solve. My problems are in the area of understanding why you thought a union-find might be an effective way to address that problem.

I'm telling you that, if you adjust your view of what a union-find is, you will have a better conception of where it can and can't be usefully applied.

> In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

In this case, the relation is symmetric by definition, so you just have graphs. Yes, it's obvious how you can use a graph to describe a relation.

But the entire point of the union-find structure is to discard any information contained in an equivalent graph that isn't contained in the relation. (And there are many equivalent graphs, but only one relation!) The absence of that information is what makes the structure valuable. It shouldn't be a surprise that, if you want to preserve information from a particular graph, you are better served with graph-based algorithms.


Well, that's an interesting way to look at it; I see better where you're coming from now.

But I wasn't trying to get the unmodified data structure to handle deletion efficiently; I was trying to use it as a basis to design a structure that could. I thought I saw a minor modification that achieved this, but then I found out why it doesn't work. That's what my linked note above is about.

More generally, if you only try things that will probably work, you'll never achieve anything interesting.


Can you give some examples where union-find is applied to great benefit?


A while back I had to perform an entity resolution task on several million entities.

We arrived at a point in the algorithm where we had a long list of connected components by ids that needed to be reduced into a mutually independent set of components, e.g. ({112, 531, 25}, {25, 238, 39, 901}, {43, 111}, ...)

After much head banging about working out way to do this that wouldn't lead to an out-of-memory error, we found the Disjoint Set Forest Data Structure and our prayers were answered.

I was very grateful for it!


Exactly the same for me. We had a defect management system, and a defect can be connected to others as dupes. We'd have whole chains of dupes. We used union find to identify them.


One example is connected component analyis on images. In one linear scan of an image (2d or 3d) you can find all connected blobs in an image.


I used it to find and track islands of connected polygons in 3D meshes.


a textbook example of an algorithm that works very well with union-find is the kruskal's algorithm to find the minimum weight spanning tree given a graph. Using union-find improves the time complexity of the algorithm from O(V²) to O(ElogV).

This happens because kruskal's algorithm essentially selects the cheapest edge not already included in our spanning tree that won't cause a cycle. So union-find is able to speed up this potential cycle check which would otherwise be naively quadratic.


Oh, huh, this is serendipitously handy because I need to be doing something like that this weekend (translating from Go to Rust - I already have a reparenting solution using hashes but hopefully this is easier / better.)


I have always wanted to really understand this data structure. Sure, I can follow the analysis with the potential functions and all, but I never really understood how Tarjan came up with the functions in the first place. Does anybody have a resource which intuitively explains the analysis?


This might a long read but it is well-written reader-friendly analysis of Tarjan's proof with chain of reasoning. https://codeforces.com/blog/entry/98275


Robert Sedgewick has a great explanation in his Algorithms book. He has gorgeous diagrams alongside his explanations.


Is it in an old edition? I just downloaded an extremely legitimate version of Algorithms 4th edition but it has no section on union data structure.


It should be in Chapter 1 Section 5 of the 4th edition. It's even in the table of contents.


Sorry, my bad. I searched for the word "disjoint" expecting it to be prominent in this section but somehow this word only appears 3 times in this book non of which are in this section!

Anyway, while the figures in this work were indeed gorgeous, it was not what I was looking for. It did not even contain a non-intuitive analysis of the inverse ackermann complexity, much less an intuitive one!


Here's a writeup I made a while ago: https://www.overleaf.com/read/hjbshsqmjwpg


We use it to wire different outputs and inputs to gates in a zero-knowledge proof circuit!


I have heard stories about annotating the edges with elements from a group rather than using unlabelled edges. Have you heard anything about that?


You can efficiently maintain count/max/sum/hash/etc. for each component. It's occasionally useful.


I wrote a beginner's guide on this topic a while ago https://medium.com/nybles/efficient-operations-on-disjoint-s...


This sounds super useful thanks! I think I'll have something this can be used for, with individual claims of whether two elements are in the same set, then easily getting the largest set of equal items.


If at some point you have a cycle, you can then reparent and remove the cycle, right? This structure in general then can encode transitive relations effectively?

Something like “a and b …” “b and c …” “c and a …”.


> If at some point you have a cycle, you can then reparent and remove the cycle, right?

You'd never have a cycle. The way a cycle would theoretically arise would have to be joining something to its own child. But you don't naively join the two objects you're given - you find their root objects, and join those. If - as would be necessary for a cycle to form - they have the same root, you can return without doing anything.

If we call

    unite(a,b)
    unite(b,c)
    unite(c,a)
then we should end up with this structure:

        c
       / \
      b   a
With parents on the right, the structure will be a-b after the first call and a-b-c after the second call. The reason we end up with a shallower tree after the third call is that when a is passed to unite, we call find on it and it gets reparented directly to c. Then, c (the representative for c) and c (the representative for a) are already related, so we don't adjust the structure further.

> This structure in general then can encode transitive relations effectively?

Well... no. This structure embodies the concept of an equivalence relation. You wouldn't be able to use it to express a general transitive relation, only a transitive relation that is also symmetric and reflexive.


Thanks!


Is there a name for the data structure/algorithm?


It's called union-find, but also disjoint-set sometimes. The process of reparenting nodes is called path compression, and the optimization of using the larger set of the two as the parent when merging is usually just called "union by rank." Any of those terms should give you more details upon search.



I remember getting this in an interview without ever coming across it.. basically, fuck union find.


Wouldn't that be extremely useful and a polynomial solution for 3-SAT, which is NP complete?


I think I understand why you're confused here. To the extent that union-find is similar to SAT, it's similar to 2-SAT: the constraints you're reasoning about are defined only on pairs of elements, whereas 3-SAT is fundamentally reasoning about triplets (at least one of these three variables is true). Notably, 2-SAT is a variant of SAT that is polynomial in time, unlike 3-SAT.


SAT complexity results from blowing up a reasonable number of variables/clauses (very rarely equivalent to others) in the formula to an enormous number of variable assignments (too many to store, and having only three classes, formula true or false or not evaluated so far, that are not useful equivalencies). What disjoint sets are you thinking of tracking?


How is it a solution to 3-SAT?


(Fantastic post idea OP. One of the best I've ever seen :D)

Related to bloom filters, xor filters are faster and more memory efficient, but immutable.

HyperLogLog is an efficient way to estimate cardinality.

Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.

see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)

Would love to learn more "stupidly fast at the cost of conceptual complexity" things.

edit:

(adding) I don't know a name for it, because it's not one thing but a combination, but once can:

Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.

Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.

Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.

Very clever. (I first saw in the Aerospike key value store)


If you enjoyed XOR filters, you might also like ribbon filters, something that I had the pleasure of working on last year. They share the basic idea of using a system of linear equations, but instead of considering 3 random positions per key, the positions to probe are narrowly concentrated along a ribbon with a typical width of 64. This makes them far more cache-efficient to construct and query.

By purposefully overloading the data structure by a few per cent and bumping those items that cannot be inserted as a result of this overloading to the next layer (making this a recursive data structure), we can achieve almost arbitrarily small space overheads: <1% is no problem for the fast configurations, and <0.1% overhead with around 50% extra runtime cost. This compares to around 10% for XOR filters and ≥ 44% for Bloom filters.

In fact, I'm going to present them at a conference on Monday - the paper is already out: https://drops.dagstuhl.de/opus/volltexte/2022/16538/pdf/LIPI... and the implementation is at https://github.com/lorenzhs/BuRR/. I hope this isn't too much self-promotion for HN, but I'm super hyped about this :)


This made my day, thank you so much. :) Your explanation makes sense and it sounds brilliant/clever yet obvious. (like many good ideas are)

I'm reading the paper and looking at your github now, and look forward to "github/academia" stalking you in the future. Skimming your list of repositories and seeing a lot of stuff I understand and could possibly use. ;-)

(I find it to be a useful strategy to, when I find a clever concept in a paper, or in code on github, then look at all the other things done by the same person, or the works of people they co-author with. "collaborative filtering" for ideas.)


Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.

Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.


I want to hear more, esp about the distributed applications, do you have any useful links, or can I buy you an "e coffee" to pick your brain for a few minutes?


> Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms.

Ha, I was thinking about this idea the other day and couldn’t figure out a good way to search for anything already written on the topic. I suspect there’s quite a lot of ground to be gained in the area.


Big fan of HLL

Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/

Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ


On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

t-digest https://github.com/tdunning/t-digest

DDSketch https://github.com/DataDog/sketches-java


In the same spirit there is also Circllhist from Circonus. https://arxiv.org/abs/2001.06561

There was a good discussion on approaches to store histogram here: https://news.ycombinator.com/item?id=31379383


t-digest is used by AWS X-ray to render their charts


A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".

Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.


Works well in analytics cubes since they can be combined.

You can retain them across time too, such that you can ask questions like "how many unique users were there over the last N days?" without needing the source data. Great for privacy-aware analytics solutions.


Love DataSketches but I was wondering if there is a way to compute datasketches across time for e.g. I want to compute the users who did X and then Y in that order. Since intersection is commutative it doesnt give an answer for time ordering.

Nonetheless the best data structure I have read over last 10 years.


I like https://en.wikipedia.org/wiki/Cuckoo_filter which allows for delete, unlike Bloom


I forgot about Aerospike. They basically built a NAND optimized key, value store right? I remember reading about how they used the FTL and thinking they were pretty clever. I cant for the life of me find the article now. I think they were really big in the ad tech space? Is that still the case?


"NAND optimized key value store" doesn't do it justice ;-) The fact that it's SSD optimized has nothing to do with key sharding across trees, the latter is what gets you the absurdly low latency and near infinite scale out. This link gives an overview: https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf And it's open source...


I think what you’re describing in the last example is the MinHash algorithm

edit: actually I’m not sure, sounds related though


Sounds a bit like radix sort?


AIUI Radix sort "kinda ish" related, yeah :) They both involve looking at your data at a 90 degree angle and slicing on bitplanes.


HAMT: Hash Array Mapped Trie. This data structure makes efficient immutable data possible. You can update a list of a million items, and keep a reference to the original list, by changing 3 or 4 references and some bytes.

This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.

https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie


What I like about HAMTs is that they can be super simple to implement if you make them one bit per level. They are like a combination of a binary search tree and hash table but without any of their annoyances.

* In binary search trees, you need to balance the tree every time you insert something because the tree will be linear if you insert the nodes in order. In a HAMT the positions are determined by the hash, so the positions will be random-like and not depend on the order you insert the nodes in.

* In binary search trees, you need to perform some complicated steps to remove a node. In a HAMT, you can either just mark the node as deleted, or if you want to fill the gap, find a leaf node that has the removed node on its path and just put it there.

* In hash tables, when it starts to get full you need to rehash it or put the added item in the "wrong" slot or something. A HAMT never gets full because it's a tree so you just add another child node.

Here I have written an explanation of a simple HAMT: https://www.robalni.org/posts/20220507-a-hash-table-you-dont...


Here is an implementation of a binary HAMT that allows you to add and find nodes:

https://gist.github.com/robalni/311afd0756f25c4f234b2ae332cd...


Another interesting approach to copy-on-write for immutable collections (for example arrays) is where you actively mutate the array in place, but leave the original as a lazy description/view of how to get back to the before state.

From the outside the effect is the same, but the performance is optimized for accessing the updated collection and only possibly using the old value.

Great for cases where you want the immutability guarantee, but where it might be unlikely the old value is actually going to be used in the general case.


See also this cppcon talk by Juan Pedro https://youtu.be/sPhpelUfu8Q


Databases do that.


If the Record & Tuple proposal advances to stage 3 we'll finally have native immutable data structures in JS [1].

[1] https://github.com/tc39/proposal-record-tuple


I have beeen waiting aeons for this, and suspect will need to continue waiting effectively forever :(


Yeah TC39 has a habit of letting proposals languish at stage 2 for years. I wouldn't give up hope though. The decorators proposal finally made it to stage 3 after spending an eternity at stage 2. According to the slides from the TC39 meeting this month [1][2], it looks like they'll be proposing an advancement to stage 3 in September.

[1]https://github.com/tc39/agendas/blob/main/2022/07.md [2]https://www.dropbox.com/s/g4enjgd4p2npv2s/record_tuple_updat...


This. Roughly a year ago I got interested in efficient immutability for my write-from-scratch-in-C Lisp [0] and started to write a HAMT implementation in C [1], along with a (somewhat hacky, you have been warned) benchmarking suite [2].

The docs are only 70% done (in particular the "putting it all together" part is missing) but it has been a really interesting and enlightening journey so far and can only recommend embarking on this path to everyone.

[0]: https://github.com/mkirchner/stutter [1]: https://github.com/mkirchner/hamt [2]: https://github.com/mkirchner/hamt-bench


See also Matt Bierner's HAMT impl in JavaScript: https://blog.mattbierner.com/hash-array-mapped-tries-in-java...


also Ctrie: "a concurrent thread-safe lock-free implementation of a hash array mapped trie": https://en.wikipedia.org/wiki/Ctrie

So basically like HAMT but lock free.


Isn't this what Sublime Text uses under the hood to give it such good performance?


Isn't this slow because of pointer chasing in the trie?


I do recall Asami using a HAMT and it is written in Clojure. [1]: https://github.com/threatgrid/asami


That’s what Immutable.js used under the hood.


As a case study we evaluated the performance of this along with other languages that use immutable data structures (perf testing) for a recent production app we were building. After some evaluation we had the opinion that the penalty of using immutable HAMT's on the Node/JS platform is just too high, at least in its current lib form. Immutable structures are often somewhat slower, but the relative penalty of using a HAMT on the Node/JS platform vs a straight mutable structure was much higher than other languages. Floats for numbers, conversions back and forth implicitly, pop count intrinsic missing, conversions in and out, and other such overheads make the relative penalty of an immutable HAMT vs a mutable JS map much higher compared to other languages such as .NET or Java let alone C/Rust.

Given Node is mostly single threaded as well some of the advantages (not all!) of immutability diminish on that platform making the performance penalty harder to swallow, at least for my use case.

It of course may not matter to your case, but I find the benefits of these structures come from their quicker clone/copy time in many applications, snapshot like characteristics, easy large diff's, etc. For this to matter the collection has to be of a decent size where a linear copy/equals/etc is too slow to work for the use case. At that point speed starts to matter as the costs of random lookup start to grow/drift further away from a standard mutable map and for JS we deemed the penalty too high and decided on another backend platform amenable to writing generic data structures.


Yes, I evaluated it. The complexity of knowing where to convert from/to plain JS, plus the extra library syntax to learn, plus the performance cost of toJS, made it a poor fit for my particular use case. Nearly as much of a hard sell at work as saying “let’s rebuild the UI in Clojurescript,” and without providing as much benefit.

My use case is pretty atypical though, and it’s worth checking out if you have more reliable data shapes/update paths.


Does immer have the same drawbacks as immutable? It uses proxies as opposed to hased array map tries.


The key design difference between the two is that immutable.js wants you to keep your stuff in its format (and operate on your stuff with its functions) until you hit some piece of code that needs a plain JS object. Whereas immer is scoped (mostly?) to update functions, and always returns plain JS. Immer seems to be designed to simplify the problem of only changing references to changed objects—and it looks like it does a good job of it.

So with immutable.js, you get more powerful data manipulation throughout the app, at the cost of having to know when and where to convert things back to plain JS. With immer, you get tightly-scoped immutable updates to objects, and inside the update function you can treat them as if they’re mutable, letting you write more idiomatic-looking JS. Instead of spreading 4 layers of objects to update one state flag, immer lets you say:

  const nextState = produce(state, draft) => {
    draft.view.marketing.annoyingSubscribePopup = false
  }
Every object reference in that path will be updated, but the rest of “state”, “state.view”, etc. will be unchanged.

If you can keep everything inside of immutable.js, it is the fastest thing out there. As soon as you have to drop back to plain JS, it gets slower. See this performance graph: https://immerjs.github.io/immer/performance/

Thanks for reminding me of this. We finally dropped IE11 support, so may be able to get some benefits from introducing immer either by itself or by bringing in Redux Toolkit.


I imagine careless use of such a structure would be an easy way to create a memory leak. Is it possible to create persistent collections in js, that will free data no longer directly referenced?


The careless usage of shallow copies of objects (with JS spreading) presents the same issue. Properly scoping assigned constants and variables is still important.

Persistent collections are available today via immutable.js, so it can be done. The catch is that you have to use a library for it, and transform them back into plain JS when something expects an object or an array. The language itself could make this transparent and lower-cost by implementing it at the engine level.

Persistent collections are primarily useful in functional programming paradigms. JS is a multi-paradigmatic language, so it doesn’t make sense to use them as the default. It would sure be nice to be able to opt into using them in FP-aligned frameworks like the current React ecosystem though.


Relatedly, RRB trees for immutable vectors with good constant factors.


Splay trees:

These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.

Piece tables:

A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).

This has interesting variations:

- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.

- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.

- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.


Came here for splay tree. I’ve found some really powerful applications for this in low level database engine work.

Being able to incrementally rebalance a tree and keep the set of deltas small each time is really powerful when you are dealing with append only IO abstractions.


The VSCode Blog has a great article on Piece Tables talking about their adoption of that data structure https://code.visualstudio.com/blogs/2018/03/23/text-buffer-r...


Spatial hashing.

Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.

Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.


TBH even quadkeys are a fun answer to OPs question, many people aren't aware of them.

Simple explanation:

If you have data with x y coordinates and you know the bounds.

To compute the quad key for a point:

1. The key starts as the empty string. (I've also seen it start with "Z" to handle points outside the bounds)

2. Divide the space into 4 quadrants

3. determine which quadrant the point falls in, append a letter (A-D depending on the quadrant) to the key

4. Repeat step 3 using the quadrant bounds (i.e. recursively smaller area) until you have desired accuracy

This can then be used to efficiently find points within rectangular bounds.


A lot of people don't know about the related segment and interval trees. Both efficiently allow you to compute all time ranges that overlap a point. This can be generalized to higher dimensional spaces.

For example if you have a set of calendar invites with a start and stop time (say tens of millions of them) and you want to find all invites which span 12pm (that is start before 12 and end after 12) you want an interval tree.

https://en.wikipedia.org/wiki/Interval_tree


That's how mongodb geospatial indexes work IIRC


Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.

In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.

A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.

When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):

1. GeoIntersects on the extent field (uses the index).

2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)

[1] https://www.mongodb.com/docs/manual/tutorial/create-queries-...


Seems exactly like broad phase and narrow phase in games physics engine.


The same things get invented over and over again and named different things depending on the field. Sometimes it's not immediately clear that they are the same things mathematically.


It's tried and tested for sure, I first encountered them in 94 but I assume they're much older.

A little sleuthing shows that (TIL) they are an application of Z-order curves, which date back to 1906


Hmm... the quad keys end up looking very much like coordinates. In the regular coordinates, the most significant bit divides the space in half, then the next bit divides those spaces in halves etc...

You could get a single "key" by just having a sequence of bits with most significant bit of x coordinate, most significant bit if of y coordinate, second most significant bit of x coordinate, second most significant bit of y coordinate, third....


From a basic binary search, this is very intuitive. Your explanation was well written.


It should be noted that this solution completely fails when the data is anything like real world location data which tends to be clustered in cities, stadiums etc. The strength of quad trees is precisely the fact that the "Alaska" quadrant can have a single child for that crazy outdoors-man using your gadget, while New York can have a million children trees down to every house on the street.


I’ve always found Morton encoding to be a cool approach. There’s an old but good blog post about this: https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...

This looks like a more modern implementation: https://crates.io/crates/lindel


I didn't think that would be an obscure data structure, but Locality-sensitive hashing [1] helps in so many cases. Like nearest neighbour search, collision detection, etc.

[1]: https://en.wikipedia.org/wiki/Locality-sensitive_hashing


Neat use of these is the "fast multipole method" for stimulating galaxies.

If you want to compute the gravitational force on a given star exactly you need to figure out the forces from all other stars and add them up. This will be n^2 and slow. But turns out you can get a good approximation by dividing up space with a kd tree and using the average force from each cell to represent the stars underneath it in the tree. This gets you to nlogn

https://en.m.wikipedia.org/wiki/Fast_multipole_method


Ever heard of geospatial hierarchical Indices?

E.g. Uber's H3 index or Google's S2 index.


I have a very optimized octree, but one thing I can not use it for is points in space. Mine is built with the assumptions that every object has a finite size, which is used to determine what level of the tree it lives in. Points fail because they have zero size.

I'm still looking for a good spatial index for points. I'll think about yours.


Are these similar to space-filling curves like z-order indices? https://en.wikipedia.org/wiki/Z-order

Some more such curves: https://web.archive.org/web/20220120151929/https://citeseerx...


A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).

These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.


Couple some spatial hashing with Morton codes and fast parallel Radix sorting and some binary search derivative and you can do all sorts of fancy queries for collision detection and graphics stuff.


Yeah, just reinvented it a few weeks ago and was like “whoa, that’s it?” It worked perfectly for my fast clustering use case.


> reasonably well behaved

Curious what this means in this context. No hot spots? Not sparse?


Performance degrades when you get many hash collisions, which will happen if you have a lot of points clumped together in space.

Sparsity is fine.

For points you expect to be unevenly distributed the more advanced spatial partitioning data structures (kd tree, BVH) perform better.


Equality graphs (e-graphs) for theorem proving and equality saturation and other equality-related things.

They're awesome data structures that efficiently maintain a congruence relation over many expressions

> At a high level, e-graphs extend union-find to compactly represent equivalence classes of expressions while maintaining a key invariant: the equivalence relation is closed under congruence.

e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and then said "x == y" (merged "x" and "y" in the e-graph), then the e-graph, by congruence, would be able to tell me that "f(x) == f(y)"

e.g. If I were to represent "a*(2/2)", in the e-graph, then say "2/2 == 1", and "x*1 == x", by congruence the e-graph would know "a*(2/2) == a" !

The most recent description of e-graphs with an added insight on implementation is https://arxiv.org/pdf/2004.03082.pdf to the best of my knowledge.

P.S: I'm currently implementing them in Haskell https://github.com/alt-romes/hegg


Shameless plug, I also maintain an OCaml implementation of egraphs (named ego) at https://github.com/verse-lab/ego

While the most popular implementation at the moment seems to be egg in Rust, I find that OCaml serves as a much more ergonomic environment for quickly prototyping out uses of egraphs in practice. As a bonus, ego also shares the same logical interface as egg itself, so once you've finalised your designs, you shouldn't have much trouble porting them to egg if you need the performance gains.


I wonder if you could apply e-graphs to situations where the union-find data structure would be used, i.e. if there are any additional benefits gotten from the congruence relation.


More pointers and more processing in exchange for representing equivalence classes of many, often infinitely many graphs compactly and enumerating them efficiently. Usually not relevant when equivalence of simple nodes is the object.


Isnt that the building blocks of logic based programming languages like Prolog?


Yes, at least these egg-based projects are usign this idea, but it looks like it's just one of the options:

https://souffle-lang.github.io/docs.html http://www.philipzucker.com/souffle-egg4/ https://arxiv.org/pdf/2004.03082.pdf


I had a similar thought when I learned about e-graphs. I'm not sure yet, but I think congruence closure is a slightly different problem from the one Prolog unification solves. In particular, if you tell an e-graph that `f(x) = g(y)`, it will take you at your word -- while Prolog would give a unification error. (An e-graph should also handle `X = f(X)`, while this would fail Prolog's admittedly often-ignored occurs check.)


Conflict-free replicated data type (CRDT) is super interesting data type (and algorithm), especially when it is implemented for complex data structures like JSON: A JSON CRDT is "[...] an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state (a conflict-free replicated datatype or CRDT)." [1].

[1] A Conflict-Free Replicated JSON Datatype (https://arxiv.org/abs/1608.03960) b Martin Kleppmann and Alastair R. Beresford.


Finger trees allow you to do amazing things. In essence they let you build an index, or multiple indices, for your dataset and then store them in a single structure.

When I go from Haskell back to imperative land I find myself greatly missing this ability. Sure I can make multiple hashmaps or trees or whatever but being able to stuff it all in one data structure is amazing.

One structure I built with them that is much more difficult with typical structures is a "block list" structure.

In this structure each block has a particular width and they're all stacked side by side.

I want to perform a query, "which box is at position X". So if my boxes are of widths 7,20,10, then the lookup for 2 yields the first box, the lookup for 12 yields the second, etc.

More interestingly, if add a new box between the second and third, the indices covered by the last box is increased.

With finger trees you use a sum monoid. This is easy. In other languages you have to roll your own structure either using a list (with o(n) lookup) or a tree with o(log n) lookup, but o(n) inserts (to translate the indices of future blocks)

https://andrew.gibiansky.com/blog/haskell/finger-trees/


You might enjoy a paper I’m a coauthor on which combines finger trees with B-trees to build a data structure for optimal updates to sliding windows: Optimal and General Out-of-Order Sliding-Window Aggregation, https://www.scott-a-s.com/files/vldb2019_fiba.pdf

Slides from the conference talk: https://www.scott-a-s.com/files/vldb2019_fiba_slides.pdf


Very interested in your idea. How does concurrent updates work with the augumented trees? I have been thinking about the same for a while something like CouchDB but segment aggregations (monoid) augumented so we can do interval queries over time too. But the only thing bothering is augumented trees are notorious of contention on concurrent updates. Do you have any ideas on merging back if we have multiple versions of augumented trees.


You can do this with any tree, finger or otherwise, and the big-O stuff applies as long as the tree is within a constant factor of balanced. For some bizarre reason, though, the concept of caching a monoid in each node doesn’t seem to have spread to the standard libraries of most languages. It’s a pity, since that would be incredibly useful.


Finger Trees Explained Anew is a great derivation of the data structure: https://www.youtube.com/watch?v=ip92VMpf_-A


If I'm not mistaken, Clojure's data structures are (or used to be) implemented using finger trees.


A variation on 32 wide Bagwell hashed tries last time I looked. Scala was too. Those are much more annoying to implement without garbage collection fwiw.


Very interesting! Sounds similar to a segment tree: https://en.wikipedia.org/wiki/Segment_tree


  "This is the story of a clever trick that's been around for at least 35 years,     in which array values can be left uninitialized and then read during normal   operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time."
https://research.swtch.com/sparse


> The sleaziness of uninitialized data access

That's the perfect word for this kind of thing. It's not wrong, but it's sleazy alright.


I've used this type of data structure in the past for things like managing sprite DMA on old consoles. It's very friendly to old systems with simple memory access patterns.


Requires numbers inserted to be unsigned, and to have a maximum value within reason, and doubles memory usage.

Very nifty trick though!


I remember this from Programming Pearls and I was going to post about it after reading the comments. Great stuff.


I was going to mention Sparse Arrays, thanks for mentioning it. One of my favorite data structures.


A minor quibble with your use-case explanation: The advantage of a bloom filter isn't strictly time complexity. For example, a hash table would also have constant lookup time (best case), and would give a definitive answer on set membership. However, to store 1 million IPv6 addresses would take 16 MB. You can see very quickly that this would not scale very well to, say, a billion addresses stored in-memory on a laptop. With a bloom filter, we can shrink the amount of storage space required* while maintaining an acceptable, calculable false positive rate.

* IP addresses actually aren't a great use case for basic bloom filters, as they're fairly storage efficient to begin with, as opposed to a url for example. Taking your example, say we need to store 1 million IP addresses in our bloom filter and we're okay with a ~1% false positive rate. Well then, if we use a bloom filter with 2^23 bits (1 MB), the optimal number of hash functions is (2^23)/(10^6)*ln(2) = 6, yielding a false positive rate of (1 - exp(-6* 10^6 /2^23))^6 = ~1.8%. So we're using 6% of the storage space, but with a nearly 2% false positive rate.


Treap: https://en.wikipedia.org/wiki/Treap

It's like a Red-Black tree in use case, but much faster to implement, which is good for competitive programming. The average case complexity is the same for all operations, but there's an unlikely degeneration to worst-case linked-list behaviour.

Lazy Propagation Segment Tree: https://cp-algorithms.com/data_structures/segment_tree.html

Like a segment tree in that it supports range queries in logarithmic time, but supports range updates in log time as well.

I know a few more fun ones that I occasionally use in contests, but I've never touched otherwise.


Speaking of ease of implementation, I just discovered AA trees. They are probably not "obscure" but I think they are worthy of more fame because they perform jsut as well as red-black trees and are easier to implement. Finding clear comprehensive documentation about them was not easy though, so here is for you, the best I could find : https://www.cs.umd.edu/class/fall2019/cmsc420-0201/Lects/lec...

And the, unhelpful to me, orginal paper : https://user.it.uu.se/~arnea/ps/simp.pdf


The Wikipedia article on AA-Trees is good: https://en.wikipedia.org/wiki/AA_tree


The nicest thing about treaps is how easy union/intersection/disjunction are.


The Zipper acts like a linked-list with a cursor, or "focused element"; it's implemented as a pair of lists in opposite orders; or, equivalently but more symmetric, as triple of (List[A], A, List[A])

Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're focusing on the 3. In code this will look like:

    ([2, 1, 0], 3, [4, 5])
Where [a, b, c] denotes a singly-linked list, with O(1) head (returning a) and tail (returning [b, c]). Notice that the first list is in reverse order.

To focus on the next element, we put the currently focused element on the first list, and pull the head off the second list:

    ([3, 2, 1, 0], 4, [5])
And vice versa to focus on the previous element:

    ([1, 0], 2, [3, 4, 5])
I like to imagine this as a string of beads, with the focused element held in our fingers and the rest hanging down:

     3
    / \
    2 4
    | |
    1 5
    |
    0
To move the focus forwards and backwards, we move our grip to the next/previous bead.

This works nicely as an immutable datastructure, since the tails of both lists can be shared (i.e. we don't need to copy the whole list, just the wrap/unwrap an element on to the head)

Zippers were first applied to lists, but have since been generalised:

- To trees: focusing on one node, and being able to move the focus up, to the left-child, or to the right child

- To having more than one focus

- To datastructures more generally, by treating them as 'derivatives' (as in calculus)

https://en.wikipedia.org/wiki/Zipper_(data_structure)

As an example, the XMonad window manager uses a zipper to keep track of the currently-focused window.

I've also used Zippers for window management, e.g. having a list of Displays, with one of them focused (accepting keyboard input); each Display containing a list of Workspaces, with one of them focused (currently shown); and each Workspace containing a list of Windows, with one of them focused (the active window). Keyboard shortcuts could shift between the previous/next Window, Workspace, or Display.


I encountered this when I tried implementing a Brainfuck interpreter in Haskell. Representing the memory array using a List + index would be way too slow; but a Zipper is perfect!


I don’t understand why wouldn’t you just use a list and an index. You can always access list[index+1] or list[index-1] or list[0] or list[list.length-1].

What is the benefit here?


Firstly, accessing an arbitrary list index is O(N), whilst a Zipper can access its focus in O(1) (shifting the focus is O(1) for both Zippers and list+index)

Secondly, using an index forces us to do bounds checks, keep track of the length (or re-calculate it at O(N) cost), etc. whereas a Zipper is "correct by construction"; i.e. every value of type (List[A], A, List[A]) makes sense as a zipper; whereas many values of type (List[A], Nat) don't make sense as zippers; e.g. ([], 42) doesn't make sense. Our logic also becomes easy for pattern-matching, e.g. in Scala:

    myZipper match {
      case Zipper(xs, x, y::ys) => Zipper(x::xs, y, ys)
      case z => z
    }
Also, your example of 'list[index-1]' is more complicated than it first appears. In particular, what is the type of 'index'? If it's an integer, then at least half of the possible values are meaningless (negative numbers); if its a natural (AKA unsigned integer), then it's not closed under subtraction (i.e. we have to worry about underflow)!

Thirdly, Zippers can be unbounded in both directions (i.e. the lists can be "infinite"/co-inductive). For example, they're great for implementing Turing machines, starting with an infinite blank tape in both directions. https://en.wikipedia.org/wiki/Coinduction

Fourthly, Zippers have more sharing. Here's an arbitrary example: say we have a zipper like this:

    ([c, b, a, ...], elem, [x, y, z, ...])
If we shift right, replace the focused element with 'newElem', then shift right again, we'll get this zipper:

    ([newElem, elem, c, b, a, ...], y, [z, ...])
Those operations take O(1) time and O(1) memory (and produce O(1) garbage, if we're no longer using the old zipper). In particular, since the tails of the lists ([c, b, a, ...] and [z, ...]) are unchanged, the same pointers will appear in both the old and new zippers; there has been no copying. This is important, since those lists could be arbitrarily long (or infinite!).

In contrast, if we did these operations on a list+index, everything before the replaced element would need to be reallocated; i.e. [..., a, b, c, elem, _] would need to be copied, since _ has changed from 'y, z, ...' to 'newElem, z, ...'. That requires O(N) memory allocation, takes O(N) time to do the copying (and produces O(N) garbage, if we don't want the old zipper).


> accessing an arbitrary list index is O(N)

I think you mean linked-list here. Parent is talking about "arrays".


> I think you mean linked-list here

Yes, I specified this explicitly, e.g. "The Zipper acts like a linked-list with a cursor" and "Where [a, b, c] denotes a singly-linked list".

> Parent is talking about "arrays"

You're using quotation marks, but I don't see what you're quoting? The parent explicitly says: "a list and an index"

In any case, arrays come with most of the same problems I mentioned for lists:

- They have invalid states, e.g. ([], 42)

- They require bounds-checking, a consistent notion of subtraction on the index type, etc.

- They can't be infinite

- They require O(N) time, O(N) memory (and O(N) garbage if we're discarding the old value) to replace the focused element

- etc.

Plus, arrays come with all sorts of extra gotchas, like buffer-overflows, pre-allocation/re-allocation decisions, over/under-provisioning, etc.


And they can be generalized to work on any kind of trees instead of just lists.


The benefit is the data sharing between the tails. You can very cheaply get a copy of zipper with the data at (or around) the focused element changed while also keeping the original data unchanged. Admittedly, this is something people in functional programming languages probably care much more about.


This zipper stuff is for immutable data structures.. For us normal folk that just use & abuse array's we don't need zippers. :-). Just kidding. I am dieing to find a use for zippers tho, i'm not a functional programmer.


A list and an index creates the possibility that the index is out of bounds, and then your code probably has to check for that scenario and figure out how to handle it.

With a zipper such a scenario isn’t possible.


Some ones I've used recently:

The "golden section search" to find a the minimum (or maximum) of a unimodal function. An actual real-world use case for the golden ratio.

Exponentially Weighted Moving Average filters. Or how to have a moving average without saving any data points..

Some of my classic favorites:

Skiplists: they are sorted trees, but the algorithms are low complexity which is nice.

Boyer-Moore string search is awesome..

Bit twiddling algorithms from Hackers Delight: for example (x &-x) isolates the least significant set bit: basically use the ALU's carry chain to determine priority.

Another is compress from Hacker's Delight. It very quickly compresses selected bits (from a bitmask), all to the right of the word. It's useful to make certain hash functions.

The humble circular queue. If your circular queue is a power of 2 in size, then the number of bits needed for indexes is at least log2(size) + 1. The + 1 is key- it allows you to distinguish between completely full and completely empty with no effort. So:

    Empty: (write_index == read_index)
    Full: (write_index == read_index + size)
    Count: (write_index - read_index)
    Insert: queue[write_index++ & (size - 1)] = data
    Remove: data = queue[read_index++ & (size - 1)];
All lossless data compression algorithms are amazing.


Regarding the bit “compress” operation, this is supported in hardware (PEXT instruction) by Haswell and newer processors, though unfortunately most AMD processors have poor implementations. I’ve found it to be quite handy when implementing subsetting operations.


In case anyone is interested, I wrote a skip list implementation for Node.js here: https://www.npmjs.com/package/proper-skip-list

I provided the time complexity of each operation as part of the README.


This is a great selection. Thanks for sharing.


HNSW, or Hierarchical Navigable Small World is a graph data structure for approximate nearest neighbor search of vectors.

https://arxiv.org/abs/1603.09320

The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.

Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.


I second this.. very useful for vector NLP and other ML tasks.


Directed acyclic word graphs (DAWGs)!

They’re like tries, but with identical subtrees glued together. They’re capable of encoding real-world dictionaries of millions of words into ~1 byte per word, when stored cleverly (see [0]). They let you do things like efficiently finding a list of words in a dictionary whose Levenshtein’s distance to a given word is less than x. Their cousins, GADDAGs, are _the_ data structure of choice in Scrabble-playing engines.

[0]: http://www.jandaciuk.pl/fsa.html


Also related: Levenshtein automata -- automata for words that match every word within a given Levenshtein distance. The intersection of a Levenshtein automaton of a word and a DAWG gives you an automaton of all words within the given edit distance.

I haven't done any Java in years, but I made a Java package in 2013 that supports: DAWGs, Levenshtein automata and perfect hash automata:

https://github.com/danieldk/dictomaton


Tries (or prefix trees).

We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).

We have an article with some animations that illustrate the concept in case anyone's interested [0].

[0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...


I wouldn't consider tries to be obscure tbh. They are the recommended solution for many leetcode-style interview problems involving string searching. I think anyone who has done serious interview prep has encountered them.

https://leetcode.com/discuss/general-discussion/931977/begin...


I got stung by this for a test engineering role. Apparently implementing them from scratch is a pre-req to building out test infrastructure!


The starting example was Bloom filters.


I'm not sure what the point of your post is. While bloom filters are heavily used in actual production code throughout the industry, it is very rare for anyone to need to code their own, or make changes to a prior legacy implementation.

Not all educational programs will cover bloom filters, and for those that do, there's no guarantee that the students will retain the information, and be able to recall it.

I don't know of any common interview question patterns where a bloom filter would be the only optimal solution. If it does share optimality with any other data structures, you would be better off choosing something else unless you are extremely confident in your ability to implement one and explain it clearly and in-depth, since the odds of your interviewer not being familiar with them are high in comparison to other solutions.

I only know what a bloom filter is because of HN, and previous submissions like this one that have made it to the front page throughout the years. It's nowhere near as common knowledge as tries, especially if you are sampling younger cohorts.


I just didn't think it was necessary to be judgemental/'gate-keep' about what's obscure or interesting 'enough' - better to let anyone share what's interested them and why.

To your points, personally I do fall in a 'younger cohort' I imagine; have come across bloom filters on HN way more frequently, and I dimly recall tries from university but I knew them as prefix trees. Without submissions like this one I wouldn't learn what a trie is/to make that alias in my head.


I was with you until the last seven words ;)

Trees were a huge part of CS practice and education historically, but have been replaced by hash-based methods in many cases. For example, in C++ std::map is generally a tree, while in a more recent language the standard Map data structure will be a hashmap. My impression is that the instruction time devoted to Bloom filters (and other hash-based methods) vs trees has shifted towards the former over the last ~20y.


C++ STL uses trees because the generic requirement for them to work is a < operator. Designing generics around hashing is much more difficult, and most languages struggle with that.

Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations, even though they are theoretically worse due to log(n) factor. So in a way, C++ algorithms are actually more modern.


> Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations

This doesn't match my experience at all. C++ trees are not cache-friendly; they're pointer-chasing (and there's no arena implementation in the STL). Second, any sort of ordering structure (be it through a tree or through sorting + binary search) is notoriously prone to branch mispredictions. They look great in a microbenchmark if you search for the same element all the time and the trees are small, but in a practical application, it can be incredibly painful. Pretty much by design, every choice is 50–50 and unpredictable.

std::unordered_map from most STLs isn't a fantastic hash table, but in my experience it absolutely destroys std::map (and is also comfortably ahead of something like std::lower_bound on a sorted array). All of this is assuming large-ish N, of course; for small N (e.g. below 30, usually), the best by far is a simple unsorted array and just going through it linearly.


> C++ trees are not cache-friendly

Agreed, they should use a B-tree to get cache locality and easy generics, but there is legacy code there.

I was referring to the performance of algorithms. For example `std::unique`, `std::lower_bounds`, etc. Many of these use sorted lists, whereas most other languages' standard libraries utilize hashing for these.

> is also comfortably ahead of something like std::lower_bound on a sorted array

I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?


B-trees don't work well for CPU caches; 64b lines are typically too small to gain much. (OK, the M1 has 128b lines, but still.) And you still get the branch mispredicts, and a significant increase in code complexity, so hashing is still significantly better. (I've seen C++ projects where the main structure was a B+-tree because the lower levels could be residing on disk, but where they maintained a separate hash table in memory to skip the top first few B-tree levels!)

> I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

If by “it” you're talking about std::unordered_map (as key); yes, you can, but you'd have to supply your own hash function, because std::pair<A, B> does not have a std::hash specialization. (It's a bit sad, but the problem is highly specific to std::pair/std::tuple.) Likewise, you can use any arbitrary type you create yourself, as long as it has operator== and you've written a hash function for it.


With a modern Intel processor, it was much faster to go through a 10k integers array than sort it first and then do binary search.

The small N is not that small anymore, and the difference speed between stack and heap is bigger now than ever.


To your point, C++ got hash structures in the standard library (unordered_map and unordered_set) in C++11 (so only about a decade ago).


a lot of this is that other than heaps, most tree based algorithms involve O(logn) random access pointer lookups which make them relatively slow for in memory data structures


Trie != tree


Indeed, but it took a while for me to appreciate Tries.

Originally, i thought "nice idea", but I rarely ever encountered a problem where I really needed a prefix tree.

But while reading into HAMT and Clojure's datastructure implementations, it dawned on me that prefix trees aren't only useful for strings (or arbitrary byte sequences): Hashes in a dictionary, for example, are sequences of bits, too. And the design of a HAMT is exactly such that it doesn't concern itself with bytes but any blocks of N bits (determined by the branching factor; i.e. a factor of 32 implies 6-bit-blocks). And if you know the length of each sequence (hash), you know the maximum Trie depth, which also works for you, not against you.

That was rather eye opening for me at the time.


Even though a trie is pretty standard and expected (to be known) these days, I will vote for this. It was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.

https://github.com/Azeem112/tries-T9-Prediction


Do you have any documentation about how tries are used in mongo indexes? Or could you point to the source?


There's not much about it online. They do mention it in the docs, they call this "prefix compression" [0], so you can search for that. This article describes it [1], although it is somewhat high level.

They only use this for indexes. It works really well for internal mongo ids (they all start with timestamps if I remember correctly). It also works well for compound indexes. Pro tip: for compound indexes always go from low-cardinality attribute to high-cardinality attribute to get the best results from prefix compression (it's also good for speed of queries).

[0] https://www.mongodb.com/docs/manual/reference/glossary/#std-...

[1] https://medium.com/swlh/mongodb-indexes-deep-dive-understand...


https://en.wikipedia.org/wiki/Macaroons_(computer_science) are very interesting to me.

Think of it as a JWT that you can narrow down the authorizations for without needing to communicate with a server, so if you have read permissions to all your photos you can add a caveat saying `photoid=123456` and share it, and the recipient can only read the photo 123456. The caveats can be anything, including requiring third party authentication via third-party caveats. I've implemented systems with it being validated by a lua module in nginx which worked well, but the whole concept never took off.

I still think it seems like one of the most interesting authZ strategies around.


I’ve heard of these!

The fly.io blog has a really cool write-up about them.

Definitely an under appreciated concept.

IIUC, you can even validate the “sub issued” macaroons offline, provided you know the validity of one of its ancestors up the chain. Is this correct, or am I misunderstanding?


I think that's correct since each added caveat builds on the previous one.

I don't have the formal knowledge to understand the full underpinnings of it, but it always seemed to me (from implementing validation of them and looking at many authZ systems) that many of the sharing and delegation features of current apps would be so much neater in a macaroon based system.


here is the mentioned blog post for anyone interested (and lazy): https://fly.io/blog/api-tokens-a-tedious-survey/


Purely Functional Data Structures[1] by Chris Okasaki is worth reading. There's a book version if you prefer vs reading a thesis.

Even though the domain application is functional programming, these datastructures can come in handy when you want to enable state sharing / keeping old versions around without having to copy data.

[1] https://www.cs.cmu.edu/~rwh/students/okasaki.pdf


I loved the "Numerical Representations" chapter, on deriving new data structures by analogy to number bases.


Reservoir sampling is a statistical technique used to randomly select a finite number of elements from a population. The elements are chosen such that each element has an equal probability of being selected. This technique is often used when it is impractical to select a random sample of elements from a very large population.

To do reservoir sampling, you first need to decide how many items you want in your sample. This number is called the size of the reservoir. Once you have the size of the reservoir, you can fill it by selecting items from the population at random.

The code is beautifully simple:

  for i = k to population.length-1
    j = random integer between 0 and i, inclusive
    if j < k
       reservoir[j] = population[i]
  return reservoir


Reservoir sampling is really cool - a slightly optimized version (algorithm L) let's you skip over every record that will not be sampled and it is still pretty simple. If your records are fixed size this can be an awesome speedup.

(* S has items to sample, R will contain the result )

ReservoirSample(S[1..n], R[1..k])

  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  (* random() generates a uniform (0,1) random number *)
  W := exp(log(random())/k)

  while i <= n
      i := i + floor(log(random())/log(1-W)) + 1
      if i <= n
          (* replace a random item of the reservoir with item i *)
          R[randomInteger(1,k)] := S[i]  // random index between 1 and k, inclusive
          W := W * exp(log(random())/k)
https://en.m.wikipedia.org/wiki/Reservoir_sampling


Indeed! Funny enough, I rewrote that article a few years ago, it previously contained an approximation of something like Algorithm L that some person with a blog came up with, having no idea that an even simpler and provably correct algorithm was published as early as 1994 :) Though others have improved the article a lot since then, adding explanations for how/why it works. Couldn't be happier to see it cited in this list!


I was going to mention it too! The really cool thing about reservoir sampling is that it can be done "online" (ie process input incrementally) which makes it super useful when you want to compute statistical properties of something in the field without blowing up your cpu and memory.

For example, let's say I have a server serving queries. I want to measure min/max/avg/stdev/99p you name it. You can do it cheaply with reservoir sampling, without having to save all data points.


Here is one not on the list so far:

Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 4MB of data be transferred to calculate the same thing.

*(in particular pin sketch https://github.com/sipa/minisketch).


Thank you!

I always wondered about this problem and never knew what to look up to read more about it!


I’ve found that a couple of times in this thread. A way to describe a problem and have potentially applicable algorithms suggested would be cool


My favorite one that I've had the chance to use professionally is the Marisa trie[0].

Context is a data scientist had written a service that essentially was just lookups on a trie. He'd left and the service was starting to have memory problems so I dug in and swapped the implementation. Iirc swapping the trie implementation changed memory usage from 8gb to 100mb and sped everything up as well.

The actual data structure is equivalent to a trie, but cannot be modified once it's been built (I think it may be the same as a LOUDS trie, I don't remember the specifics)

[0] https://github.com/s-yata/marisa-trie


Struct of arrays (also called MultiArrayList in Zig), instead of storing big structs in an array you store each field in a separate array and if you need to fetch the full struct you reconstruct it from the arrays.

The benefit is that the arrays items memory size is smaller and has no padding, it also increases cache locality.


Now I remember reading a blog post recently, about an Areta library (one of the rust ones maybe?) that were doing both: array-of-struct-of-array.

The idea was to split the main arrays up into “chunks” that were structs, themselves containing rows of data (still array oriented).

The idea was that you retain the cache/layout/performance benefits of SoA layout, except each chunk is already packaged up into lengths ready for pushing through SIMD. Quite a clever idea.

Found it: https://www.rustsim.org/blog/2020/03/23/simd-aosoa-in-nalgeb...


Sounds like a primitive of columnar databases.


Entity-Component Systems do exactly this for better cache locality


this is a great way to store structured data in js since it saves the memory cost having repeated keys. e.g.:

  records = {
    time: [1000, 1001],
    price: [20, 25],
    volume: [50, 15]  
  }

  records = [
      { time: 1000, price: 20, volume: 50 },
      { time: 1001, price: 25, volume: 15 }
  ]
  // not a big difference with 2 records, but for xxxx records...


Decent JS engines will use "hidden classes" to dedupe keys for you already, so this isn't necessary to save space; the technique is pretty old and dates to Self. Still, the arrangement may help with locality of reference.


In practice, most js engines these days can ‘recognise’ the ‘class’ of these objects (if you create them from scratch in a few places) and the memory representation would end up with a word for the ‘class’ which says that time is at field 0 and price at 1 and volume at 2, and then the data itself. The main reason is to speed up code that reads the fields rather than memory use.


I remember BASIC in the 80s when all I had were arrays of integers or strings. To have complex data structures I used one array per field. I don't think the CPU had cache though (Z80) /grin


Wouldn’t that hurt locality? Since now you need to do multiple access across the entire heap to reconstruct one object.


It depends. For operations or aggregates on a single field, it improves cache locality, whereas if operations act on all, or most, fields of the struct, it hurts cache locality. The exact same tradeoff differentiates OLTP (transactional/row stored) databases and OLAP (analytics/column stored) databases.


You wouldn't typically use this when you need lots of random accesses, but when you process the data sequentially.

It also simplifies and speeds up SIMD loads and stores, because you can load/store entire SIMD registers with a continuous, non-strided memory access.


Huet's zipper. https://en.wikipedia.org/wiki/Zipper_(data_structure).

One zipper value represents a regular tree of data, but from the perspective of the "current position". There are traversal operations that take a zipper value and return the same tree from the perspective of an element above/beside/below the current position, and there are operations that return the same structure but with the current element changed, or items deleted or inserted.

Huet's paper is an easy read if you've been exposed to OCaml or similar languages: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced... . His "returned glove" metaphor is what made it click for me.

Clojure includes an implementation of Huet's zippers https://github.com/clojure/clojure/blob/master/src/clj/cloju... that is <300 lines of code. It's very, very clever and broadly useful for dealing with XML or other nested data structures.


Most of the data structures posted here are taught in CS classes. Here’s an interesting list of more obscure ones: https://web.stanford.edu/class/cs166/handouts/090%20Suggeste...


Not all of us got (cough any) CS degrees though, and while technically the information is all out there for me to find, I wouldn't necessarily have a reason to look, or on those rare occasions where my sysadmin/DBA projects do throw up a problem that one of these algorithms/structures are suited for - I wouldn't know _to_ look, or even how.

That's possibly one of the better arguments for doing a CS degree actually, at least for me. But 20 years ago when I might have cared, CS was always advertised as "you'll learn data structures and algorithms" and that's it. ( 20 yr old me: "I know loops and recursion, linked lists, some basic trees and oh look this "perl" thing has hashes, meh I'll be fine" )

If they'd listed out the sort of descriptions I've seen here.. well, I might have had a very different life!

So I'm absolutely loving this thread!

And thankyou for your link too, added to my list.


This is great!

Exactly what this thread is about; everybody should go through this.


Disjoint-Sets have a very cool implementation whose amortized time complexity is extremely slow growing. It is not quite constant, but even for a disjoint-set with as many elements as there are particles in the universe, the amortized cost of an operation will be less than or equal to 4.

https://en.wikipedia.org/wiki/Disjoint-set_data_structure


Use the idea[1] at work to make large NP-complete problems more tractable by breaking the problem up into disjoint subsets.

[1] doesn't implement the inverse-ackermann algorithm but still implemented as union/find.


aka "union find" — some other comments in the thread call it that


I've been told to consider it constant for all practical purposes


The inverse Ackermann function is one of the slowest-growing functions that you’re ever likely to encounter in the wild. To say that it’s “constant for all practical purposes” is 100% true but doesn’t do justice to how amazingly slow this function is to grow.

https://en.m.wikipedia.org/wiki/Ackermann_function


Previously on HN:

https://news.ycombinator.com/item?id=1370847 (2010)

https://news.ycombinator.com/item?id=2363522 (2011)

https://news.ycombinator.com/item?id=3369454 (2011) -- Although not much is in the discussion, it points to a Stack Overflow post with some examples.

https://news.ycombinator.com/item?id=7079427 (2014)

https://news.ycombinator.com/item?id=28758106 (2021) -- Not exactly a data structure, but an interesting algorithm.


My contribution: the Binary Numeral Tree: https://eprint.iacr.org/2021/038

This is a tree-like structure where the structure of the tree is fully determined by the number of leaves N. Specifically, each 1-bit in the binary representation of N corresponds to one of the tree's perfect binary subtrees. For that reason, I think of it more as "structure imposed upon a list" rather than a typical mutable tree with pointers and such.

What are the applications of this data structure? I am only aware of one: computing the Merkle root of a large amount of data in streaming fashion. The funny thing is that it's been discovered independently multiple times: in BLAKE3, Sia, Utreexo, and more. Maybe someday, someone will find another use for it. :)


Can you dumb this down for me?


It might be easier to think about it as a stack, rather than a tree. Each element of the stack represents a subtree -- a perfect binary tree. If you ever have two subtrees of height k, you merge them together into one subtree of height k+1. Your stack might already have another subtree of height k+1; if so, you repeat the process, until there's at most one subtree of each height.

This process is isomorphic to binary addition. Worked example: let's start with a single leaf, i.e. a subtree of height 0. Then we "add" another leaf; since we now have a pair of two equally-sized leaves, we merge them into one subtree of height 1. Then we add a third leaf; now this one doesn't have a sibling to merge with, so we just keep it. Now our "stack" contains two subtrees: one of height 1, and one of height 0.

Now the isomorphism: we start with the binary integer 1, i.e. a single bit at index 0. We add another 1 to it, and the 1s "merge" into a single 1 bit at index 1. Then we add another 1, resulting in two 1 bits at different indices: 11. If we add one more bit, we'll get 100; likewise, if we add another leaf to our BNT, we'll get a single subtree of height 2. Thus, the binary representation of the number of leaves "encodes" the structure of the BNT.

This isomorphism allows you to do some neat tricks, like calculating the size of a Merkle proof in 3 asm instructions. There's some code here if that helps: https://github.com/lukechampine/us/blob/master/merkle/stack....

You could also check out section 5.1 of the BLAKE3 paper: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...


Fenwick Trees (which, despite the name, are implemented using an array) allow counting prefix sums AND updating prefix sums in O(log n) time. Very useful when n is in the order of millions. I have used them a few times in Project Euler problems.

https://en.wikipedia.org/wiki/Fenwick_tree


Was going to mention Fenwick Tree here as well, but since you've already did, I'll just add that this is IMO a great introduction to Fenwick Trees: https://www.youtube.com/watch?v=kPaJfAUwViY


Is it possible to implement a Fenwick tree using a tree, and support quickly adding and removing items as well as quickly shifting the positions of suffixes?


Yes! This is both possible and fairly easy.


Yes.


Segment trees are objectively superior in all ways except implementation length


Another advantage to Fenwick Tree: while it shares asymptotic space complexity with Segment Tree, it has better constant factors, which can be useful in very restricted environments.


I used fenwick trees in a smart contract because space/time costs money.


Succinct Data Structures [0] [1]. It encompass many different underlying data structure types but the overarching idea is that you want small data size while still keeping "big O" run time.

In other words, data structures that effectively reach a 'practical' entropy lower bound while still keeping asymptotic run time.

[0] https://en.wikipedia.org/wiki/Succinct_data_structure

[1] https://github.com/simongog/sdsl-lite


OBDD: Ordered Binary Decision Diagrams. This data structure allows efficient symbolic operations on boolean functions. Widely used in electronic design automation software.

[1] https://en.wikipedia.org/wiki/Binary_decision_diagram [2] https://apps.dtic.mil/sti/pdfs/ADA470446.pdf


1. Merkle-trees. Helps with content-based-retrieval, efficient add/modify/delete/move node operations. Useful in scenarios where you are building a data-serialization format that needs to support efficient updates.

2. Not necessarily a data structure, but SMT solvers (like Z3) provide a very useful abstraction for certain kinds of problems. They also use some cool data structures like disjoint-sets among other things.

3. Lock-free datastructures


The SO question "What are the lesser known but useful data structures?" has a good list

https://stackoverflow.com/questions/500607/what-are-the-less...

[edit: the link above has been discussed on HN multiple times https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...]


The Aho-Corasick automaton [0]. You have a bunch of strings, and you want to know if and where they appear in a longer string. You build a search trie from the strings, but add links back from each node to the node representing the longest suffix of the string it represents that is also in the trie, so you can skip backtracking as you scan, yielding linear time complexity in O(length of string being searched + number of results found).

[0] https://en.wikipedia.org/wiki/Aho–Corasick_algorithm


The aho-corasick Rust crate[1] also provides a way to build the automaton in a slightly different way to provide leftmost first or "preference first" match semantics. This matches how popular backtracking regex engines implement alternations of literals, for example. (It also provides leftmost longest semantics, which matches how POSIX regexes implement alternations.)

[1]: https://docs.rs/aho-corasick


I really like this thread, but now I am concerned which of these will show up in a future interview.

Edit: forgot to add mine. Not obscure by any means but Fenwick Trees are very cool IMO

https://en.m.wikipedia.org/wiki/Fenwick_tree


"I remember a thread on HN about uncommon data structures. Do you also follow that site?"

It could be a good answer to start bonding with the interviewer. Then nobody knows everything. Go for a reference and show that you can understand it.


Sparse sets.

They're often used in Entity Component System architectures since you have O(1) add/remove/find, & O(n) iteration. Iterations are also packed, which is a bonus for cache efficiency.

Can be difficult to implement well, but the concept is simple and a neat example of a useful specialty data structure. Take a look at https://github.com/skypjack/entt


You can apply the same idea to hash tables: Store hashes and dense-array indices in your sparse array, during lookup use robin hood hashing to cache efficiently probe the sparse array for a matching hash value, and if a match is found use the dense-array indice to lookup the data in the packed/dense array. This approach is both cache efficient and space efficient as the dense array can grow independently of the sparse array.


Cuckoo hashing is really nice: fast O(1) access, amortized O(1) insert. And really space efficient, and it can be parameterized easily for a space efficiency vs. access time trade-off (it is always O(1), but with larger constants for better space efficiency). It is also very simple: basically just an array (plus an algorithm). Unexpectedly simple in my opinion for its good time and space performance.

https://en.wikipedia.org/wiki/Cuckoo_hashing


(EDIT: using "mail" because I was using "letter" too much in my description)

The way I think of a bloom filter is like at an office mailroom or at post office mailbox where mail is grouped by the letters of people's name.

Given someone's name, you can glance and see "Uptrenda? No letters for 'U'" very quickly. This is when a bloom filter returns FALSE. But if they glance and see mail in the "U" box, then they return MAYBE. After a MAYBE, to see if you actually have any mail, they need to actually go through the box (i.e. a bloom filter miss).

Mine are a couple that are present in the Java Collections but I'm always disappointed that other language's don't include them:

- TreeMap/Set: A RedBlack tree. It keeps keys ordered, either naturally or by using the provided Comparator, but has log(n) time on basic ops.

    - https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
- LinkedHashMap/Set: keeps keys ordered by insert order via a linkedlist but can be configured to use access-order making it easy to use as a LRU cache. It keeps O(1) time but is less efficient than HashMap:

    - https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html


> other language's don't include them

stl::map and stl::set in C++ are actually a tree map and tree set.


> Good use-case: routing. Say you have a list of 1 million IPs that are [deny listed].

Apparently, bloom filters make for lousy IP membership checks, read: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

CritBit Trie [0] and possibly Allotment Routing Table (ART) are better suited for IPs [1].

[0] https://github.com/agl/critbit

[1] https://web.archive.org/web/20210720162224/https://www.harig...


That over-simplifies what the linked article says, plus there are things like blocked bloom filters and other tricks to speed things up.

Plus if he's allocating 128MB, well, just do it as a direct array 1 bit per IP4 address (which can be optimised to remove a few special blocks I guess) and skip any hashing.


Why not a Map? It will have search complexity of 1

If it should take into account ip blocks then just put the ranges in an sorted index and you have logarhytmic complexity for search (if speed is important and the blacklist is not huge, you can also expand the blacklisted ranges by individual ip and put them in a Map so that you get complexity of 1 for ip range search)


1. Probabilistic filtering and matching:

Since you mentioned bloom filters - other probabilistic data structures like count-min sketches (roughly, streaming bloom filters) are super useful. Approximate kmer methods like minhash and w-shingling use them in really cool ways. Rolling hash methods like Rabin chunking also work really nicely with probabilistic/streaming hash tables - splitting the data stream into chunks that can be consistently filtered or matched is useful in many circumstances! Think NSA-level data harvesting, or realtime genome/geospatial data classification.

2. Checking for, locating and counting subsequences in giant string datasets:

Wavelet trees, FM-indices, suffix arrays more generally - and structures that allow extreme performance in very niche situations with arbitrary encodings like bitvectors and compressed integer vectors. If you have a million books, or all the genomes ever sequenced, you use the first structures to find where a specific phrase can be found, even if you don't quite spell it right. You can use the latter ones to do super fast comparisons of giant datasets - given a million viruses, how similar is each one to the human genome, and where specifically do they most closely match?

3. Reconstructing structured information from (potentially noisy) fragments:

De-brujn graphs (and related graphs like string graphs, colored de brujns). This is a way to find all the overlap connections between fragments of information, and then weight the possible traversals of those connections by the evidence. These can be represented using the data structures from #1 (FM-indices for example), and efficiently used in some circumstances with those from #2 to enable some kinds of graph algorithms. If you have a shredded set of documents, or a billion short DNA reads from a genome sequencing experiment, this is how you reconstruct the original.

4. Decentralised coordination structures. Merkle-DAGs and Kademlia DHTs in particular. Being able to compare any trees by root-first hashes, and being able to request the content of the subtree for any node hash from an arbitrary set of peers - these structures enable the p2p web infrastructure. Everything from Limewire to bittorrent, IPFS and blockchains, and, most importantly, Sci-Hub.

1, 2 and 3 together are some of the fundamentals of computational biology. If you're interested in understanding them, https://rosalind.info/problems/list-view/ is a great starting place.


Funny you should mention bloom filters and that use case, I just re-read this post again this morning: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

Basically, it makes a similar point to https://news.ycombinator.com/item?id=32186837 — i.e., cache effects are not modeled in big O analysis, even though most problems that involve large amounts of data (so most problems) are memory bound rather than CPU bound. The blog post makes a good case for a very simple hash table as a better fit for this problem.


That Cloudflare article is a little frustrating.

> While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler

Yes, standard Bloom filters fail for large filter sizes and/or very small false-positive rates. But we've known this for decades, and tons of other probabilistic filters have come out since then to address the problem. Cuckoo filters in particular are incredible.

Was there really no easy way to bring in an open-source Cuckoo filter implementation? They're not that complicated. Maybe I'm just used to modern languages having good package managers.

Plus the author's solution is basically the idea behind Cuckoo filters: "what if we just use a hash table, but allow for collisions?" Cuckoo hashing is just clever about guaranteeing O(1) worst-case lookup.

And for absolute dead-simple filters, a blocked Bloom filter is basically just as easy as a Bloom filter. It's like two or three more lines of code. It's just changing the indexing operation to be modulo some large "block" size. That said, I don't think they'd work too well in this case, since the author has a pretty small false-positive rate (0.000019), and in my experience small Bloom filters (which is what comprise blocked Bloom filters) don't handle small FP rates well.

But guess what excel at small FP rates… cuckoo filters.


A linear probing hash table is simpler. That’s the trade off they were going for at that time. I don’t think that’s the most efficient solution given the hardware they were using, and I don’t think the blog author would either, but it’s certainly easier to write such a hash table — and it’s well written, but still mostly interview level stuff.

To me the blog post is not about cuckoo or bloom filters or hash tables at all. It’s about profiling and highlighting that a naive reading of the literature can easily lead you astray (worse performance and complexity). In school they don’t teach you that mov is the biggest cycle-eater of them all — at least it’s not the lesson people remember.


Cuckoo is terrible from constant const point of view, linear probe is where is it at, indeed.

>"mov" is the biggest cycle-eater of them all.

The R part in 'RAM' is so wrong nowadays.


If you’re willing to use moderately more memory, you can get much better cache locality using a blocked bloom filter. The basic idea is to use an array of bloom filters that each fit into a cache line, and use some bits of hash to determine which sub-filter each item goes into. Suddenly each lookup or insertion has at most one cache miss!


This post: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

"Scalability! But at what COST?"

In it Frank McSherry uses a handful of data structure tricks to make graph algorithms like connectivity and PageRank run progressively faster on his laptop, beating distributed cluster implementations. It's actually a version of their HotOS paper with Isard and Murray.

Admittedly it's the opposite of obscure, being kind of a myth busting piece about distributed processing being faster and how even largish data sets can fit on a single node (and even your laptop).


Addenum.

Link to paper: https://www.usenix.org/system/files/conference/hotos15/hotos...

COST acronym: "Configuration that Outperforms a Single Thread"


Probabilistic data structures are pretty cool.

Count-min sketch [1] is another one. It gives a reasonably accurate count of different events, even millions of unique events, in a fixed memory size. It shows up a bunch in high volume stream processing, and it's easy to understand like the bloom filter.

Another cool data structure is HeavyKeeper [2], which was built as an improvement on count-min sketch for one of its use cases: ranking the most frequent events (like for a leaderboard). It can get 4 nines of accuracy with a small fixed size even for enormous data sets. It's implemented by Redis's topk.

[1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

[2]: https://www.usenix.org/system/files/conference/atc18/atc18-g...


Fractional Cascading. A simple and very cool way to speed up searching for the same key in multiple lists. Instead of K binary searches taking time Klog(N), you can do it in log(N) time without using asymptomatically more space.

https://en.m.wikipedia.org/wiki/Fractional_cascading

I wrote a simple demo in Rust a while back to help myself learn the language.

https://github.com/mgraczyk/fractional_cascading

I also think pairing heaps are neat.


The Bag. Also known as a Multiset. I can't believe it took me so many years to learn the name of the most basic data structure imaginable - it basically makes no promises whatsoever about the structure. Think of it like a Set that allows duplicates - you can put stuff in and take stuff out, just like a bag - what more could you want?


I love the Bag! It's so simple I use it constantly


How is that any different from a list?


Two lists are equal only if they have the same contents and in the same order. Bags don’t have an order, they’re more like sets, so bags are equal so long as they have the same contents.


Hopefully this java can illustrate the point if you're not a Java lover

The bag essentially works like this Map<K, Collection<V>>

But looks like this Bag<K, V>

I use my own simple "BagSet" which works like this Map<K, Set<V>>


It's not a data structure but a really cool algorithm. Locality Sensitive Hashing. It allows like items to be hashed to the same value. So instead of a typical hashing functions that wants to avoid collisions this algorithm tries to optimize for collisions.

https://en.wikipedia.org/wiki/Locality-sensitive_hashing


Fibonacci Heap

Famous as the data structure that achieved optimum runtime in Djikstra’s algorithm. In practice it isn’t often used because simpler heaps outperform it except in very specific circumstances.

https://en.wikipedia.org/wiki/Fibonacci_heap

Soft heap

As far as I know this hasn’t really been implemented. It’s used for the best known minimum spanning tree algorithm. Pretty crazy. https://en.wikipedia.org/wiki/Soft_heap


Frugal streaming.

For estimating a median or percentile in a stream, using constant memory.

It's very simple: if the current value is higher than our estimate, then increase our estimate by one. Else decrease by one. It will converge over long enough time. This is called the Frugal-1U algorithm.

The Frugal-2U is a slightly more advanced version that modifies the step size along the way, plus other optimizations, to converge faster and not oscillate too much.

https://pastebin.com/wAcCS8WZ


An alternative streaming method due to Arandjelović, Pham, Venkatesh based on maximum entropy: https://ieeexplore.ieee.org/document/6971097 and implementation in C: https://github.com/dressipi/apv-median


Isn't that essentially a low-pass filter?


Do you know about:

* HyperLogLog? Convenient for doing approximate counting across huge datasets.

* SkipList is another probabilistic data structure that allows you to skip ahead N elements in 1 step. I believe ClickHouse uses it.

* Bitmap index organizes database pointers into a matrix of bits. The bitmap scans simply skip over the zeros. This type of index gets in trouble if you have high cardinality, though.


Lots of companies use skiplists without realizing it because Redis uses them as the implementation of sorted sets. For example, last time I checked, every time someone looks at a playlist on PornHub the software is doing a partial traversal of a skiplist behind the scenes.


Have you heard of Geohash? My mind was blown the first time I learned it. https://en.m.wikipedia.org/wiki/Geohash


This one is my favourite concept, bloom filter goes after that. Cool stuff about Geohash that you can expand it's concept into 3 or more dimensions and whole ide just makes you think a bit differently about coding and operating over small tokens of data. fantastic stuff


Cuckoo filter

"A space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters."

https://en.m.wikipedia.org/wiki/Cuckoo_filter


Latent Dirichlet Allocation (LDA) structure:

"LDA represents documents as mixtures of topics that spit out words with certain probabilities."

"LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents."

https://cfss.uchicago.edu/notes/topic-modeling/

I'm making a next-gen search engine.


The Suffix Array is surprisingly cool and useful https://en.wikipedia.org/wiki/Suffix_array

The Alias Table for sampling a discrete distribution in O(1) time is a very clever idea https://www.keithschwarz.com/darts-dice-coins/


The fact that suffix trees/suffix arrays can be built in O(n) is IMO the most surprising and "magical" result in computer science. To me, more surprising than median of medians/quick select.

There might be more "magical"/surprising things that are obscure and I am unaware of, but to me, this is it. :)


what about randomized O(n) minimum spanning trees?


Hazard Pointers are an interesting concurrent data structure.

Suppose we've got a lot of Doodads, let's say there's a Graph of ten million Doodads, and a whole bunch (dozens? hundreds?) of threads are poking around in this same graph, maybe looking at Doodads and sometimes (but not often) removing them from the Graph.

What happens if my thread is looking at a Doodad, and meanwhile a different thread removes it from the Graph? Is the Doodad destroyed while I was looking at it? That sounds very bad. What if while I'm looking at it, the Doodad is destroyed, but, a nearly identical Doodad appears in its place. How would I know? Not acceptable.

We could reference count the Doodads, we'd need atomic reference counts, something like C++ shared_ptr -- but this would work. However now our performance is terrible, because we've got 100 threads increasing and decreasing these reference counts just to look at Doodads and each such change causes a write that must be observed on other threads.

So instead Hazard pointers go like this:

Every thread who wants to look at Doodads asks for a Hazard Pointer, usually these are never deleted once made because it's much easier. A thread sets this Hazard Pointer to point at a Doodad when it wants to look at one, it checks this was successful (if not the Doodad is gone now), then contemplates the Doodad (it can be in this state for an arbitrarily long period of time) and then it clears the hazard pointer once it is done. It can only point one Hazard Pointer at one Doodad (some schemes allow a thread to have say, two or some finite number of Hazard Pointers each)

Somewhere a Linked List of all these hazard pointers is kept. When you want to remove a Doodad, you take it out of the graph, then you check the List of Hazard Pointers. If any of them was pointing at the Doodad you removed, you put the Doodad on a special pile since it can't be deleted yet, otherwise you can delete it immediately because nobody else needs it any more. This is a relatively expensive operation, but you're only doing it for removal, not just reads.

Finally there needs to be some mopping up of that pile of "couldn't delete yet" Doodads by trying to delete them again (checking the Hazard Pointers again of course). This can be periodic, it can happen every time somebody tries to delete one, or whatever, the best policy may vary between applications.

This approach means reading Doodads costs only two local writes, no waiting around for the cache. Removing them is more expensive, but that's OK because it was rare.


Seems to me to be quite similar to the virtual pages in Bw-Trees: https://www.microsoft.com/en-us/research/publication/the-bw-...


Have you used this before? What was the domain?


These are not commonly used in application code, but are "commonly" used to implement reclamation in languages without memory management for concurrent mutable data structures.


I have seen this general pattern in several decently large game object systems, although I can’t think of which ones off hand, if I’m recalling correctly it tend to be used in the longer lived (aka 10s of frames) object status/object existence checks, and often in AI subsystems, pathing and general decision making queries, etc.


Seems like games are an 'easy' case for this kind of thing, since you can synchronise every frame, and arbitrate destruction then; no?


For the most part. I think most of the usage was to allow more dynamic ‘objects’ and queries on them to be used in a low friction way without worrying about generational validity checks or null checks when reading data from those objects. It a lot like implementing a deterministic GC for a small set of memory objects.


Fountain Codes: reconstruct data from random selection of packets.

Was deeply patent encumbered until recently.

https://en.m.wikipedia.org/wiki/Fountain_code


I searched this thread far and wide for fountain codes. Surprised they're so low on the list. I find them amazing. Aso, the name is really good. The fact that you can collect any random packets in any order invokes the idea of putting a cup under a fountain and letting it fill up with water. Love it.


Skip lists! Super wacky, super fun to implement. O(log n) insertion/deletion/binary search, maintained sorted. It's probabilistic; you have a number of linked list levels, so you flip k coins to insert a node, and insert into the lowest k linked lists.

also, radix heaps are underappreciated.


Splay trees: https://en.wikipedia.org/wiki/Splay_tree recently searched items are always near the top.


Nuts! I was going to say splay trees. Completely silly these days, like most trees, given caching hierarchies and bursty memory accesses, but a really neat data structure all the same.


My absolute favorite, Cuckoo hashing, has been mentioned already, but so far (239 comment in) nobody has mention this approach (called?) to store an array A[] of N-bit numbers, most of which are zero or small: use N set of numbers S[N], such that for each A[I] you store I in S[J] where J are the bit positions where A[I] have a set bit. In other words, S[J] are the indices of elements from A that has a set bit in position J.

The representation of S can be anything, and even dynamically adapting to the data. I’ve seen trees of bitmaps for example.


In practice Cuckoo sucks, b/c the reading from an unknown index in an array is not a const cost operation. It has an upper bound of course, but its average cost is quite different, depending if you are to hit L1, or L2... or just go for the a cache miss.

"Cache misses" is what dominates performance nowadays.


Where I go there are no cache misses (it’s in hardware)


Disruptor queues are also fun. They are lock free multi-producer multi-consumer circular buffers. I figured out the basics on my own in a vacuum out of need, then discovered a relevant white paper.

I used one to implement a fast shared memory pub-sub message bus for communicating across dozens of processes in a simulation framework.


I don't quite see how's that obscure; it's a standard circular buffer. The fast/low latency communication part comes from the busy wait (and L3 cache communication).


A standard circular buffer is only single-producer single-consumer. The producer manipulates the head and the consumer manipulates the tail.

Extending a circular buffer to allow multiple consumers is relatively straight forward; you just give each consumer its own tail and accept the loss of back pressure.

Extending it to allow multiple producers without introducing locks is where the complexity shoots up drastically.

But yes, the reason why you would want something like this is to have the semantics of message passing with the performance of direct IPC between userspace processes without going through the kernel.


>Extending it to allow multiple producers without introducing locks is where the complexity shoots up drastically.

You can have a single tail with an atomic add, and that's pretty much it. The consumer needs to know, if the data is available, so there has to be a serialization point that with each producer has to wait - effectively a locking mechanism... or the consumer has to check all the producers progress.

It doesn't feel harder. The disruptor part/fame mostly came as it didn't have to allocate new memory.

For example writing a lock-free, max capacity limited, FIFO requires a single long (64bit) in each node and then reading head (1st), then tail one by the producers. Same idea - 64bits are too many to cause an integer overflow when increasing one-by-one.

It has been awhile since the time I wrote lock free stuff (occasional CAS and COW do not count). My current job doesn't really need it.


> You can have a single tail with an atomic add, and that's pretty much it.

That's the primary difference. In a disruptor queue, there's two tails. The first one is used for producers to allocate space in the buffer to write to, and the second one is used to commit it so that it's available to consumers.

It's true that there is a small amount of necessary synchronization across producers, because a producer can't commit its data before a concurrent producer earlier in the buffer does, and can end up spinning on an atomic compare-and-exchange. They can both independently write to their allocated parts of the buffer freely before that point, though, so in practice it's not a contention point.


> there's two tail

My point about the sync part w/ the producers, the consumers won't be ready to read from the producers before it's ready. However another option would NOT using a contention point of a shared tail but marking each record as done - need write/write memory fence, so even if the consumers start reading, then can bail out if the write process has not committed.

> the buffer freely before that point, though, so in practice it's not a contention point.

Likely a false sharing between the producers, unless the impl. is careful enough to allocate cache-line sized records.


Do you have a link to the white paper you found?


This doesn't look like I remember, but this sounds right.

https://lmax-exchange.github.io/disruptor/disruptor.html#_de...

The basic idea is that you maintain two atomically incremented counters. You increment one to allocate some buffer space to write your message into, and you increment the other once the message is written in. There's some tricky details to get right when multiple producers have messages staged up at the same time.


Thank you!


Linked Hash/Tree Maps, simple, but elegant. A Map with its nodes connected in a linked list so you can traverse them in insertion order (and O(n) time). Very useful for window queries over sequential data and other cases where you want FIFO access, but also quick access by a field of the data.


You forgot the most important feature over normal hash maps: they offer a deterministic iteration order without incurring the cost of a tree-based ordered map.

(If you don't know why this is important then maybe you haven't worked on large systems that undergo rigorous evaluation)


Modulo the snark, you're completely right. Iteration order = insertion order is super valuable for a bunch of reasons (this one included).


Is (non-)determinism really the right concern here? I’m aware that most hash tables do not have generally predictable iteration orders, but I nevertheless understood them to be deterministic.


They are deterministic in the sense that provided everything else is the same, iteration order would be the same after each insertion.

However, it can change after insertion (if the hash bucket count changes). If the hash function is random-keyed, iteration order would be different for different instances of the hash table even if you insert the same items in the same order.

Sometimes you want the order to be consistent every time. Be it insertion order, or some natural order of the items.


>Is (non-)determinism really the right concern here?

A massive one - there is a lot of code that implicitly depends on the order of iteration (think configurations/initialization). The issue might be invisible, and reproduce rarely in production only. The iteration order depends on the hash of the keys, plus the capacity of the underlying array.

In Java I advise to never use the standard HashMap in favor of LinkedHashMap. In cases where data density is important both are a terrible fit having nodes instead of a plain object array (and linear probe/search).


Well, hard/impossible to predict perhaps. Iteration order can depend on the order things were inserted and deleted and may differ from computer to computer (for example in Julia the hash-based dictionary ordering differs between 32 and 64 bit systems, and might change between versions of Julia, etc - you’d see the same thing with C++ unordered maps, etc, etc).


The same thing was a huge issue in Python until somewhere around 3.6, until they canonicalized the sort order.


You can do this without a linked list and it can be quite performant, like python’s newest dictionary implementation (appeared around Python 3.6 or 3.7 from memory).


Why would Linked Hash Map be a obscure one? It's been a part of java collection framework for over 20y. It's a standard for LRU caches as well.


Hence the "simple but good". It's not really obscure, but also not on the common path. I've lost track of the number of times I've pointed this one out to a colleague or student.


Crit-bit trees [0]. Also known as TRIES, with an interesting use described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function [1].

Also ROPE, a string allowing for prepends, substrings, middle insertions and appends [2]

[0] http://cr.yp.to/critbit.html

[1] https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96....

[2] https://en.wikipedia.org/wiki/Rope_%28data_structure%29


CDAWGs. Compressed Directed Acyclic Word Graphs.

They are like tries but share prefixes and suffixes of a word corpus, can be built in linear time, and since they are graphs, one can define an inner product between two CDAWGs and use them for kernel-based machine learning algorithms.

https://hal-upec-upem.archives-ouvertes.fr/hal-00620006/docu...


Roaring bitmaps are compressed bitmaps and they're pretty great. https://roaringbitmap.org/about/


Trivial, but very useful - phase accumulator / numerically controlled oscillator.

With two unsigned integers, freq and phase, you get amazingly fine control over timing. You only use the top N bits of phase for output. It's saved my bacon in embedded systems more than once.


I saw this and bloom filter was my very first thought, so I'm glad you had it.

We used it to check if you had voted on a discussion on reddit. It almost all cases the user had not voted, so it was a lot quicker to check if you hadn't voted than if you had.


A GADDAG[1] is a combination pre-/suffix trie that's used for move generation in Scrabble. Unlike a traditional dictionary to find anagrams for words, a GADDAG makes it easier to identify plays based off of available letters already on the board. I made a little visualization of how they work for a data visualization class a few years back.[2]

[1] https://en.wikipedia.org/wiki/GADDAG [2] https://www.axioms.me/gaddag/#?page-5



Never heard of this one until it came up in a google interview. Totally botched it. I couldn't think of any way to do it to their liking that doesn't require a dynamic cast.


> ...bloom filters... Good use-case: routing. Say you have a list of 1 million IPs that are black listed...

Seems-needed clarification: Bloom filters suffer from false positives, but not false negatives. So - if BloomFilterBlacklistCheck( $IP ) returns false, your router can safely conclude "not blacklisted". But if it returns true, then you'll need to perform further checks. (And you can't just use a Bloom filter of known false positive - that will also suffer from false positive, potentially letting in traffic from blacklisted IPs.)


The humble array but with a twist for accessing its indices in a hardware multiplexer way with 'shift left' and 'or' bitwise operations.

    /* Bits: selected, hovered */
    const colors = [
      grey,  // 00
      green, // 01
      blueA, // 10
      blueB  // 11
    ]
    const color = colors[selected << 1 | hovered]

https://blog.uidrafter.com/bitwise-table-lookup


No hardware multipliers here: left shifts are handled by much cheaper hardware [1], and are almost part of the basic arithmetic logic unit taught in school -- it can do addition, subtraction, bitwise operations, shifts by one, and maybe shifts by any number less than their word size.

[1] https://en.wikipedia.org/wiki/Barrel_shifter


Multiplexer (not multiplier)


My mistake! They look so similar.


Structures good for Geospatial information like rtrees, quadtrees.

https://en.m.wikipedia.org/wiki/R-tree

Also concurrent data structures.

https://youtu.be/jcqGSehrMGU


Yes, R-tree turned out to be a godsend when working with larger geospatial data and calculating intersections.


Yep, that's what I used it for! Intersections. Hard to make concurrent though. Ended up with concurrent map backed geohash.


I've used quadtrees for some cool stuff I can't really talk about. All the things you think binary trees are good at but with x,y coordinates, or any n-tuple of coordinates as you suggest.


Also LSM trees!


https://github.com/facebook/folly/blob/main/folly/docs/Packe...

is pretty awesome and I’ve personally gotten big wins from it.

The parent mentioned Bloom Filters, HyperLogLog-type stuff is on that family tree and also very interesting and very useful.

But the coolest thing lately by far? Swiss table (and the largely equivalent F14).

That thing is a game changer in certain scenarios.


Years ago as part of my thesis research I created a rather obscure family of data structures called DFI Filters or DFilters for short. The work centers around indexing typed trees such as abstract syntax trees or a web browser DOM such that you can answer arbitrary queries of the form "give me the set of all descendants of node P that are of type T" in something that in practice collapses to O(1) in the number of tree nodes. There are several versions of the data structure, but the general setup is a sort of "tree of trees" where the main tree is a binary search tree organized based on the depth-first indices of the nodes in the tree being indexed, and there are links into each type-specific variant of the main tree (in other words, simplified versions of the main tree containing only one type). The key intuition behind the whole thing, and the reason I called them DFI Filters, is the fact that if you are at some particular node in a tree, and you know ahead of time the depth first indices of your node and it's siblings, you actually already know exactly how many descendents your node has based on the difference in the DFI number between your node and its depth-first successor. Then you can build a variety of indexing approaches and data structures based on this insight -- I came up with ones that do it in truly constant time but are expensive to update as an undergrad, and in grad school I was able to build a version that updates on the fly but still retains ammortized O(1) for querying and ammortized O(m) for updating where m is the size of the update.

By the way if you're curious how I can get constant time when I'm returning a set of descendants, it's because I can give you the size and pointers to the first and last elements right off the bat.

The link to the original paper seems to be down (looking into it), but you can find the master's thesis covering all variants here: https://github.com/sam0x17/hierarch_old/blob/master/Master's...

I've been working on a rust implementation lately as well. There are a number of applications for DFilters including fast file-system searching, DOM traversal, very fast static analysis on abstract syntax trees, and more.


Monotonic stacks are neat.

I made up a data structure once, consisting of a pyramid of deques. It lets you efficiently compute any associative function over a streaming window of data.


I recently learned about monotonic stacks thanks to this LeetCode problem: https://leetcode.com/problems/sum-of-total-strength-of-wizar...

I feel like they're quite possibly the most deceptively simple data structure I've yet to encounter. That is to say that for me at least there was/is a wide gulf between simply understanding what the data structure is (doesn't get much simpler than a stack!) and when/how to actually apply it.


Do folk genuinely get asked problems like that in technical interviews these days? I can't think of any plausible reason why anyone would want to do a calculation as contrived as that. I love the required modulo value too.

I assume the trick is to somehow leverage the distributive property of multiplication to refactor the data to no longer require as much iteration? Is there supposed to be a linear time solution?


> Do folk genuinely get asked problems like that in technical interviews these days?

No idea! I'm still in interview prep mode though so I'm doing the LeetCode contests every weekend. Even if practicing problems like these is over-preparing I think it's probably a good idea for me because I imagine I'll be much more nervous and less performant during an actual live interview.

> Is there supposed to be a linear time solution?

Yes! Using a monotonic stack to find the "pivot points" of the contiguous subarray groups as well as both a prefix-sum array and prefix-product array to calculate the sum of all subarrays in each group in O(1) time after initially building the arrays in linear time.

There are some great explanations in the discussion section. This is one of the few problems that I got stuck on and wasn't able to solve on my own after a couple days of trying. The LeetCode community is awesome though and getting to see how other people solved the problem is a great educational resource.


You may be interested in the work myself and some coauthors have done on data structures and algorithms for streaming aggregation. GitHub repo for the code, which high level descriptions of their properties and pointers to papers: https://github.com/IBM/sliding-window-aggregators


That is indeed very interesting!


The LIFO buffer. Also surfaces as the messy desk, the million email inbox, or ten-thousand browser tabs. It really works at progressively sorting frequently accessed items to the top, even accounting for changing or seasonal habits and preferences.


https://en.wikipedia.org/wiki/Fibonacci_heap?wprov=sfla1

Fibonacci heap is theoretically better than Binary heap but the cache behavior is terrible

https://en.wikipedia.org/wiki/Van_Emde_Boas_tree?wprov=sfla1

Well, I saw this in CLRS tho. Very clever way of abusing bit patterns for quick range query in O(log log M) where M is the integer size.

https://en.wikipedia.org/wiki/Suffix_array?wprov=sfla1

A simpler way to do text matching since suffix array is the preorder traversal of a suffix trie constructed using the same string. Not heavily used in practice, but I do know bioinfo people used it a lot.

https://study.com/academy/lesson/robin-hood-hashing-concepts...

Not a data structure, but still very underrated. Robinhood hash basically means you keep the first insertion order and the actual insertion order and compare their distance. Once the difference is bigger than expected, you put it in front to "steal from the rich" to have better expected access distribution, hence Robinhood. Getting more prominent today thanks to Rust popularizing it again, since the default hash table implementation of Rust uses hashbrown which uses Robinhood hashing scheme.

https://en.wikipedia.org/wiki/Skip_list?wprov=sfla1

The ordered linked list with a fast lane every log(n) level. Not used very much, but one of the place I know uses it is Redis

https://en.wikipedia.org/wiki/Double-ended_priority_queue?wp...

Basically one priority queue that does two things. Can ironically be implemented with two priority queues (dual)


A suffix array is useful as a part of binary diff, finding a longest matching substring to copy from an old file. Bsdiff makes heavy use of this, it's used in Courgette for patching after preprocessing.

And once you have a suffix array, you're on the way to the Burrows-Wheeler Transform for e.g. bzip2 compression! https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...


Interval trees. Really cool trees that allow fast lookups of all the ranges that contain a given point.

https://en.wikipedia.org/wiki/Interval_tree


Found out about this DS in AoC 2021 Day 20+ problems if I remember correctly. Pretty interesting.


PR Quadtrees (https://opendsa-server.cs.vt.edu/OpenDSA/Books/CS3/html/PRqu...) got me my current job at Google. One of my interviewers asked basically "how would you design Google Maps?" and I based my answer on this data structure.


Looked to see what these things were. Realised it might be the reason I made a career in IT: I used this structure (I called it a box) to calculate residues (https://en.wikipedia.org/wiki/Residue_(complex_analysis)), which I figured would be simpler using an object-oriented hence I learned myself C++. This was during my PhD (theoretical physics).

Off course, never used C++ again (learned object oriented language ‘properly’ using SmallTalk, which I also never used again in real life). Now, 25 years later, seeing the first time that I was using an (adaption of) PR Quadtrees.


Why subdivide so far? I find it better to have leaf nodes contain an array of up to some small number of points, e.g. 20. That reduces the pointer chasing significantly, and a linear search through such a small array is very fast.


The entire space of probabilistic and sublinear data structures is fascinating. From hyperloglog for high cardinality uniqueness counting, to Cuckoo filters (similar to bloom filters), or Count–min sketch for frequency counting.

https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...


Due to data immutability, a lot of the standard data structures in Haskell are quite interesting.

Eg `Data.Sequence` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) is based on Finger Trees (http://staff.city.ac.uk/~ross/papers/FingerTree.html) which allow O(log(min(i,n-i))) inserts (at i) and O(1) inserts at the beginning or end even thou the data structure is basically copy-on-write.

In general Haskell gas a lot of specialized data structures. On of my favorites is `Data.IntMap` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) which is a specialized map for integer keys (based on Patricia trees iirc).

(Man I miss working in Haskell :-( )


Not a data structure, and not really obscure, but I am still amazed by the concept of solving DP (dynamic-programming) problems in O(log N) instead of O(N) time by writing the solution as a Fast Matrix Exponentiation equation: https://www.geeksforgeeks.org/matrix-exponentiation/


Not technically a data structure, but Reservoir Sampling is an interesting probabilistic sampling technique used to choose a random sample from a population of unknown size N in a single pass, so useful for a data structure that does not fit in memory.

https://en.wikipedia.org/wiki/Reservoir_sampling


In a Scrabble AI coding contest I discovered the GADDAG which is like a trie but indexes all of the words forwards and backwards from any starting point within the words.

https://en.wikipedia.org/wiki/GADDAG#:~:text=A%20GADDAG%20is....


Finger trees. They work on an arbitrary monoid and maintain a list of items. You can insert and remove items in logarithmic time, as well as ask for the sum of items in a given range, also in logarithmic time.

This is useful for example when you want to convert between offsets and line/column numbers efficiently, while you also want to efficiently update/insert/replace lines.


Vantage-point trees. Great structure for nearest neighbour searches.

Bonus tip: When you have a fast distance function (like a hamming distance) and write your tree from scratch in C (which almost automatically means you'll optimise it for your use case), they can become absurdly fast - much faster than the general-purpose nearest neighbour libraries I'm aware of.


Nearly 300 comments and no one said

Rainbow Tables.

Sure, they not used anymore and are not particularly useful nowadays, but I find the idea behind them very beautiful.


The Hierarchical Timing Wheels is an efficient data structure/algorithm for managing timers (event scheduling) when: 1. The timers variance is large. 2. Timers are likely to be cancelled. 3. A fixed (configurable) precision is configurable.

This talk provides a nice overview of different timing wheels implementations including hierarchial and hashed timing wheels.


Too late to edit but I forgot to paste the link to the timing wheels overview talk that I mentioned, so here it is:

https://www.youtube.com/watch?v=AftX7rqx-Uc


Nm, thanks:)


I think you had not posted the link to the talk!


I don't know whether it already exists or if it has a name, but internally I call it Virtual List.

- Reasoning: It's used in cases where you'd ideally use an array because you want contiguous memory (because you'll usually iterate through them in order), but you don't know beforehand how many elements you'll insert. But you can't use a resizeable version like std::vector, because it invalidates any pointers to the elements you've already allocated, and you need to store those pointers.

- Implementation: As a black box, it's used like a list: you index elements with an integer, and you can append to the end. But internally it uses different lists. Like a std::vector, it starts with an allocated size, and when it needs a bigger size, it allocates newer blocks that are "added" to the end, but doesn't move the already allocated elements.

- Downsides: Data is not all contiguous, at some point there are jumps between elements that should be close together. For performance when iterating, it's negligible since those cache misses happen "rarely" (depends on the block size, bigger block size means better cache performance but more potential wasted allocated memory), and most of the time you're iterating from start to end. But this means that you can't use this structure with algorithms that use "pointer + size", since internally it doesn't work like that. And since internally you can't shrink the lists and invalidate pointers, there's no way to delete elements, but in my case it's not a problem since I'd reuse indices afterwards with an added structure.

- Applications: I used it when coding games, when you have a big list of entities/components but you don't know how many beforehand. You access them by index, you might store pointers to elements, and most of the time you'll iterate through the whole list (when updating, drawing, etc.).


I think you're describing an unrolled linked list: https://en.wikipedia.org/wiki/Unrolled_linked_list

Although specifically, in an unrolled linked list, the blocks are connected by a chain of pointers from one to the next. You can alternatively just keep pointers to each block in a top-level array. it's not clear from your description which you're doing.

The JDK has a twist on the latter, which it calls a spined buffer, where the blocks double in size every time one is added, which makes sense if you have absolutely no idea how many elements there will be when you start.


Yeah that's pretty similar. I keep a list of pointers to blocks and they're all the same size, that way you can easily address elements via a linear index. I tried doing the doubling size thing like std::vector, but the addressing became more complex and I didn't find a need for it, but I might consider it in the future. But yeah, it seems like it's almost the same kind of structure. Thanks for the pointer!


So like a std::deque? If the block size varies then lookup will be O(log B) where B is the number of blocks.


I remember checking whether I would use std::deque, and I don't remember exactly why but I decided to implement this other system instead. I think it was because you can invalidate pointers with it if you delete somewhere in the middle, either manually or by calling an algorithm with it.

Of course you have the option to just not do that in your code, but it's nice to have a hard guarantee that this can never happen, even by accident. I could have used a deque as a backend and wrapped it in the class, though.


Not really obscure, but I had never heard of it till recently (and most people I've talked had no idea about it): Difference Arrays.

Great for when you need to do ranged updates in constant time.

More info: https://codeforces.com/blog/entry/78762


Good! I got always fascinated by non-boolean indexing (probabilistic instead).

I've remember having implemented an OOP indexed version of U.C. Berkeley's Cheshire-II in 2007 with Dolphin Smalltalk.

I was using BTrees tho, hence O(log N). The non-boolean part was in the values of the keys which were probability of good match against your search target.


Can I describe a data queueing problem that I feel like there is a specific data (or queue) structure for, but that I don't know the name is?

Let's say you are trying to "synchronize" a secondary data store with a primary data store. Changes in the primary data store are very "bursty", one row will not change for days, then it'll change 300 times in a minute. You are willing to trade a bit of latency (say 10 seconds) to reduce total message throughput. You don't care about capturing every change to the primary, you just want to keep it within 10 seconds.

It feels like there should be a clever way to "debounce" an update when another update overrides it 500ms later. I know debounce from the world of front-end UI where you wait a little when doing autocomplete search based on keyboard input so as not to overwhelm the search.


There are a variety of solutions to this but CRDTs are a very good one (CRDTs solve your problem (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...). If the operations you're doing commute (that is a ○ b = b ○ a, e.g. the order in which you apply the operations doesn't matter) then one could apply all operations in parallel, and only send the final result of doing them all. Casandra uses LWW-Element-Set CRDTS to solve this exact problem (https://cassandra.apache.org/doc/latest/cassandra/architectu...).


Ideally you have a reliable Change Data Capture (CDC) mechanism like a Binlog Reader. Debezium, for example, can write directly to a queue like Kafka. A Kafka consumer picks up the events and writes to your secondary datastore. Something like that can probably handle all of your events without bucketing them, but if you want to cut down the number of messages written to the queue you can add that logic into the Binlog Reader so it emits a burst every 5 seconds or so. During those 5 seconds it buffers the messages in the processes memory or externally in something like Redis using a key so only the latest message is stored for a given record.


If you want everything to be within 10 seconds, then you build a state-change tracker (which only tracks all state changes since last update) and then you send the updates every 10 seconds.

Don't worry about debouncing - the state tracker should handle representing the 300 updates as a single state change, and if there are more then they just get sent in the next update 10 seconds later.


Sentry used Redis to buffer writes for a similar use case:

https://blog.sentry.io/2016/02/23/buffering-sql-writes-with-...


I've seen this done with kinesis streams.

Basically just update a cache, and forward the results every so often.

If you put things in a stack, you can keep using the most recent for the update. Compare times and you won't add a bad value


Concurrent tries with non-blocking snapshots [0]

Say that you have a dataset that needs to be ordered, easily searchable, but is also updated quite frequently. Fast accesses are a pain if you decide to use traditional read-write locks.

Ctries are entirely lock-free, thus there is no waiting for your read operations when an update is happening, i.e. you run lookups on snapshots while updates happen.

They are also a lot of fun to implement, especially if you aren't familiar with lock-free algorithms! I did learn a lot doing it myself [1]

[0] http://aleksandar-prokopec.com/resources/docs/ctries-snapsho...

[1] https://github.com/mabeledo/ctrie-java


Sounds like if you want a version of this that doesn't leak memory you need a garbage collected language?


Yeah, either that or you implement your own. Doing so with reference counts might not be too hard, though.


The reference count on each element would presumably require atomic operations that could race with those used to update the actual data structure, and therefore you'd lose all your concurrency guarantees?


Generally speaking, this shouldn't be an issue if you perform batch collections.

Either way, any new mechanism to release memory would need to take into account both the generation and the reference count, and the atomic release should be performed in the same GCAS fashion.


Associative memory? Store data in them and they will never complain they are full, but they may "forget" instead. Querying them with slightly wrong keys, they'll auto-correct the keys as long as they are not overloaded.

Binary fuse filters? (Xor|Bloom) filters on steroids.

Non-deterministic finite state transducers? Reversible, more elegant and more powerful version of non-deterministic finite state recognizers (transitions are labelled with two outputs, which can be interpreted as inputs versus outputs, but these roles can be swapped as they are symmetric).

Ropes? What, you are still using strings? Dude, please stop writing that newbie editor.

Caches with Time aware least recently used (TLRU) or Segmented LRU (SLRU) cache replacement strategies (Oh, you thought Least recently used (LRU) was cool?)


FST is one underappreciated data structure. Its used in places like speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction. If you know about it you exactly know where to use it.


+1 if your input and output are both sequences, FSTs can be your best friend…Unless you prefer the company of the flaky cool kids from deep learning (CTC, RNN-T, …)


Hashed time wheels: https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-...

Great for short-lived values in a cache that are frequently accessed. If all of your values always have the same expiration period (i.e. expires at insert time + N period) then it's super efficient.

More or less you have an array with a head and tail. Every "tick" (usually each second but you can customize it) you move the read pointer ahead by 1 array slot. The tail is then evicted.

If you have a highly accessed part of your code, you can be lazy and avoid using an extra thread and just tick ahead the read pointer when the time changes.


We all know about the immutable vectors of clojure (tries, basically). I always preferred the RRB-tree which is the same thing, but relaxes the leftwise dense requirement. This means you have a bit-partitioned trie until you do a split, merge or insert, after which you get a slightly slower operations, but still very competitive.

It is actually a quite trivial change until you come to the merge algorithm which is finicky in all the wrong ways. The benefits are (in clojure terminology) O(1)ish splits, inserts and concatenations instead of O(n).

I don't know if it can be called obscure anymore, since Scala's vectors are RRB-trees and clojure actually has an implementation of them, but I often see people talk about ropes in situations where an RRB tree should be a no-brainer.


Good old Calendar Queues: https://en.m.wikipedia.org/wiki/Calendar_queue

These things are common in discrete event simulation, but, are useful anytime you have a time based queue.


A sorted list. It's often cheaper to sort the list after each change, then to search the whole list. Especially when you do collision detection in 3d... (maybe not so obscure, but at least underestimated) When I pair my socks I first place them in color order :)


Note that re-sorting an array after making a constant number of changes can be done in O(n) and will usually be very fast in practice. Insertion sort will do it, as will Timsort and similar algorithms.


I was reading up on C++'s standard library and found out that maps and sets are generally implemented as continually sorted lists



Dancing links https://en.m.wikipedia.org/wiki/Dancing_Links

“ In computer science, dancing links (DLX) is a technique for adding and deleting a node from a circular doubly linked list. It is particularly useful for efficiently implementing backtracking algorithms, such as Donald Knuth's Algorithm X for the exact cover problem.[1] Algorithm X is a recursive, nondeterministic, depth-first, backtracking algorithm that finds all solutions to the exact cover problem. Some of the better-known exact cover problems include tiling, the n queens problem, and Sudoku.”


Not super obscure, but I remember that one specific time when circular-linked-list made a lot of sense to use, well I wanted to use it so I used it.

I had a bunch of API keys with poor expiry documentation and implementation, so to find out if a key expired it had to be used. I put it in a main "keys.pop loop" and all methods below tried to use the key. If HTTP response was (some another obscure HTTP status code like) 505, I simply called `continue;` in the loop to jump to another, without caring at all where I was before.

https://github.com/atedja/gring


Not sure you could call it a data structure as such, but Latent Semantic Indexing[1] kinda blew my mind when I learned about it.

The fact that you could perform linear algebra on documents and words, to get out a number that somehow quantified how close a document was to another really surprised me back in the day. Grew a new appreciation for what exactly those vectors in linear algebra could represent.

Not in the field so don't know how obscure it is though.

[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...


Not sure whether they classify as obscure, but I haven't see cited already:

- Dominator tree (https://en.wikipedia.org/wiki/Dominator_(graph_theory))

- Single-Connected-Components-Graph

- Deterministic data structures (eg. a set that acts deterministic to the fact that addresses might be somehow randomly assigned, very useful for ensuring reproducibility)

Already cited, but it's clearly among the most elegant:

- union-find (!!!!)

and as a bonus one that is easily overlooked:

-std::deque, that when restricted to push_back() or push_front() guarantees not to ever move objects around.


I love the set reconciliation structures like the IBLT (Iterative Bloom Lookup Table) and BCH set digests like minisketch.

https://github.com/sipa/minisketch

Lets say you have a set of a billion items. Someone else has mostly the same set but they differ by 10 items. These let you exchange messages that would fit in one UDP packet to reconcile the sets.

Minisketch is more costly in CPU but has deterministic guarantees. IBLTs are very fast but have a chance of failure requiring a randomized iterative procedure.


The XOR Linked List blew my mind when I first read about it. Clever!

https://en.wikipedia.org/wiki/XOR_linked_list


Horrible for todays hardware but i also came to say this.


Also, the algorithms around using Reduced Order Binary Decision Diagrams (ROBDD) to represent sets efficiently. They are used in verification algorithms with very large state spaces, such as "Symbolic Model Checking: 10^20 states and beyond" by Burch, Clarke, and McMillan <http://www.cse.chalmers.se/edu/year/2012/course/TDA956/Paper...>


The BK-Tree, which allows fast querying of "close" matches, such as Hamming distance (number of bits different). http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...

I wrote a Python library implementing them a number of years ago: https://github.com/benhoyt/pybktree


Since you mentioned Bloom filters, here are ribbon filters: https://engineering.fb.com/2021/07/09/data-infrastructure/ri...

Edit: another that I enjoyed working with 'Burst Tries': https://dl.acm.org/doi/10.1145/506309.506312


It must be [CHAMP](https://blog.acolyer.org/2015/11/27/hamt/#:~:text=CHAMP%20st....). Acronym for Compressed Hash-Array Mapped Prefix-tree. It is a current state of the art persistent hashtable. I'm surprised that no one mentioned this before, because it is super useful!


> Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.

Humm. A link to these would have been good[1], but anyway I understand that an optimally filled bloom filter takes ~1.44 times more space than the optimal theoretic value (which I understand Golomb encoding gives you?), so 'much smaller' is not a helpful measure. Likewise the 'worse performance' is not a helpful phrase, I believe a linear lookup time is needed for decoding, but a small amount of space for an extra indexing structure can speed things up lots. Bloom filters are updatable (addition only in the simplest bloom filter), golomb coded sets practically can't be updated except by rebuilding AIUI.

I suppose binary heaps should get a mention <https://en.wikipedia.org/wiki/Binary_heap> cos they barely seem to figure in other comments here. The neat bit is they are trees but implicit (stored in an array) not explicit (which would use pointers) and are always balanced giving you log access times in a compact form.

[1] <https://en.wikipedia.org/wiki/Golomb_coding> but there are simpler explanations on the web


SeqHash

This data structure is an immutable, uniquely represented Sequence ADS (Authenticated Data Structure) with various interesting use cases.

It is based on a novel hashing scheme that was first introduced with SeqHash.

See http://www.bu.edu/hic/files/2015/01/versum-ccs14.pdf) by Jelle van den Hooff , a brilliant young researcher back then.

SeqHashes are uniquely represented Merkle Trees that also represents a sequence. Concatenation of two SeqHashes is O(log(n)). In turn, the cryptographic signing of SeqHashes is O(1) as each tree node carries a cryptographic hash.

Of course, for each node to carry a hash incurs a hefty overhead but that can be alleviated by (lazily) grouping nodes into buckets (turning it in some kind of BTree).

SeqHashes also don't support splitting in O(log(n)) like for example AVL trees.

I've created an Btree version of SeqHash that also allows splitting SeqHashes in O(log(n)) called SplitHash.

See: https://github.com/odipar/spread/blob/master/src/main/scala/...


I made https://github.com/mamcx/tree-flat as flattened stored tree in pre-order that allows for very fast iterations even for childs/parent queries. Is based on APL, so not that novel.

I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.


Nice! I'm curious to see where you take TablaM. I agree that the relational model is not nearly as well represented as a basis for programming language semantics as one might hope.



Yeah, that one is good, but I bet is possible to encode trees efficiently as relations without indirections.

That is part of my look at `tree-flat` and I wanna base trees on that for my lang.



Also, Learned Indexes Structures:

The Case for Learned Index Structures: https://arxiv.org/abs/1712.01208

Learned Indexes for a Google-scale Disk-based Database: https://arxiv.org/abs/2012.12501

And Noms/Dolt Prolly Trees:

How to Chunk Your Database into a Merkle Tree: https://dolthub.com/blog/2022-06-27-prolly-chunker/


Count-min sketch comes to mind [1]. It's a probabilistic data structure similar to the bloom filter. Learnt about it in this system design video "Top K Problem" [2].

[1] https://en.wikipedia.org/wiki/Count–min_sketch

[2] https://www.youtube.com/watch?v=kx-XDoPjoHw


Low Density Parity Codes (LDPC) (1).

So good forward error correction that up toa few years ago it was all under patents.

Gold Codes (2) are also very cool in CDMA systems.

[1] https://en.m.wikipedia.org/wiki/Low-density_parity-check_cod...

[2] https://en.m.wikipedia.org/wiki/Gold_code


Min-Max Heaps

Think of them as double ended priority queues. O(n) construction with O(lgN) mutations. Both the min and max elements are quick to get O(1).

The paper is short and the data structure is elegant. I read it a few years ago in uni and made me appreciate how greatly useful can be implemented very simply.

https://dl.acm.org/doi/pdf/10.1145/6617.6621


BW Trees are pretty neat [1][2]. Typically BTrees in database engines need latches to prevent concurrent writes messing up the tree but BW Trees conveniently sidestep that requirement by appending all node updates to a linked list (called a Delta Chain), and having each reader apply the delta of updates by traversing the linked list until it hits the final node in the list. Periodically, these linked lists are compacted back into single nodes.

If you like these kinds of data structures, the Database Internals [3] book has a good introduction and explanation of a handful of BTree optimizations like this.

1: https://www.microsoft.com/en-us/research/wp-content/uploads/...

2: https://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pd...

3: https://www.databass.dev/


A certain ePub reader on MacOS used to use a bloom filter for text search. You would decompose an ePub into pages of text and generate a bit array for the text on each page. As the user types in the search field you could often reject large portions of the document — leaving perhaps only a handful of pages where the code would only then have to drop down to the more tedious string matching algorithm.

Very cool.


How about the R-Tree, used for accessing/searching spatial data: https://en.wikipedia.org/wiki/R-tree

Maybe not so obscure if you've worked in the geospatial realm.

I actually had a job interview once that asked me to implement an R-Tree from scratch in C++ as a homework assignment - I didn't get that job :)


I've had a situation where I needed to stream blocks of data from a remote computer in soft-realtime, but still needed to share the same data with many different consumers.

The code was simple, but effective (Go):

    import "sync"

    type Muxer[T any] struct {
        ret T
        mut sync.RWMutex
    }

    func (c *Muxer[T]) Multiplex(fn func() T) (ret T) {
        if c.mut.TryLock() {
            defer c.mut.Unlock()
            ret = fn()
            c.ret = ret
        } else {
            c.mut.RLock()
            defer c.mut.RUnlock()
            ret = c.ret
        }
        return ret
    }
The way to use it is to write a non-concurrent nullary get() function, then pass it to the Multiplex() function in as many goroutines as needed:

    get := func() DataType { ... }
    muxer := Muxer[DataType]{}
    
    // in other goroutines
    value := muxer.Multiplex(get)

The value will be shared across all goroutines, so we get minimum latency with minimum copying.


I don't get it. It seems overengineered to me, but I can't formulate my reasoning well enough.

Why isn't the original value that you're returning protected by a RWLock, and all goroutines will just need to acquire a read lock, instead of using a write lock for what is basically a getter function?

Yeah, I don't get it.


The Multiplex() function serves a double purpose:

1) If no other goroutine is currently executing the Multiplex() function, then Multiplex() acquires a write lock, executes the getter and caches the result.

2) If some other goroutine is currently executing the Multiplex() function, then Multiplex() waits to acquire the read lock, then reads the cached value.

So in case of a single goroutine executing the Multiplex() function in a loop, it will be nothing more than a wrapper. But when many goroutines are executing the same Multiplex() function concurrently, only one goroutine will actually execute the getter, and all others will just wait for the first one to finish, then take its cached result.

The point is that you don't have to think about who executes the getter and who copies the cached values, how often to execute the getter, how much time will getter take... You just call it like an ordinary function, and every goroutine will get the most recent value it can get.

Hopefully this clears things out. I'm willing to elaborate further if you have any more questions.


All probabilistic structures are fascinating to me. I'm most enamored with HyperLogLog - estimating the number of distinct elements in extremely large sets of data without having to build a set of seen elements - but the entire field is pretty awesome.

The canonical use case for HLL is counting unique visitors. But any large dataset where you only need a good approximation is a valid target :)


I used to work for a mobile analytics company, and HLL made things SO much easier and fast that at the time we changed all our queries to be based on HLL. At the time, the ability of being able to sum uniques was like black magic, and still to this day I am quite impressed about HyperLogLog.


Quadrilateral Simmelian backbones

Count quadrangle structures in graphs to derive an embeddedness coefficient for each edge and expand the graph along the most embedded edges, yielding nice clusters.

https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...


Use arrayanes instead of matrix efficiently in boardgames like tetris or chess"A bitboard is a specialized bit array data structure commonly used in computer systems that play board games, where each bit corresponds to a game board space or piece. This allows parallel bitwise operations to set or query the game state, or determine moves or plays in the game"


pure algorithm & perf stuff... - exponential unrolled linked list: I don't know what they are called, so I called them that way https://fulmicoton.com/posts/tantivy-stacker/. - radix heap. I actually had to use that data structure on a real use case. - radix tree. Actually super useful. - HashMap<String, V> where the string and the value are contiguous in memory. Weirdly I've never seen that one... The hashmap can stores the full hash in its hash table anyway, so false positive are super rare. It will have to check the string anyway... You can improve your locality by keeping the key and the value in the same place in RAM.

just practical stuff, in python, I often use a magic dict defined as follows.

``` from collections import defaultdict >>> def MagicDict(): ... return defaultdict(MagicDict)

```

then you can use your MagicDict as a weird json-like map.

``` c = MagicDict() c[3]["aaa"][2] = 5 ```


Djikstra maps. This page describes it better than I can: http://www.roguebasin.com/index.php/Dijkstra_Maps_Visualized but it can be a real simple way to add emergent behavior to an AI in a game (my interest in it), among other things.


Eertree. As one can guess from the name, it stores all sub-palindromes of a given string and does stuff to them. Somewhat similar to a suffix tree in spirit, but was only invented in 2015.

See https://en.wikipedia.org/wiki/Palindrome_tree and the original paper


DLX https://en.wikipedia.org/wiki/Dancing_links

Stanford Lecture: Don Knuth — "Dancing Links" (2018) https://www.youtube.com/watch?v=_cR9zDlvP88


Invertible Bloom Lookup Tables. They're like Bloom filters but you can also remove elements and even reconstruct elements in certain cases. Useful for syncing 2 databases which are almost in-sync, by sending only a small amount of data between them. It's used by Bitcoin nodes to communicate the contents of newly mined blocks.


Here's another one I was thinking about. It's an idea for a hierarchical Free Space Bitmap. I don't know if anything like this has appeared in a real file system or not.

At the lowest level, let's have one bit represent if a 4K chunk is free space or occupied. 64 bits at this level tells you the usage of 64 * 4096 bytes (256KB)

Then Level 2. Two arrays. One bit at this level represents 64 bits in the level 1 array. One array for All-Bits-Set in level 1, one array for Any-Bit-Set in level 1. 64 bits at Level 2 tells you usage of 64 * 64 * 4096 bytes (16MB).

Then Level 3. Two arrays. One bit at this level represents 64 bits in the level 2 array. One array for All-Bits-Set in level 2, one array for Any-Bit-Set in level 2. 64 bits at Level 3 tells you the usage of 1GB.

I'd imagine there would be ways to use the hierarchical tables to quickly find a hole of a particular size without checking only the lowest level.


I think you're describing a Van Emde Boas tree (or trie): https://en.wikipedia.org/wiki/Van_Emde_Boas_tree



Colonies/hives, which have fast insertion/erasure/iteration and preserve pointers on insertion/erasure. I remember that Cataclysm: DDA used this data structure for keeping track of the items on each map tile.

https://plflib.org/colony.htm


Slotmap seems like a superior data structure to adjacency lists backed by Vec, but still haven't replaced them in graph libraries: https://docs.rs/slotmap/latest/slotmap/


The half-edge data structure, used in computational geometry.

I remember when I first discovered it, it drove home the point that choice of data structure can really matter - it made a bunch of problems I'd been working on much easier to solve, as it gave a different way of looking at things.


Here's one I don't know if I've ever seen documented anywhere. If anyone knows a proper name for this one, let me know!

Imagine it's for a text editor, and you want to map Line Numbers to Byte Positions. But then you want to insert a byte somewhere, and you need to add 1 to all your Byte Position values.

Instead of actually keeping a big array of Byte Position values, you have a hierarchical array. The convention is that whenever you want to read the value, you add the Parent's value, and that parent's value, etc. Number of parents would be logarithmic.

So if you want to add 1 to everything past a certain point, you just apply it to some leaf nodes, then the next internal nodes, rather than applying it to every single leaf node.


Somewhat along these lines, I have formed a concept forcedly called "differentially composable string", or "deposed string", or more precise "poor man's git".

The intended use case is to obtain a compact representation of all the historic text entered into an input field (notes, comments, maybe long-form): all the stages of the text, where a stage is a tuple [add/remove, start_index, text/end_index]. Once you get the stages from the deposed string as JSON, you could transform them however you want then load them into a new deposed string.

You can read more on GitHub: https://github.com/plurid/plurid-data-structures-typescript#... or play around on my note-taking app implementing deposed strings and more: https://denote.plurid.com


Ted Nelson named them Enfilades. Guy Steele called them Monoid Cached Trees.


Actually I think it was Roger Gregory or Mark Miller who came up with the term "enfilade", and I don't think enfilade wids and dsps have to be monoids necessarily. I'm not sure, though.


Enfilade trees were going to be my suggestion for this page? Really nice structures. Such a shame the writing describing them is so hermetic in style - dispative and widdative properties, really, come on now.


Rodney Bates called his independent invention of them K-Trees.



That's good for representing the text, but I'm more focused on the part about being able to add X to everything past a particular point very quickly by manipulating parent values.


Ropes do this too; each node is labeled the size of its subtree. If you want to compute line numbers as well, add them as another label


That kind of thing is a common extension to ropes, because it makes them indexable containers like arrays, but it is not inherent to the definition of ropes. I don't think it's mentioned in the original Cedar rope paper.


An annotated tree?


Succinct data structures in general are interesting. They use close to the information theoretic bound for a given problem and still support fast operations -- a decent mental model is that they're able to jump to a roughly correct place and (de)compress the data locally.

A good example is a wavelet tree. Suppose you have a text of length N comprised of an alphabet with C characters and wish to execute full-text pattern searches for patterns of length P. Once the tree is created, you can do your full-text search in O(P log C) time, and you use less space than any constant multiplicative factor of the information theoretic lower bound. Note that the runtime is independent of the original string's length.


Another example is distance oracles. A distance oracle is a two stage object, in one stage creates a cache of distances, and in the other it allows us to query the cache. The oracle gives an approximate solution to the distance between any two nodes while avoiding the quadratic memory requirement to something in order of nlogk with k query steps.


I'll add: the count-min sketch[0]. Used for streaming top-k, where the set of keys may be too large to fit in memory. It's essentially a composite of a fixed size (k) heap and a counting bloom filter.

Instead of using a bit field for membership, use an integer field for tracking count.

When querying: take the min of each value in the field location given by the hash functions.

When updating: increment each field location given by the hash functions, take the min of new values. if the min is greater than the least value in the heap, push the updated key into the heap

[0] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch


Input, Output Unionnions. basically a input struct into a algorith, layed out in such a way, that the algorithms output, always only overwrites input no longer needed. If done well, this allows for hot-loops that basically go over one array for read & write-backs. After all, its all just memory and to use what you got in situ is the fastet way one can go.

No pointers to dereference and wait, just the input, computated, stored back into the same cacheline- to be used for the next part of the hot loop. If the hot-loop is multi staged and the output continously decreases in size, one can even "compress" the data, by packing an array of orgMaxisze/outputSize subunits into the unit.


Do you have any references for this? I'd like to understand better


https://hackaday.com/2018/03/02/unionize-your-variables-an-i...

https://en.wikipedia.org/wiki/Test-driven_development

This is unsafe C Code by nature, but test driven development on platform can make it "safe".

Now you start with a struct in the union: {inputA, inputB, inputC}

and pad the intermediate struct

{Padding against Overlap, resultB, result C}

and end up with the result struct in the same union.

{outputA, outputB, outputC} .

The trick is to keep track of the state and validate the "purity" via automated tests.

Then you have it all in one L1 Cache Line, pumping through a algo, no dereferences, no huge stack structures, its all there, as long sas the size of input output does not differ wildly.

Remember down there its all just bytes accessed by code. There is no such concept as objects or even variables. All those mental pots to grab things out and put things back in, are artificial constructs needed by us.

The machine down there, can go to work like a line cooking shiva in a trailer. And doing so and having this in view, makes it faster.

PS: Pointers within the struct nullifys the advantages gained here, because every pointer is a memory load and thus "relatively" slow.


Another use case for Bloom filters: Bioinformatics (see https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatic...)


Briggs-Torczon sets of small integers (0 <= x < N) deserve to be better known.

(Set up two N-element arrays "member" and "index" of ints and an elementCount; set elementCount to zero. You don't have to initialize the arrays if you can live with some valgrind grumbling. The leading "elementCount" entries in "member" are the current members of the set, unordered, and for each of those values x, index[x] is its index in "members".

Then isInSet(x) = index[x] >= 0 && index[x] < elementCount && member[index[x]] == x and addToSet(x) = isInSet(x) || (member[index[x] = elementCount++] = x)

and rmFromSet() is left as an exercise.)


Don't know if it fills the criteria, but some very basic concept: Circular arrays.

This is great for live updated time series data: you never have to expand your arrays, allocate memory etc. It is extremely fast, predictable and resource efficient.


Also known as ring buffers


Cuckoo hashing: https://en.wikipedia.org/wiki/Cuckoo_hashing

Hash tables that come with a parameter that lets you control density vs. speed.


This isn’t so much a data structure than an implementation of one, but map-represented trees are great.

For context, they were used by Figma to represent its scene-graph (https://www.figma.com/blog/how-figmas-multiplayer-technology...).

Essentially, it’s a way to represent a normal tree as a map; keys of the map represent individual nodes on the trees, and one key “ROOT” points to the rootmost node.

Each node is connected via the following relationship: parents have an unordered set of their children’s keys (not the actual node reference) and children know their parent’s key. Children also maintain a property to track their index using a technique called fractional indexing.

What’s great about this data structure is that you have O(1) transformations for most parts. If you need to change a node’s property, height for instance, (assuming this is a design tool or something), all you need is that nodes key. If you need to reposition a node under a new parent, simply remove the key from the parent’s set, change the node’s parent key, and update the index. This structure plays well with collaborative apps too as most edits to the tree are O(1)— edits are swift and can be communicated to other collaborative instances quite simply with small payloads.

Last thing I’ll touch on is fractional indexing. Very crudely put, it forces children nodes to track their position in a list via an “index” that can be a normal float. To reposition a child node, you first find its neighbors and average their indices to find a new index. Ordering the list is a sort from smallest indexes to largest.

In an example, say we had 3 nodes A, B, and C. A’s index is 0.1, B’s is 0.3, and C’s is 0.5. To insert an element D between A and B, I average A and B’s index (0.2), set that as the index of D and insert it into the parent node’s children set. At render, the set is ordered and we find A, D, B, C.

Shameless plug, but I’m using these techniques to build a multi-framework mobile app builder at phazia.com :). I love nerding out about these things so thanks for the chance to do so @op.


I came up with a clever one recently. So, I had a stream of records of fixed size where I will receive exactly one per second. I needed to keep approximately two weeks of this data. I would also need to be able to query the records by time. Anyway, the idea is: use modulus on the file creation time to calculate a write position. Use the location of the current write position to locate query starts and stops. The file then has zero space overhead. Is this something known or did I actually invent something here? If this is my invention I'm open to hearing suggestions for a cool name!


Sounds like a ring buffer stored in a file.


Succinct Trie http://stevehanov.ca/blog/?id=120

Super effective way to store a searchable list of items, like a dictionary of words.


First contact with Functors, multidimensional Kalman filtering abstractions, and Golomb ruler optimized DFT... kind of like meeting the Wizard of Oz, and finding out it was a chicken all along. ;)


Very cool stuff...

To take advantage of all these, where would one go to ask "what's the best data structure to use to implement X, Y, and Z?"

Is there a stackoverflow tag or something like that that folks use??


Very good idea!

And I love bloom filter too (and their more modern successor cuckoo filter) but I'd challenge the usecase you mention though: 1 million IPv4 is 4MB, and 16MB for IPv6. That's tiny, you're better off using some kind of hashtable, unless you have a fast and small memory, and then a slow and big memory (say embedded processor with small CPU cache and some DRAM).

Bloom filters are useful when your working set cannot fit in your fast memory, to allow you to go to the slow memory only for a small number of requests.


The other use is when the cost of checking the set is high due to something like accessing a remote system. You can pull a small representation of the full set and only do a network request after when it's likely to succeed.


I recently learned about deques in Python (other languages may have it too) which is like a linked list but it keeps pointers for the first and last elements. This allows you to insert/read/remove elements from the beginning or end of the list in constant time. I’ve never used it, but I could imagine it being useful in certain situations.

https://docs.python.org/3/library/collections.html


A deque may be implemented as a linked list, but it's actually more common to be implemented on top of arrays.

Python in particular uses arrays under the hood for almost everything because modern computer hardware is so blazingly fast manipulating them.


Arrays aren't efficient when you want to add/remove to the head. Deque in python exists so there is a data structure with constant time pop/push to the head. And it is in fact implemented as a doubly linked list, with various optimizations: https://github.com/python/cpython/blob/v3.8.1/Modules/_colle....


It is not commonly done, but there is no reason why you can't have an array with amortised constant-time insertion at both the head and the tail.


I think it's more common than not to implement deques on top of arrays. There's a meaningful performance difference at the hardware level due to the spatial locality in memory. Linked lists in general cause the CPU to jump all over the place in memory, causing pressure on the cache and whatnot.

I believe the standard C++ deque implementation consists of a vector of vectors. The purpose of the two layers is at least two-fold. It guarantees that elements stored within the deque remain at a stable memory location at all times, even during a reallocation. Also, the memory getting shifted around during a reallocation consists of relatively tiny vector metadata rather than the application data.

To speculate, I suspect the reason it's a linked list in python is due to the age of the module. It was likely written relatively early on in python's development, and the linked list implementation was simple and elegant. Now enough stuff uses it that it would be painful to change.

Dicts are hashmaps in python rather than binary trees for similar performance reasons. You can't even find any binary tree based containers in python's standard library. You can find containers with the same performance characteristics and semantics of binary trees, but under the hood they're implemented on top of arrays because it better maps to hardware.


Neat, it's always best to look at the source. To be clear, a deque can be implemented on top of an array and still have constant time operations on the head.


If it's not constant time on both ends, it's not a deque. You can already insert items at any index in an array with linear cost.


I invite you to think about the problem for a while, and try to think of a way to implement a performant deque on top of a resize-able array. I promise you it's possible.


I don't think the comment you replied to disagreed with that.


Ole Begemann and Chris Eidhof wrote Advanced Swift, and, in there, described a really efficient FIFO queue.

I implemented a variant of it, in my Generic Swift Toolbox Package[0]. It’s lightning fast.

[0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...


Might not be very obscure but it was new to me. Ended up being really useful for finding out whether dates intersect a particular period and ended up using it quite a bit at work.

Range Tree:

https://en.wikipedia.org/wiki/Range_tree

You give it a range of dates and it'll tell you if any intersect so if you're looking for "how many people are absent in this time period" you can really quickly find out by using a range tree.


I want to write a content based pub/sub system. Where a subscriber says "I want messages where field X = A, y = B and z > C".

I've looked around and I've seen R+ Trees are potentially a solution but there's very little info on how they're actually created in code.

Does anyone know of any nice data structures to do this? Obviously the most basic way is just indexing each field and that works for simple equality but gets harder when you're trying to do other comparators like < and >.



Nested Sets

https://en.wikipedia.org/wiki/Nested_set_model

You can ust to store tree structures in relational databases for faster querying stuff like "all nodes below node x", "How many nodes to the top node" and so on.

Querying is easy, but comes at the cost that changes like inserting and deleting are more expensive.

I used this for an auth service with roles, user and org units. pretty query heavy, worked very well.


HyperLogLog[1] — approximately count distinct elements from an extremely large set (e.g. 10⁹ elements) super efficiently.

→ Distinct count time complexity[2]: O(1)

See Redis's PFCOUNT[4].

[1] https://en.wikipedia.org/wiki/HyperLogLog

[2] For a fixed number of registers (e.g. Redis's implementation)

[3] https://redis.io/commands/pfcount/




Resource Description Framework (RDF)

The Resource Description Format (RDF) is the W3C standard for web data. It allows for data the be linked, anywhere, and to be "grounded" in semantic descriptions of the data. The core RDF data model is very simple: It is a set of triples orgainzed into a RDF Graph.

https://www.w3.org/egov/wiki/Resource_Description_Format


Bitboards for chess board representation, the fastest and most complex such representation: https://www.chessprogramming.org/Bitboards Other structures for chess are easier to follow and understand: https://www.chessprogramming.org/Board_Representation


I did an append only list of URL's with RAW files where each bit says something is true/false about the url at that offset.

For example 26x26 RAW files asking if the page contains the letter combination "aa" or "ab" all the way up to "zy" and "zz".

When one types a search query after 2 letters a file is pulled, at the 3rd letter we have a second 2 letter combination. Then do the AND operation.

It is much like a bloom filter and tells you what is not in the set.


That is a form of q-gram indexing.



After that: maybe the Enfilade, from the Xanadu project:

https://xanadu.com/tech/

https://en.wikipedia.org/wiki/Enfilade_(Xanadu)


I came here to say Golomb compressed sets except now I see it's part of the question!

They are used by default in the OpenMined implementation of Private Set Intersection[1] - a multi-party computation technique.

[1] https://github.com/OpenMined/PSI/blob/master/private_set_int...


I am trying to compile a list of data structures (and algorithms and other stuff) into a kind of knowledge base: https://github.com/Dobatymo/data-algos/blob/master/data-stru.... There are few cool and obscure data structures already. Please feel free to help extend it!


Skip lists.

They are useful for checking in O(log(n)) if a number (or string) is in any of n ranges of numbers.

You just keep a sorted list of endpoints of the ranges, and when you do a lookup, you do a binary search if the search value is in the list, and if not, between which two indexes. If it's between and even and an odd index, it's in.

Very useful if you remember that IP addresses are also integers, and with IPv6 networks become way too huge to enumerate.


Append-Log. A grow-only linked list where you can only append elements to it.

Good use-case: Lock-free concurrent reads are allowed, because there is no way committed content could change. Writes need to be Linearizable (so locking is required). Given this property, this data structure provides faster reads than a Mutex<List<T>> and similar write perfs.

Bonus section: Provides broadcast broadcast capabilities if used as channel.


I think this can be done completely lock free with almost same number of memory barriers as a mutex when writing - read the tail pointer (acquire), append your node using CAS(release), then update the tail pointer with CAS (release).

For reads, start with an acquire read of the tail pointer, which will make all preceding writes to the list visible to you. Then you can read the list all you want up until you hit the node you read from the tail pointer without synchronization.


> read the tail pointer (acquire)

> append your node using CAS(release)

> update the tail pointer with CAS (release).

I thought as well, but, when I wrote that impl there was no way to shortcut the 2 releases into a single atomic operation. That ended up creating forks or loosing blocks.

I tested that model and a few variations with loom (https://docs.rs/loom/latest/loom/), so I'm confident that it didn't work. However, I also think that a working design might exist. I just haven't found it yet :)


Another option that makes appends cheaper would be to use a Treiber stack (one acquire, one release!), but maintain a separate path for readers. When pushing, do a read (acquire) of the tail, then set (relaxed) a back link on the second-from-top element pointing to the current top, then CAS (release) to push your node. Since the back link will always write the same value, it's ok for contending writers to submit racing writes.

Readers start by reading the tail(acquire) and storing a pointer to the node before it. They then traverse from head until they reach the node before the tail, then skip reading its back link pointer (as they already know it will point to the tail).


Not obscure by itself, but gifted with obscurity due to its language‘s success: JavaScript‘s Map type.

Just because of the Type being introduced very late to the language, and another very successful method named identically makes it almost impossible to google your way out of any situation.

You have to rely on core documentation and link lists. Just by using “new Map()” in JS, you’re suddenly ehem mapped 30 years back in time!


Not particularly innovative, but extremely useful: HDRHistograms let you very efficiently create, combine and query empirical probability distributions.

If I'm doing anything with random or random-looking variation, I prefer to store the full distribution in a HDRHistogram rather than just the last value, last few values, mean value, or whatever. Opens up so many possibilities for analysis later down the road.



I actually got to use a k-d tree before, I was naïvely brute forcing tiles together according to their 2d positions, the tiles being part of a larger image, and it was perfect.

Would run out of memory on a 96gb machine before, after it could process the tiles as quick as they came in from the camera, only a couple held in memory at a time.

When you have the right data structure, man it’s like night and day performance!


One of my favourites is treap [1], which doubles as a great excuse to forget how to balance binary trees. It is similarly "probably efficient" as HAMT, mentioned in other comments; both require a well-behaving distribution in the hash function.

1: https://en.m.wikipedia.org/wiki/Treap


For my final project in college Data Structures I wrote a linked hexagonal data structure. To index it I used two different solutions. One was to use a base-6 system to encode a series of hops in the six different directions. Another was to have two coordinates which indicated how far to traverse in two directions to reach the target hex cell.


Not a data structure, but a nice concept for very fast search on immutable pre sorted array, eytzinger order. I did some experiments, it is about 2-3x faster then binary search. https://medium.com/swlh/binary-search-vs-eytzinger-order-301...


The BIP buffer.

It's a ring buffer compatible with APIs that don't specifically support it.

It's a great fit for any network I/O and I believe Android uses it under the hood.

https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The...


Locality-sensitive hashing[0] is similar to bloom filters. It's a hashing function that hashes values into buckets. One interesting application is neutral networks with sparse inputs.

[0] https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing


I left this thread alone since it is a rabbithole but I was surprised how monotonous it ended up being. The datastructure I am quite interested in these days is 'concept lattices' and object-property (bi)graphs. Although I am less interested in the dynamic side atm and am more thinking about representation of data while being exchanged.


Not really a pure data structure, but I like hash tables where the entries expire after some time. This is very useful in interactive situations like in games or chat bots, where for example you want the AI/bot to remember a conversation or piece of information for some time.

Simplest to implement is to just wrap a hash table and insert time checks in the accessors.


Link cut trees. They can be used to query information about paths in a forest, while also allowing you to add and remove edges in the forest.

For example, you could use them to store the minimum spanning tree of a graph, while supporting updates if edges are added to the graph.

They can also be used to speed up max flow algorithms by finding the augmenting path in logarithmic time.


Here's a few fun ones: the Burrows–Wheeler transform, suffix arrays in general, Kanerva's fully distributed representation, reduced-affine-arithmetic numbers, LSM trees, binary array sets, sparse array sets, and gap buffers. (I'm not sure how nobody had mentioned gap buffers yet on the thread, but if they did I didn't see it.)

— ⁂ —

The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.

— ⁂ —

Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.

This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.

I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.

— ⁂ —

I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function ax₀ + ax₁ + ax₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)

Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!

— ⁂ —

LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.

Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).

— ⁂ —

A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < nA[B[i]] == i, from which it is straightforward to work out the other methods.

I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.

Oh, I see kcbanner posted this one too, with a link to Russ Cox's writeup, which is better than mine and explains that it was published by Briggs and Torczon in 01993, but probably not originally invented because it was an exercise in an Aho, Hopcroft, and Ullman textbook in 01974 and in Programming Pearls: https://research.swtch.com/sparse

— ⁂ —

A gap buffer is a data structure sort of similar to Huet's zipper but without the dynamic allocation; Emacs famously uses them for its editing buffers. It consists of an array containing of a leading chunk of text, followed by a gap, followed by a trailing chunk of text. Edits are always made in the gap; to start editing in a different position, you have to move some text from the end of the leading chunk to the beginning of the trailing chunk or vice versa.

Suppose you want to edit a 531-byte file; you allocate a larger buffer, say 1024 bytes, and read those 531 bytes into the beginning of the buffer. If the user starts inserting text after byte 200, you begin by moving bytes 201–530 (330 bytes) to the end of the 1024-byte buffer (positions 694–1024); then you can insert the user's text at positions 201, 202, 203, etc., without having to do any more work to move text. If the user moves up to byte 190 and inserts more text, you only have to move bytes 191–203 or whatever to the end of the buffer.

A gap buffer is very fast in the usual case of making a sequence of small edits at the same location or close-together locations, and because the constant factor of copying a large block of bytes from one place to another is so small, even moving the gap from one end of a buffer to another is generally plenty fast enough for interactive use. Moreover, operations like string search are very fast in a gap buffer, while they tend to be very slow in fancier data structures with better asymptotic complexity, such as piece tables and ropes; and it uses a predictable and relatively small amount of extra memory.

— ⁂ —

However, like FORTH, fancy data structures are a bit of an attractive nuisance: they tempt you to get into trouble. When I first came across Sedgewick's Algorithms in C it was a revelation to me: here were so many clever ways of breaking down problems that made apparently impossible problems easy. (I had come across sorting algorithms before but I didn't really understand them.) I immediately and erroneously concluded that this was the knowledge I had been missing to program all the things that I didn't know how to program, and I began my counterproductive data-structure-fetishism phase.

Much better advice is found in The Practice of Programming: almost all programs can be written without any data structures but arrays, hash tables, linked lists, and, for things like parsing, symbolic algebra, or filesystems, trees. So just use those if you can.

Occasionally, a program is only possible, or is dramatically better, using fancy data structures like LSM-trees, bloom filters, suffix arrays, skip lists, HAMTs, union find, zippers, treaps, BDDs, ropes, gap buffers, Fenwick trees, splay trees, and so on. And that can be super fun! But this situation is pretty rare, and it's important not to get yourself into the position where you're implementing a red-black tree or HAMTs when it would have been a lot easier to write your program in a brute-force way. Or where your successors are fixing bugs in your Fenwick tree implementation.

Also, it's important to be aware that, quite aside from the difficulty of writing and understanding the code, fancy data structures often have extra costs at runtime which can outweigh their benefits. It's easy to discover that your splay tree runs slower than sequential search over an unsorted array, for example, and of course it uses much more RAM.


> Much better advice is found in The Practice of Programming: almost all programs can be written without any data structures but arrays, hash tables, linked lists, and, for things like parsing, symbolic algebra, or filesystems, trees. So just use those if you can.

In general, as much as possible use stuff you already have good libraries for.

Often, you can get away with much simpler data structures with a bit of clever pre-sorting (or sometimes: random shuffling). Relatedly, batching your operations can also often make your data structure needs simpler.


Those are good tips.

I'm not that enthusiastic about libraries. Yes, using a good LSM-tree library will keep you and your successors from spending holiday weekends debugging your LSM-tree implementation — probably, because "good" doesn't mean "perfect". But it won't be a tenth as fast as an in-RAM hash table, if an in-RAM hash table can do the job. And CPython's standard hash table usually won't be a tenth as fast as a domain-specific hash table optimized for your needs.

That doesn't always matter very much, but it does always matter. Especially now with the advent of good property-based testing libraries like Hypothesis, if you can run a function call in 0.1 ms instead of 1.0 ms, you can do ten times as much testing with a given amount of resources.


I like Hypothesis just as much as the next guy. It's awesome.

You have some valid points, when you have specialised needs, you might need to write specialised software.

Though even for your example, I would suggest you write your domain-specific hash table as if it was a library, if possible, (instead of embedding it deep in your application code). Trying to make your code testable with hypothesis strongly encourages such an approach anyway.

(To show that my advice ain't trivial, I give a counterexample where you can't do this as easily: C folks sometimes write things like intrusive linked lists, or intrusive data structures in general. Almost no language gives you good tools to write these as a library, so testing properties in isolation is hard to do, too.)


I agree!


This isn't a data structure technically but I find it clever: ZFS' space maps.

It keeps track of which space is allocated by logging every allocation and deallocation. When initializing an allocator, which can be implemented in any way, this log is replayed.

If the map becomes too big it is rewritten so it is equivalent to the old map but with no redundant entries.


I like consistent hashing (https://en.m.wikipedia.org/wiki/Consistent_hashing). When a hash table (or load balancing pool) needs to be resized, it usually reduces the number of keys (clients) that need to be remapped.


Even simpler is Weighted Rendezvous Hashing (https://www.snia.org/sites/default/files/SDC15_presentations...).

It's quite a bit easier to implement and verify than consistent hashing, and carries the same benefits (minimal reshuffling on ring resizes, etc)


In a similar vein, Maglev hashing has a more uniform distribution and decent reshuffling. I couldn't get my head around the explainaions that I had seen until I read this one:

https://www.usenix.org/sites/default/files/conference/protec...


I recently updated the Python example on the Wikipedia page for Weighted Rendezvous Hashing. (I'm amazed my edit seems to be still there.)

https://en.wikipedia.org/wiki/Rendezvous_hashing


Deterministic acyclic finite state automata. Essentially a compressed trie. Significantly more memory efficient for real world dictionaries.

https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...


Somewhat straddling pure cs and cryptography/coding theory: one of my absolute favorites: Merkle tree. It's used for easily confirming the integrity of data. E.g. a filesystem.

https://en.m.wikipedia.org/wiki/Merkle_tree


Soft heaps are pretty neat. They are like normal min-heaps, but support all operations you care about in O(1) instead of O(log n) time. The catch is that they are allowed to make a configurable proportion of errors.

They are basically only useful in theory, not in practice. They allow construction of some asymptotically optimal algorithms.


Robin Hood tables are a common-ish hash map implementation that often shows up on benchmarks for fast inserts. Depending on the implementation, a Robin Hood table is also a sorted sparse array! This could be useful for data that needs random access and sequential access to a subset of the sorted data, like an order book.


Lazy Adaptive Trees : https://dl.acm.org/doi/10.14778/1687627.1687669

It is a B-Tree where updates to the tree are buffered in the branch nodes and are propagated down to the leaf nodes at a later point in time.


Piece tables: a persistent data structure used to represent the current document with relatively efficient editing operations in various text editors, word processors, etc.

https://en.wikipedia.org/wiki/Piece_table


A lazy, immutable rose tree

This article shows some really cool stuff you can do with it https://www.well-typed.com/blog/2019/05/integrated-shrinking...


Nested Set Model: a way to model nested sets of entities and query them in a SQL database. I haven't used it for anything but it seems fun.

https://en.m.wikipedia.org/wiki/Nested_set_model


Used it for tag hierarchy, liked it a lot!


The disruptor ring-buffer for providing lock-free parallelization for multiple readers/writers: https://martinfowler.com/articles/lmax.html


> Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements.

In the case of IPs, you just have to make 4 comparisons (for IPv4), if you store them in a tree.


The Rete algorithm / graph network, and its variants.

You can evaluate infinite rules, with automatic incremental updates (no re-evaluation if not necessary), with O(1) time complexity. The tradeoff being that you have to hold a very large graph which represents your rule set in memory.


Suffix trees are known to anyone who's done text search, but very neat! Fairly easy idea too: in order to allow quick substring search, build a mapping from all suffixes to documents. This is compressed as a tree, and prefix matching is done by partial traversal.


Succinct data structures are one of my favourites.

The idea that we can very cleverly pack a collection of information into a much smaller form, but you can query the compressed form with good computational complexity, and without unpacking the data you need to read is amazing.


See Steve Hanov's classic article on the topic: http://stevehanov.ca/blog/index.php?id=120


The Alias method: O(n) in space and O(1) biased number generator

https://en.wikipedia.org/wiki/Alias_method

Mind was utterly blown when I first understood the way this works in my late teens.


HyperLogLogs! They're extremely useful for cardinality count estimation, and support set operations.

https://en.m.wikipedia.org/wiki/HyperLogLog


Reddit’s engineering blog has a good article of how they put HyperLogLog to use https://www.redditinc.com/blog/view-counting-at-reddit


o Fractal Tree Index's are rather nifty as used in Tokudb engine. Basic gist is that you get the speed of Btree but your update are less painful.

https://en.wikipedia.org/wiki/Fractal_tree_index http://www-db.deis.unibo.it/courses/TBD/Lezioni/mysqluc-2010...

o k-d tree for space partitioning, handy real world applications in mapping and gaming.

o Bloom Filters as you mention are great, always a handy one to know.


String tries are pretty fun if you need to implement auto completion. I spent some time implementing my own on one project and open sourced it. And then brought it in on a few more projects.

I've used bloom filters with in memory caches a few times.


I'll just sit here hoping that all of the above datatypes get a nice rust crate :)



My attention seems to to be attracted to perfect hash functions. They are typically used for stuff like keywords detection in parsers, but they are an interesting concept, that should be possible to extend to much more.



So it’s not snazzy, but I also wrote Objective-C a long time without knowing about NSPointerArray.

Among other things, it can store nil values, which if you’ve written any Cocoa apps, is an all too common way to crash your software.


Linked list but where each entry in the list is fixed size array. O(1) inserts into the array. If the array fills you split it into two in the linked list. Amortizes the cost of traversing list pointers.


Not crazy obscure, but half-edge mesh data structures are super cool to me.


Time complexity doesn't seem like a great reason to use a bloom filter. Why not just check if the IP is in a set? Set membership is supposed to be O(1).

The reason to use a bloom filter would be to save space.



For routing you can also use a sparse radix tree, also known as a Patricia tree. This allows fast matching against a set prefixes (which is used for IP routing) without taking up a lot of space.


There’s this one called a binary search tree. I’ve only heard about it a handful of times- during engineering interviews, but is otherwise reserved for the most esoteric of problem/spaces.


Skew heaps. They aren’t useful but their utter simplicity is really fun.


Any data structure based on skew binary numbers is also fun.


If somebody made a Youtube channel going over the data structures mentioned in this thread with clear examples of pros/cons, I'd watch the heck out of it and support it on Patreon.


Extendible hashing allows space efficient growable hashtable on disk, with a maximum of two reads to reach the data page, and an extremely simple and efficient way to grow the hashtable.


> Lets you test if a value is definitely NOT in a list of pre-stored values

wouldn’t a hashmap / dict / object do this in 0(1)? Input goes in, is hashed, the key exists or doesn’t exist?


The point of a Bloom filter is that it takes constant memory. (The tradeoff is the possibility of false-positives, of course.)


I’ve bookmarked this thread. I have a new reading list! Thanks OP!



I always thought the relational calculus was awesome. There’s only one place I’ve seen it in a real world use case, outside of college, and that’s in a database called Suneido.


Static arrays - when was the last time you used a static array?


What do you mean by “static” exactly? It could be referring to an array that doesn’t change size. Or the contents are read-only. Or do you mean C/C++ static linking? Or something else?

In C, C++ and CUDA I use constant fixed-size arrays with lookup tables of specialized data all the time. I also use small arrays to hold temporary results that need to get first collected, and then sorted or processed together. Seems like very common in C++, especially embedded, game, and GPU programming. I’d love to hear about why static arrays seem uncommon in your world.


Static arrays as in constant fixed sized arrays. With a static array, O(N) is faster than O(LogN) and even some cases O(1) ahem, hashes when N is small. A contiguous cache is king, and even with modern dynamic languages, the generational GC can even be completely bypassed or highly simplified if the JIT learns that memory access is just dependent on offsets.

Simple code runs fast.

See Pike on Complexity - Rule 3: https://www.lysator.liu.se/c/pikestyle.html


I find I use fixed-sized arrays a lot on microcontrollers. You want to avoid the heap (for reliability) and you don't have a lot of RAM, so you must carefully size your fixed-size arrays.


N-dimensionsonal arrays are pretty cool, and fun to implement. I did one in C for an Advent of Code problem that switched from 3 to 4 dimensions between the two parts.



>Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

The easy and efficient way to test if a value is in a list is to use a hash set or dictionary. Complexity is always O(1).


OP is probably mentioning bloom filter because it is space efficient. A good filter will use less space than entire hash set making it a good candidate to be kept in RAM to filter out requests and reduce expensive calls to read disk.


I really like the splay tree. It optimizes itself during reads and writes such that it is optimal for a fixed probability distribution of accesses.


- Prefix tree (also known as a trie) - Splaytree (Self balancing tree optimized for querying fast for the frequently accessed nodes)


[Ordered] Minimal Perfect Hash Functions with O(N) construction time.

They have been around for 30 years and I don't see them used in practice.


Fixed key sets are uncommon.


Many assets associated with a program or system are fixed/readonly/constant data. Slowly changing key sets can also take advantage of Minimal Perfect Hashing.


How about a pinsketch:

A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.

For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).

Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).

Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.

The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P

Here is an implementation I contributed to: https://github.com/sipa/minisketch/

There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.

IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.

The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)


Bitboards, although perhaps not so obscure, allow for some very nice optimization tricks.

I've used them to solve Sudoku puzzles really fast.


Sketching data structures: count min sketch, hyperloglog, sliding window hyperloglog, t digest

Skiplists

Locality sensitive hashing

Log structured storage (bitcask, lsm)


Skip lists: 90% of the benefits of more complex tree data structures, with a fraction of the complexity.


Not super obscure, but also not well known... LinkedHashMap. You get the O(1) lookups and ordered iteration.


A Patricia Merkle tree. Sometimes affectionately referred to as a Perkle tree in some large companies.


The Trie is pretty cool. I think it's a bit obscure because the memory use can grow quite fast.


That depends very much on how the trie is represented in memory. Something like a burst trie or HAT trie is usually very compact!


FST: Finite State Transducer. Widely used inside search (Lucene), DB powering prefix, fuzzy search


patricia tries are cool. used them once while working on an infrastructure management tool; makes searching word dictionaries very efficient.

also love the Option<T> struct with pattern matching that all of the hippie-dippie functional languages seem to like lol


Three of my favorite probabilistic data structures are: Skip Lists, Bloom Filters, and SimHashes.


Algorithms and Data Structures for Massive Datasets By Manning.

Just started reading this. It looks interesting


Wikipedia’s list of algorithms and data structures is pretty extensive:

https://en.m.wikipedia.org/wiki/List_of_algorithms

https://en.m.wikipedia.org/wiki/List_of_data_structures

__________________

As for a specific algorithm, this paper covers analysis of HyperLogLog (HLL):

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

TLDR: Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

Source:

https://en.wikipedia.org/wiki/HyperLogLog

Recent HN coverage of HyperLogLog:

https://news.ycombinator.com/item?id=32138790


I like Ring Buffer, it leverages the CPU-cache line and can use to store event queue.


If you like Bloom Filters, you'll love Cuckoo filter, which allow key deletion.


Tree automata and the BURS (Bottom Up Rewrite System) algorithm are pretty cool.


patricia trees

They are particularly interesting for storing long strings.

You can look up all strings with a common prefix quite easily.

the lookup function is really interesting, just check the first differing bit.


Bloom Filter is very clever and simple DataStructure, Thank you.


Actually came here to say bloom filter but you beat me to it.


I find the RTree beautiful. While not as in use these days.


arrays are pretty underground, theyre like lists but better. C++ offers arrays but unfortunately python doesnt :(


It does: what Python calls lists are actually (dynamic) arrays.

There is also this module, which provides unboxed arrays: https://docs.python.org/3/library/array.html


Ropes. Something between array and tree.


> I'll start

You're supposed to post yours as a comment, so it can be voted on like every other type of structure suggested here.


I am enamored by data structures in the sketch/summary/probabilistic family: t-digest[1], q-digest[2], count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-Gries sketch[7], top-k/spacesaving sketch[8], &c.

What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.

I hope they are as useful or intriguing to you as they are to me.

1. https://github.com/tdunning/t-digest

2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html

3. https://florian.github.io/count-min-sketch/

4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...

5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf

6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...

7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf

8. https://www.sciencedirect.com/science/article/abs/pii/S00200...

9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.


Quadrupley linked lists


Skip lists.


I'm trying to create a first class citizen of loops and trying to improve software responsiveness.

There's two ideas: concurrent loops and userspace scheduling and preemption.

Nested loops can be independently scheduled. I wrote a M:N userspace scheduler[2] and it does something simple. You can interrupt a thread that is in a hot loop by changing its limit variable to the end. You don't need to slow down a hot loop by placing an if statement or checking if X number of statements executed. (Many runtimes do this to implement scheduling and you don't need to do it this way)

Golang schedules away from a goroutine by checking during stack expansion if the thread has used its timeslice.

I think this simple observation is really powerful. The critical insight is that frontend performance can be drastically improved by reducing latency from the user's perspective. Did you ever notice that the cancel button rarely cancels immediately? Or resizing an image is slow and gives no feedback or encrypting gives no feedback?

By having a watching thread interrogate the output buffer, we can create GUIs that visually do things and provide observability to what the computer is doing. Imagine watching a resize operation occur in real time. Or encryption. Or network communication.

One of my ideas of concurrent loops is "cancellation trees"[1], if you model every behaviour of a piece of software as a tree of concurrent loops that are reentrant - as in they never block and each call is a loop iteration of a different index, you can create really responsive low latency software. If you cancel a branch of the tree, you can cancel all the loops that are related to that branch. It's a bit like a tree of processes from PID 0 or loops as lightweight green threads/goroutines.

So as you're moving around in your text editor or browser, if you invalidate the current action - such as press ESCAPE while Intellisense is trying to find related code, you can interrupt all the loops that fanned out from that GUI operation.

Telling the computer to stop doing what it is doing and observability are two key weaknesses of modern computing. GUIs do not always accurately communicate what the computer is doing and you usually need to wait for the computer to finish before it shows you what it is doing. I am sure the cancellation token could be extended to follow this idea.

If you liked this comment, check out my profile to links to my ideas documents.

1: https://github.com/samsquire/ideas4/blob/main/README.md#120-...

2: https://github.com/samsquire/preemptible-thread


Tim sort.


The Israeli queue.

Like a regular queue, but if something new that comes in sees its friend in the queue already, it can jump the queue and go and stand next to her.

Useful for when something has a big overhead on top of its own processing, but the overhead can be shared between several similar entries. If you're doing it anyway for the one already in the queue, you get to make the most of it for its friends too.

I chose it because I live here and it's totally true :)


There is a great (always busy) sandwich shop in Boston called Darwin's that I used to go to a lot that worked that way. When a customer placed an order they would call down the long line, "anyone else for an X?", then make make the sandwiches with the inner loop over sandwich and the outer loop over ingredient.

I took much delight in the efficiency as well as the incentive structure: If you are not picky and you can help us increase throughput then we'll reduce your latency :)


Reminds me a little bit of a "delta queue" used in efficient discrete time based scheduling, for example, in an OS scheduler. Imagine you want to implement a scheduler that accepts jobs and runs them at some point in the future. Instead of keeping a list and scanning through the list at each clock tick, implement a linked list with (time delta, job) at each node. Now, handling a tick() simply requires decrementing the head node and popping items until you see a non-zero delta. Inserting is done like a normal list, except as you iterate through the list, you decrement the desired time interval as you see deltas in the list. It goes like this:

insert(job1, 1) # (job1, 1) -> null insert(job2, 1) # (job1, 1) -> (job2, 0) -> null insert(job3, 2) # (job1, 1) -> (job2, 0) -> (job3, 1) -> null tick() -> job1, job2 # (job3, 0) -> null tick -> job3 # -> null


Sounds similar to the behavior of Guava's LoadingCache. If you try to fetch a key that's not in the cache, the supplied callback is used to fetch it. While the value is being loaded, other threads can access the cache concurrently, but if a second thread asks for the same key, it will wait for the first load to complete instead of redundantly fetching the same value.


Sounds like a priority queue with a sidecar hashmap or some kind of notify mechanism for when "order's up!"


How do you implement it? Do you do a scan with every insert or do you hash or use a tree for indexing?


Sounds like a priority queue


Not really. In a priority queue, an item with higher priority gets pulled out before one with lower priority, so you can have items that skip parts of the queue (or all of it) even if there are no items of the same priority in there already.

This structure seems to behave differently, in that an item may skip ahead in the queue if and only if another item with some matching characteristic is already contained. This makes it so you can never skip the whole structure, for example, assuming you get put after your "friend".


Is it the same thing? I'm trying to imagine how this would work with a priority queue.

I think you need an flexible number of priorities that is equal to the queue length, and the priority values come from an infinite counter that only goes up. Higher values = lower priority.

When a unique item comes into the queue you increment the counter and give it that value as a priority. This puts it on the end of the queue.

But if the new item is a duplicate of an existing item then you give it the priority of the existing item.


the BroTree. it’s the tree you don’t know about but will inevitably be queried about during a google interview with a googler on a mission to prove supreme smartness.


How about the current DATE formats based on the EPOCH. Hell, they knew there was nothing to it, but some backstreet performances inspired my Y2K formative years where my normal, kind, religious mother became a hoarder that destroyed everything she touched. Why can Java still not do simple date processing when even M$$ nailed that formula generations ago? Not letting devs communicate cross platform without Linux level religious arguments is ridiculous. Git this, Linus, is the EPOCH the final destination of the post-Christian measure of time (i.e. the common era)


I think you managed too lose me about 3 times in there.. what point are you trying to make, and what do you think Linus should do?


I've sought out an isolated community of intellectuals that I respect if I actually exist at all... Your -1 is cute... But perhaps engage someone less intelligent than you who is genuinely trying so others who believe that everything ultimately ends in a violent last stand... Perhaps engage me in a dead thread without your timing advance


I think if there's an intelligence gap between a guy who needs a tailor made retreat that costs more than my entire earnings potential for my lifetime to learn... DON'T BE AN ASSHOLE... SOMEONE WHO DOESN'T CARE WILL PUNCH YOU... I think humanity needs new heroes


Linus is a pussy nerd and he's brilliant... But that is not a reflection of society.


I appreciate the comment. Is that supposed to matter to me? What is your purpose?


I think Linus should do what he did when surveillance by horrific regimes came up in interview. Did the NSA force you to introduce backdoors?


Also... I have daddy issues. You're lack of approval makes me want to support Donald Trump. JK... it'd be that easy to destroy your politics, though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: