Hacker News new | past | comments | ask | show | jobs | submit login
A database without dynamic memory allocation (tigerbeetle.com)
359 points by todsacerdoti on Oct 13, 2022 | hide | past | favorite | 186 comments



Allocating all required memory during startup has been idiomatic for database architectures as long as I can remember. The obvious benefits to doing things this way are wide-ranging.

There is a less obvious benefit in more schedule-driven database architectures. By implication, the code knows exactly how much of which resources are instantaneously available at all times since it is directly managing those resources. This enables the database to dynamically adapt its behavior and scheduling decisions in response to specific resource pressures or excess resource capacity. This is basically a fine-grained form of internal micro-backpressure, dynamically prioritizing subsets of the workload that minimize impact on the resources under pressure at any moment in time. In some versions, you can switch internal execution strategies to take advantage of currently under-utilized resources. If you outsource this to the OS, you have no idea how close or far you are from the limits, it doesn't really work.

This doesn't add much performance per se but is great for minimizing tail latencies. It helps to create a robust internal resource equilibrium that handles transients and concurrency more smoothly and consistently than if the database couldn't make a fine-grained decisions about what to do from a position of total resource awareness.


> Allocating all required memory during startup has been idiomatic for database architectures as long as I can remember. The obvious benefits to doing things this way are wide-ranging.

Allocating all required memory is, IMHO, a great practice for almost any type of program.

Heap memory allocators are a type of performance optimization, they allow re-use of existing memory, but only if you are careful to not run out of memory. Heap's of course also allow for more efficient utilization of memory, if some software module isn't using that bit of RAM right now, let another bit of code use it.

But like all performance optimizations, they make code messier, and they also lead to less reliable and harder to follow code.

In the embedded world, code statically allocates all memory it needs up front. This means that code that reads from the external world has a fixed buffer size, and also likely a related max throughput, regardless of system conditions (e.g. lots of free memory laying around).

But, this is also a good thing! It forces programmers to think about their limits up front, and it also forces thinking about error handling when those limits are broken!

Sure it can require creative coding, loading large blobs of binary data requires more work when you stop assuming malloc can return infinite memory, but, hot take here, programs should stop assuming malloc can return infinite memory anyway.


Static allocation has also made TigerBeetle's code cleaner, by eliminating branching at call sites where before a message might not always have been available. With static allocation, there's no branch because a message is always guaranteed to be available.

It's also made TigerBeetle's code more reliable, because tests can assert that limits are never exceeded. This has detected rare leaks that might otherwise have only been detected in production.


I think the grandparent was saying that dynamic allocation is a form of optimization, which also makes the code harder to follow. Your anecdote seems exactly in line with their suggestion.


Ah, missed that, thanks! I've updated the comment.


I don't get your characterization. The statically sized arrays can never be fully used? That is doubtful. I suppose that if data X and only ever used 1:1 with data Y and you got a Y, then you are guaranteed to have an X, but that requires complete static analysis of the code base and certainty that no future code change will ever affect that 1:1 relationship. That seems fragile.

I mean it's not like memory exhaustion is a common occurrence nowadays. These kind of compromises, where one trades an almost non-existent failure mode for a fragile assumption-ridden code base does not sounds like a wise choice.

I mean, if the particular process is so tightly-integrated that you can know that you have a 1:1 relationship between X and Y, you can also tie those together with dynamic allocations. I find it hard to believe that you can easily statically show that X and Y allocations are tied in a static allocation scheme but that it would not be so under dynamic allocation?


> Allocating all required memory is, IMHO, a great practice for almost any type of program.

Do you know any good resources to get into that mode of thinking, including common patterns, for sombody who is used to malloc everything? I could easily come up with lots of examples in my current project where "it isn't that easy" and "you cannot know this up-front", even when re-designing the whole architecture, and trying to solve those problems feels like re-inventing a hundred wheels that have been invented before.



I could only read the second article for now (the first one is blocked at work), but while interesting it's not quite what my question was aiming at. That is, it solves the malloc/free hassle in an elegant way, but it doesn't solve the fundamental problem to know in advance how much memory you'll need.

That is, the second article basically says: (pre-arena) It would be so easy if we could just allocate everything on the stack, because lifetimes are nestable. But the article doesn't say how to know the required size of the stack in advance, to prevent stack overflows. That was the part I was interested in.

edit: I could read the first article through a proxy, but it also just talks about different allocators. I'd like to achieve what com2kid was talking about: "Allocating all required memory is, IMHO, a great practice for almost any type of program."


Late reply, but the most common ‘advice’ (usually in the form of post-mortums) I have seen is that knowing the total allocation size for a program is an engineering analysis based on memory availability for a given system and estimated worst-case bounds for memory usage. The system is then designed to work within the constraints generated by that analysis.

Pre-allocating certainly requires a system/application designer to consider the most likely worst case resource usage and then enforce that estimate throughout the design.

Further, for systems that have modern memory sized resources (i.e. 16gb and up) the system is designed to work as though memory was a stack and then that stack is allocated from system resources using something like a growable array, or slab style allocator. So that if the estimated wort case usage is reached, the ‘static memory allocation’ can be relocated and enlarged or added onto with a new slab.

Your comments imply the sticking point may be the estimate of program memory usage, and that is a very fact/situational analysis.


> Do you know any good resources to get into that mode of thinking, including common patterns, for sombody who is used to malloc everything? I could easily come up with lots of examples in my current project where "it isn't that easy" and "you cannot know this up-front", even when re-designing the whole architecture, and trying to solve those problems feels like re-inventing a hundred wheels that have been invented before.

At some point you have to accept you will have hard limits in place. You have to signal back to whoever is sending you data that you cannot receive any more, they need to chunk it up smaller.

But, key lesson, you should already be doing that, but 99.999% of the time we get away without doing that because memory on modern systems is so large.

Drop down to embedded, and now your bit of code gets a 512 byte buffer to work with, and your paycheck depends on making it work, well, you will figure out a way to make it work.

For data coming in over the wire, generally it consists of packet size fields. It also means waiting for data to be processed until you accept more.

For data being passed around locally, well someone somewhere has a buffer. Careful tracking of ownership means you reduce unnecessary copying of data, especially if data is going to be processed and then discarded.

Strings are a separate problem, typically you will have some sort of string allocator dedicated just to strings because strings suck. But even then you will have a max length, but string sizes vary so much just using max length for every string on a screen will blow your memory budget away.

It took me awhile to get into the proper headspace, and it hurt my heart a little bit when I went back to GC'd languages where memory is thrown away willy nilly!


> In the embedded world, code statically allocates all memory it needs up front.

Except for things like the ESP32, which uses malloc pretty liberally within ESP-IDF. Networking is hard without it, it seems?


Thankfully TCP/IP packets max out at 64K.

Most old school formats have max size limits for good reasons, and the really good protocols allow the receiver to specify the maximum amount of data to send at a time!

Of course if your device has 256K of memory, reserving 64K for a max size TCP/IP packet is incredibly wasteful.

But if you can receive one of those packet at any time, realistically you need to have 64K available at any time so....

I'm not sure about newer stuff. Last time I did embedded things like Bluetooth were sectioned off into their own bit of memory, malloc was used, and it was a source of bugs that we weren't happy about.


Imagine my disappointment after working on a fixed-allocation HTTP server stack to discover that even `esp_log.h` does several malloc() calls [1] per log output line... thankfully that was only on the ESP8266_RTOS_SDK, and esp-idf isn't quite as bad.

[1] https://github.com/espressif/ESP8266_RTOS_SDK/blob/89a3f254b...


Networking can use allocators other than malloc. You mostly need arbitrary allocation for stuff like packet payload which can be arbitrary size (within limits). Many other structures, like IP headers, etc, can use much simpler, faster, safer allocators.


I know LwIP is pretty good around that sort of thing. It was just a little bit of a shock to see a lot of usage of malloc throughout ESP-IDF. But then it does have to handle Wi-fi, Bluetooth and more. It runs well though! And it does somewhat amortise its use of malloc to setup functions where possible.

One of the challenges I’ve had in the firmware work we’ve been doing, is the sheer amount of network-received runtime configuration. Makes statically allocating everything difficult. Not impossible of course, but our memory usage ends up sky high because we end up allocating buffers for the 70% of code that isn’t even going to be executed! So using “dynamic” allocation once when the runtime config determines this functionality is actually needed has worked reasonably well

Though it really reminds me of avoiding the GC in garbage collected languages, which is a little amusing to think about


If you're running on a regular OS, you can't escape it by just not calling malloc.

You'll still get lower performance if the system is low on memory, because you got swapped out - it doesn't have much if anything to do with how much you allocated. Wiring it is a different story, and if you're an audio app maybe you should allocate and mlock(), but not a database.

> hot take here, programs should stop assuming malloc can return infinite memory anyway.

Assuming this is a perfectly fine way to make reliable software.

https://en.wikipedia.org/wiki/Crash-only_software


> Wiring it is a different story, and if you're an audio app maybe you should allocate and mlock(), but not a database.

Why wouldn't you wire database memory?


If you're the only one on the system then nobody's going to swap you out, so it won't do anything. If you aren't the only one on the system, then you're probably not the most important process.

No reason to fight the OS unless you're a hard realtime process, which a database usually isn't.


Many databases wire their memory because swapping would be operationally catastrophic and they position themselves as being designed for "mission-critical" applications. I've seen databases get swapped a couple times many years ago, it isn't pretty. Databases have a roughly fixed memory footprint, so sensible configurations don't contribute to memory pressure experienced by other processes.

Database implementations are capable of swapping most of their address space themselves, independent of and better than the OS -- it is a large part of their function. As such, optimization is heavily dependent on both memory and storage being physical. This being true is precondition for most architectural macro-optimizations to work.


> memory during startup has been idiomatic for database architectures as long as I can remember.

How can this be true given most mainstream databases do not do this? Maybe a pattern sure, by idiomatic, not so much. I only add this comment in defense of the poster as this comment seemed to reduce their achievement/efforts.


Not just databases but really any kind of latency sensitive software especially with “realtime” requirements (e.g telecom gear)


Yep. After algorithmic improvements, tracking (and limiting) the use of malloc in a program can improve performance by an order of magnitude or more. Not only is allocating slow, but having data spread all over main memory is terrible for the CPU’s cache and branch predictor.

I’ve gotten a lot of massive performance wins by rewriting code in rust because I can control memory allocations more closely. Some people I’ve talked to about this seem to just hear “rust = fast”. I’ve seen expensive rust rewrites which somehow end up running slower than the equivalent Javascript code. Turns out if you write rust like you write javascript, with boxes, vecs and .clone()s everywhere it’s going to be slow.


Building a system that allocates memory once up-front requires up-front thinking.

sqlite apparently has some support for this, although of course you make it more predictable if you control all the code.

https://www.sqlite.org/malloc.html

They even provide some nice math for you to create the fixed size heap!

The problem of dynamic memory allocation, and specifically the problem of a memory allocator breakdown, has been studied by J. M. Robson and the results published as

    J. M. Robson. "Bounds for Some Functions Concerning Dynamic Storage Allocation". Journal of the Association for Computing Machinery, Volume 21, Number 8, July 1974, pages 491-499.


Thanks for the reference, that Robson proof is amazing. Published in 1974... I wonder what other treasures are hidden in the literature.


Isn't this just a bunch of static object pools? That's extremely common in embedded programming. I'm glad to see someone recognized it could apply to a database.

My students do this on the GameBoy Advance specifically so they don't need to learn about dynamic memory allocation to get useful things done early in my course. In fact, they only learn about dynamic memory allocation for academic reasons later on and never _need_ to use it in their games.


Do you mind sharing the name of the course? It sounds interesting. (I didn’t see anything in your profile.)


Isn't this just a bunch of static object pools? That's extremely common in embedded programming. I'm glad to see someone recognized it could apply to a database.

Zig in particularly makes this really approachable. Too dumb to pull it off in C myself, trivial in Zig.

YMMV.


I do small projects statically all the time! Writing a simple bf interpreter can be done so statically that you don't even need to store the program in its own array; program text can just be taken directly as const char*

https://github.com/ijustlovemath/bfi

You can also do ergonomically nice things using static allocation in many data structures, like linked lists (as a tiny example: https://github.com/ijustlovemath/linked-list/blob/master/ll....). That way you can focus on the concepts that matter; the data structures, and not frustrate newbies with opaque and difficult-to-reason allocation issues (at least not at first!)


> My students do this on the GameBoy Advance

This sounds interesting - Any information / course material available online? (I found nothing in your hn profile...)


The idea of having programs that don't crash under memory pressure is one of the things that I really hope developers will start striving for more in the near future.

A database that doesn't OOM crash is great, and the same is also true for more consumer-grade software.

I have 32gb of RAM on my PC and whenever I'm encoding a video with DaVinci Resolve, I always see Firefox and Discord crash almost instantly.


What would happen in your utopia is that DaVinci Resolve would fail to start because Firefox or Discord would reserve too much memory. Or maybe Firefox would demand you close open tabs before allowing you to open a new one. We have a solution to the problem you describe, it's called paging to disk and hoping that we won't need to read or write to that memory very often (which works great for something like Firefox where a bunch of pages are sitting idly in the background).


Too bad we can't design some protocol for applications to "work together" to ensure each has the memory they need. Like some sort of "application collective memory management" where each application has certain memory it needs/wants and depending on the priority of that need and the current priorities of the other applications, it could know to reduce memory usage during those times when some other application has a higher priority. For one, it would help in your case, because background video processing might be a lower priority, or perhaps you want it to finish as fast as possible, so that Firefox and etc. just "freeze" themselves for a while and drop as much memory as is practicable (ie: deleting currently rendered pages from memory and re-rendering them later, or paging those out to disk for a while).


Related references:

- Neal Walfield’s papers on resource management[1,2]

- The agoric papers[3] from the E people (who sadly now seem to be trying to steer the whole thing in the cryptocurrency direction)

[1] http://walfield.org/papers/2009-walfield-viengoos-a-framewor...

[2] http://walfield.org/papers/2011-walfield-smart-phones-need-s...

[3] https://papers.agoric.com/papers/


The E/Agoric people have been trying to steer the whole thing in the cryptocurrency direction for about 35 years now, some of them for more than 50 years. If you are enthusiastic about their previous work, consider that possibly they might have been right about this, however unfashionable cryptocurrency might be on HN this year.


Eh. They’ve certainly always foresaw people using capability systems they were building to sell computing services to each other, but I don’t see anything pointing to the particular stylized flavour of computing of cryptographic smart contracts before those developed independently(?) over the last decade.

While I’ll admit that I find the corporate gloss of the Agoric website thoroughly repulsive, I would actually like to see cryptocurrencies succeed as a boring payment instrument. It’s just that my interests have always lain more in userspace resource allocation (etc.) among applications on a single machine, and in that context both E and the original agoric papers are completely separable from their money aspects, while any developments in the cryptographic consensus direction seem unlikely to be. Conversely, it’s not like the original object-capability work seems universally attractive to me—Shapiro advocated some system mechanisms that are, as far as I can tell, DRM-complete. But those are, again, separable from the rest of the ideas.


It's definitely true that E smart contracts were conceived to run on a secured server (or in tamperproof hardware), which was known to be feasible, rather than via a decentralized consensus mechanism, which was not. But they've been building smart contracts since last millennium.

Moreover, they didn't just foresee people selling computing services to each other; some of them built what we would call cloud computing platforms where the platform owner sold computation and the platform users could sell services to each other on it, starting in 01968, which was successful for decades: https://en.wikipedia.org/wiki/Tymshare

That's the experience that gave rise to the object-capability approach in the first place.

And, as far as I know, the Digital Silk Road paper from Agorics was the first workable proposal for a secure decentralized digital money.


iOS has had something similar but simpler since the very beginning: https://developer.apple.com/documentation/uikit/uiapplicatio...


It doesn't work very well and can't be made to work better. free()ing memory doesn't return memory to the system fast enough and it's easy to accidentally create more memory pressure while trying to respond to this (like by touching a page that's compressed/swapped out).

The only safe thing you can do is free entire pages at once, but that's not necessarily effective.

That's why iOS typically just kills you instead.


Of course you can only return pages at a time. That is the minimum size virtual memory works in.

You don't want malloc to instantly release memory to the system on free(), and malloc() is then constrained by the same page multiple at a time restriction as anything else so it can't release fragmented memory.

What you should be doing is trying to save state so that you can return to original state, not fighting to keep going.



Firefox should in theory be able to crash individual tabs to reclaim memory, but in practice I just see it die.


Isn't it likely because it's not crashing by itself, and is being killed by the OS?


Are you both running with swap turned off? You can't just turn off swap on a system designed for it; of course it's going to break.


>What would happen in your utopia is that DaVinci Resolve would fail to start because Firefox or Discord would reserve too much memory.

Great! Then users can annoy devs who write crappy software “I can’t open anything else when discord is open!”

Vs today where the OS essentially conceals crappy software from the user via a billion layers of abstraction and hacks


Well, no, because the tradeoff with static allocation is that you need to allocate all the memory you want upfront. If you're requesting memory via dynamic allocation, you can be less wasteful by only requesting what you need and returning what you don't need to the OS.

In the case of e.g. Firefox, do you really want it to always be doing bookkeeping for 100 tabs when you only need 1? Or having to restart Firefox every time if you've hit the upper limit of tabs? Or just let Firefox dynamically allocate memory as you open more tabs?

There are many compelling reasons for dynamic memory allocation, and "the OS essentially concealing crappy software from the user via a billion layers of abstraction and hacks" is a fundamental misunderstanding of its purpose.


Allocating upfront is only one technique.

Somehow DaVinci Resolve is capable of using all the free memory it can find and not more than that, otherwise it would never be able to complete any rendering job. That's one "utopic" program that's pretty real.

> We have a solution to the problem you describe, it's called paging to disk and hoping

No. The solution is to be smart about memory usage, and to handle failures when trying to allocate memory. DaVinci would become unusable if it started swapping when rendering.

Maybe for some programs swapping is a fine solution, but the argument that software engineers shouldn't concern themselves with designing for OOM conditions is absolutely ridiculous.

That's not a 'utopia', that's what we should be concretely be striving for.


It’s difficult to see how that would be possible without a hard break from legacy systems when fork/exec practically forces overcommit and very dubious resource accounting on us. (One consumer system that handled OOM conditions with remarkable grace was Symbian, but of course it forced an extremely peculiar programming style.)


That's a fair point, but it's also true that we're never going to break away from legacy systems until we start writing software that has a chance of functioning correctly when running out of resources.

The good news is that we do have ecosystems where you don't have overcommit, like webassembly and embedded devices, and if you're writing libraries meant to also work on those platforms, then you will have some good building blocks that can be used to iterate towards more robust software.

That's how you're expected to approach library writing in Zig.


The purpose of abstractions is so you don't have to think about what's under them.

malloc() can fail if you try to allocate many GBs at once, but if you let it fail for 32 byte allocations then what are you even going to do about that? The only solution is to crash your process, but that will lose user data if you haven't saved it yet, and purity isn't worth that.


As far as I'm aware, witing to a file doesn't require any memory allocation on any of the major OSs, so you can most definitely save the user's work when running out of memory.

> The only solution is to crash your process

I disagree. I think this is a bad mindset that we allowed ourselves to slid into. There's a lot more than can be done when an allocation fails, starting from what I mentioned above.

That said, I'll agree that at least you need a programming language that can help you design and maintain allocation-free codepaths in your program, and that 's hard if your language has builtins that allocate implicitly, as you will have to avoid them all.

Zig has the philosophy of keeping all memory allocations explicit precisely for this reason. In the beginning people used to say that it was too extreme, but now Rust is also embracing this approach in order to be used in the Linux Kernel (one place where the "only solution is to crash" learned helplessness will not be well received), so hopefully there will be even more options in the future.


> As far as I'm aware, witing to a file doesn't require any memory allocation on any of the major OSs, so you can most definitely save the user's work when running out of memory.

Course it does. You can’t run any code in a typical app environment without allocating heap objects as you go, and even if you didn’t, anything that accesses a swapped out page is “allocating physical memory”.

And once it gets into the kernel, file systems have to allocate their own objects to track new writes. Though if the kernel is failing to allocate memory you have worse problems.


https://man7.org/linux/man-pages/man2/write.2.html

write doesn't require you to allocate any memory


You have to build the serialized data you’re writing first.


It would be nice if you could specify e.g. a Postgres table be backed by a (really big) "ring buffer" rather than the usual storage.

Maybe that breaks some relational features on the table, but being able to painlessly query the most recent X records for some interesting transient log data would be nice.


I ended up using a ring buffer in Redis recently to store the last N values of some database lookups that are slow. It works great! Didn't take that much logic, either, since Redis has easy transactions. I just have the "current buffer index 'i'" as a key-value, the "buffer length 'N'" as a key-value, then 'N' numbered entries as key-values. I prefix these with the data identifier that it caches, so you can have several ring buffers active. you can do this with a Postgres table just as easily, I assume!

I also am building a ring buffer data structure for my hobby database that I use for personal stuff that uses FoundationDB [though I have been unhappy with the state of the FDB Go bindings, which are somewhat out of date versus the C client library for FDB and aren't very Go-like in design (the C libraries are the "gold standard" and they use FFI in all the other bindings to connect to those)].

Added later: it was a bit harder to deal with resizing the buffer, so when that happens I do some data shifing/dropping depending on which way we're resizing, effectively initializing the buffer anew because that was easier (again, a Redis transaction saves me a lot of concurrency work!), but lucky for me we would only resize each buffer at most 1 time per minute due to how our configs work.


OOM is usually a good reason to crash. Sure, a DB might be able to do something creative with its cache, but if you can't allocate, there's not much the average program can do anymore. I guess browsers can kill their child processes, but that's not a long-term solution. If you're out of memory, your computer will be lagging from paging/swapping memory in-and-out, so killing yourself helps relieve some pressure.

Also, are you sure it's the programs failing and not some OOM killer coming around? Could your pagefile/swap be full?


The article says it rejects incoming connections. that seems like a much better behaviour than OOM - clients can back-off and retry and keep the overall system stable.


Thanks!

Joran from the TigerBeetle team here.

This was in fact one of our motivations for static allocation—thinking about how best to handle overload from the network, while remaining stable. The Google SRE book has a great chapter on this called "Handling Overload" and this had an impact on us. We were thinking, well, how do we get this right for a database?

We also wanted to make explicit what is often implicit, so that the operator has a clear sense of how to provision their API layer around TigerBeetle.


I can easily imagine that kind of design getting into a state where it's "up" but not accepting any connections indefinitely (if it's using just enough memory to run itself, but doesn't have enough to accept any connections). Crashing early is often a good way to reduce the state space and ensure consistent behaviour, since you will always have to handle crashing anyway.


> using just enough memory to run itself, but doesn't have enough to accept any connections

The situation you described is exactly what is prevented with static allocation. It's a very predictable design - it will successfully handle X requests and will fail any requests above that.

An important thing implicit in this design - or at least that I'm assuming - is that there is no queue of requests. Queuing is a major source of problematic behaviour during overload. It's /queuing/ that causes RAM to increase with load (in a system with fixed max of requests being processed). It's /queuing/ that is why latency increase as you near load limits (vs errors).

I'm with you that early errors are better, and the scheme used here achieves that on a request level.


Crashing and losing all existing connections often merely leads to an avalanche of subsequent requests, accelerating overload of your systems. I've seen this countless times.

Being crash tolerant is one thing. But "crash early, crash often" is absolutely horrible advice.


> Crashing and losing all existing connections often merely leads to an avalanche of subsequent requests, accelerating overload of your systems. I've seen this countless times.

Sure, but again, that's a scenario that you need to handle anyway.

> Being crash tolerant is one thing. But "crash early, crash often" is absolutely horrible advice.

I've found it to be good advice. It's a big part of why Erlang systems have been so reliable for decades.


In Erlang a "process" is more akin to a thread or fiber in most other languages, and "crashing" more like an exception that is caught just across a logical task boundary. Importantly, in Erlang a process "crashing" doesn't kill all other connections and tasks within the kernel-level Erlang process.

And that's why Erlang is so resilient, precisely because the semantics of the language make it easier to isolate tasks and subtasks, minimizing blast radius and the risk of reaching an inconsistent or unrecoverable state. I often use Lua (as a high-level glue language) to accomplish something similar: I can run a task on a coroutine, and if a constraint or assertion throws an error (exceptions in Lua are idiomatically used sparingly, similar to how they're used in Go), it's much easier to recover. This is true even for malloc failures, which can be caught at the same boundaries (pcall, coroutine.resume, etc) as any other error.[1] It's also common to use multiple separate Lua VM states in the same kernel process, communicating using sockets, data-copying channels, etc, achieving isolation behaviors even closer to what Erlang provides, along with the potential performance costs.

[1] While more tricky in a language like Lua, as long as your steady-state--i.e. event loop, etc--is free of dynamic allocations, then malloc failure is trivially recoverable. The Lua authors are careful to document which interfaces and operations might allocate, and to keep a core set of operations allocation free. Of course, you still need to design your own components likewise, and ideally such that allocations for connection request, subtasks, etc can be front-loaded (RAII style), or if not then isolated behind convenient recovery points. Erlang makes much of this discipline perfunctory.


Isolation is indeed a cornerstone of what makes this style work, but "let it crash" is another one, and just as important IMO. "In this language crashing would take down other connections" does not make it safe to continue without crashing and doesn't remove the need to crash and recover, or something equivalent to that - rather it's an argument for finding a way to separate connection tracking from more complicated logic that might get into unexpected states.


Whoa. In the Erlang equivalent here, the client would crash (GenServer.call will timeout if the called server is overloaded), not the service.


When SQL Server goes OOM, it's often because the system can't free memory quickly enough, not that there is literally not enough memory to continue operating. Sure, it could be more aggressive with freeing memory, but at some point it just makes more sense to restart the entire system so that overall performance doesn't nosedive.


Software for critical infrastructure often handles no/low memory situations in a graceful manner. I've worked on network switch software where we were expected to work indefinitely with zero remaining memory. A single packet drop is unacceptable, at 10-40G line rate. Any new malloc needs to be justified with a design document.

So yeah, average user software can do a lot, it's just that we've given up on reliability and robustness as an industry. That's why sometimes your phone fails to make a 911 call, and sometimes you need to reset your car.


> Could your pagefile/swap be full?

This presumes that one is indeed using a pagefile or swap partition. I know of quite a few folks who skip that on SSDs out of fear of SSD wear.

Personally, in this day and age I don't really give a damn about SSD wear (that's what backups are for), so I'll happily create a swap partition, but not everyone is as comfortable with that.


In some areas, especially hard real-time/embedded or safety critical, dynamic memory allocation is strongly frowned upon, and for good reasons.

Even if your application does not have stringent safety/predictability requirement, following the principles of Data Oriented Design leads naturally to a radically different approach to memory management.

I am glad to see this kind of projects and design choices.


Thanks!

Data Oriented Design runs like a river through TigerBeetle—it's always on our mind.

By the way, have you seen Andrew Kelley's Handmade Seattle talk on Practical DOD? [1]

[1] https://vimeo.com/649009599


Yes, I've seen this one, and a few others.

I think DOD should be more like an engineering mindset than a strict set of rules.

From my experience it can be impractical/cumbersome, especially when prototyping because DOD usually pays off in the long run.


Should be titled "without using malloc" or something. It's pretty much impossible to do anything with user input strings unless you have some sort of memory management. Maybe you have one giant buffer that you use for strings in a particular session, but that still involves memory management because even if you allocate the buffer at startup, using one buffer for multiple strings is just a type specific arena allocator.

As someone who works on low level database code, this approach is going to be EXTREMELY wasteful. Instead, it's much better to instead have a more sophisticated understanding of memory use by subsystems and have some concept of required memory and caches. Each system allocates the minimum memory it needs at startup and then treats the rest of memory as caches which can be emptied as necessary to preserve fairness among all the various subsystems.


> Should be titled "without using malloc" or something. It's pretty much impossible to do anything with user input strings unless you have some sort of memory management

Title says "without dynamic memory allocation" == what means without malloc in this context, so static memory allocation ... where you do your own kind of memory management in fixed size arrays or more sophisticated if you want. It didn't say without memory management? (And article explains also why no strings or other unknown/arbitrary length stuff..)


Also, the article mentions the only entities are two fixed size types: accounts and transfers.

There are no user strings in TigerBeetle. :)

If anyone wants to see what you can store in accounts and transfers, check out the reference docs!

https://docs.tigerbeetle.com/reference/accounts

https://docs.tigerbeetle.com/reference/transfers


That's wild. So every column is fixed width?

Is anything nullable, or for like the user_data column, do you just use 0x0 to indicate no record?


Yep, every field is fixed width. Nullable would just be zero but N zeroe for the width of the field.


> using one buffer for multiple strings is just a type specific arena allocator.

Sure, but there's a lot of advantages of entirely controlling memory allocation in-process.

> this approach is going to be EXTREMELY wasteful.

Could you elaborate why? Sure, you have to do a bit more juggling in-process, but I would imagine this would be potentially more efficient than having conventional mallocs.

In the environment it is designed to operate in, there is usually a fixed memory limit per process, so you might as well grab all that up front.


Thanks!

Joran from the TigerBeetle team here.

> I would imagine this would be potentially more efficient than having conventional mallocs.

Yes, in our experience, static allocation means we use sometimes 10x less memory. For example, TigerBeetle's storage engine can theoretically address up to 100 TiB with less than 1 GiB of RAM for the LSM in-memory manifest, which is efficient.

Because we think about memory so much upfront, in the design phase, we tend not to waste it.


I think it should be noted that TigerBeetle is an extremely specialized database. It's not a general-purpose DB for storing arbitrary data, all it stores is numeric financial account information.


I like this a lot! I’ve been trying to find ways to avoid memory allocation as much as possible in my projects as well. So far I’ve settled on mmap-ing large slabs at a time and arena allocating (Linux’s mremap makes growing the allocation really efficient too!).

However, come to think of it just using a fixed amount is just as good if not better, since I’m always going to be constrained by memory in some way anyway. And if I allocate e.g. 80% of the system’s memory upfront, it won’t _actually_ be mapped by the OS until I touch it. There may be some interesting stuff with using huge pages for the first few pages and use smaller and smaller pages as you approach your memory limit. Definitely will give this a try!


> if I allocate e.g. 80% of the system’s memory upfront, it won’t _actually_ be mapped by the OS until I touch it

What about people running vm.overcommit_memory=2


In practice that's a broken config because any program that uses the fork syscall will hog all your memory.

(Yes fork-exec is stupid, but there's not much you can do about it)


Yes, this approach is followed as well, but now you're at the problem of writing a memory allocator by yourself (which might of course turn out to be better for your usecase). So the issue of writing code that avoids allocation still prevails, as now you probably want to know what to allocate memory for, since if you were supposed to make a general purpose allocator, it is likely it won't overperform libc malloc.


I agree on some level, but I think if you’re intentional with your design you can get by with a very simple arena allocator where an allocation is a single subtraction from a pointer.

In my experience, most programs have a few long-lived data structures that can be initialized up front, and then lots of small data structures that are created and destroyed frequently. If you do it right, you can put the long-lived guys at the beginning of the arena and then just wiggle your arena head pointer back and forth for the small guys.

This does take some more thought and consideration, and is definitely a bit trickier than just using malloc when you need it, but for me the trade off is worth it (for one I gain some amount of satisfaction from imagining my data all neatly in a row).


I use a double-ended stack allocator for most allocations in personal C projects. The second end adds a significant amount of flexibility; temporary allocations can accumulate at one end of the stack without preventing a longer-lasting allocation from being made later (yet before the temporary allocations can be freed) on the other end.


Oh that's also great! Thanks for the tip, I'll have to incorporate it!


I've been an embedded software developer for almost 2 decades, and this style used to be near universal in embedded, but has become less so, much to my dismay. I even saw a project not too long ago where a large C++ STL List (i.e. a linked list) was created at startup, then later freed piecemeal, causing so much heap fragmentation on the poor embedded system malloc implementation.


I think it was when I was reading the zlib source code, while dinosaurs roamed outside my college window, where I first saw a variant of this, of a library that statically allocated memory for a graceful shutdown routine before doing anything, so even if memory were exhausted by a bug in other code later on it could still exit cleanly.


It's still done. I did the code review for the guy who wrote it for our server. If we're crashing due to an OOM situation we still have enough memory to write out a minidump.


The pain here is to test and keep testing these paths. Same as some 'free disk space' monitoring...


Joran from the TigerBeetle team here!

TigerBeetle uses Deterministic Simulation Testing to test and keep testing these paths. Fuzzing and static allocation are force multipliers when applied together, because you can now flush out leaks and deadlocks in testing, rather than letting these spillover into production.

Without static allocation, it's a little harder to find leaks in testing, because the limits that would define a leak are not explicit.


Can you provide some pointers regarding the simulation testing for more information?


Sure!

Here's an overview with references to the simulator source, and links to resources from FoundationDB and Dropbox: https://github.com/tigerbeetledb/tigerbeetle/blob/main/docs/...

We also have a $20k bounty that you can take part in, where you can run the simulator yourself.


I agree. However, in our case, that pain has proven to be less than not having those minidumps.


Oh for me too, but I wish it was easier, like something supported seriously by my language of choice, my OS of choice or my libc of choice, with the flick of an flag... I don't retest the whole libc and kernel every time (although, err, someone should, right? :-)


What environment is the server running in? Can you prevent the Linux OOM killer from killing your process?


The OOM killer can be disabled completely on a system, or a process can be excluded from it by setting an OOM flag under `/proc/<pid>` It's not generally considered good practice in production, but I think that's largely because most software does not do what is being suggested here... in the scenario where you have, seems like the right thing to do.


It's definitely good practice in production and is often necessary.

The techniques mentioned above will (perhaps surprisingly) not eliminate errors related to OOM, due to the nature of virtual memory. Your program can OOM at runtime even if you malloc all your resources up front, because allocating address space is not the same thing as allocating memory. In fact, memory can be deallocated (swapped out), and then your application may OOM when it tries to access memory it has previously used successfully.

Without looking, I can confidently say that tigerbeetle does in fact dynamically allocate memory -- even if it does not call malloc at runtime.


> It's definitely good practice in production and is often necessary.

I'd be curious if you have any resources/references on what is considered good practice in that now, then.

It's been a long time since I did much ops stuff outside a few personal servers, so it may well be my background is out of date... but I've certainly heard the opposite in the past. The argument tended to run along the lines that most software doesn't even attempt to recover and continue from a failed malloc, may not even be able to shut down cleanly at all if it did anyway, and the kernel may have more insight into how to best recover...so just let it do its thing.


Sure. A lot of systems are single-app systems. They run a single instance of a rdbms, app, etc - or several instances all of which are represent the sole purpose of the system. In these cases it's generally good practice to disable OOM because if the app is killed, nothing else on the system has value anyway.

It would be ideal to not tickle the OOM killer, but it does happen. A great example would be Redis, using bgsave. There's a lot to criticize about redis and bgsave and I don't mean to defend its architecture. Its behavior is fairly extreme so it provides an example useful for illustration. Because Redis forks its entire in-memory state and writes itself to disk, it will sometimes appear to the system as if it has doubled in memory use. It's a huge, sudden memory pressure event, exacerbated by any writes forcing CoW allocations while the bgsave runs.

Many other app servers or database systems can have similar sudden memory pressure. You basically don't ever want a primary app on a system to be killed, so it's usually best to just disable OOM killer entirely on these processes.

There's often no failed malloc() in these situations. Often you will see OOM in situations where no malloc() has ever failed, because malloc() is just allocating address space without any attached memory, which is nearly free, and which almost always succeeds. The failure will occur later when the allocated page is first written to -- causing a page fault and an actual allocation to happen. There's no associated function call or system call. Page faults are triggered simply by accessing memory. This is why the OOM killer exists, as there's no function available to return a failure to when memory can't be produced. Such is the idiosyncratic behavior of lazily-allocated memory in modern virtual memory systems.

tldr: Malloc never fails, because malloc allocates address space not memory. Memory writes trigger failures, because writes create page faults and trigger the actual allocations. But the failures often express behavior elsewhere, via the action-at-a-distance magic of the OOM.


We're aware of this, in fact, and do have a plan to address virtual memory. To be fair, it's really the kernel being dynamic here, not TigerBeetle.


Oh for sure. Not casting any shade cast at all on your work - I am really happy to see what you're doing. This kind of thing has a lot of value even in a virtual memory environment.


What language? What does the server do?


C++ on Windows. It's an old codebase that has been used to run custom test hardware.

We also target Linux with the same codebase, but the environment is a bit different. On Linux we're using a system slice and we're not doing minidumps.


IIRC, .NET does something similar for OutOfMemoryExceptions. The exception is thrown before actually running out of memory. That allows just enough space for the OutOfMemoryException object (and associated exception info) to be allocated.


I believe the runtime preallocates an OOME on startup, so if one is encountered it can just fill in the blanks and throw without needing any additional allocations.


Iirc also perl does this. I’m on the phone right now, but iirc $^O was a variable filled with about 20k of worthless data that could be freed in case of an out of memory condition.


It's called $^M, and you have to explicitly fill it with garbage.


game development has been doing it this way for a long time too. the domain often dictates how we code.


Would you have some links to posts that talk about this style of memory management and allocation?


I don't have some links, but it's more straightforward than you might think (and TFA covers some of it).

1. Bound every whatever that requires memory to some value N

2. For all whatevers that can have a lifetime for your entire program, do that

3. For any whatevers that can't have a lifetime for your entire program (e.g. for a database server like TFA, a connection object would qualify), make a table of size N

4. When you need a whatever (again using a connection object as an example, when listen() returns) grap an unused entry from the table. If there is no unused entry do the appropriate thing (for a connection, it could be sending an error back to the client, just resetting the TCP connection, blocking, or logging the error and restarting the whole server[1]).

5. When you are done with a whatever (e.g. the connection is closed) return it to the table.

Those are the basics; there are some details that matter for doing this well e.g.:

- You probably don't want to do a linear-probe for a free item in the table if N is larger than about 12

- Don't forget that the call-stack is a form of dynamic allocation, so ensure that all recursion is bounded or you still can OOM in the form of a stack overflow.

I should also note, that for certain values of N, doing things this way is normal:

- When N=1 then it's just a global (or C "static" local variable).

- When N=(number of threads) then it's just a thread-local variable

1: The last one sounds extreme, but for a well understood system, too many connections could mean something has gone terribly wrong; e.g. if you have X client instances running for the database, using connection pools of size Y and you can have 2XY live connections in your database, exceeding N=2XY is a Bad Thing that recovering from might not be possible


Missing the end of "- You probably don't want to do a linear-probe for a free item in the table if N is larger than about"


I was curious about that too :) My guess is 16 items (edit: OP says 12) but I'm not an embedded developer, and I'm sure it's hardware-specific.

The approach I've seen for larger fixed-sized tables is a free list: when the table is allocated, also initialize an array of the slot numbers, with an associated length index. To allocate, pop the last slot # off the array and decrement the length index. To free, push the slot # onto the end of the array and increment the length index. Checking for a free slot is then just testing for length index > 0.


Yeah, a stack of free indexes is good. In the rare case of "lots of indexes, but little (extra) memory" a bitmap can be good because it's 1 bit per index rather than ⌈log2(N)⌉ bits per index and has a constant factor speed up equal to the word-size of the architecture.


Thanks, fixed


Nearly everything I learned about embedded systems programming came from either Embedded Systems Magazine articles(https://www.embedded.com/embedded-systems-design-magazine-ar...) or just being a paranoid SoB.

Writing reliable systems means spending a lot of time thinking about what and how failures can occur, minimizing them and dealing with the cases you can't get rid of. Simple is nearly always better.

It also helped to be an EE by education and not a CS major. CS majors can get blinded by their abstractions and forget that they're working on a real piece of HW with complex limitations which require design tradeoffs.


Game Development still revolves around this. For example when loading a level, it used to be you have huge arena, you load your "serialized" map at one end (say top), and your other chunks at the bottom - you have some "flexibility" to pop from top or bottom, so early on you arrange for that. Alternatively you split your memory per sized big chunks, and stream in out these (works well for textures, and primed sounds - e.g. the first 64kb or 128kb of a sound file).


I like this idea a lot.

Are there any other databases which do a similar thing?

This seems like one of those things that has massive benefit if you're willing to invest in the effort + architect it like that from the start.

(Unrelated note: I discovered last night that Zig has io_uring support and some nice wrappers implemented in "std". Which is nice for usecases like databases or networked services)

https://github.com/ziglang/zig/blob/42a3b60c331cf01d1ab80eb7...


> (Unrelated note: I discovered last night that Zig has io_uring support and some nice wrappers implemented in "std". Which is nice for usecases like databases or networked services)

io_uring support was contributed to the Zig standard library by Joran, who is also the creator of TigerBeetle :^)


For the first 5 years that I wrote C, I never used dynamic memory. (It was a hobby.) Later I found that dynamic memory allowed me to perform more operations. By only allocating what I needed, I could process more operations in the same amount of memory. I tripled the amount of operations due to not needing to restrict myself to an arbitrary size.

The Kernel has a feature that takes advantage of this same principle: memory overcommit. Programs are always asking for way more memory than they actually need. The kernel lies to those programs, and says, "yes, yes, I'll give you all that memory... sure thing..." even if it already said that to every other program. This allows more programs to continue operating since they aren't really using all that memory. When the system does run out of memory, it can swap rarely used memory out to a swap file; when that doesn't work, the application can receive an error, or the OOM killer can just kill the process. The result is more programs doing more work, and an occasional app crash.

Virtual machine hypervisors do a similar trick with memory ballooning. If the guest OS has a balloon driver, the hypervisor can overcommit memory for the VM guests, and you can provision more VMs than there's actually memory for.


So I have a theory as to why we were able to get away with that in the past, but can't really any more.

The approach of letting applications just allocate as much memory as they wanted arose in the era of primarily 32-bit operating systems. And that meant that, at a process level, in practice 'as much memory as you want' capped out at 4GB (or 2GB if you are Win32 because... Windows reasons).

And so when you're running 32 bit processes on computers with single-digit-gigs of physical RAM, and multiple-digit-gigs of disk space, and single-core processors, letting processes treat memory as 'all you can eat' and swapping out excess allocations to disk is going to work out fine. Your processor can only be running one application's code at a time, and a significant fraction of that application's entire address space can all be paged into physical memory at once, with the full amount not taking up too much disk space, so for the most part you can get away with just treating 'process address space exhausted' as being as good a signal as any of 'out of memory'.

But now we're in 64-bit land, address space exhaustion is no longer a thing. If you just allow programs to keep allocating RAM until they run out of 64-bit address space, to a first approximation, 0% of your application's maximum address space will ever be able to be paged into physical RAM, and you will need several exabytes of hard drive space to hold what's paged to disk.

So the problem is we now don't really have a natural limit akin to the 32-bit address ceiling to impose on applications who want to allocate memory, and it means we're instead put in the position of having to come up with policies ourselves. Stuff like setting a memory limit on a docker container. And so we wind up having to actually now think about and answer the question, 'how much memory should we let each process on this system allocate?'


As far as I'm aware, we're still getting away with it. Your kernel and every program on your computer and every server is still doing it...

Setting a memory limit on docker containers comes from automation. We built new-fangled automation that can schedule multiple copies of an application across N nodes. Then we tell people they can run more and more copies of the app, for CI/CD, for test/stage/prod, for Joe's Test App, etc. All these copies are using up memory. At some point, somebody tries to deploy to prod, which needs to start new copies of the new versions. But now they can't because the paultry few nodes have now run out of memory from all these extra copies of the apps. So now we have to impose limits or we can't deploy to prod.

(you might ask: "why don't we just automate expanding the nodes to accommodate more memory?" and they did. and it doesn't work well, for reasons that give me a headache)

(you might also ask: "why don't they just make an OOM killer for the automated cluster?" and they did. and it doesn't work very well either, for other headachy reasons)

"Software is a gas: it always expands to fit whatever container it is stored in." - Nathan Myhrvold (https://web.archive.org/web/19990202072225/http://research.m...)


Never really looked into it, but if the kernel overcommits memory, and multiple applications actually utilize that memory, it would occur outside the scope of malloc, right? The kernel can’t pass an error to the app because it could occur at anytime during program execution — the only option is to OOM kill the app?

Only now thinking about it, doesn’t overcomitting make it impossible for programs to be well-behaved during high system load? Everything is lying to you up until the last second, when suddenly you’re dead by OOM killer. When does malloc even return errors in practice?


>The kernel lies to those programs

Doesn’t that cause the very problem it was meant to solve?

Programs ask for more memory, because they can.

Users run those programs, because there’s no indication anything is wrong.

And programs keep requesting more memory than need because the kernel cleans up after them!

Alternatively a kernel that actually gave non-malicious (?) programs the memory they requested exclusive access to would at first run worse, but long term perform better as users rightly pester app devs for apps that use excessive memory.


You've basically described the difference between how Windows handles memory allocation and how Linux handles memory allocation, each of which has its pros and cons. Not doing overcommit has it's own serious drawbacks, especially if you make use of fork().


welcome to the tragedy of the commons.


It seems analogous to fractional reserve banking and the free economic performance it enables by leaning on a mostly reliable statistic about empirical peak concurrent usage of a shared resource.


Welcome back to the 70s! Jokes aside I think there are use cases for this style of memory allocation even today (embedded systems for example). I just have never felt its need in a DB (or rather, I have many more desiderata from a DB than "I don't have to worry about OOM").


Good to be back—Joran from the TigerBeetle team here!

Static allocation does make for some extremely hard guarantees on p100 latency. For example, for a batch of 8191 queries, the performance plot is like Y=10ms, i.e. a flat line.

And memory doesn't increase as throughput increases—it's just another flat line.

I personally find it also to be a fun way of coding, everything is explicit and limits are well-defined.


> back to the 70s

The sixties actually; that's how memory was managed in Fortran from its inception until the eighties.


> Restrict the length of function bodies to reduce the probability of poorly structured code. We enforce a hard limit of 70 lines per function.

Reminds me of Carmack on the benefits of long functions: http://number-none.com/blow/john_carmack_on_inlined_code.htm...


"Clever" developer limitation like this almost invariably cause developers to come up with even more convoluted solutions to problems than needed. Don't disagree that long functions can be an indication that someone was copy pasting a bunch of stuff around in the codebase, but not always!


Joran from the TigerBeetle team here.

The limit of 70 lines is actually a slight increase beyond the 60 line limit imposed by NASA's Power of Ten Rules for Safety Critical Software.

In my experience, in every instance where we've refactored an overlong function, the result has almost always been safer.


70 lines is generous compared to Clean Code dogma/opinion.


Obligatory Hank Hill. I'll tell you hwhat. I've seen some poorly structured code with nothing but small functions. Like a state machine implemented via chains of callbacks that could all just live inside one big switch statement.


> TigerBeetle is written in Zig

This is the key part.

Zig memory allocation is something like bookmark, allocate and then free everything after bookmark when done. Someone pushed it to limit.


*can be. You can do more traditional memory patterns too in zig. There is also an allocator in stdlib that directly wraps C's malloc. Likewise, if you're writing a plugin for another software or FFI for a VM language, you can wrap their allocators (if they provide them).


Interestingly enough, the first version of TigerBeetle (that still followed much of the same design decisions) was written in Node! :)

https://github.com/tigerbeetledb/tigerbeetle/blob/main/docs/...


In one of the projects to port a game to Linux phone with 16MB I had to allocate all the memory upfront for performance reason to avoid memory allocation/deallocation overhead and fragmentation. Fun times ;)


Listen to me now and hear me later:

Static allocation doesn’t eliminate use-after-free bugs, it just reduces them.

If you have mutable data structures, it’s still always possible to refer to something that used to have one meaning and now means something else. That’s still a use-after-free bug.

I was going to say that it’s similar to how GC can still have memory leaks, but in fact it’s the exact same problem: reachability of data that is no longer relevant/correct.


That jumped out to me too. The point of use after free is that your program has a logic error. The logic error doesn't go away just because it doesn't crash.

There's a lot of wisdom in this post, but that made me wince. It's a bit like saying "Our system doesn't crash when we dereference a null pointer." Well, cool. What could possibly go wrong.

Still, I think it's best to just ignore that line and read the rest of the post. It's an unfortunate distraction to the core idea (which does have real value). For example:

> here is how we calculate the amount of memory needed to store messages for TigerBeetle’s consensus protocol: <snip>

> And if we get the calculation wrong? Well, since we enforce static allocation, we can check these calculations with assertions (i.e. that a free message is always available). If we then get the static allocation calculation wrong, this can be surfaced sooner through fuzzing, rather than having no limits and eventual resource exhaustion (and cascading failure!) in production. When combined with assertions, static allocation is a force multiplier for fuzzing!

Being able to put a hard upper bound on the amount of memory your system can use is generally a sign of good design. The alternatives are somewhat messy but still work. (The HN clone I maintain https://www.laarc.io/ runs via a while-loop bash script, because it dies via OOM every couple weeks. But it works great, since every time it dies it just restarts.)


Thanks!

Joran from the TigerBeetle team here.

Appreciate your balanced comment.

To be fair, we're certainly concerned about logic errors and buffer bleeds. The philosophy in TigerBeetle is always to downgrade a worse bug to a lesser. For example, if it's a choice between correctness and liveness, we'll downgrade the potential correctness bug to a crash.

In the specific case of message buffer reuse here, our last line of defense then is also TigerBeetle's assertions, hash chains and checksums. These exhaustively check all function pre/post-conditions, arguments, processing steps and return values. The assertion-function ratio is then also tracked for coverage, especially in critical sections like our consensus or storage engine.

So—apologies for the wince! I feel it too, this would certainly be a nasty bug if it were to happen.


So you only reuse memory for objects with a checksum? Buffer bleed is scary if exploitable (see heartbleed) and I’m curious how you are protecting against it in practice.


It's defense-in-depth.

We use what we have available, according to the context: checksums, assertions, hash chains. You can't always use every technique. But anything that can possibly be verified online, we do.

Buffer bleeds also terrify me. In fact, I worked on static analysis tooling to detect zero day buffer bleed exploits in the Zip file format [1].

However, to be clear, the heart of a bleed is a logic error, and therefore even memory safe languages such as JavaScript can be vulnerable.

[1] https://news.ycombinator.com/item?id=31852389


If anything, it might make use-after frees more painful. Now the address sanitizer won't help you find them.


In zig, you can run your a subset of tests with a normal allocator (or better yet, a page allocator) that will really help you find these and then swap out for the statically allocated system later.


Even resource leaks are still possible. Marking an object in a static array as used is still allocation. Forget to mark it as unused later and it still remains unavailable. Do that enough times and you'll still run out. What this approach does is eliminate unconstrained memory use. Leak-related failures will tend to surface earlier and more predictably, instead of at some distant unpredictable time when the system can't satisfy your program's infinite appetite. And there are resources besides memory.

This approach is still useful, but it's no panacea. It can even leave various kinds of capacity unused, or it can become a tuning nightmare if you try to avoid that. It has long been common in embedded (as others have pointed out) due to the requirements and tradeoffs characteristic of that space, but for general-purpose computation it might not be the tradeoff you want to make when other techniques can solve the same problem without some of the drawbacks.


Where this hair gets particularly thin is when you start with an app that uses fixed memory and then you end up in a situation where it works out that running many copies of the code is a solution to a (scalability) problem. Then pre-allocation becomes a problem because you've just locked up a hefty chunk of memory.

And that can be a self fulfilling prophecy. One of the problems with memory pools is concurrent access, both by your programming language and the processor, and so it starts looking attractive to spool up a process per task.


Based on my limited experience with using arenas and preallocated memory I find debugging memory issues significantly harder since it’s super easy to accidentally reuse allocations and only notice much later.


The first pointer arithmetic bug I fixed, I only narrowed it down once I realized that the function was getting garbage data. Several others since have been the same way. This data structure has a value that can’t possibly be arrived at, therefore this data is fake.

When you point to real, dead objects, the bug can be anywhere or everywhere. If you have a pattern of using impure functions everywhere this can be a nightmare to identify and fix at scale.


Sounds like there should be some additional metadata memory that is summed in as well (for the total allocation at startup)? To check for such issues.


Yep, I was thinking along the same lines. What if the database kept a reference to a connection for some timer or something and the connection memory later gets used for a different connection? GC languages don't have this problem because you would just allocate a fresh connection each time and the language itself guarantees that said physical memory is not reused unless it is unreachable.


Pedantically, you could say static allocations eliminate use-after-free() bugs.


use after free is really about the consequences of use-after-reuse. The bug is always there, but you don't notice the behavior until someone changes the data out from under you. Until someone reuses the memory allocation, your app appears to be working properly.


I understand your point. My comment was a weak attempt at being clever. Free (the concept) versus free(void* ptr) (the function call) - the latter would not be present without dynamic allocation.


At the risk of shilling for Big Rust (or at least any language with ownership semantics, really) -- well-designed Rust code can truly eliminate use-after-free. I was the lead engineer on a project that wrote an entire filesystem from scratch in Rust, and we did exactly this. 100K lines of code without any use-after-free bugs in the whole development process.

One of the first bits of code we wrote in the project was a statically allocated pool of buffers such that the only way to free buffers back to the pool is for all references to that buffer being dropped.


Maybe I’m missing something but I would imagine query planning and caching is the hardest thing in terms of database memory usage. This post doesn’t seem to address that.

Even if the messages are fixed sized surely the memory cost of executing the queries depends on the complexity of the query, and the expected numbers of results returned.

Are they just doing get/set style queries only?


Thanks!

Joran from the TigerBeetle team here.

TigerBeetle's storage engine is designed also for range queries, and we have some interesting ideas for our query engine in the works.

To be sure, there are some tricky things that crop up, such as pipeline blockers for queries. However, we do have limits on literally everything, so that makes it easier—static allocation is best when it's done viral.

We wanted to go into so much more detail on all this but there was only space for so much. Stay tuned.


Just today I got presented an article in my "newsfeed" claiming that coroutines in C++ are the hottest thing right now in algorithmic trading, which I assume also relies on databases.

It left me thinking because of the unpredictability in timing which arises from its use, but then again, the article closes with the sentence

> Until then, it is important to note that no feature is given support by C++ if it has not been fine-tuned and optimized and we cannot expect to see total coroutine implementation until at least C++26. Nonetheless, if you’re an algorithmic trader or C++ developer, watch this space and watch it very closely.

https://www.efinancialcareers.com/news/2022/10/algorithmic-t...


This implies changing of buffer sizes requires a process restart. Which is a fine trade off.

This also implies all memory is allocated upfront. It would require the operating system to be smart enough to compress/page blocks or else the process will hold on to unused memory


Per the article: "(Currently it’s determined even before startup, at compile-time, but startup is the important point since that is the first time that memory can be allocated.)"

Sooo are we sure you can even change this via proc restart? This is saying the limits are built-in during compile...


They are currently compile time settings. That may change in the future to be more user-friendly. :) But what won't change is that the size is fixed at startup.


That totally makes sense. I think you'd have a miserable time supporting this at scale if a re-compile is needed every time a new workload requires a tweaked setting or two. Config based static allocation is the best of both worlds.


What's so special about it? Static heap allocation (without malloc) was the norm with COBOL + ISAM; early SQL databases used eg ESQL/C with static or quasi-static allocation (by which I mean at most one malloc per field for the entire process runtime). Process-per-request was (and is) recommended practice if you value stability and security above all when using languages that don't have GC and/or have issues with memory fragmentation.


I love/hate the bad humor of HN.


Can somebody eli5 what this really means? Specifically, how is code without dynamic allocation bring about predictability?


In addition to the mentioned known maximum memory, it also has useful performance characteristics. global memory allocators, by their nature can create allocation patterns and layouts that can be very sensitive thing like startup conditions, long-runtimes, or exact inter-thread execution sequences. This can cause items that in one run that get allocated near each other, can be allocated very distantly, or chaotically in the heap in another run, for instance.

Now, given that modern CPUs rely heavily on their cache lines/hierarchies, branch predictors, and prefetchers, changes in layouts can have significant performance implications, algorithms that run quickly when things are contiguous in memory, can run orders of magnitudes slower on layouts that are more random, and have more cache line miss. And those layouts are often the result of long run-times in dynamically allocated systems, due to accumulated heap fragmentation.

Now, static allocation neither completely solves these problems, nor is it the only solution, but it is the case that the techniques that are used for static allocation, often have tremendous benefit for cache layout, and combined with other added benefits, like reduced OOM conditions, it's a powerful constraint.

Edit: Not to mention that the previously noted heap fragmentation can also cause eventual failure to allocation conditions, even without true memory exhaustion.


Dynamic allocation can fail. Memory can become fragmented or you could just exceed your budget.


It brings predictability because know you won’t run out of memory. If everything is statically allocated, the total amount of memory the application will use is known at compile time. Very important for memory restricted systems.


rant at fixed memory systems:

In many fixed sized systems "re-used" as a result of bugs data is still good one and it will produce slightly different results that can not be distinguished from real ones until whole computation will be done outside of system. If data reference mistake is stable it can even pass majority/stability tests for long time. Filling destroyed values can save a lot of time just because it will produce nonsensical results that will be noticed faster.

Also coming from games industry I don't recognize praise of fixed systems at all. Games prefer fail fast. Restart can be easy for players if they re-connect or not lose any progress. The way to get players truly upset is to get player data mixed with bad values after long play time or if there is no easy way to recover progress/saved data.


what about arena allocators?


> Doing this was possible because we sketched out all resource usage (network, disk, memory, CPU) in the design phase with back-of-the-envelope calculations.

I wish all programs did this


Without looking through the implementation, I suspect that the consequence here is that it pre-allocates way more memory than it needs and then does some custom dynamic allocation with that.


They explain pretty clearly that this is static memory allocation. In other words, a calculated amount of (mostly) contiguous memory is allocated at startup and used very carefully, relying on disk to supply the dynamic nature of database load semantics.

https://www.geeksforgeeks.org/difference-between-static-allo...


Cool article and an interesting concept. It's definitely good to think out of the box - but I've rarely had to deal with a database OOMing (I assume because they have strategies to write to disk before that happens).

There are plenty of strategies nowadays to avoid garbage collection and use-after-frees (rust comes to mind), I'm not convinced working with fixed amount of resources is the most friendly to operators.


How does it works exactly? It's a memory pool that uses 85% of available memory? If I have one row in my DB it's going to prealocate 30GB of memory? I don't think it's a sane design, it reminds me the JVM args with min / max.


The OS doesn’t actually map/allocate the pages to you until you touch them, so if you preallocate 30 GB and only use one row, you’d only be using 4 KB (or whatever your page size is). This is the same as if you’d individually malloc’d the single row, since deep down malloc just calls mmap.


Joran from the TigerBeetle team here!

We have a secret plan for this too. ;)


The article differentiates between "memory" and "on disk" storage, which is where the rows in the database would live.


This is pretty much how all DBMSs work right now. You have to tell it how much memory it can use for its internal buffer pool.


One of the first claim of the article is plainly false. Static allocation does not avoid use-after-free. It just avoids accessing freed memory. You can still have bugs and access dead data in those static arrays that are really dead object but your buggy code still references and access them.

Most benefits listed have nothing to do with memory allocations. For example "having a client open too many connections" can be limited by... limiting the number of open connections. Where and how that check is done might differ, but it's not related to memory allocations.

There are BIG downfalls to static allocations:

   - You cannot support varied workloads. If one workload needs a lot of X and another needs a lot of Y, well, you already decided the maximum numbers of X and Y and cannot re-balance it.

   - If you run out of anything you thought was "sufficient" enough, your programs just fails.
And I know, because I've worked on program that did static initialization at startup, and they failed in just that way. I also did test the performance benefits: it was around 12%. So you do get some performances out of it, but the added complexity of having things fail out where in a normal program it would just chug along, having to deal with these failures and the usual have-to-roll your-own data structures or libraries is not worth 12%. Because, remember, if you go out of your way to have static allocations, that means all 3rd-party libraries have to do it too, otherwise the benefits is wasted. Is that worth 12%? IMO, no.

I mean, sure, your program is easier to analyze statically, just like if you gave up dynamic threads and instead just pin different parts of your program to different CPU core. Sure that solves some concurrency problems, but most programs don't do this for obvious reasons.


I'm not sure why this is being downvoted? Dividing up memory statically at startup/compile time has some nice benefits described in the article, but it also makes the database less flexible to changing workload requirements. For example, if a workload wants to do a big hash join that needs 90% of the memory at time X and then a big data load that needs most of the memory in a completely different sub system at time Y - its nice to be able to dynamically shift the memory around to match the workload. Doing static up front allocations seems to block this from happening, and that is maybe a fine trade-off for TigerBeetleDB based on the types of workloads they're optimizing for, but its definitely a valid dis-advantage of their approach.


You actually are doing dynamic memory allocations, you are just not allocating from the operating system.

A quick look shows how this is done: https://github.com/tigerbeetledb/tigerbeetle/blob/91a105875c...

You can see things like alloc / free, memory ownership, and everything that is associated with it.

The fact that you allocate from a startup buffer vs. a call to `mmap`/`brk` doesn't really change things.


parse_addresses is only called once, during init.


Yes, we're planning also to add a kill switch to the allocator that we switch on if anything allocates after init().




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: