I was never a fan of the typical SAN topology, ever since I read Joyent's responses to one of the big early EBS outages in 2011 or 2012. Plus of course, as the article points out, local storage is faster. But Joyent never actually pulled off anything like what Fly has done for migrating volumes between hosts. Congrats on solving the migration problem while maintaining what's good about local storage.
I wonder what the io perf will look like during migration. Gut feeling is that going through dm-clone/iscsi/wireguard would be lot slower than direct local nvme.
It obviously is slower, but note that you take the hit only for the first time any block is read, and not for anything written after migration has started. Our migrations are overwhelmingly intra-region, as well, so really what you're doing approaching the performance envelope of the "standard" SAN topology used for cloud block devices anyways.
(I don't want to pretend to know what EBS does to make this fast!)
I designed a similar system 10 years ago at Bytemark which worked for a few thousand VMs, ran for about 12 years. It called BigV [1]. It might still be running (any customers here still?). I think the new owners tried to shut it down but customers kept protesting when offered a less-featureful platform :-)
The two architectural differences from fly:
* the VM clusters were split into "head" and "tail" machines & linked on a dedicated 10Gbps LAN. So each customer VM needed its corresponding head & tail machine to be alive in order to run, but qemu could do all that natively;
* we built our own network storage layer based on NBD called flexnbd [2]. It served local discs to the heads, managed access control and so on. It could also be put into a "mirror" mode where a VM's disc would start writing its blocks out to another server while continuing to serve, keeping track of "dirty" blocks etc. exactly as described here.
It was very handy to be able to sell and directly attach discs of different performance characteristics without having to migrate machines. But I suspect the network (even at 10Gbps) was too much of a bottleneck.
I can't remember whether Linux supported the kind of fancy disc migration we wanted to do back in 2011. If it did, it was hard enough that spending a year getting our own server right seemed worth it.
It is particularly sweet trick to have a suspicion about a server and just say "flush it!" and in 12-24 hours, it's no longer in service. We had tools that most of our support team could use to execute on a slight suspicion. You do notice a performance dip while migrations are going on, but the decision to use network storage (and reduce it overall lol) might have masked that.
Having our discs served from userspace reduced the administration that we needed to do. But it comes with terror of maintaining a piece of C that shuffled our customers data around. Also - because I was a masochist - customers discs were files stored on btrfs and we became reluctant experts. Overall the system was reliable but it took a good 12-18 months of customers tolerating fscks (& us being careful not to anger the filesystem).
I did miss this kind of work in 2022 and interviewed for a support role at fly. I'm not sure how to take being rejected at the screener stage, I'm sure some of my former staff might be able to explain it :)
I still remember your paper about BigV! Very interesting solution and architecture. I was very close to becoming a customer essentially because of this. You were ahead of your time :)
I suspect you might want to have a quick look at the numbers in your first sentence and have a rethink ;-)
> I can't remember whether Linux supported the kind of fancy disc migration we wanted to do back in 2011.
I don't think it did. There was one other important difference (reading between the lines) between fly and us, and that's that we weren't using the kernel's NBD implementation, at all. At the time it had a hard limit of (from memory) 10 nbd mounts, so it was out of the running for the number of VMs we wanted on each head. We had to do it in user space, which meant we never had the problem of kernel threads locking up when the network had a moment. It was flexnbd talking directly to qemu, which presented a virtio block device to the guest.
That meant not using LVM2 or any of the other niceties directly in bigv itself. We did use it for administering the systems that ran bigv, and we used it in other places, but from memory the head and tail code itself had no knowledge of volumes or partitions. From the tail's point of view the customer disks were just files on a filesystem, and as far as I'm concerned that was a massive win - it meant that grabbing someone's VM to debug it needed no tools other than being able to recover the file, which isn't the case on some other storage systems.
It also meant that I ended up doing some of the most fun code I've ever written, getting the multithreaded code in flexnbd right. Although I do remember commenting fairly early on that threads in C might, perhaps, not be the finest choice of concurrency primitive for that particular job :-)
> It is particularly sweet trick to have a suspicion about a server and just say "flush it!" and in 12-24 hours, it's no longer in service.
Being able to do that with both the heads and the tails was nice, and it's fun seeing the same ideas we built crop up in other places. I'm not sure exactly how far ahead of the curve we were but it's definitely taken a while for some of these ideas to spread - largely, I suspect, because kubernetes came along and stomped all over the conceptual space, and it's just got different ideas about how a lot of this stuff works. I do know that no other nbd server at the time could do what we were doing correctly, because if it could have, we'd have used it.
That experience left me with a few lasting impressions:
1. Hardware RAID is a scam. Rebuilds take too long, more disks fail, the firmware is buggy, it's expensive. You're best off using the hardware as a convenient way to ram more disks into the chassis than using its RAID features.
2. You're lucky if you get 12 hours' notice. For storage volumes that size, where you're already on network storage, you want a snapshot to already be available off the machine when you make the decision to evacuate. And that means some synchronisation algorithm running all the time. There are systems around which do that now - Longhorn springs to mind - and that would have been the next thing to build if I'd convinced anyone it was worthwhile. I understand fly saying "Raft is too complex" but then the flip side is that you have to move all the data at the worst possible moment. Maybe the network speed to disk size ratios they're dealing with make it make sense?
3. Small teams. Small teams all the way. Fly are doing this with - depending on how you count - between two and four times as many hands on keyboards as we had (for an international product, where ours was in two DC's, so there's that). There's a break-even point for how much faster you go with more developers, and it's usually lower than ten. People often don't understand how fast a team that small can go because they've never seen it first-hand, but having since seen the other extreme... yeah. It might not have felt like it at the time, but we got a lot right in terms of enabling ourselves to move fast.
> I did miss this kind of work in 2022 and interviewed for a support role at fly. I'm not sure how to take being rejected at the screener stage, I'm sure some of my former staff might be able to explain it :)
No comment >:-) Having just spent a spectacularly unproductive afternoon trying to wrangle some particularly recalcitrant AWS terraform into a cooperative shape, I can say that this kind of work still needs doing, though...
We don't use ZFS (in theory, we give Fly Machines a raw block device), but I'd be interested in the sketch of how you'd do something like this in an end-to-end ZFS system, just so I can see how well I communicated the properties of our system in this post.
I was thinking zfs as the base filesystem with zvols for the Fly Machine so that would be you raw block device. I'm not sure how send/recv handles the trimming problem you outlined though
zvols on linux have known performance problems, see https://github.com/openzfs/zfs/issues/7631 -- most VM disk images on ZoL end up being regular files, for this reason.
zfs send+receive of those such work just fine, but the bigger problem is that zfs replication doesn't know about the "pull on demand, migrate in the background" design at all, so you're probably better not using it for migration.
Speaking as someone that has built experiments on top of zfs zvols (https://xeiaso.net/blog/waifud-progress-report-2/), I'm not entirely sure what the main usecase of them is. They seem to really hate journaled filesystems (nearly all of the filesystems currently in use). It's also really easy to run into cases where LVM inside a zvol for a VM makes the host kernel eagerly pick it up and pin it as active on reboot, preventing you from deleting the zvols in question (this took me a week to properly diagnose).
They probably made sense back when filesystems weren't journaled, but we don't live in that world anymore.
It'd be nice if Fly offered a highly available disk. I know you can move HA into the database layer but that is a lot of complexity for their target audience. If you can build all this machinery, you can also probably manage running DRBD.
We did NBD in preference to everything else because NBD is so simple, and is the kernel's de facto answer to "abstract a block device away from the hardware". Shaun ran into reliability issues with it (which, for all I know, were just bugs fixed in more recent kernels) and iSCSI was within arm's reach, so it wound up being iSCSI.
We could do a bunch of testing and find the performance gains of NVMe-TCP worth the switch-out, and that project would be tenable (though: big enough we wouldn't do it unless the win was really big).
A serious NVMe-OF deployment would be tantamount to us building an EBS-style SAN scheme, which almost certainly would require us to single out specific regions to get the "better"/"newer" disk storage in, which is something we haven't had to do yet. I think in the next year or two we're more interested in seeing how LSVD, and especially Tigris LSVD, plays out.
Am I the only one seeing a lot of advantages with local storage? I don't think it's idiosyncratic at all – that's how DigitalOcean became the company it is today, with simple local storage VMs.
The performance of local NVMes is way better, it is more predictable and you don't have to factor in latency, bandwidth, network issues, bottlenecks and more. Redundancy can be achieved with a multi host setup, so even if the host fails the underlying application or database is not impacted.
The one disadvantage I can see is that you can't "scale"/change the underlying disks. The disks you bought when setting up the server are probably there to stay.
No, you are correct. The database performance difference between using local NVMe and network-attached storage is integer factor on what is otherwise the same hardware. It would be surprising if this was not the case since storage bandwidth of local NVMe is typically integer factor higher and the performance of modern database engines is strongly correlated with available storage bandwidth for many workloads.
The network-attached storage does some things for you that local NVMe cannot, like easily moving storage between hosts, but it is also relatively expensive. Nonetheless, it can makes sense if performance and minimizing hardware footprint is not a major concern. If your main business is data-intensive on the other hand, the economics of network-attached storage start to make a lot less sense.
Most good servers use removable SSDs or HDDs, which can be replaced or moved from one server to another during the normal operation of the servers, without shutting them down. Even the consumer SATA connectors are designed for hot-plugging, which is also true for the enterprise NVMe or SAS connectors. Only the M.2 connectors are an exception among the modern SSD connectors, because they do not support hot-plugging, which is why they are not used in serious servers.
The only advantage of shared storage that is concentrated in some special place is that when such operations like replacement or moving are required, which require the presence of a human, the human may need to walk a much shorter distance to the rack with shared storage than if passing by each server had been required.
I know. It's called hot swapping. But what I meant is the size. Having 6x 3.84 TB NVMe in an existing raid array and wanting to upgrade to 6x 7.68 TB is possible but not as straight forward.
It would be less straight forward only if you have a single server and that server has only a few SSD slots, which are all occupied.
Otherwise, you can just move the old SSDs to another server, plug the new SSDs, copy the old content in the new SSDs and that is it. No reboot needed. Normally any server should have a separate small boot SSD, typically an M.2 SSD on the MB or sometimes a small USB memory plugged internally in the MB, with the operating system, which is not affected in any way by the mounting or unmounting of the other SSDs or HDDs that contain user data.
When the second server does not have free SSD slots, it must be taken temporarily offline and its own SSDs must be removed until the old SSDs of the first server are copied. In the worst case one would have to do three pairs of SSD unplugging and plugging multiplied with the number of SSDs in one RAID array, six in your example, to replace the SSDs in the first server with the new SSDs, then replace the SSDs in the second server with the old SSDs, then restore the SSDs of the second server. Even so, I would consider such a procedure as straight forward, as all the operations would take a few minutes at most, which would be a much shorter time than required for the copying of the old content to the new SSDs.
When you have a single server, then the alternative of using shared storage would also not exist. Except for home servers, the use of single servers should be rare, because even when a single server has enough capacity there should be a backup for it.
I fully agree with you, for what it's worth. I really like what Fly.io is doing here. Most of the advantages of being closer to the metal, combined to most advantages of being in the cloud.
I can't speak to the technical aspects of this product, but makes me think of the movie Ghost in the Machine, and all the horror aspects of AI processes moving between hardware around the world :)
> When your problem domain is hard, anything you build whose design you can’t fit completely in your head is going to be a fiasco. Shorter form: “if you see Raft consensus in a design, we’ve done something wrong”.
This bothers me a bit. I get what they are saying, but it feels like this assumes they are implementing Raft too. Packages like Dragonboat make it so you don't have to think about Raft, only whether you are the leader or not.
This mentality is a rake we have stepped on repeatedly. Dragonboat in particular! It's wonderful, until you have to debug it. But even the most battle-tested and resilient Raft implementations create an immense infra/ops burden; a soundly built Raft implementation will eventually converge, but there's no guarantee it will do so within the tolerances of your SLOs without a lot of monitoring and continuous tuning.
Yeah, we need to tweak them; they were designed for a wider browser window than a lot of people use. If they're not rendered as actual sidenotes, I think we should make them footnotes or popups.
The sidenotes also come first when reading with a screen reader, and there's no indication that they're sidenotes.
On the other hand, nice job with the alt text on the packet diagram. Maybe it sucks that the diagram itself can't be accessible, but I think you did the right thing in this case.
This has been a peeve of mine for awhile; we're going to "demote" them to pop-up footnotes when there isn't enough screen width to make them sidenotes. (In either case, you're better off than when they're in the main flow of the text).
I really love that fly are calling themselves a cloud provider now!!!!!!!!
I've advised a few startups over the years who were trying to take a stab at "developer focused cloud" and for whatever reason they felt shy to say that, and frankly, I think it's the reason they're no longer around. Fly are bold and I really enjoy how they show the infra side of the engineering.
Handling stateful application migrations with asynchronous data hydration and block-level cloning, A+++ - I've been thinking a lot recently about how if I was ever to build a cloud provider again, I think first focusing on an "intelligent" (read "AI" driven) orchestration system - this would be good generally, but especially around things like global data compliance and sovereignty, I can imagine some pretty sweet features.
Literally 3/4 of the blog post is about how they migrate a VM.
With qemu they wouldn't need
the iscsi workaround as nbd (both server and client) is rock-solid in qemu. Nor the discard dance since qemu at the source knows what blocks are in use (via bitmap or zero-detection).
Also qemu can copy the data up-front so performance stays consistent.