How we designed Dropbox’s ATF – an async task framework

ryanworl · on Nov 11, 2020

"To avoid this situation, there is a termination logic in the Executor processes whereby an Executor process terminates itself as soon as three consecutive heartbeat calls fail. Each heartbeat timeout is large enough to eclipse three consecutive heartbeat failures. This ensures that the Store Consumer cannot pull such tasks before the termination logic ends them—the second method that helps achieve this guarantee."

Neither this or the first method guarantees a lack of concurrent execution. A long GC pause or VM migration after the second check could allow the job to get rescheduled due to timeout. The first worker could resume thinking it still had one heartbeat left to execute before giving up on the job and it could've already been handed out to another worker in the meantime.

LASR · on Nov 12, 2020

I bet they’ve thought through this with a system that operates at the scale they say it does.

But often with technical blogs like this, you get a “dumbed-down” version that is inaccurate but summarizes in a few minutes what is essentially many person-years of work.

jonpurdy · on Nov 11, 2020

I don't meant to take away from the article, but it makes me sad to see such awesome people building and writing about really cool bespoke solutions. It's obvious that Arun knows their stuff and is able to communicate it clearly.

The sad thing is that Dropbox Product has so heavily dropped the ball that users like myself (from back in 2009) have switched away in droves over the past few years.

I understand that Dropbox core functionality wouldn't have been enough to multiply the valuation of the company to what investors expected. But it would have been nice to not jam collaboration features into the product and mess up the simple, platform-native UI with it's current abomination. I'd pay $10/mo forever if I could get the 2010-esque Dropbox Mac client and sync service back since it's way better than anything else (especially iCloud).

rkagerer · on Nov 11, 2020

Dropbox customer here, agree wholeheartedly.

They've really gone downhill by adding unwanted bloat, and it just seems to be accelerating. Meanwhile their core product is degrading. Abandonment of the Public folder in spite of a huge outcry from customers was disappointing. The user experience is plastered with advertising to try their other products, even if you turn off all the relevant notification settings. And lately I've been running into subtle functionality bugs in the client.

Would happily give my money to a competitor focused on a lean, reliable product.

Osiris · on Nov 11, 2020

I can understand the need for a company to be constantly trying to add value to their product, but that tendency to be changing so much can easily cause you to lose sight of what made you popular in the first place.

I use Dropbox personally to keep documents synced between my computer and my wife's and also to grab documents I need from the web if I'm on another computer. I occasionally share a folder if I need to give a large number of files to someone.

I recently had a notification come up on the dropbox taskbar icon and it popped up this huge window that looked like a massive electron app. In the old days, there wasn't even a UI, just a context menu that also showed the state of the sync.

For me, Dropbox provides the most benefit when it's not visible, running invisibly in the background doing it's thing.

svara · on Nov 11, 2020

I don't know, I think it's pretty great. Can't live without it.

The client's UI is a bit odd, but at the end of the day it's really good at what it's supposed to do: Syncing files.

Performance is also great: I'm using multiple machines to write code on, and I keep my local git repo on Dropbox. I can literally save a change on my notebook and run it on some other machine 3 seconds later.

On Mac and Linux you might want to check out maestral (https://github.com/SamSchott/maestral), a third-party client that works really well.

jonpurdy · on Nov 11, 2020

Thank you! The reason I ditched Dropbox was because of the bloated Mac client. Maestral looks like it might bring back the 2010-esque Dropbox client feeling (fast and efficient).

I mentioned above that I switched away; iCloud Drive is pretty mediocre since syncing sometimes doesn’t happen instantly and there’s no way to force-sync. I’ll probably move my Mac synced preferences back to Dropbox if this works well. Though I’m also worried about them deprecating their APIs since that seems to be a popular move these days.

pizza234 · on Nov 12, 2020

Maestral is very interesting.

The Linux Dropbox client is essentially abandoned (it's broken both functionally and aesthetically). On this platform, it definitely does not even acceptably do what it is supposed to do, so an alternative is good for the company.

I wonder though, why Dropbox limits the syncing functionality at the API level (by not offering partial file sync).

nyanpasu64 · on Nov 12, 2020

Does Maestral count towards free Dropbox's 3-device limit or not?

secondcoming · on Nov 11, 2020

Does Dropbox use proper filesystems? I considered using Amazon's S3 to host a repo but apparently it may not work properly since it's not a 'proper' file system

crgwbr · on Nov 12, 2020

They’re entirely separate things. Dropbox syncs a local directory on your computer. S3 is a remotely accessible object store.

So yes, you can put a git repo in Dropbox if you want. But that’s still an altogether different thing than “hosting” a git repo like you might do with GitHub or Gitlab.

Arnavion · on Nov 12, 2020

I believe they meant they'll fuse-mount an S3 store to a local directory and store a git repo in that. And presumably Dropbox will refuse to sync it because it requires the synced directory to be on one of a few specific filesystems.

secondcoming · on Nov 12, 2020

A reason for the downvotes would be nice.

donor20 · on Nov 11, 2020

This - we are a business user for dropbox, on windows the task tray is a mess, the collab / editing / paper features so annoying. Sync I think is still OK if you can ignore everything on the website.

I do wish you could PAY for a basic version (maybe make the collab stuff free as part of some trial or something).

azinman2 · on Nov 12, 2020

I think the reality is between Apple, Google, and Microsoft, their core feature is being given away for free or packaged with existing devices that ppl are buying anyway. Then they’re further integrated into those services. This is on top of a move away from files towards mobile and cloud computing, in addition to hitting a wall with consumers who want/need a Dropbox client and at the prices Dropbox sets. They have to have the business users to sustain themselves, and those business users want collaboration tools as a differentiator to not jump ship.

Dropbox is in a bad spot, I think in part because they haven’t focused deeply enough on real collaboration tools or other related product spaces.

ramraj07 · on Nov 12, 2020

I will never ever trust my most important files to companies for whom storage is not even their fifth most important concern. Dropbox will be my primary storage solution for that reason for the foreseeable future unless the fuck up real bad.

shajznnckfke · on Nov 12, 2020

This way of thinking doesn’t quite make sense to me. Since those companies are huge, they could within them contain an organization that has more resources than Dropbox dedicated to storage. This organization will be completely focused on that goal. What if some company buys Dropbox?

conception · on Nov 12, 2020

https://www.pcloud.com - check em out. They have a lifetime subscription too.

joshxyz · on Nov 12, 2020

Amen too much unwanted bloat, wouldve even been okay if they got opt in opt out options but NO the inconsistencies are getting shoved down my throat like I owe them some money

aspectmin · on Nov 12, 2020

I'm curious - what did you switch to? (I'm looking)

jen729w · on Nov 12, 2020

Don’t laugh: iCloud Drive.

Honestly, it works. Never had a bit of trouble. I might not be a mega power user, but all my stuff is there and I just don’t think about it. I moved away from Dropbox about 2 years ago – because I got sick of the shitty invasive macOS client software – and I regret nothing.

Bonus: invoking Spotlight on an iPad and having your files just show up is kinda magic.

jonpurdy · on Nov 12, 2020

I'm the GP, and I also switched to iCloud Drive. But thanks to a comment above I've moved my _sync folder back to Dropbox and use Maestral as my Mac Dropbox client and it's been working great for the past 18 hours or so.

I switched _sync back because Alfred and other apps warned that using iCloud Drive for sync doesn't work great because it doesn't always update properly, and I've found that to be the case.

aspectmin · on Nov 13, 2020

Interesting. Will definitely check out Maestral. Thanks.

I like these third party clients for things that are showing up (E.g. Apollo for Reddit).

aspectmin · on Nov 13, 2020

Got badly burned by iCloud Drive in the early days (and also Apple Photo's losing a ton of my photos). That was a long time ago though. Probably time to revisit it and give it a try.

I do like how iCloud and Onedrive have become so tightly integrated/transparent. (Benefits and banes).

cghendrix · on Nov 12, 2020

Woah. Haven’t used Dropbox in more than a decade but they got rid of the syncing? Wow.. that was the most useful part to me. Interesting.

johncena33 · on Nov 12, 2020

One of the things that have dramatically hampered my experience on HN lately is irrelevant or tangentially related comments. I want to see relevant insightful comments or other people's experience of better way of doing things or any potential pitfalls of what's being recommended in the posted article. Instead, the comments are usually end up being complaining about some products or some features. And it becomes worse, after awhile it becomes the same broken record of constant complaining.

jeffbee · on Nov 11, 2020

If you have the opportunity, please do not build it like this. Referring to the architectural diagram, it is going to be much more efficient for the "Frontend" to persist the task data into a durable data store, like they show, but then the Frontend should simply directly call the "Store Consumer" with the task data in an RPC payload. There is no reason in the main execution path why the store consumers should ever need to read from the database, because almost all tasks can cut-through immediately and be retired. Reading from the database should only need to happen due to restarts and retries of tasks that fail to cut through.

Disclaimer/claim: I worked on this system and on gmail delivery.

dropofwill · on Nov 12, 2020

Maybe I'm missing something, but how would this meet the requirement of scheduling tasks in the future?

mrfox321 · on Nov 11, 2020

In other words, the frontend gets some ACK from the DB before calling the "Store Consumer"? (just trying to make sure I understand your critique of the design)

jeffbee · on Nov 11, 2020

Well it doesn't necessarily need to happen in that order, I think. Frontend needs to ensure that the task is durably stored before it acknowledges the end of the operation to its caller.

Using email as an analogy, you have to commit the message to durable storage before you respond 250 to DATA.

neolog · on Nov 11, 2020

If you do it that way, how do you make sure the task gets completed successfully and exactly once?

jeffbee · on Nov 11, 2020

I wouldn't. Exactly-once is a fool's quest, and the scheme in this article does not offer it.

To achieve at-least-once, you need only track which tasks have been successfully retired, and persist that knowledge in the database by either deleting or mutating the task. During a cold start you scan the persistent store to find tasks that were still pending/live at the time your process began.

newfeatureok · on Nov 11, 2020

Why is it that this task scheduling problem appears to often? Why hasn't this problem been "solved" in the same way sorting strings has?

I understand companies have different requirements, but if you look at the history even on Hacker News this problem is basically being resolved by different companies at least once a quarter.

mfateev · on Nov 11, 2020

Because it is a hard problem to solve holistically.

It looks simple on the surface. So almost any company ends up creating an implementation similar to the one described in the article. Then it learns that it is much harder than looks, but it is usually too late. So they end up maintaining it for a long time with the original team long gone.

BTW I believe that temporal.io (I'm tech lead of the project) is so far the best open source solution to this problem.

Thaxll · on Nov 11, 2020

If you work in Go and need to work on similar problem I highly recommend: https://cadenceworkflow.io/

mfateev · on Nov 11, 2020

It is not Go specific. It already supports both Go, Java and Ruby.

temporal.io our fork of Cadence will have PHP support very soon. Support for other languages is coming. Python and Typescript are the highest priority.

swyx · on Nov 11, 2020

Clickable link: http://temporal.io/

i worked on this site btw - would be happy to receive feedback on the site, particularly if any wording was confusing or unclear!

GordonS · on Nov 12, 2020

I clicked the link and I struggle to understand what Cadence is. I found my way to the docs, and came cross reams of text bemoaning how difficult it is to describe Cadence... not helpful?

I get it's something related to workflows, beyond that I have no clue what problem it aims to solve or where or why I'd use it.

mfateev · on Nov 12, 2020

Did you click on temporal.io?

It is hard to describe as it is a new way to build distributed applications that doesn't have a commonly agreed name yet.

dragonwriter · on Nov 12, 2020

> It is hard to describe as it is a new way to build distributed applications that doesn't have a commonly agreed name yet.

Seems to be a fairly straightforward, minimal (which can be good; that's not a criticism) workflow engine.

tsurkoprt · on Nov 12, 2020

new ? really ? this has been around since 1995

mfateev · on Nov 12, 2020

Great, show me a single product (besides AWS SWF and Azure durable Task Framework) that has the programming model of Temporal. For example, which product allows to write production code like this that survives any process failures:

  public void execute(String customerId) {
    activities.sendWelcomeEmail(customerId);
    try {
      boolean trialPeriod = true;
      while (true) {
        Workflow.sleep(Duration.ofDays(30));
        activities.chargeMonthlyFee(customerId);
        if (trialPeriod) {
          activities.sendEndOfTrialEmail(customerId);
          trialPeriod = false;
        } else {
          activities.sendMonthlyChargeEmail(customerId);
        }
      }
    } catch (CancellationException e) {
      activities.processSubscriptionCancellation(customerId);
      activities.sendSorryToSeeYouGoEmail(customerId);
    }
  }

brown9-2 · on Nov 12, 2020

The use of the term “callback” makes it seem like the code that the task runs is in the same codebase/process as the one that scheduled the task, but these are distributed processes across many machines - so how does the “CreateTask” RPC caller specify what code to run in the task?

edit: seems like it is up to the callback owners/authors to deploy and run their own task workers, which presumably implements the individual task logic:

The design is very intentional in driving an ownership model where lambda owners own all aspects of their lambdas’ operations. To promote this, all lambda worker clusters are owned by the lambda owners. They have full control over operations on these clusters, including code deployments and capacity management. Each executor process is bound to one lambda

charrondev · on Nov 12, 2020

Our product has an asynchronous scheduler that runs various types of “Jobs”. (This is all for a PHP-FPM based server)

We have “RemoteJob”a that run entirely on a separate server. The name of the job, and a message (including information about how to run the job) are passed the remote queue which will schedule the job based on priority and availability. Ideal for sending emails, making requests to external services, processing web hooks, updating records in the search engine, bulk deletion, etc.

We have “LocalJob” that run in the same process but are deferred until after the response to the client is finished sending. These can be used for simple things like preparing a payload for a remote job, small bulk deletion operations, sending in app notification to a large pool of users, etc.

Additionally we have a “CallbackJob” that may schedule out a bit of work and callback an API endpoint in the app when there is load available, either to return the value of the job or to trigger additional processing. Our current use essentially been to come back the app when load is available to recalculate some expensive cached values.

0xbadcafebee · on Nov 12, 2020

* > There were no open-source projects or buy-not-build solutions that worked well for our use case and scale*

B2B/SaaS providers, take note: If a company gets really big, they may be less likely to afford your enterprise offering. This is because the part of the business that is paying for your thing still has a tiny budget. Yet the number of users they have to support is 10x larger than their budget. So to catch bigger fish, you should actually dial down your cost for bigger orgs. It won't seem fair to the smaller ones, though, so probably this should remain confidential.

sna1l · on Nov 11, 2020

Why not use something like Cadence/Temporal/Amazon Workflow Service?

stunt · on Nov 11, 2020

Flyte is also in the same space.

https://github.com/lyft/flyte

sna1l · on Nov 11, 2020

Yeah lots of different options here, mostly surprised they didn't talk about why they didn't choose any of the alternatives.

aborsy · on Nov 12, 2020

Dropbox could focus on selling privacy, forgetting about dedup and offer end to end encryption. This could be a killer feature.

Companies such as google, Microsoft and aws rely on user’s data. It’s difficult for them to offer privacy.

But with C. Rice on board, I have doubts about its direction.

rkagerer · on Nov 11, 2020

I wish folks building Task frameworks would provide a standard mechanism for tasks to signal their progress. I realize it might not be relevant here but I've noticed this gap in more general frameworks as well.

mfateev · on Nov 12, 2020

temporal.io supports task heartbeating. An application specific data can be attached to each heartbeat. The data is accessible from outside to display the progress in the UI for example. The data from the last heartbeat is also available to a task when it is retried.

staticassertion · on Nov 11, 2020

Error: 4xx Error (4xx) We can't find the page you're looking for. Check out our Help center and forums for help, or head back to home.

Getting this when I click the link.

ugh123 · on Nov 12, 2020

>We were disappointed to find little material published by engineers who built supersized async services. Now that ATF is deployed

Really? Such an important step to understand even a little bit about what real companies are using and you stopped at Google.com? Its not hard to network and interview with former employees to understand some of the bigger picture stuff in use at large companies.

_wka6 · on Nov 11, 2020

Why didn't they just use Celery?

imstil3earning · on Nov 11, 2020

What does this solve that something like Rabbitmq doesn't? Am I missing some key points?

aeyes · on Nov 11, 2020

Everything listed under "Features", "System guarantees" and "Lambda requirements"?

Dropbox using Python, the real question is what Celery didn't solve for them. My guess would be scalability.

solumos · on Nov 11, 2020

It seems that Nextdoor also had issues with celery[0].

"Scalability" is a great scapegoat for making dubious decisions, but my guess here would be the "task priority" requirement.

[0] https://engblog.nextdoor.com/nextdoor-taskworker-simple-effi...

aeyes · on Nov 12, 2020

Hm, this post is from 2014 so many of the arguments presented don't stand anymore.

Today you can containerize and autoscale Celery workers, so having task specific queues and worker groups will solve your resource utilization problems. For even better utilization, you can have gevent and process based workers depending if the task is resource or I/O heavy.

In years I have never seen Celery workers hang, we process millions of tasks (hundreds of task types) every day. Since the post is very old, I'm sure that they had to deal with a few nasty bugs back then.

However, Celery does have real scalability issues: If you run over 1000 worker nodes Celery gossip alone will be well over 1000 messages per second. Having 1000s of gossip queues isn't ideal either given that RabbitMQ is not easily scalable. You might end up with multiple clusters which can be a bit of a pain in a single codebase. I'm not sure if running Celery on SQS solves some of these issues.

sateesh · on Nov 12, 2020

One more reason could be, prioritising the tasks to pick. In celery you can create multiple task queues and prioritise by allocating fewer resource to consume from a lower priority queue (say) and allocate more resources to consume from high priority queue. This configuration is static and is a configuration option. But if you want to dynamically change the priority how you consume tasks from each queue, say by having a scheduling code picking tasks from different queues and handing it over to your application code that would be difficult. I don't think introducing a scheduling layer between celery and the application code is possible unless you modify the celery code.

Also I am not sure how scalable RabbitMQ is, redis as broker scales quite well but you lose durability when you use redis.

Thaxll · on Nov 11, 2020

Rabbitmq does not solve this problem, Rabbitmq offer a solution for message passing that's it, it does not offer a framework to execute tasks etc ...