Netflix Conductor: Open-source workflow orchestration engine

kelnos · on Aug 19, 2020

I set up Conductor where I work while evaluating workflow engines, and overall wasn't too happy with it. The default datastore is this Netflix-specific thing (Dynomite) that's built on top of redis. It's not particularly easy to integrate operationally into non-Netflix infrastructure, and Conductor itself hard dependencies on several services.

The programming model for workflows/tasks felt a little cumbersome, and after digging into the Java SDK/Client, I wasn't impressed with the code quality.

We did have some contacts at Netflix to help us with it, but some aspects (like dyomite itself, and its sidecar, dynomite-manager) felt abandoned with unresponsive maintainers.

We've started using Temporal[0] (née Cadence) recently, and while it's not quite production-ready, it's been great to work with, and, just as critically, very easy to deal with operationally in our infrastructure. The Temporal folks are mostly former Uber developers who worked on Cadence, and since they're building a business around Temporal, they've been much more focused and responsive.

[0] https://temporal.io/

doktorhladnjak · on Aug 20, 2020

The Temporal founders worked on AWS SWF too before building Cadence at Uber. They have a lot of experience in this area and are making the product better with each iteration no doubt. I enjoyed using Cadence at Uber and definitely sad about not having it at my current company.

One of the founders, mfateev, is around elsewhere on this thread answering questions.

skissane · on Aug 20, 2020

> I set up Conductor where I work while evaluating workflow engines, and overall wasn't too happy with it. The default datastore is this Netflix-specific thing (Dynomite) that's built on top of redis. It's not particularly easy to integrate operationally into non-Netflix infrastructure, and Conductor itself hard dependencies on several services.

I find the RDBMS-based backends (Postgres or MySQL, depending on your preference of DB) much easier to get going with. Of course, by default you don't get the same HA as Dynomite – but that might not be a big issue in a particular use case, and RDBMS clustering/failover solutions can address that. If you are already in an environment with lots of RDBMS, RDBMS can be a sensible choice.

There are other options to consider too – Conductor supports direct use of Redis (including with Redis Cluster and Redis Sentinel) instead of going via Dynomite, and it also supports Cassandra. Someone should really create a pro/con comparison of all the different storage options, that was the biggest thing I was missing when evaluating it.

freeqaz · on Aug 19, 2020

Workflows and orchestration are my jam -- that's what we're trying to simplify over at https://refinery.io

Conductor is a cool piece of tech, and it's a well-established player in a rapidly growing space for workflow engines.

I used to work at Uber and that company had microservice-hell for a while. They built the project Cadence[0] to alleviate that. It is similar to Conductor in many ways.

One project to watch out for is Argo[1] which is a CNCF-backed project.

There are also some attempts[2] to standardize the workflow spec.

Serverless adds a whole new can of worms to what orchestration engines have to manage, and I'm very curious to see how things evolve in the future. Kubernetes adds a whole dimension of complexity to the problem space, as well.

If anybody is interested in chatting about microservice hell or complex state machines for business logic, I'd be excited to chat. I'm always looking for more real world problems to help solve (as an early stage startup founder) and more exposure to what others are struggling with is helpful!

0: https://github.com/uber/cadence

1: https://argoproj.github.io/argo/

2: https://serverlessworkflow.github.io/

jayd16 · on Aug 20, 2020

Hey, wait a second. Are these just a modern incarnation of the enterprise service bus? Is there a significant difference?

zo1 · on Aug 20, 2020

By modern you mean using "docker", have an "io domain", and use node-js/go/rust? That seems to be the pattern these days. Take existing items, rewrite/repackage and io-ify it, maybe make it a SAAS product while at it.

devonkim · on Aug 20, 2020

ESBs historically required a domain-aware object model (when I saw them they were routed based upon rich types and terms like CORBA and RMI were still in vogue). This is more obvious when you look down the page for features in Apache Camel and they mention different data type support. Well, a workflow or orchestration engine doesn't need _any_ idea of your business layer data model to make decisions on where they need to be broadcast, but it's a convenience similar to listening to specific topics and segments in Kafka after querying a data broker service. Decoupling routing and representation of objects means that the bus is now dumb but the scheduler / router is now interchangeable. Which is how we can see so many possible competitors to Conductor and EBPM and all that now as shown in these comments.

extrapickles · on Aug 20, 2020

Yes. The only difference from what I can tell is that each service/node on the bus is a VM/Container rather than a bespoke built machine.

swyx · on Aug 20, 2020

dragonwriter below you says the opposite - based on a plane-level analysis. would be interested if that changes your mind.

extrapickles · on Aug 20, 2020

The real answer is it depends on what you think what responsibilities exactly ESB or Workflow Orchestration should have. dragonwriter is of the reasonable opinion that ESBs focus should be a message plane. If your of my opinion, ESBs generally include some of the features found in workflow. devonkim is also right in that ESBs in general let the business data/process bleed into the rest of the stack, where as with workflow, its completely agnostic to the business data/process.

The most accurate answer would be from comparing various ESB products to Workflow Orchestration products, as every vendor/product will have slightly different opinions on where their responsibilities are.

swyx · on Aug 20, 2020

fair enough. appreciate the thoughts!

dragonwriter · on Aug 20, 2020

> Are these just a modern incarnation of the enterprise service bus?

No, ESB is the messaging plane, workflow orchestration is a higher level service, on top of the messaging plane.

Answerawake · on Aug 19, 2020

Not sure if this relates but I was wondering if you heard of ASCI ActiveBatch? If so what are your thoughts on it?

I'd like to have some easy to set up orchestration/job scheduling engine in my team so we can help clean up the tangled mess that we are in but something like Ansible seems like too much work to setup and add more jobs over time. I tried Activebatch and I wish there was some free or cheap alternative of it.

freeqaz · on Aug 19, 2020

I hadn't heard of it before. Thanks for sharing!

What are the types of problems you're hoping to untangle, more specifically? I assume you want something like... A workflow where each step maps to some provisioning process?

If that guess is reasonably true, then I immediately think of Terraform. You can specify dependencies and write hooks that trigger callbacks when certain steps are called. We use Terraform a bunch under-the-hood for Refinery, and it's great. I haven't used Ansible (only read about it) so I can't contrast them very well.

If you want to chat some more, I'd be curious to hear more about what you're trying to build and see if there is a way we could collaborate. Or if I can offer any recommendations for software that I know of. My email is free at refinery.io :)

dap · on Aug 20, 2020

Thanks for the resources! It seems like a lot of these implementations expect to run as independently managed services in some microservice architecture. Are you aware of any workflow engines implemented as a library inside your application, maybe with storage backed by an external database? I think that you could still have a highly available model, provided the database supported that.

hukola · on Aug 20, 2020

https://github.com/ing-bank/baker is one such library for JVM languages. The state is kept in Cassandra or in-memory. We've been running production workloads with it for the last 2.5 years. A feature comparison with alternatives can be found here https://ing-bank.github.io/baker/sections/feature-comparison....

mianos · on Aug 20, 2020

Prefect. It's a workflow engine in a python library.

manyxcxi · on Aug 20, 2020

Funny, even though Conductor isn’t designed to be- this is exactly how we use it. In a Kotlin and Spring Boot codebase nonetheless!

AlphaSite · on Aug 20, 2020

Like celery?

Ataraxy · on Aug 20, 2020

The idea of a pseudo flow based programming worfklow has always appealed to me because I think having logic as components that can be used as legos makes sense. It seemed like the other stuff out there was either too overkill/obtuse or even too low level for what I had in mind. So I started writing something for personal use in JS that fits my mental model of what that would be.

In a nutshell it's just a runner and uses a project called moleculer which is a microservice framework that I'm more or less just using as an rpc client to execute the tasks of the workflow.

One thing I've been debating with myself is how I should work with the dataflow. Would you say it's better to have the results of every node merged into a singular context across an entire flow that is passed to every other node/block (ie. always one input, the context, one output the context), or would it be better to explicitly declare and pass in inputs/outputs.

Here's one project that does it that way: https://github.com/danielduarte/flowed

One benefit of this seems to be that nodes can run in parallel the moment their dependencies are met without having to care about each other.

I poked around on your site and I like what you have to offer. You appear to do it the singular context way which I suppose makes sense because it seems like every block is an individual lambda. This also seems like overkill for my purposes because I'm interested in being able to do granular things if I wish such as a simpe rule action. I'm not sure having an individual lambda to run "if email exists return true" would be practical. That and the warm up time/latency.

The other thing is managing something like npm dependencies might be annoying across blocks.

Being able to create arbitrary API endpoints is nice though.

I wish there were something a little more in between than all of these enterprisey offerings.

halfmatthalfcat · on Aug 19, 2020

My company is slowly switching from Tekton to Argo (much to my chagrin). They seem very similar in that they’re incredibly alpha and fighting over similar parts of the stack.

Argo seems more CD focused right now whereas Tekton really is a “toolbox”/make what you want.

freeqaz · on Aug 19, 2020

That's funny because the Tekton website[0] has this as their first text:

"Tekton is a powerful and flexible open-source framework for creating CI/CD systems, allowing developers to build, test, and deploy across cloud providers and on-premise systems."

Do you know why they are switching, out of curiosity?

The market is definitely very young and I don't think anybody has really nailed down every use case.

There are the parts of the market targeting "business model" use cases like Camunda/Zeebe.

Then there are ETL-style systems like Airflow for dealing with massive data processing.

And still you've got the CI/CD side of things like Argo/Tekton for automating complex build systems/running tests.

Then for systems like Netflix Conductor, Uber Cadence, and AWS Step Functions (among others), you have systems trying to abstract on top of existing complex systems (microservices, etc).

That's not even including low-code spaces like Zapier or IFTTT that try to target making integrations trivial.

It's a crazy world!

0: https://tekton.dev/

theptip · on Aug 20, 2020

I had Tekton in my mental filing cabinet under "Cloud Native CI/CD", how's the UI / DAG-visualization for that toolchain? Like the sibling comment, I'd also be interested in hearing your A/B thoughts on Argo vs. Tekton for the generic workflow management usecase.

halfmatthalfcat · on Aug 20, 2020

Tekton's UI is very bare bones. Really all you need to see job status and logs from each job/step running. It's not _good_, it's fine to get the job done. It's not good at surfacing errors during execution though. We spend a lot of time diagnosing why pipelines failed.

At this point in my experience for both (I have much, much more time with Tekton though), Argo is good for CD. It's visualization and syncing of environment charts (we use Helm) is useful.

Tekton is amazing for defining highly complex build/deploy workflows (which we have). Along with Tekton's sibling project Triggers, you can have full end-to-end PR -> Build -> Deploy -> Feedback into Slack/Monitoring/Alerting/etc.

If I had to make a comparison (with my current knowledge level), I'd say Argo is like Angular, batteries loaded with addons you can include if you wish. Tekton is like React, you get pieces and build them into whatever you want.

devmunchies · on Aug 19, 2020

interesting project you started. I had used snaplogic before and its similar, although not built specifically for AWS serverless. Do you have any videos of any walkthroughs, demos, or tutorials?

EDIT: nevermind, i see your getting started docs have lots of video clips.

freeqaz · on Aug 19, 2020

I hadn't heard of Snaplogic before. Had to dig in. Thanks for sharing -- this is helpful context to have. There are so many companies across different spaces that it's hard to find them all!

Glad you found the docs here[0] (for anybody else who is curious).

We're still iterating on Refinery a lot and trying to find product market fit. If it doesn't make sense or you're confused, that feedback is super helpful for me. The goal is to build something people actually want to use and it's an iterative process to get there!

0: https://docs.refinery.io/getting-started/

faizshah · on Aug 20, 2020

Is there a serverless workflow solution for google cloud yet?

CraigJPerry · on Aug 19, 2020

What kind of support is present for testing workflows?

mfateev · on Aug 19, 2020

Temporal (Cadence) comes with Unit testing framework. The most interesting feature of it is ability to skip time automatically when workflow is blocked waiting for something. It allows testing long running workflows in milliseconds without any need to change timeouts.

freeqaz · on Aug 19, 2020

It's fairly limited right now to testing single blocks in the editor.

We recently added support for Git-based development to address testing. The idea there is that each block is just a chunk of code living in a Git repo, so you can use whatever testing tools you'd like to.

It's actually pretty slick -- it's bidirectional. You can use Git to make a commit, like normal, and then refresh the editor to see it. And you can make also changes in the editor and it will commit + push to Git. If you want to play with that, I can share details on how to enable it (it's fairly hidden in the UI atm).

In terms of testing workflows together as a chain, that's still TBD. It is something that we'd like to ship eventually though. Along with staging environments and canary deployments.

On our immediate roadmap is to get the core of the tool open-sourced so that people can start playing with it outside of our hosted platform. We have it gated behind a credit-card form right now to fight fraud and because of some technical limitations.

theptip · on Aug 19, 2020

Quick notes from skimming the docs:

* Conductor implements a workflow orchestration system which seems at the highest level to be similar to Airflow, with a couple of significant details.

* There are no "workers", instead tasks are executed by existing microservices.

* The Orchestrator doesn't push work to workers (e.g. Airflow triggering Operators to execute a DAG), instead the clients poll the orchestrator for tasks and execute when they find them.

My hot take:

If you already have a very large mesh of collaborating microservices and want to extract an orchestration layer on top of their tasks, this system could be a good fit.

Most of what you're doing here can also be implemented in Airflow, using an HTTPOperator or GRPCOperator that triggers your services to initiate their task. You don't get things like pausing though. On the other hand, you do get the ability to run simple/one-off tasks in an Airflow operator, instead of having to build a service to run your simple Python function.

I'm unsure on whether push/pull is better; I think it largely depends on your context. I'm inclined to say that for most cases, having the orchestrator push tasks out over HTTP is a better default, since you can simply load-balance those requests and horizontally scale your worker pool, and it's easier to test a pipeline manually (e.g. for development environments) if the workers respond to simple HTTP requests, instead of having to provide a stub/test implementation of the orchestrator. (In particular I'm thinking about "running the prod env on your local machine in k8s" -- this isn't practical at Netflix scale though.)

dilly_li · on Aug 20, 2020

Is there a workflow tool that’s designed with micro service in mind?

My particular use case: - several workers process the data on the workers’ local threads - several workers serve as relays to interface with external third party services, hold all the necessary credentials, and conduct cron-like checking. - the ETL tool doesn’t directly provision these workers.

The second point is part of the reason why we don’t want to use Airflow’s k8s operator.

But it doesn’t seem like there is a better option in terms commonly used and robustness. So we are leaning towards write some custom operators and sensors to make Airflow more friendly to micro services.

Thought?

tupac_speedrap · on Aug 19, 2020

We've used Conductor at my workplace for about a year now. The grounding is pretty solid but the documentation is pretty pants once you dig into it. We have to resort to digging into github issues to find fairly fundamental features that aren't really documented. I feel Conductor is something Netflix has open-sourced and then sort of dumped on the OS community.

For example there isn't any examples of how to implement workers using their Java client, we had to dig up a blog post to do that, although it is fairly simple a very basic example of implementing the Worker interface would be nice.

They also do not make it clear the exact relationship between tasks and workflows and it's hard to find any good examples of relatively complex workflows and task definitions available on the internet other than Netflix's barebones documenatation and the kitchen-sink workflow they provide, which is broken by default on the current API.

Also the configuration makes mention of so many fields that are pretty much undocumented, like you can swap out your persistence layer for something else but I would have no idea how that works.

TheColorYellow · on Aug 20, 2020

Suprised to see Camunda isn't mentioned here more.

Open-Source BPMN compliant workflow processing with a history of success. Goldman Sachs supposedly runs their internal org with it.

Slightly different target use case, but Camunda has really shined in microservices orchestration and I find implementing complex workflow and managing task dependencies much easier with it.

swyx · on Aug 20, 2020

do you have some recommended resources to learn more about BPMN? what's your take on BPMN vs other approaches? (JSON or cadence/temporal style "workflow as code")

_gfrc · on Aug 19, 2020

Very interesting. Looks a lot like zeebe [0], which uses BPMN for the workflow definition. This makes it easier to communicate the processes with the rest of the company. I never used it in production, just played around with it for a demo.

[0] https://zeebe.io/

theptip · on Aug 19, 2020

I've looked at Zeebe, and Camunda too - likewise, just in a demo capacity.

Interested in folks' experiences deploying these tools, as this sounds like a potentially very useful way of modeling business workflows that span multiple services.

MrSaints · on Aug 19, 2020

I've used Conductor, Zeebe, and Cadence all in close to production capacity. This is just my personal experience.

Conductor's JSON DSL was a bit of a nightmare to work with in my opinion. But otherwise, it did the job OK-ish. Felt more akin to Step Functions.

Arguably, Zeebe was the easiest to get started with once you get past the initial hurdle of BPMN. Their model of job processing is very simple, and because of that, it is very easy to write an SDK for it in any language. The biggest downside is that it is far from production ready, and there are ongoing complaints in their Slack about its lack of stability, and relatively poor performance. Zeebe does not require an external storage since workflows are transient, and replicated using their own RocksDB, and Raft set-up. You need to export, and index the workflows if you want to keep a history of it or even if you want to manage them. It is very eventually consistent.

With both Conductor, and Zeebe however, if you have a complex enough online workflow, it starts getting very difficult to model them in their respective DSLs. Especially if you have a dynamic workflow. And that complexity can translate to bugs at an orchestration level which you do not catch unless running the different scenarios.

Cadence (Temporal) handles this very well. You essentially write the workflow in the programming language itself, with appropriate wrappers / decorators, and helpers. There is no need to learn a new DSL per se. But, as a result, building an SDK for it in a specific programming language is a non-trivial exercise, and currently, the stable implementations are in Java, and Go. Performance, and reliability wise, it is great (relies on Cassandra, but there are SQL adapters, though, not mature yet).

We have somewhat settled on Temporal now having worked with the other two for quite some time. We also explored Lyft's Flyte, but it seemed more appropriate for data engineering, and offline processing.

As it is mentioned elsewhere here, we also use Argo, but I do not think it falls in the same space as these workflow engines I have mentioned (which can handle the orchestration of complex business logic a lot better rather than simple pipelines for like CI / CD or ETL).

Also worth mentioning is that we went with a workflow engine to reduce the boilerplate, and time / effort needed to write orchestration logic / glue code. You do this in lots of projects without knowing. We definitely feel like we have succeeded in that goal. And I feel this is an exciting space.

theptip · on Aug 20, 2020

Thanks for the thoughtful reply, this is very useful.

The concept of having business users able to review (or even, holy grail, edit/author) workflows was one of the potentially appealing aspects of the BPMN products; did you get a signal on whether there were any benefits? "the initial hurdle of BPMN" sounds like maybe this isn't as good as it seems on the face of it?

Also, how do you go about testing long-lived workflows? Do any of these orchestrators have tools/environments that help with system-testing (or even just doing isolated simutions on) your flows? I've not found anything off-the-shelf for this yet.

MrSaints · on Aug 20, 2020

You raised a pretty good point about being able to review the BPMN. I did not immediately think of this, but now that you have mentioned it...

1. It was good for communicating the engine room

I remember demo'ing the workflows within my team, and to non-technical stakeholders. It was very easy to demonstrate what was happening, and to provide a live view into the state of things. From there, it was easy to get conversations going, e.g. about how certain business processes can be extended for more complex use-cases.

2. It empowered others to communicate their intent

Zeebe comes with a modeller which is simple enough even for non-technical users to stitch together a rough workflow. The problem is, the end-result often requires a lot of changes to be production-ready. But I have found that this still helps communicate ideas, and intent.

You do not really need BPMN for this, but if this becomes the standard practice, now you have a way of talking on the same wavelength. In my case, we were productionising ML pipelines so data scientists who were not incredibly attuned to data engineering practices, and limitations, were slowly able to open up to them. And as a data engineer, it became clearer what the requirements were.

On the point about testing, the test framework in Zeebe is still a bit immature. There is quite a few tooling / libraries in Java, but not really in other languages. The way we approached it was lots of semi-auto / manual QA, and fixing live in production (Zeebe provides several mechanisms for essentially rescuing broken workflows).

The testing in Cadence / Temporal is definitely more mature. But you do not have the same level of simplicity as Zeebe. That said, the way I like to see it / compare them, you could build something like Zeebe or even Conductor on Cadence / Temporal, but not vice versa.

mfateev · on Aug 20, 2020

Temporal/Cadence provide unit testing framework that automatically skips time when workflow is blocked. So you can unit test using standard language frameworks (like mockito) to inject all sort of failures. And the tests execute in milliseconds even for very long running workflows.

TheColorYellow · on Aug 19, 2020

I've worked with Camunda extensively, which Zeebe is based on.

I've found Camunda to be incredible. The APIs are implemented well and the work flow processing paradigm is easy to work with. Setting up the Camunda engine as a web server in a Spring project and integrating with external sources is great.

I've found there can be some performance issues when running a single engine, but clustering is easily enabled and you can adjust with a dedicated worker paradigm too.

Great piece of tech honestly. Haven't worked with Zeebe yet but am excited to.

shadykiller · on Aug 19, 2020

Can someone explain how and where to use a Workflow Orchestration Engine ?

juancampa · on Aug 19, 2020

These are useful for tasks that can last an arbitrarily long amount of time. Think about the process of signing up a user and waiting for her to click on a email verification link. This process can literally never end (the user never clicked) but more commonly it takes a few minutes.

It's easier to implement these things if you can write the code like:

   await sendVerificationEmail();
   await waitForUserToClick();   // This could take forever
   await sendWelcomeEmail();

If you do the above in a "normal" program, said program could stay in memory forever (consuming RAM, server can't restart, etc). The workflow engine will take care of storing intermediate state so you can indeed write the above code.

The other option is to implement a state machine via your database and some state column, but the code doesn't look as pretty as the above three lines.

Note that this particular tool seems to be more declarative than my example above (it uses JSON to do define the steps), so instead of using an `if` statement, you'd need to declare a "Decision".

Hope this helps!

vvladymyrov · on Aug 19, 2020

So is it like https://github.com/uber/cadence, AWS SWF, Amazon Step Functions?

ecnahc515 · on Aug 19, 2020

Yes, they all solve similar problems. This is a space with a lot of differing requirements, so they all share some common features, but many have different strengths for various types of tasks.

acjohnson55 · on Aug 20, 2020

Yep. This is the dream, IMO. Serverless functions that can await for arbitrary spans of time. It's incredible to me how much effort we spend working around this limitation. Cadence/Temporal are pretty close to this ideal, but I need to look into what Conductor has going on.

bogomipz · on Aug 19, 2020

Interesting. So what holds the state in a workflow orchestrator? In your example what if user never clicks? Or if a server is rebooted? Is the state persisted to disk and then garbage collected after after a certain threshold?

juancampa · on Aug 19, 2020

> Is the state persisted to disk

Exactly!

swyx · on Aug 20, 2020

other benefit of this is the process also becomes "fault oblivious" - any part of it can crash and restart and it picks up where it left off

stingraycharles · on Aug 19, 2020

I’ve seen them a lot, but two of the more canonical use cases are (complex) ETL and Continuous Delivery processes.

In both these cases, before these types of systems existed, it were useful a dozens of ad-hoc scripts that handled it. Monitoring, testing, instrumentation and visibility were usually after-thoughts and often not worth the trouble.

Workflow Orchestration Engines solve these problems. Each of these workflow engines have a slightly different angle, slightly better at different use cases, and as usual it can be quite important to select the right tool for the job.

But if you have found one, it can be a real boost to the quality of your processes.

ramon · on Aug 19, 2020

I prefer Power Automate / Logic Apps interface, it would be cool if there was a Power Automate open source imagine the number of plugins for that cloud tool that would come up? It's a valuable tool and part of the O365 ecosystem and could be even greater, more strategy and vision into that product would make O365 and Azure a leader in components and integrations this is the most valuable thing in the end of it all.

catmanjan · on Aug 19, 2020

There is, it's called Apache nifi

It's ripe for someone to make it saas

ramon · on Aug 19, 2020

Dude awesome tip, just saw it now thank you! Great project! Watching the videos now... this is ground breaking stuff and little marketing for now on it... impressive I'm impressed with Nifi.

zok3102 · on Aug 20, 2020

Checkout flogo.io - specifically Flogo Flow action https://github.com/project-flogo/flow for orchestration. You can use a Web UI, Golang SDK or a hand-edit the JSON DSL to build your orchestration flows. Deploy as binaries/containers/functions or on a cloud service like TIBCO Cloud Integration.

Disclaimer: I work at TIBCO

ramon · on Aug 20, 2020

Wow.. the product looks good and it's totally open I can choose between UI or coding.. I saw I can code in golang it would be cool be to be able to code in NodeJS as well thinking about AWS Lambda.. agregate different providers there like the Azure Functions and Google Functions.. all within your own Kubernetes. this is a very nice concept and seems like a great architecture. Loved the tip thanks!

ForHackernews · on Aug 19, 2020

Does this have the same limitations as Airflow? How does it compare to something like Prefect?[0]

[0] https://medium.com/the-prefect-blog/why-not-airflow-4cfa4232...

thundergolfer · on Aug 20, 2020

That blog post is a good rundown of the problems with Apache Airflow.

jiehong · on Aug 19, 2020

We started using it around 2016 in the company I work for. We decided to use it to automate the often manual setup of new clients for each product. It grew to use our own security and rights system, and we also added a different database support (which we are working on the open source). We also changed the Jason API to conform to our company wide standard.

At the time, we wanted something that we could host ourself, maintained, open source and that was working!

Nowadays internal teams also use it to automate their own processes as well.

We’d probably go for a “push based” workflow engine, maybe based on events, mainly for latency and load reasons, but it’s something we’re ok with so far (there is a way to listen to event for some tasks though, but it’s not that easy)

If I’m not mistaken, Netflix uses it to automate video encoding for shows, but that might be outdated.

Overall, we’re pleased about it. But here are some cons about it: we wished we could split some services out (such as read only ones, or the workflow definitions from the executions, etc, but the code isn’t architecture for such an easy split: for example, pushing the result of a task computation by a worker triggers the current workflow to determine the next task to schedule, but it’s doing this internally, and not through the defined interfaces) Security of the api is not so easy, as it’s not really modular (unlike the database implementation, which is great). That point is being worked on though, so there is some hope for the future.

yeswecatan · on Aug 20, 2020

For better or worse, we ended up creating our own workflow engine at my company. Unfortunately, everyone who ends up using it hates it. We've also run into the problem where the entire process of producing our end product is encoded in the workflow. Downstream steps depends on earlier steps etc. If any part of the process changes, managing this data becomes tough.

Additionally, we have software engineers writing these workflows. Ideally we would have tooling so that those who know the process can write these things. The difficulty we have had though is making it easy to join/match up earlier parts of the process with later steps. We do this now by keeping a lot of data in the workflow and by occasionally persisting data in other places. Software engineers, not the process people, are the ones who understand the data model and how to munge everything together.

Have others dealt with this issue?

Ataraxy · on Aug 20, 2020

I've been really interested in creating my own workflow engine for personal use and I kind of had the same thought while trying to plan it out.

One approach that comes to mind to solve this sort of end user issue would be to explicitly define all inputs/outputs for the dataflow of a node as "requires" and "provides". In this manner each node would run in parallel the moment their requirements (ie. dependencies) are met. Additionally, since what each piece of data a node needs and provides is explicitly defined you could technically automatically wire nodes together without having to even really needing know the underlying data model itself.

So it just means needing to define clear unique labels for each port. In a UI a user could just drop nodes into a space which would automatically wire up to matching ports. You can then display which needs are not met and even have an interface for choosing matching nodes that fit what might be missing.

In the end all you would really need to know is how to compose the pieces of logic to get the desired outcome.

I'm a novice at this stuff though so take that with a grain of salt.

yeswecatan · on Aug 20, 2020

Yup in theory it makes sense. We do define inputs/outputs for each task and workflow but it's somewhat crude; for instance, we just check if the variable name will be available but not if a specific key is in a dictionary. We can definitely improve this though with schemas (probably json schema) and validation.

Wiring this together _without any user input_ (which is our goal) though is very, very hard. Let's say we have two parallel branches-- one that orders boxes with holes in them and another that orders shapes that can fit in those holes. When both orders are in the workflow can join and we want to put the shapes into boxes. How do we know which shape can go in each box? Maybe we have a BoxType with a list of possible shapes. This gets very complicated though when you have many attributes and care about different attributes at different stages of the process. Additionally, if the process should update the database at some point to, say, change a flag from `false` to `true` the user would need to know the underlying data model.

Ataraxy · on Aug 20, 2020

So if it's a long running task that's definitely a more complex problem that would require some sort of queue/pause/resume system presumably but I imagine the same concept could still apply.

The node that merges context would simply have a requirement/dependency on the results "provided" by each branch. It would wait to execute until those requirements were met.

The user wouldn't need to know about the data model still, the "merge node" would just intrinsicly wait for the results provided form the seperate branches.

Fundamentally when each node completes the system itself just needs to check against the list of nodes to see if its dependencies have just been met and keeps track of which nodes have been already executed so that they don't end up getting triggered again when the next check happens.

There will always be a need for some sort of workflow level state or context management that governs all of this orchestration that you would want to persist to a database somewhere if this is a long running workflow but this is a systems concern and the user doesn't need to know about it.

That was just a long way of me more or less saying that it doesn't matter how many branches there are, all that matters ultimately is that a node waits to execute when its requirements are met.

yeswecatan · on Aug 21, 2020

I guess what I'm trying to say is that the logic in the merge node is where the magic happens. So you have parallel branches, A and B. A orders boxes while B orders shapes. Somewhere in A and B we created Box and Shape instances and those are outputs of A and B, respectively. The merge node, C, waits until A and B are completed and takes their outputs as input. The merge node needs to know how to match up the Box and Shape instances to say shape 2 can fix in box 1.

MrSaints · on Aug 20, 2020

What you have described is quite similar to what Lyft's Flyte is trying to accomplish https://flyte.org/

A lot of Tensorflow inspired DAGs approaches the described node processing in the same way.

f0rr0 · on Aug 24, 2020

I have been exploring workflow orchestration for sometime now - specifically Temporal. Temporal's authors don't recommend it for very high throughput (per workflow) use cases, although I haven't benchmarked it myself. Also using it in a SaaS environment, I would prefer some serverless deployment strategy which possibly allows scaling down to zero.

I have my eyes on flink stateful functions http://statefun.io/ The abstractions are quite low level as compared to Temporal but the overall ability to write tasks/activities as serverless functions which have access to state is quite attractive.

Would be happy to talk to someone who has explored this further.

jayd16 · on Aug 19, 2020

Seems neat. I guess this partially solves the problem of having some workflow stuck/dropped.

I wonder how much overhead there is. How much latency does each task cause?

Is it feasible to complete workflows while a user/client is waiting for a RESTful response?

iblaine · on Aug 20, 2020

Since this has come up several times, Airflow is an orchestration tool for ETL jobs (long running complex processes) and Netflix Conductor is an orchestration tool for micro-services (short running simple processes).

f0rr0 · on Aug 24, 2020

There is also https://github.com/dapr/workflows which uses Azure Logic Apps engine on https://github.com/dapr/

dmead · on Aug 20, 2020

> Almost no way to systematically answer “How much are we done with process X”?

is this a typo?

The_rationalist · on Aug 19, 2020

Any difference with airflow?

dang · on Aug 19, 2020

If curious see also

2016 https://news.ycombinator.com/item?id=13174743

realistcake · on Aug 20, 2020

It's great to see corporations getting more involved in open source software; giving back and empowering the developer community.

lewisjoe · on Aug 20, 2020

can somebody ELI5, why would someone need such a workflow orchestration engine? What problems are best solved with workflow engines?

swyx · on Aug 20, 2020

really good answer in another comment https://news.ycombinator.com/item?id=24215303

itpragmatik · on Aug 19, 2020

How does this compare with Cadence/Temporal?

tiagod · on Aug 19, 2020

You need to specify a DAG, rather than just write regular code

dahfizz · on Aug 20, 2020

I wonder what the relationship is to StackStorm[1]? StackStorm is older and lists Netflix as a sponsor / user.

[1] https://stackstorm.com/

iamAtom · on Aug 20, 2020

Comparison with Airflow will be helpful.