Hacker News new | past | comments | ask | show | jobs | submit login
Toil – A Python Workflow Engine (toil.readthedocs.io)
111 points by nikolay on May 4, 2016 | hide | past | favorite | 28 comments



After Arvados (http://arvados.org) Toil has the most complete support for the common workflow language http://commonwl.org) which is an emerging standard for writing portable workflows that can run on different platforms, instead of being tied to a particular engine or grid/cluster/cloud technology.


What exactly are workflows here? I tried following some links but can't quite figure it out


Appears to me to be a series of commands used for processing data. The workflow description language (WDL) probably has the best overview of the 1-2 level links:

https://github.com/broadinstitute/wdl

Essentially a formalization of a unix command pipe. With the ability to tee and recombine components. Also, the ability to define valid inputs and outputs.


I do not want to be snarky, but are these languages actually used and what do they bring to the table?

I have to admit, if I hear the term "workflow" I get very suspicious as there does not seem to be a consensus on what exactly a "workflow" is. The only thing people seem to agree on is the idea that "a workflow describes some kind of data flow" which is pretty broad as this includes all programs.

So a "workflow engine" is then ... an interpreter for a (mostly domain-specific) language?


Theres a push in the bioinformatics world to try to use these (WDL in particular). The main use-case is to write complex data analysis pipelines in WDL so they can be shared and executed on different environments (local machines, cloud VMs, etc...).

Think of it more like an "execution engine" rather than a workflow engine. You define multiple jobs, their resources, and how they interact. The engine executes each step.

Unfortunately though, to me, it seems clunky to use. It's not something that I'd like to write by hand. It is logically very similar to defining Java builds with Ant XML files, if that helps. You can get the job done, but if you want to have any sort of complexity or logic, it's not going to be easy. Any XML/JSON/YAML format is going to have this issue.

I think the jury is still out though as to how much it gets used outside of a few large academic centers. I see it as mainly helpful for large-scale analysis projects where you have to manage cloud resources. Not many people have that problem. I'm personally only planning on using it as an output for my own language (which is basically a Make-like format that submits dynamic jobs to HPC clusters). I have yet to see how WDL can work with dynamic workflows (where you might have more than one path to an output or change the workflow based on a command like argument). It might catch on, but I've seen this show before.


I work on a workflow engine (aka DAG processor) that dates to before 2008. We use it for many things in an astronomy experiment. It has executed over 900M batch jobs and 200M "scriptlets", and most of those workflows can be re-executed at any stage, even today, five+ years after they were originally ran, to recreate data products. The workflow engine also serves effectively as a provenance database for each data product to trace its origins back to raw data.


I created the WDL specification and parser. You're right that a "workflow" is a squishy term that could mean different things to different people and the lowest common denominator definition is very broad. WDL attempts to give at least one formal definition of what a workflow is. Maybe using terms like "WDL workflow" could make it more concrete.

And to answer your question about if these languages are actually used... absolutely! We use WDL extensively, but it is relatively new, only about 1 year old. We've had a lot of success using it so far.

What does WDL bring to the table? Well, WDL is essentially Makefile on steroids, designed for cloud computing. We want to be able to take a shell script and turn it into a function with parameters and return values. Like this: https://github.com/broadinstitute/wdl/blob/gvda_add_scripts_.... WDL calls this a "task".

Then, once we define a bunch of tasks, we need to be able to link them together. Like, have a sub-set of one WDL task's outputs be the inputs to another task. WDL calls this a "workflow". Here is an example of invoking the above task: https://github.com/broadinstitute/wdl/blob/gvda_add_scripts_...

The workflows we're creating are far too large to be run on a single machine, so we wanted a way to specify what each task needs in terms of CPU/memory/disk so we can provision machines that are capable of running each task. Parallelism is also a very important aspect of WDL, and all workflow languages/engines.

As others have said, this is not a new idea. WDL's primary goal is to make this as easy as possible for people to read, write, maintain, and extend workflows. We knew that these "workflows" would be code and they'd evolve, they'd have to be code reviewed and diff'd, and there'd be pull requests of WDL files, so we needed a way of describing them that would work well in this environment.


(Toil contributor here)

Workflow systems are generally used to orchestrate parallel batch jobs on a cluster, which may have hundreds or thousands of individual tasks. So the emphasis is on defining what units of work are independent that can run on separate nodes without needing to communicate with other tasks, and how outputs of one task feed into the inputs of another.


Could Toil be extended to work on something like say Mesos maybe using the Marathon scheduler, or would you be better off to write a toil framework instead?


AFAIK, there's some support in there already:

""" Develop and test on your laptop then deploy on any of the following:

Commercial clouds: - Amazon Web Services (including the spot market) - Microsoft Azure Private clouds: - OpenStack High Performance Computing Environments: - GridEngine - Apache Mesos - Parasol - Individual multi-core machines """

http://toil.readthedocs.io/en/latest/


I'm not sure on usage of WDL, but the motivating use case make sense to me. Currently there are several ways of creating a 'workflow' (Makefiles, Airflow, Luigi) for things that need to be run over and over again many of those workflows are stored in a database a plain text representation of a workflow is something that could be easily versioned or perhaps even generated by a language of your choice.

A "workflow engine" could encompass the definition of a graph of tasks that need to be run, the monitoring of those tasks, retry logic for failures, alerting for monitoring and some form of UI for viewing progress.


Not Python 3 compatible eh? Thought it was disheartening at first, but then it seems the project started in 2011.


True, but at the same time it's odd for open source projects to not have py3k compatibility. Hell, the stuff I have on github has no python 2 compatibility because honestly after spending some time in Python 3 it really feels like a step back (compared to bytes vs strings, etc.). Unless you are getting paid or depend on libraries that are only on py2, why would you use python 2 instead of 3?


Because RHEL6 and a-hole sysadmins who are too lazy to install SCL are all too common.


I'm a sysadmin who uses python27, centos and rhel on a daily basis and I've never personally installed it - simply for not knowing anything about it, nor has anybody asked me to install it. The name rings a slight bell, but it doesn't sound like anything a python package would sound like, so I can kinda see how I could have just overlooked it many, many times. Besides, installing numpy/scipy/etc is just a few commands away to having a similar build in comparison, anyway, so I've never even needed to think about alternatives.

I understand your frustration, but calling us assholes and lazy is just a little harsh. At this point, I'd have hoped it was understood by everyone that this wasn't the sysadmin's fault. How insistent or pushy or hoity toity are you were when asking? I've been woken enough times while on call after a 2.7 upgrade to pretty much give up on it, so even the request gives me a blinding headache these days. So yeah, I'll probably be a little grumpy at that request and I will take my time with it on top of the other things going on. If I was called lazy or an asshole for bringing these issues up, be ready to wait.

Hell, to flip it back on y'all, would it hurt to have proper shebangs? For every developer who asks for a 2.7 upgrade, I get three more asking to downgrade until they fix their badly self-deployed code, despite plenty of preparation. It sucks pretty hardcore.

That said, I have a flood of baseline rhel/centos machines to install soon, so I'll try to include it in my builds.

(most of these issues I've had are in larger environments where best practices are best-effort, through legacy swamps.)


I believe SCL refers to this: https://en.wikipedia.org/wiki/Scientific_Linux

It's an unusual abbreviation though.



Don't think so (first link that seemed to make some sense) - http://ftp.riken.jp/Linux/cern/scl/slc6-scl.repo

Funny - I actually just spent the past week visiting one of the original scientific linux guys. I don't remember him calling it that, though. :)


Toil is very much reminiscent of Luigi. I hope the author(s) will elaborate on this here (or elsewhere): There's very little on both readthedocs as well as their GitHub repohttps://github.com/BD2KGenomics/toil


Seems like a lot of work to setup. Also, spent several minutes on the docs, and did not one line of code; examples?


An 8 line Makefile translates to an 80 line common workflow (.cwl) file.

https://github.com/common-workflow-language/workflows/tree/m...


Honestly the format looks a bit over-engineered IMHO. The task is quite simple: make build system that executes jobs on clusters. So why not get best build configuration format practices and just make them run over network? For example ninja build system [1] format is quite good in my opinion, so just make runtime execute commands over network. Or travis-ci [2] is another example of well designed configuration format, and it really enables developers to write small and powerful configurations. Sure it was even done before (though mostly for C/C++ stuff), like IncrediBuild [3] for example, or FASTBuild [4] or distcc [5]. Though the case with precise control of pipes could be improved in current build systems, but not sure how important it is for this application.

- [1] https://ninja-build.org/ - [2] https://travis-ci.org/ - [3] https://www.incredibuild.com/ - [4] http://fastbuild.org/ - [5] https://github.com/distcc/distcc


Haven't checked ninja, but I've blogged a bit on limitations in common build systems, such as make and its various derivatives:

"The problem with make for scientific workflows":

http://bionics.it/posts/the-problem-with-make-for-scientific...

"Workflow tool makers: Allow defining data flow, not just task dependencies"

http://bionics.it/posts/workflows-dataflow-not-task-deps

The last of which is a limitation of even the most of the "very much-engineered" ones, as the post goes on to explain.


From the first blog post:

> Files are represented by strings

I think it's especially true for make - looks like it was designed to efficiently express operations for transformations of the same type (like .cpp -> .o/.obj). So in different use case it may become a bit clumsy to use. Ninja should help a bit in this case - you can define a rule, and just use rule name when defining inputs and outputs of a build statement, though it still operates on files.

>[Problems with] combinatorial dependencies

Yes, partially this could be fixed with wildcards in make. Ninja doesn't have wildcard support, so I've created the buildfox [1] to fix it :)

>Non-representable dependency structures

I think it's a limitation of this type of build systems, their configuration language oriented on expressing "how" to achieve things, not "what" to achieve.

- [1] https://github.com/beardsvibe/buildfox/


Ouch. I suppose that's the price paid for portability, though.

Anything beyond that trivial example Makefile will rather the reflect the system and environment on which it was written.


Is there a GUI tool to work with cwl? This could take off if an ETL like gui tool could generate the config for you.



Nice.

A presentation on this (or similar) subject would be nice at PyData Paris (http://pydata.fr/), the CFP is open.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: