What I really want is a way to define a graph of tasks and dependencies, and the...

igoose1 · on Jan 27, 2021

Hi!

Consider trying redo[0]. It's an idea of D. J. Bernstein (a.k.a. djb) what could already be a good advertising.

Your problem can be solved with make as it pointed by others but I see a wonderful example where redo's target files are pretty clear describing what redo can do.

redo's target files are usually SHELL-scripts but they can be whatever you want if it can be executed by a kernel. `redo-ifchange file1` is a command which waits until file1 have been rebuilt or, by other words, waits until a file1's target file have been executed if it requires.

There are 4 target files to show how to solve your problem --- downloading and merging two files:

all.do file is

  DEPS="foo.json bar.json"
  redo-ifchange $DEPS
  jq -s '.[0] * .[1]' $DEPS

foo.json.do file is

  curl "http://example.com/foo.json"  # redo guarantees that any errors won't update foo.json as it can happen in make world.

bar.json.do file is

  curl "http://example.com/bar.json"

After creating these files you could write `redo all` (or just `redo`) and it will create a graph of deps and will execute them in parallel --- foo.json and bar.json will be downloading at the same time.

I'd recommend getting started with a Go version of redo --- goredo[1] by stargrave. There is also a link to documentations, FAQ and other implementations on the web-site.

[0] http://cr.yp.to/redo.html

[1] http://www.goredo.cypherpunks.ru

JNRowe · on Jan 27, 2021

I'll note that nq's author also has a redo implementation¹. Being generally redo curious I've wondered a few times why their other projects(nq/mblaze/etc) don't use redo, but never actually asked.

¹ https://github.com/leahneukirchen/redo-c

gmfawcett · on Jan 26, 2021

Use a Makefile? e.g., GNU make can perform parallel execution.

butt_hugger · on Jan 27, 2021

    $ make jq -j2 -f - <<EOF
    file1:
            wget file1
    file2:
            wget file2
    jq: file1 file2
            jq file1 file2
    EOF

Could probably make some syntactic sugar for this.

enriquto · on Jan 27, 2021

you can use semicolon instead of \n\t for a neater file

mbreese · on Jan 27, 2021

What you're looking for is in the class of tools known as batch schedulers. Most commonly these are used on HPC clusters, but you can use them on any size machine.

There are a number of tools in this category, and like others have mentioned, my first try would be Make, if that is an option for you. However, I normally work on HPC clusters, so submitting jobs is incredibly common for me. To keep with that workflow without needing to install SLURM or SGE on my laptop (which I've done before!?!?), my entry into this mix is here: https://github.com/compgen-io/sbs. It is a single-file Python3 script.

My version is setup to only run across one node, but you can have as many worker threads as you need. For what you asked for, you'd run something like this:

    $ sbs submit -- wget http://example.com/foo.json
    1
    $ sbs submit -- wget http://example.com/bar.json
    2
    $ sbs submit -afterok 1:2 -cmd jq -s '.[0] * .[1]' foo.json bar.json
    $ sbs run -maxprocs 2

This isn't heavily tested code, but works for the use-case I had (having a single-file batch scheduler for when I'm not on an HPC cluster, and testing my pipeline definition language). Normally, it assumes the parameters (CPUs, Memory, etc) are present as comments in a submitted bash script (as is the norm in HPC-land). However, I also added an option to directly submit a command. stdout/stderr are all captured and stored.

The job runner also has a daemon mode, so you can keep it running in the background if you'd like to have things running on demand.

Installation is as simple as copying the sbs script someplace in your $PATH (with Python3 installed). You should also set the ENV var $SBSHOME, if you'd like to have a common place for jobs.

The usage is very similar to many HPC schedulers...

    sbs submit
    sbs status
    sbs run
    sbs cancel

etc...

natmaka · on Jan 27, 2021

For heavy lifting check OAR at https://oar.imag.fr/

mbreese · on Jan 27, 2021

Never used OAR...

I've used (and installed) PBS, SGE, and SLURM [1]. Most of the clusters I've used recently have all migrated to SLURM. Even though it's pretty feature packed, I've found it "easy enough" to install for a cluster.

What is the sales pitch for OAR? Any particularly compelling features?

[1] https://slurm.schedmd.com/

natmaka · on Jan 27, 2021

(I'm not an expert)

OAR, according to knowledgeable people I've worked with, is robust and extensible.

cowsandmilk · on Jan 27, 2021

all of the options can probably claim that...

shipp02 · on Jan 27, 2021

Check out walk[1]. It does exactly this. Lets you define a graph of dependencies in any language of your choice.

[1](https://github.com/ejholmes/walk/)

kazinator · on Jan 27, 2021

  $ cat Makefile

  url_base := http://example.com

  %.json:
           wget $(url_base)/%.json

  .PHONY: all
  all: foo.json bar.json
          jq -s '.[0] * .[1]' $^


   $ make -j 2

chromatin · on Jan 26, 2021

I imagine in theory Snakemake, which handles dependency graph resolution , could be used to compute dependencies, and its flexible scheduler could then call nq.

OTOH, if just working on one node, skip nq and use Snakemake as the scheduler as well.

sdwolfz · on Jan 27, 2021

Slightly related to this, I'm building a ruby gem that lets you create such a workflow:

https://gitlab.com/sdwolfz/cache_box_rb

I guess some slight tweaks for task persitance and a CLI wrapper for it could let you achieve this (although I don't leverage Ractors so no true parallelism yet).

Anyway, it still does not have an "official" release, nor a stable API, although the code works well and it's fully tested, as far a I can tell. I might consider providing such wrapper myself in the future as I can definitely see it's utility, but time is short nowadays.

mr_toad · on Jan 26, 2021

I think Airflow and Celery can do this, although I’ve never tried it myself.

linux2647 · on Jan 26, 2021

They can, but those tools are much more heavyweight than something like nq

sheriff · on Jan 27, 2021

Drake is nice for these kinds of data dependencies.

https://github.com/Factual/drake

humanrebar · on Jan 27, 2021

Some other fine suggestions here, but I'll also mention that luigi should do this fine as well.

agumonkey · on Jan 27, 2021

At one point we end up in is / concurrent system design. It's a very interesting topic.

crb002 · on Jan 27, 2021

It is called GNU Make.