Hacker News new | past | comments | ask | show | jobs | submit login

What I really want is a way to define a graph of tasks and dependencies, and then execute it across n workers.

Something like

    parq add --task-name wget1 wget http://example.com/foo.json
    parq add --task-name wget2 wget http://example.com/bar.json
    parq add --dependencies wget1,wget2 jq -s '.[0] * .[1]' foo.json bar.json
    parq run -n 2 # executes wget1 and wget2 in parallel using two workers, then merges them with jq



Hi!

Consider trying redo[0]. It's an idea of D. J. Bernstein (a.k.a. djb) what could already be a good advertising.

Your problem can be solved with make as it pointed by others but I see a wonderful example where redo's target files are pretty clear describing what redo can do.

redo's target files are usually SHELL-scripts but they can be whatever you want if it can be executed by a kernel. `redo-ifchange file1` is a command which waits until file1 have been rebuilt or, by other words, waits until a file1's target file have been executed if it requires.

There are 4 target files to show how to solve your problem --- downloading and merging two files:

all.do file is

  DEPS="foo.json bar.json"
  redo-ifchange $DEPS
  jq -s '.[0] * .[1]' $DEPS
foo.json.do file is

  curl "http://example.com/foo.json"  # redo guarantees that any errors won't update foo.json as it can happen in make world.
bar.json.do file is

  curl "http://example.com/bar.json"
After creating these files you could write `redo all` (or just `redo`) and it will create a graph of deps and will execute them in parallel --- foo.json and bar.json will be downloading at the same time.

I'd recommend getting started with a Go version of redo --- goredo[1] by stargrave. There is also a link to documentations, FAQ and other implementations on the web-site.

[0] http://cr.yp.to/redo.html

[1] http://www.goredo.cypherpunks.ru


I'll note that nq's author also has a redo implementation¹. Being generally redo curious I've wondered a few times why their other projects(nq/mblaze/etc) don't use redo, but never actually asked.

¹ https://github.com/leahneukirchen/redo-c


Use a Makefile? e.g., GNU make can perform parallel execution.


    $ make jq -j2 -f - <<EOF
    file1:
            wget file1
    file2:
            wget file2
    jq: file1 file2
            jq file1 file2
    EOF
Could probably make some syntactic sugar for this.


you can use semicolon instead of \n\t for a neater file


What you're looking for is in the class of tools known as batch schedulers. Most commonly these are used on HPC clusters, but you can use them on any size machine.

There are a number of tools in this category, and like others have mentioned, my first try would be Make, if that is an option for you. However, I normally work on HPC clusters, so submitting jobs is incredibly common for me. To keep with that workflow without needing to install SLURM or SGE on my laptop (which I've done before!?!?), my entry into this mix is here: https://github.com/compgen-io/sbs. It is a single-file Python3 script.

My version is setup to only run across one node, but you can have as many worker threads as you need. For what you asked for, you'd run something like this:

    $ sbs submit -- wget http://example.com/foo.json
    1
    $ sbs submit -- wget http://example.com/bar.json
    2
    $ sbs submit -afterok 1:2 -cmd jq -s '.[0] * .[1]' foo.json bar.json
    $ sbs run -maxprocs 2
This isn't heavily tested code, but works for the use-case I had (having a single-file batch scheduler for when I'm not on an HPC cluster, and testing my pipeline definition language). Normally, it assumes the parameters (CPUs, Memory, etc) are present as comments in a submitted bash script (as is the norm in HPC-land). However, I also added an option to directly submit a command. stdout/stderr are all captured and stored.

The job runner also has a daemon mode, so you can keep it running in the background if you'd like to have things running on demand.

Installation is as simple as copying the sbs script someplace in your $PATH (with Python3 installed). You should also set the ENV var $SBSHOME, if you'd like to have a common place for jobs.

The usage is very similar to many HPC schedulers...

    sbs submit
    sbs status
    sbs run
    sbs cancel
etc...


For heavy lifting check OAR at https://oar.imag.fr/


Never used OAR...

I've used (and installed) PBS, SGE, and SLURM [1]. Most of the clusters I've used recently have all migrated to SLURM. Even though it's pretty feature packed, I've found it "easy enough" to install for a cluster.

What is the sales pitch for OAR? Any particularly compelling features?

[1] https://slurm.schedmd.com/


(I'm not an expert)

OAR, according to knowledgeable people I've worked with, is robust and extensible.


all of the options can probably claim that...


Check out walk[1]. It does exactly this. Lets you define a graph of dependencies in any language of your choice.

[1](https://github.com/ejholmes/walk/)


  $ cat Makefile

  url_base := http://example.com

  %.json:
           wget $(url_base)/%.json

  .PHONY: all
  all: foo.json bar.json
          jq -s '.[0] * .[1]' $^


   $ make -j 2


I imagine in theory Snakemake, which handles dependency graph resolution , could be used to compute dependencies, and its flexible scheduler could then call nq.

OTOH, if just working on one node, skip nq and use Snakemake as the scheduler as well.


Slightly related to this, I'm building a ruby gem that lets you create such a workflow:

https://gitlab.com/sdwolfz/cache_box_rb

I guess some slight tweaks for task persitance and a CLI wrapper for it could let you achieve this (although I don't leverage Ractors so no true parallelism yet).

Anyway, it still does not have an "official" release, nor a stable API, although the code works well and it's fully tested, as far a I can tell. I might consider providing such wrapper myself in the future as I can definitely see it's utility, but time is short nowadays.


I think Airflow and Celery can do this, although I’ve never tried it myself.


They can, but those tools are much more heavyweight than something like nq


Drake is nice for these kinds of data dependencies.

https://github.com/Factual/drake


Some other fine suggestions here, but I'll also mention that luigi should do this fine as well.


At one point we end up in is / concurrent system design. It's a very interesting topic.


It is called GNU Make.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: