Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Rocketry – Statement-based scheduling framework for Python (github.com/miksus)
168 points by Miksus on Sept 22, 2022 | hide | past | favorite | 67 comments



Anyone using this?

It's not spelled out, but it's apparent you just run the python file containing the app definition and leave it running in the background?

Looks very clean and pythonic.


Of course I have this running though it's still running older version (been too busy with developing this). It has been running over half a year for my scrapers without a single interruption even though the machine has the worst specs available. I have tested this with Linux/Unix and Windows at least. Of course, I have gotten message from various people saying they are using it. Some have said they migrated from Celery or other alternatives as they found Rocketry more suitable for their needs.

And that's true: it's 100% Python and basically there is a main loop that checks starting conditions of tasks (and some other things) and if a task's starting condition is reached, the task is run. Tasks can be executed synchronously by setting execution as "main" or concurrently with async, threading or multiprocessing. Maybe in the future with another interpreter as well. The main loop is left running in background.

So in short, it's a Python that's constantly loop running. It sleeps defined amount of time after checking a set of tasks to lower the resource consumption but you can also create a task with execution as "main" and do sophisticated sleep like "sleep more when CPU usage is X%" or estimate the time when the next task should start from the tasks' conditions.

And thanks for the positive comment!


Hey cool project, congrats!

How does Rocketry saves execution state? Like, if it crashes and goes back up again, does it know which tasks were executed and which ones were not?


Thanks a lot, nice to hear!

The system knows which task ran and when by extedning logging (from standard library). There is a logger called "rocketry.task" that should have a handler which can be read as well: redbird.logging.RepoHandler. An in-memory logger is created if nothing is specified. This handler abstracts simple read and write to a data store which can be an SQL database, in-memory Python list, MongoDB or CSV file.

Seems I forgot to implement a method mentioned in the docs but here's an example to specify a task log repo: https://github.com/Miksus/rocketry/issues/108#issuecomment-1...

The latest success time, starting time etc. are also stored in the tasks themselves and there is some optimization (which can be turned off) to reduce the reads in some cases. In the start-up these attributes are set in each task (if logs found).


This looks brilliant. I like that it’s kept light as a concept - feels like you can just sprinkle it over your existing tasks without getting bogged down in complex configuration.

We have a couple of hand rolled variants of this that run into all the issues this solves. Will definitely look at taking this for a spin.


I like the idea, but this should be an api that can be accessed from any language.


If you want an API (or UI), just clone this and modify it as you need: https://github.com/Miksus/rocketry-with-fastapi. I also wrote an article to Medium how it works with FastAPI: https://itnext.io/scheduler-with-an-api-rocketry-fastapi-a0f...

Rocketry plays quite nicely with FastAPI.


I was scratching my head, looking at the docs and asking myself "Ok do I need a database? Is the scheduler separated from workers? Does it have a UI?"

Being just a lib is actually quite refreshing compared to complex behemoths like Airflow. I guess you could just use your favorite service runner (systemd, k8s, nomad, none at all...).


Biggest win I see here is the native support for async methods. Celery, the default option for most, does not support it and there are only hacky ways to make async work. Kudos to the team @ Rocketry.


I haven't looked too closely yet so please excuse me for asking this question but how dynamic can i make the timed events? I have two use-cases in mind:

1. I would like to run a task each day 30 minutes before dawn so i have to compute that time at some point.

2. I run a task normally every hour but if something happens i want to run it 20 minutes after that event.


Given the "before dawn" constraint here I'm going to assume this is somehow related to home/building automation, in which case you should go look at Home Assistant, which has built in support for things like "do this every hour, or 20 minutes after device X was triggered" and indeed "do this 30 minutes before sunrise".


Well it's a mixed bag. The dusk/dawn stuff would just be like nice-to-have. I would like to be reminded 30m before dawn to for example walk the dog while the sun is _just_ still out. The other for example could be used for handling "social" events. Like some games only happen during evening hours where i check more often than in off-hours but if any game happens whenever i would like to handle it after the regular game time.

Now, nothing of that would be too hard to implement myself but these task runners pop up every so often and i would like to leverage other peoples work. Home assistant just feels a bit big for this. Do note that i currently do neither of those but i always try to evaluate these use-cases for these task-runners.


Celery supports solar scheduling, though it's not obvious how you would do +/- some time from the solar event. You would probably need to extend their implementation, but I don't think that would be too hard.

https://docs.celeryq.dev/en/stable/reference/celery.schedule...

EDIT: I think you just need to provide a nowfun that offsets datetime.now() by the desired timedelta.


https://sunrise-sunset.org/api

I used this myself for my father's chicken coop, it's quite easy to use.


Thanks but computing the time aint the hard part (modules are available for many languages). I'm more interested in how easy it is to fit this task into the scheduler.


Well, unless you like running in rain you'd have to feed it weather info too.

On other side it would allow it to be fancy, like rescheduling it earlier if rain is close to the sunset


I'm not 100% sure, but take a look at the section on "Manipulating Other Tasks" https://rocketry.readthedocs.io/en/stable/cookbook/controlli... It seems like what you could do is have a task that runs at some regular interval which would compute the 30 minutes before dawn each day and then add the task with the correct start time directly on the rocketry.args.Session: session.create_task(func=before_dawn_task, start_cond=some_condition)


You can build arbitrary conditions (the example given is `file_exists`) that can run whatever code and only need to return True or False.

If you can write the condition in English, it would seem to me you can build a custom Rocketry condition to suit it.


For #2, you want an event driven scheduler that can coordinate between events.

We've built this at https://www.inngest.com. You can run functions based off of schedules or events, with things like "when this event happens, run 20 minutes after the event". Or, "run, wait for another thing to happen, then continue".

Event driven schedulers do all the regular scheduling, but with a few benefits:

- It's reactive

- You can fan-out, so one event runs many functions

- You can store all events for debugging, replay, local testing, typing, etc.

We could plumb in an event source for #1 which indicates sunrise and sunset. Heh.


Nice project and good looking page. Just fyi but the "s" in "background jobs" is a bit cut off for me (Chrome on Pixel 6 Pro) https://ibb.co/42j18fk


Isn't it better to fetch dawn times everyday then set accordingly?


Does this support multiple time zones simultaneously? We have clients that are in different time zones, and (for example) they all want weekly summary emails every Monday morning at 6am, but in their own timezone. So California users get theirs at 6am US/Pacific, New York users get theirs at 6am US/Eastern, and we want to be able to handle this without having to worry about updating crontabs the night of a daylight savings change.

For this reason, we are using fcron[0] instead of regular cron, which allows you to specify the timezone at the start of each crontab line. If this tool supports that sort of scenario, it might be worth switching.

[0] https://github.com/yo8192/fcron


Interesting. How does this compare to APScheduler?

How does it deal with concurrently scheduled tasks and the possibility of missed tasks?


I wrote my own ideas of how it compares to APScheduler (and other alternatives) here: https://rocketry.readthedocs.io/en/stable/rocketry_vs_altern...

Note that this is my own opinions which probably are a bit biased. At the moment there are no built-in missed task launchers but it should be fairly easy to do such by creating a condition that checks the task run periods and whether the task did not ran the latest interval. This is not hard to do but the problem is that I haven't had time to document the time period utilities which are actually pretty extensive. I have plans and some prototypes to do pre-built a misfire condition which one can just add to any task using the OR operator.

There are 3 options for concurrent tasks: async, thread and process. Just change the execution argument of a task. Choose which suits you and remember there are pros and cons in each. All of them supports parameters etc.


What backends do you support? Do I need Redis or it can work with PostgreSQL? Cant find this info in readme.


You can do without a database backend but of course then the task logs are not kept in case of restart. Currently you can use any SQL database that SQLAlchemy supports, MongoDB or CSV files, or any other if you wish to extend Red Bird. It uses Red Bird (another project of mine) to abstract the data store: https://red-bird.readthedocs.io/en/latest/. And it just extends the logging library for reading task logs.

It seems I did not yet implement the set_repo method even though the docs talk about this but here's one way to set a CSV repo, for example: https://github.com/Miksus/rocketry/issues/108#issuecomment-1...


Does it keep track of the status of each task?

Imagine I run an app with three processes A, B, C. A runs perfectly, but B fails and halts the app. If I start the app again, is it going to know that A has been already executed? Or is A going to be executed again?


Yep it does, the main process and thread is responsible of communicating with the logs (see the other comment in which I explained the logging mechanism). If you run a task in subprocess, the logs are relayed via queue to the main process and the main process logs it to avoid conflicts.

There are also an option to force reading the status always from the logs. I'll provide later how's that changed but by default there is some optimization to avoid unnecessary reads from disk as often there is only one scheduler reading/writing to the log data store.

The logs are stored in memory by default but this can be changed to any data store (if you are willing to expand Red Bird). At the moment CSV, SQL and MongoDB are supported + the in-memory.


... or make CLI version of your tasks and let the system mangement daemon ("cron" or in my case systemd timers) handle it.

For clarity make a subfolder called "tasks" or something like that.

Then you get consolidate logging, retries and all kind of stuff for free in a battle-hardened setup and a standardized way to lookup what is enabled and what is not.


I remember a project that could convert any python script to a python CLI. This was taken to another level by another project that could convert any python CLI to a GUI.

Can anyone help me with this?


Fire can basically do the first step (object -> CLI):

https://github.com/google/python-fire

Gooey can do (CLI -> GUI):

https://github.com/chriskiehl/Gooey


Thank you very much!

I hope I don't lose them this time, even though I can see that I have starred one of the repos before.


I didn’t know from could schedule jobs dependent on other jobs. Isn’t it what this brings to the table?

“When job a and job b are done, run job c” kind of things.


In systemd you can have multiple ExecStarts, which will be run in order (if I remember correctly), and ExecStopPost is brilliant for notifying problems..


The main benefit of cron is your code stops once it’s done, process is cleaned up. There isn’t a provided way to do dependencies but that can be done using some shared locks and scheduling. Won’t be completely accurate which why solutions like Airflow are used.

Edit: forgot the most obvious way to do dependencies… just execute A & B together as one cron job; still need something like airflow if it gets into a DAG territory


I quickly went through the docs but didn't see a reference to be able to dynamically schedule and unschedule tasks at runtime. I've used APScheduler in the past and it does support this.

The use case is wanting to have let's say a web form where a user can say they want to run a task at XYZ interval and then they can schedule and unschedule it on demand. APScheduler will pick these up without needing to restart anything.

Does your library support that? If not, is that a planned feature?


Do you mean with "dynamically schedule and unschedule" that you sort of manually (or using another task) stop running a task in its specified interval (or condition)? It does support this, there is an argument "disabled" in the tasks that can be set True and then the task won't be run unless explicitly forced to run (calling run method of a task). The task can be enabled by setting it back to False. This can be done in runtime in another task using main, thread or async execution.

It could be the docs don't mention this. I'll need to check and add it there in case it's missing.

Or did I misunderstand?


Dynamic as in you don't need to predefine the task in a config file or decorator before you start your server.

This way you can load and unload tasks at runtime based on user input which you can optionally and independently save in your own database.

Like imagine a user wanting to control when a backup happens. You can ask them to fill out a form on your site to say "ok run this every day at 4am" and that would spawn a new job that executes at that interval and the user can also delete that and the job would be removed. There might be 100 different users each with their own individual backup jobs that are running or not running.


Sorry, I'm in quite a hurry (so sorry for the language and lack of ellaboration).

You can create tasks dynamically and you can create them after starting the scheduler. You can use app.session.create_task and pass "func" (Python function) for it or path and func_name if you wish to lazily load the task function (imported only when executing the task). You can also pass a command for this method as well.

And you can create a task that runs on startup (on_startup=True) and create your other tasks using this task. Use main, async or thread as execution. Then you can create other metatasks that create/modify/delete the tasks on runtime with any logic you want. For example, sync them with a database.

I'm planning on doing a proper demo about this at some point.


Does it have storage options in terms of status e.g. for restarts?


There is a repository mechanism to store the logs. The task logger is simply an extension of logging library. Seems my docs are slightly off on setting up the CSV repo but you can just add a RepoHandler (from redbird) to the logger called rocketry.task. At the moment there are MemoryRepo, CsvFileRepo, SQLRepo and MongoRepo.

You can find more finer details of the repo mechanics in Red Bird's docs: https://red-bird.readthedocs.io/.

And there are methods in the session to shut down or restart the scheduler in various ways. There is also a shut condition to end the scheduling when a condition is reached.


I don't really use schedulers for work and have never really worked with them. So this may sound trivial but do i have to keep a terminal open and the script running for this to work? Or it works in the background like cron jobs? If i have to keep a terminal alive for it, what is a scheduler's advantage over using a good old fashioned loop with a sleep or time on the function call?


Putting it mildly, this is nothing more than a sophisticated Python while loop. And it's not as performance friendly as Cron due to that Rocketry runs on Python. You need to be able to leave Python program running in order to use Rocketry. As bad as that sounds, it's not really a problem with modern machines though. Have run this on Raspberry and with a machine with even poorer specs.

However, this has a lot of features that Cron doesn't and which are not obvious to create yourself like create task dependencies (like "run this after that has succeeded or this has succeeded"), error management, integrating with APIs, parametrizing etc. Also if you need to run concurrently/parallel tasks, you be facing a lot of odd errors due to race conditions if you tried to do it yourself in a loop. I have even found a bug in Python's time/datetime modules while developing Rocketry. It sounds easy but I advice you don't go to the same rabbit hole as I did. Please don't, it's not good for mental health.

Of course if you need something very simple, go ahead and do it with a simple loop. Rocketry however makes easy and complex problems easy so it's still a good candidate as in case you realize your problem was more complex than you thought, it possibly has the answer or an obvious way to implement.

Compared to similar alternatives like Celery or Airflow, (I think) it is much easier to set up and more complex scheduling problems are much easier with Rocketry than with them. Of course if you are a data engineer, I suggest to use Airflow as that's the industry standard.


To tackle one bit of GP's question that might not be clear, you would not need to leave the terminal open. From the terminal, you can run the app and detach its process from the terminal, thus running it in the background.

https://superuser.com/questions/178587/how-do-i-detach-a-pro...


If you want to run something once a month or a year, I imagine this requires the script to be running in a main loop that whole time?


Not really, that's just the most basic way to think of scheduling. The usual way to avoid this in a big loop is a notifier thread that looks at when it next needs to wake and then sleeps for an interval just shorter than that time.

Due to clock errors and accounting for thread wake wonkiness IME it's usually a good idea to have this "loop" fire in a bit of a window around when the event needs to happen (say +/- 25ms, YMMV) and then trigger the event only at the specified time. After triggering the event repeat sleeping until the next scheduled event.


Being able to configure simply the execution (async, process, thread) is pure gold, I am tempted to use it for this alone.


I'm tempted to use it for the name alone, I don't even care what it does.


Neat!

Looks very clean. Is parallelism/multiprocessing (or even distribution over multiple workers) a thing you plan on making possible?


Looks like parallelism is already present based on the front page samples, including async.


async is just concurrency, not parallelism, but I see now that multiprocessing seems to also be supported: https://rocketry.readthedocs.io/en/stable/handbooks/task/exe...


Looks nice and clean!

Can it support static type checking like mypy? Not only type level like str, but the time format level like “hh:mm” and “n minutes” too.

I hate when I misspell “3 minuts” and realize only after execute it. Much nice if my editor tells me.



> NOTE: Red Engine has been renamed as Rocketry


I’d love to have a Cronitor integration for this.

OP, if you make one we will feature you in our blog and email newsletter (goes to around 25k devs monthly)


Since we are here does anyone have idea about how to schedule 1000 calls per day?

Background task:

I am building a service to send twitter messages daily. The limit of Twitter API is I can't send more than 1000 messages so I have to limit the call in the backend.

The closet solution I found that is I schedule a celery beat background task exactly 3 minutes and I call the Twitter API 2 times per task so daily it can send upto 960 messages in 24-hour window.

So I am still finding the solution to make this happen. Suggestion welcome.


Why not just schedule the task for every 86 seconds and count how many times you've done it so you stop at 1,000?


Add some logging and connectors and you have an airflow replacement with the pipelining or am I missing something?


Hmm, this definition reads a bit wrong:

  @app.task((weekly.on("Mon") | weekly.on("Sat")) & time_of_day.after("10:00"))
  def do_twice_a_week_after_ten():
Expression reads Monday or Saturday, yet its meaning is Monday and Sunday according to function name.


if you take it as a boolean test it makes sense. it will be true for any timepoint that is after ten and is either on monday or saturday


Yep, exactly. Rocketry works on conditions which are either true or false thus you need to give it a time range.

Points in time do not actually make much sense in terms of scheduling as nothing can be run exactly at specified point. There should always be some buffer of tolerance. I think Cron has a tolerance of a minute or so and in Rocketry the tolerance is made obvious and completely customized.

For those interested more about those two types of time conditions. "time_of_..." are conditions that check whether the current time is in the specified range. The "secondly", "minutely", "hourly", "daily" etc. also check that current time is as specified but also that the task did not yet run on the interval. By combining the two you can create quite complex scheduling strategies easily.


It would simplify things a little if the weekly.on accepted multiple days, so you could reduce it to one call:

    weekly.on("Mon", "Sat")


That's a great idea actually. Thanks, I think that should be pretty easy to do!


`|` is a bitwise-or in Python, not a boolean-or.


That's right, but it doesn't change how the expression is read.


if it provides monitoring ui (e.g. success/failure status for last N runs), then i'd seriously give it a try.


How does it compare to prefect?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: