Of course I have this running though it's still running older version (been too busy with developing this). It has been running over half a year for my scrapers without a single interruption even though the machine has the worst specs available. I have tested this with Linux/Unix and Windows at least. Of course, I have gotten message from various people saying they are using it. Some have said they migrated from Celery or other alternatives as they found Rocketry more suitable for their needs.
And that's true: it's 100% Python and basically there is a main loop that checks starting conditions of tasks (and some other things) and if a task's starting condition is reached, the task is run. Tasks can be executed synchronously by setting execution as "main" or concurrently with async, threading or multiprocessing. Maybe in the future with another interpreter as well. The main loop is left running in background.
So in short, it's a Python that's constantly loop running. It sleeps defined amount of time after checking a set of tasks to lower the resource consumption but you can also create a task with execution as "main" and do sophisticated sleep like "sleep more when CPU usage is X%" or estimate the time when the next task should start from the tasks' conditions.
The system knows which task ran and when by extedning logging (from standard library). There is a logger called "rocketry.task" that should have a handler which can be read as well: redbird.logging.RepoHandler. An in-memory logger is created if nothing is specified. This handler abstracts simple read and write to a data store which can be an SQL database, in-memory Python list, MongoDB or CSV file.
The latest success time, starting time etc. are also stored in the tasks themselves and there is some optimization (which can be turned off) to reduce the reads in some cases. In the start-up these attributes are set in each task (if logs found).
This looks brilliant. I like that it’s kept light as a concept - feels like you can just sprinkle it over your existing tasks without getting bogged down in complex configuration.
We have a couple of hand rolled variants of this that run into all the issues this solves. Will definitely look at taking this for a spin.
I was scratching my head, looking at the docs and asking myself "Ok do I need a database? Is the scheduler separated from workers? Does it have a UI?"
Being just a lib is actually quite refreshing compared to complex behemoths like Airflow. I guess you could just use your favorite service runner (systemd, k8s, nomad, none at all...).
Biggest win I see here is the native support for async methods. Celery, the default option for most, does not support it and there are only hacky ways to make async work.
Kudos to the team @ Rocketry.
I haven't looked too closely yet so please excuse me for asking this question but how dynamic can i make the timed events? I have two use-cases in mind:
1. I would like to run a task each day 30 minutes before dawn so i have to compute that time at some point.
2. I run a task normally every hour but if something happens i want to run it 20 minutes after that event.
Given the "before dawn" constraint here I'm going to assume this is somehow related to home/building automation, in which case you should go look at Home Assistant, which has built in support for things like "do this every hour, or 20 minutes after device X was triggered" and indeed "do this 30 minutes before sunrise".
Well it's a mixed bag. The dusk/dawn stuff would just be like nice-to-have. I would like to be reminded 30m before dawn to for example walk the dog while the sun is _just_ still out.
The other for example could be used for handling "social" events. Like some games only happen during evening hours where i check more often than in off-hours but if any game happens whenever i would like to handle it after the regular game time.
Now, nothing of that would be too hard to implement myself but these task runners pop up every so often and i would like to leverage other peoples work.
Home assistant just feels a bit big for this. Do note that i currently do neither of those but i always try to evaluate these use-cases for these task-runners.
Celery supports solar scheduling, though it's not obvious how you would do +/- some time from the solar event. You would probably need to extend their implementation, but I don't think that would be too hard.
Thanks but computing the time aint the hard part (modules are available for many languages). I'm more interested in how easy it is to fit this task into the scheduler.
I'm not 100% sure, but take a look at the section on "Manipulating Other Tasks" https://rocketry.readthedocs.io/en/stable/cookbook/controlli... It seems like what you could do is have a task that runs at some regular interval which would compute the 30 minutes before dawn each day and then add the task with the correct start time directly on the rocketry.args.Session: session.create_task(func=before_dawn_task, start_cond=some_condition)
For #2, you want an event driven scheduler that can coordinate between events.
We've built this at https://www.inngest.com. You can run functions based off of schedules or events, with things like "when this event happens, run 20 minutes after the event". Or, "run, wait for another thing to happen, then continue".
Event driven schedulers do all the regular scheduling, but with a few benefits:
- It's reactive
- You can fan-out, so one event runs many functions
- You can store all events for debugging, replay, local testing, typing, etc.
We could plumb in an event source for #1 which indicates sunrise and sunset. Heh.
Does this support multiple time zones simultaneously? We have clients that are in different time zones, and (for example) they all want weekly summary emails every Monday morning at 6am, but in their own timezone. So California users get theirs at 6am US/Pacific, New York users get theirs at 6am US/Eastern, and we want to be able to handle this without having to worry about updating crontabs the night of a daylight savings change.
For this reason, we are using fcron[0] instead of regular cron, which allows you to specify the timezone at the start of each crontab line. If this tool supports that sort of scenario, it might be worth switching.
Note that this is my own opinions which probably are a bit biased. At the moment there are no built-in missed task launchers but it should be fairly easy to do such by creating a condition that checks the task run periods and whether the task did not ran the latest interval. This is not hard to do but the problem is that I haven't had time to document the time period utilities which are actually pretty extensive. I have plans and some prototypes to do pre-built a misfire condition which one can just add to any task using the OR operator.
There are 3 options for concurrent tasks: async, thread and process. Just change the execution argument of a task. Choose which suits you and remember there are pros and cons in each. All of them supports parameters etc.
You can do without a database backend but of course then the task logs are not kept in case of restart. Currently you can use any SQL database that SQLAlchemy supports, MongoDB or CSV files, or any other if you wish to extend Red Bird. It uses Red Bird (another project of mine) to abstract the data store: https://red-bird.readthedocs.io/en/latest/. And it just extends the logging library for reading task logs.
Imagine I run an app with three processes A, B, C. A runs perfectly, but B fails and halts the app. If I start the app again, is it going to know that A has been already executed? Or is A going to be executed again?
Yep it does, the main process and thread is responsible of communicating with the logs (see the other comment in which I explained the logging mechanism). If you run a task in subprocess, the logs are relayed via queue to the main process and the main process logs it to avoid conflicts.
There are also an option to force reading the status always from the logs. I'll provide later how's that changed but by default there is some optimization to avoid unnecessary reads from disk as often there is only one scheduler reading/writing to the log data store.
The logs are stored in memory by default but this can be changed to any data store (if you are willing to expand Red Bird). At the moment CSV, SQL and MongoDB are supported + the in-memory.
... or make CLI version of your tasks and let the system mangement daemon ("cron" or in my case systemd timers) handle it.
For clarity make a subfolder called "tasks" or something like that.
Then you get consolidate logging, retries and all kind of stuff for free in a battle-hardened setup and a standardized way to lookup what is enabled and what is not.
I remember a project that could convert any python script to a python CLI. This was taken to another level by another project that could convert any python CLI to a GUI.
In systemd you can have multiple ExecStarts, which will be run in order (if I remember correctly), and ExecStopPost is brilliant for notifying problems..
The main benefit of cron is your code stops once it’s done, process is cleaned up. There isn’t a provided way to do dependencies but that can be done using some shared locks and scheduling. Won’t be completely accurate which why solutions like Airflow are used.
Edit: forgot the most obvious way to do dependencies… just execute A & B together as one cron job; still need something like airflow if it gets into a DAG territory
I quickly went through the docs but didn't see a reference to be able to dynamically schedule and unschedule tasks at runtime. I've used APScheduler in the past and it does support this.
The use case is wanting to have let's say a web form where a user can say they want to run a task at XYZ interval and then they can schedule and unschedule it on demand. APScheduler will pick these up without needing to restart anything.
Does your library support that? If not, is that a planned feature?
Do you mean with "dynamically schedule and unschedule" that you sort of manually (or using another task) stop running a task in its specified interval (or condition)? It does support this, there is an argument "disabled" in the tasks that can be set True and then the task won't be run unless explicitly forced to run (calling run method of a task). The task can be enabled by setting it back to False. This can be done in runtime in another task using main, thread or async execution.
It could be the docs don't mention this. I'll need to check and add it there in case it's missing.
Dynamic as in you don't need to predefine the task in a config file or decorator before you start your server.
This way you can load and unload tasks at runtime based on user input which you can optionally and independently save in your own database.
Like imagine a user wanting to control when a backup happens. You can ask them to fill out a form on your site to say "ok run this every day at 4am" and that would spawn a new job that executes at that interval and the user can also delete that and the job would be removed. There might be 100 different users each with their own individual backup jobs that are running or not running.
Sorry, I'm in quite a hurry (so sorry for the language and lack of ellaboration).
You can create tasks dynamically and you can create them after starting the scheduler. You can use app.session.create_task and pass "func" (Python function) for it or path and func_name if you wish to lazily load the task function (imported only when executing the task). You can also pass a command for this method as well.
And you can create a task that runs on startup (on_startup=True) and create your other tasks using this task. Use main, async or thread as execution. Then you can create other metatasks that create/modify/delete the tasks on runtime with any logic you want. For example, sync them with a database.
I'm planning on doing a proper demo about this at some point.
There is a repository mechanism to store the logs. The task logger is simply an extension of logging library. Seems my docs are slightly off on setting up the CSV repo but you can just add a RepoHandler (from redbird) to the logger called rocketry.task. At the moment there are MemoryRepo, CsvFileRepo, SQLRepo and MongoRepo.
And there are methods in the session to shut down or restart the scheduler in various ways. There is also a shut condition to end the scheduling when a condition is reached.
I don't really use schedulers for work and have never really worked with them. So this may sound trivial but do i have to keep a terminal open and the script running for this to work? Or it works in the background like cron jobs? If i have to keep a terminal alive for it, what is a scheduler's advantage over using a good old fashioned loop with a sleep or time on the function call?
Putting it mildly, this is nothing more than a sophisticated Python while loop. And it's not as performance friendly as Cron due to that Rocketry runs on Python. You need to be able to leave Python program running in order to use Rocketry. As bad as that sounds, it's not really a problem with modern machines though. Have run this on Raspberry and with a machine with even poorer specs.
However, this has a lot of features that Cron doesn't and which are not obvious to create yourself like create task dependencies (like "run this after that has succeeded or this has succeeded"), error management, integrating with APIs, parametrizing etc. Also if you need to run concurrently/parallel tasks, you be facing a lot of odd errors due to race conditions if you tried to do it yourself in a loop. I have even found a bug in Python's time/datetime modules while developing Rocketry. It sounds easy but I advice you don't go to the same rabbit hole as I did. Please don't, it's not good for mental health.
Of course if you need something very simple, go ahead and do it with a simple loop. Rocketry however makes easy and complex problems easy so it's still a good candidate as in case you realize your problem was more complex than you thought, it possibly has the answer or an obvious way to implement.
Compared to similar alternatives like Celery or Airflow, (I think) it is much easier to set up and more complex scheduling problems are much easier with Rocketry than with them. Of course if you are a data engineer, I suggest to use Airflow as that's the industry standard.
To tackle one bit of GP's question that might not be clear, you would not need to leave the terminal open. From the terminal, you can run the app and detach its process from the terminal, thus running it in the background.
Not really, that's just the most basic way to think of scheduling. The usual way to avoid this in a big loop is a notifier thread that looks at when it next needs to wake and then sleeps for an interval just shorter than that time.
Due to clock errors and accounting for thread wake wonkiness IME it's usually a good idea to have this "loop" fire in a bit of a window around when the event needs to happen (say +/- 25ms, YMMV) and then trigger the event only at the specified time. After triggering the event repeat sleeping until the next scheduled event.
Since we are here does anyone have idea about how to schedule 1000 calls per day?
Background task:
I am building a service to send twitter messages daily. The limit of Twitter API is I can't send more than 1000 messages so I have to limit the call in the backend.
The closet solution I found that is I schedule a celery beat background task exactly 3 minutes and I call the Twitter API 2 times per task so daily it can send upto 960 messages in 24-hour window.
So I am still finding the solution to make this happen. Suggestion welcome.
Yep, exactly. Rocketry works on conditions which are either true or false thus you need to give it a time range.
Points in time do not actually make much sense in terms of scheduling as nothing can be run exactly at specified point. There should always be some buffer of tolerance. I think Cron has a tolerance of a minute or so and in Rocketry the tolerance is made obvious and completely customized.
For those interested more about those two types of time conditions. "time_of_..." are conditions that check whether the current time is in the specified range. The "secondly", "minutely", "hourly", "daily" etc. also check that current time is as specified but also that the task did not yet run on the interval. By combining the two you can create quite complex scheduling strategies easily.
It's not spelled out, but it's apparent you just run the python file containing the app definition and leave it running in the background?
Looks very clean and pythonic.