Instead of enquing a job you’d just write to an SQS queue and have a lambda as your “job handler” using the queue as an event source.
Boom, no need for Active Job.
Welcome to Serverless.
There’s going to be a lot of people trying to do things the “rails way” on serverless and it’s going to be a little confusing for a while. Many will write it off, some will creat innovative new gems. And of course the Ruby community will come up with the best abstraction for SAM templates and deployment, because we like nice tools.
No need to use SQS here, assuming you don't need strict ordering behavior, since SNS can use Lambda functions as notification targets. The nice thing about SNS is that it can automatically fan out to multiple Lambda functions in parallel.
And then what if your Lambda function fails because a service dependency is down? The answer is, it retries twice and then you can have it go to a deadletter queue. If you want to have any configuration control over your retry or deadletter behavior, SQS is really handy.
It's also a useful pattern to, instead of configuring the SQS event itself as your Lambda trigger, have a scheduled trigger and then perform health checks on your service dependencies before pulling messages from the queue. This technique saves you wasted retries and even gives you more fine-grained control over deleting messages (the SQS mechanism for saying, "this is done and you don't need to retry it after the timeout). The SQS trigger, in contrast, deletes the entire batch of messages if the Lambda exits cleanly and does not delete any of them if the Lambda does not exit cleanly, which is a bit messy.
Also, you can have an SQS queue subscribe to multiple SNS topics, which means you can set up separate SNS topics for backfill or other use cases. This is especially useful in cross-account use cases, where often your online events are coming from an SNS from a different account, but you want to be able to inject your own testing and backfill events.
I usually don’t let the lambda SQS back end control message deletion for just that reason. I delete individual messages as I process them and throw an exception if any fail.
If a specific message fails to process multiple times, most often it’s because that particular message is malformed or invalid, or exposes an unhandled edge case in the processing logic, and will never successfully process, at which point continued attempts are both fruitless and wasteful.
It’s better to retry a finite number of times and stash any messages that fail to process offline for further investigation. If there is an issue that can be resolved, it’s possible to feed deadletter messages back in for processing once the issue is addressed.
Feeding your deadletter events back into your normal processing flow not only hides these issues but forces you to pay real money to fail to process them in an infinite loop that is hard if not impossible to diagnose and detect.
Using SQS as an event source, it will also fanout with the advantage of being able to process up to 10 messages at once, dead letter queues for SQS make a lot more sense than lambda DLQs etc.
But, I prefer to use SNS for producers and SQS for consumers and assign the SQS queue to a topic. With SNS+SQS you get a lot more flexibility.
Be careful the if there is possibility for a job to fail and need retry. SQS visibility and automatic dead lettering can give you this basically for free. I don't think SNS will.
1. You can have one SNS topic, send messages to it with attributes and subscribe to different targets based on the attribute.
2. You can have one SNS message go to different targets. There may be more than one consumer interested in an event.
3. “Priority” processing. I can have the same lambda function subscribed to both an SQS queue and an SNS topic. We do some things in batch where the message will go to SNS -> SQS -> lambda. But with messages with a priority of “high” they will go directly from SNS -> Lambda but they will also go through the queue. High priority messages are one off triggered by a user action.
By “fan out” I mean that SNS will execute as many Lambda functions as needed to handle the published events as they occur. This gives you much greater concurrency than a typical pub/sub architecture where you typically have a ceiling on the subscriber count or are forced to wait on consumers to process the earlier enqueued messages before processing subsequent ones (i.e., the head-of-line blocking problem).
Yes. But there is an easy way around it. If you can break your process up into chunks that run less than 15 minutes, you can have all of your steps in different methods, and use Step Functions to orchestrate it. You will usually be using the same lambda instance since you are calling the same function.
A JSON file and a case statement is hard? You can even validate the JSON as you’re creating it in the console. Not just validating correct JSON, but validating that it is a correctly formatted state machine. Anyone who can’t handle creating the step function and a case statement is going to have a heck of time using Terraform, CloudFormatiom, creating IAM policies, yml files for builds and deployments, etc.
Just run the stupid thing on a VM for however long it takes, why would anyone need all this complexity to run some dumb program that should just be a cron job to begin with. And why should the limitations of some platform dictate how one structures their code to begin with? You're solving for the wrong problem.
“Just do everything like we would do on premise” is the very reason that cloud implementations end up costing more than they should.
Let’s see all the reasons not to use a VM.
1. If this job is run based on when a file comes in, instead of a simple S3 event -> lambda, then you have to do S3 Event -> SNS -> SQS and then poll the queue.
2. Again if you are running a job based on when a file comes in, what do you do if you have 10 files coming in at the same time and you want to run them simultaneously? Do you have ten processes running and polling the queue? What if there is a spike of 100 files coming in, do you have 100 processes running just in case? With the lambda approach you get autoscaling. True you can set up two Cloudwatch alarms - one to trigger when the number of items in the queue is X and to scale down when the number of items in the queue is below Y, then set up an autoscaling group and launch configuration. Of course since we are automating our deployments that means we also have to set up a CloudFormation template with a user data section to tell the VM what to download and the description of the alarms, scale in, scale out rules, the autoscaling group, the launch configuration, the definition of the EC2, etc.
Alternatively, you can vastly over provision that one VM to handle spikes.
The template for the lambda and the state machine will be a lot simpler.
And after all of that work, we still don’t have the fine grained control of the autoscaling with the EC2 instance and cron job that we would have with the lambda and state machine.
3. Now if we do have a VM and just one cron job, not only do we have a VM running and costing us money waiting on a file to come in, we also have a single point of failure. Yes we can alleviate that by having an autoscaling group with one instance and set up a health check that automatically kills the instance if it is unhealthy and bring up another one and set the group up to span multiple AZs.
I disagree - the rampant complexity of systems you can build using all these extra services is the bigger problem, and at scale, all these extra services on AWS cost more than their OSS equivalent running on VMs. Take a look at ELB for example, there's a per-request charge that yes is very small, but at 10k req/s it's significant - not to mention the extra bytes processed charges. Nginx/haproxy with DNS load balancing can do pretty much what ELB is offering. In the very small none of this matters, but you can just run smaller/bigger VMs as needed; prices starting at the cost of a hamburger per month.
Re reliablity, I have VMs on AWS with multiple years of uptime - it's the base building block of the entire system, so it's expected to work all the time. You monitor those systems with any sort of nagios-type tool and alert into something like pager duty for when things do go wrong, but they mostly don't. Redundancy I build in at the VM layer.
Additional services on AWS however have their own guarantees and SLAs and they're more often down than the low-level virtual machines are - so it's an inherently less reliable way to build systems imo, but to each their own.
Also, this notion that you can conveniently split your processing up into 15m chunks is just crazy -- you may never know how long some batch job is going to take any given day if it's long running. They're typically leaning on some other systems outside your control that may have differing availability over time. I met a guy who built some scraper on lambda that did something convoluted and he was happy to make like 100k updates in the db overnight -- I built a similar thing using a single core python job running on a VM that did 40MM updates in the same amount of time. It's just silly what people think simple systems are not capable of.
I disagree - the rampant complexity of systems you can build using all these extra services is the bigger problem,
So now I am going to manage my own queueing system, object storage system, messaging system, Load Balancer, etc? My time is valuable.
and at scale, all these extra services on AWS cost more than their OSS equivalent running on VMs. Take a look at ELB for example, there's a per-request charge that yes is very small, but at 10k req/s it's significant - not to mention the extra bytes processed charges. Nginx/haproxy with DNS load balancing can do pretty much what ELB is offering.
ELB automatically scales with load. Do you plan to overprovision to handle peak load? My manager would laugh at me if I were more concerned with saving a few dollars than having the cross availability zone, fault tolerant, AWS managed ELB.
In the very small none of this matters, but you can just run smaller/bigger VMs as needed; prices starting at the cost of a hamburger per month.
Again, we have processes that don’t have to process that many messages during a lull but can quickly scale up 20x. Do we provision our VMs so they can handle a peak load?
Re reliablity, I have VMs on AWS with multiple years of uptime - it's the base building block of the entire system, so it's expected to work all the time. You monitor those systems with any sort of nagios-type tool and alert into something like pager duty for when things do go wrong, but they mostly don't.
Why would I want to be alerted when things go wrong and be awaken in the middle of the night? I wouldn’t even take a job where management was so cheap that they wouldn’t spend money on fault tolerance and scalability and instead expected me to do wake up in the middle of the night.
The lambda runtime doesn’t “go down”. Between Route53 health checks, ELB health checks, EC2 health checks, and autoscaling, things would really have to hit the fan for something to go down after a successful deployment.
Even if we had a slow resource leak, health checks would just kill the instance and bring up a new one. The alert I would get is not something I would have to wake up in the middle of the night for, it would be something we could take our time looking at the next day or even detach an instance from the autoscaling group to investigate it.
Redundancy I build in at the VM layer.
Additional services on AWS however have their own guarantees and SLAs and they're more often down than the low-level virtual machines are - so it's an inherently less reliable way to build systems imo, but to each their own.
When have you ever heard that the lambda runtime is down (I’m completely making up that term - ie the thing that runs lambdas).
Also, this notion that you can conveniently split your processing up into 15m chunks is just crazy -- you may never know how long some batch job is going to take any given day if it's long running.
If one query is taking up more than 15 minutes, it’s probably locking tables and we need to optimize the ETL process....
And you still didn’t answer the other question. How do you handle scaling? What if your workload goes up 10x-100x?
I met a guy who built some scraper on lambda that did something convoluted and he was happy to make like 100k updates in the db overnight -- I built a similar thing using a single core python job running on a VM that did 40MM updates in the same amount of time. It's just silly what people think simple systems are not capable of.
Then he was doing it wrong. What do you think a lambda is? It’s just a preconfigured VM. There is nothing magic you did by running on Linux EC2 instance that you also couldn’t have optimized running a lambda - which is just a Linux VM with preinstalled components.
Yes costs do scale as you grow and they may scale linear but the cost should scale at lower slope than your revenues. But then again, I don’t work for a low margin B2C company. B2B companies have much greater margins to work with and the cost savings of not having to manage infrastructure is well worth it.
If I get called once after I get off work about something going down, it’s automatically my top priority to find the single point of failure and get rid of it.
Another example is SFTP. Amazon just announced a managed SFTP server. We all know that we could kludge something together much cheaper than what Amazon is offering. But my manager jumped at the chance to get rid of our home grown solution that we wouldn’t have to manage ourselves.
Yes, you wouldn't use Sidekiq anymore either because you would have a different ActiveJob backend, but the important part is that you're eliminating a piece of infrastructure (Redis) without the the poor scalability of DelayedJob.
Don’t get me wrong, I’m no savage. I use structured JSON logging with Serilog and use ElasticSearch as a sink with a Kibanna for most of my app logging.
I also log to Cloudwatch though and keep those logs for maybe a week.
Honest question: I don’t Ruby, I just looked up what it is.