How Gov.uk Notify reliably sends text messages to users

jtchang · on April 4, 2020

One of the most important parts about this is that they applied backpressure to the load balancing process:

> We also decided that if a provider is slow to deliver messages, measured in the same way as before, we would reduce their share of the load by 10 percentage points

When doing systems design this is a critical piece to include in almost all load balancing aspects. Because a ton of time you will start with 100% of traffic load balanced between two boxes but what happens when one box fails or slows down? You don't want to overwhelm anything which leads to cascading failures.

duxup · on April 4, 2020

Load balancers rule the world.

flibble · on April 4, 2020

We are facing the issue of SMS not getting delivered and have implemented failovers (Messagebird, Nexmo, Twilio) when we detect delivery problems. However we have a problem that frequently we get sent positive delivery receipts despite the SMS not being delivered. This makes it hard to know if we should fail over. Has anyone good solutions fur this?

toast0 · on April 4, 2020

Positive delivery receipts from an SMS aggregator have approximately zero diagnostic value.

If you want them to mean something, you need to have a direct connection with the carrier of your user, and you have to know that the carrier or their network doesn't just fake positive delivery receipts at some point in the system to make numbers look good. With an aggregator between you and the carrier, you have no way of knowing when the delivery path changed and the intermediary faked a delivery report.

The absence of a positive delivery receipt doesn't really mean anything either. Negative delivery reports have some information content though.

My context is sending verification codes though; receiving a code back from a user is a much better measure of successful delivery to the user than a delivery receipt. If you're sending news or something where there's no measurable direct action taken as a result of the message, I guess a delivery receipt is better than nothing, maybe?

Realtime measurement and thompson sampling or multi-armed bandit with as many credible providers as you can manage is the best way forward, don't send retries through the same provider either.

Vuska · on April 6, 2020

I used to work for a UK SMS SaaS provider.

I'm surprised you'd frequently get false positives. What regions were you sending to?

They were essentially unheard of with the UK networks and very rare with European networks. It was only Middle East and African networks that were troublesome, but they represented a tiny fraction of our traffic. Even then, false positives were rare compared to messages just not getting delivered.

dsincl12 · on April 4, 2020

Super interested in this as well. I switched from Messagebird to Twilio after giving Nexmo a try and it's a lot better now but still get false positives.

xeromal · on April 4, 2020

Who do you get false positives from the most?

dsincl12 · on April 4, 2020

For me personally it was Messagebird. Two failed demos because of this and the response from them where non-existent. I was even contacted by one of their sales people after I wrote my last "angry I'm leaving" email asking me if I was interested in using their services.

theblueprint · on April 4, 2020

Yes, similar position with Messagebird. Lack of transparency in their reporting/insights vs Twiliio. That said, our Twilio pricing is 2-3x Messagebird.

aeonsky · on April 4, 2020

If its something requested, the user can always re-request it. If it not something requested like news, use two different providers to send 2 messages. One the actual content, and then a minute later, a different smaller message saying "This message brought to you buy MyCompany. Please text back STOP to stop messages or AGAIN to get the latest alert again." Some end users might get better delivery results with a specific texting service, so it might be useful to set service affinities on a user specific basis.

john_minsk · on April 4, 2020

You must be kidding. This never happened to me so far, but if it did, I would be so unhappy with YourCompany that probably would never buy anything from you again

aeonsky · on April 7, 2020

You understand that OP is requesting a solution for a practically unfixable issue. Unless you a have relationship with the carrier itself. A provider sends you a success, but it was actually a failure. Imagine a DB sending Commit-OK, but data was never written. You probably wouldn’t buy very much from a company that uses that kind of DB as well, since it loses your orders.

Once again, if you are receiving texts that must mean you opted into them. User is expecting them and it’s different from a user receiving two texts never having signed up for them.

I do believe this technique uses some out of the box thinking and mostly DOES solve OPs problem.

sergiomattei · on April 4, 2020

On a side note, I'm a huge fan of the standardized design language of all UK Government sites.

They really nailed the aesthetic and the consistency is unmatched.

Edit: I'm impressed at this tooling.[1]

1: https://www.gov.uk/service-toolkit#gov-uk-services

robin_reala · on April 4, 2020

That’s the GOV.UK Design System in play: https://design-system.service.gov.uk/

(I helped to start the current incarnation of that, and it’s probably the thing in my career that I’m most proud of.)

GordonS · on April 4, 2020

When it comes to UK tech, what's always happened is projects get farmed out to mega-outsourcing firms, like TCS and Capita, who then deliver a bunch of crap, very late and way over budget.

I'm so pleased that we've brought some of it in house with GDS, and I'm amazed (in a good way) that it actually works, and they are even part of the OSS community.

adwww · on April 4, 2020

I was involved (at arms length) in one part of GDS a few years ago and it sounded like some department heads really didn't like the GDS approach.

Departments actually prefered spending ££££££ with HP / Capita / CapGemi etc. because the big outsourcing companies didn't ask as many 'awkward' questions (about accessibility, for example), and the departments got to 'own' the product more.

GordonS · on April 4, 2020

Saddens me to hear that. At least from their blog, OSS, and the end results I see as a citizen, GDS is a breath of fresh air in a never-ending series of farcical, failed mega-projects. And as a tax payer, I'd much prefer my taxes to go towards building a capable team, rather than paying billion after billion to oursourcing firms, and god only knows how much in pork to to politicians.

TruffleMuffin · on April 5, 2020

Couldn't agree more.

throwaway_pdp09 · on April 4, 2020

I can't criticise the aesthetics but damn I can criticise the usability.

I've been on UC (edit: Universal Credit) a while and when I was first on it would lose messages (on both my side and the jobctr officials, and it has frustrated them because I talk to them Edit: and they have their own usability frustrations with it, with little or no way to feed them back up), log you out even before timing out, lose appointments or have them be set but not show up for you, and more.

There's plenty more I could say but I'll just leave something which sums it up. At the top it says "BETA This is a new service - your feedback will help us to improve it". Guess what, it's disabled so you cannot give feedback. It always has been disabled.

Visually pretty good but in terms of usability, not so good. Pretty sure no proper testing was done on it, though it has been getting a lot better recently.

switch007 · on April 4, 2020

Not a single part of UC is meant to make it easy on the claimant, so colour me surprised to read about website issues!

throwaway_pdp09 · on April 4, 2020

Actually the claimants use it intermittently while the UC office staff use it for much of their working day. I don't know for sure, but it's a pretty likely that it's actually quite a bit harder on them.

kennydude · on April 6, 2020

It's better than it used to be. Still, a way off from what it needs to be.

The first version was on Microsoft Dynamics (yes a CRM...) for whatever reason iirc.

toyg · on April 4, 2020

So cases don't get worked and UC doesn't get paid. In other words: working as intended.

rkangel · on April 4, 2020

The impression I've got is that this also actually a rare victory of politics. My understanding is that GDS was set up under the Cabinet Office with enough political power to dictate to the various bits of the government how their web presence would work. It's lovely that that has (mostly) worked and that there is a great design being delivered!

xiaq · on April 4, 2020

I cannot praise the UI design on gov.uk more. It is a beacon of light in this dark age of web UI usability and I am not at all surprised that they have a coherent design system.

sergiomattei · on April 4, 2020

Incredible work. I don't think I've seen another government that has a design system so cohesive and well thought out.

zouhair · on April 4, 2020

So that's where Canada.ca aesthetic comes from.

robin_reala · on April 4, 2020

Also https://www.govt.nz/ and https://www.gov.au/.

frosted-flakes · on April 4, 2020

And https://www.ontario.ca.

ponitozhekoni · on April 4, 2020

Thank you.

hn_throwaway_99 · on April 4, 2020

I'm glad that they improved their delivery, but one thing I find frustrating about our industry is how often we all seem to be reinventing the wheel. I mean, there are tons of well-described load balancing algorithms with various pros/cons. From the article it sounds like they just figured out their load balancing algorithm through trial and error, rather than researching load balancing algorithms first, then tweaking them based on real world performance.

londons_explore · on April 4, 2020

The "we need to keep load on providers nearly even for business/political reasons" constraint is fairly unique.

From a purely technical perspective, you would just distribute requests inversely proportional to response time. Probably under low load, one provider would get all the requests, and only in an outage or overload scenario would the other provider take the rest.

hn_throwaway_99 · on April 4, 2020

> The "we need to keep load on providers nearly even for business/political reasons" constraint is fairly unique.

Where did that constraint come from? Did I miss it in the article? Their initial approach, after all, had all of their load going to a single provider.

Poacher5 · on April 4, 2020

From the article: "If we ended up sending only a small number of messages through one provider over the long run, they might not be massively incentivised to be a provider in the future."

Gotta keep the service provider happy to ensure they still go along with the program.

crazypython · on April 4, 2020

Can you send a link to a list of the mentioned load balancing algorithms?

toast0 · on April 4, 2020

> I'm glad that they improved their delivery, but one thing I find frustrating about our industry is how often we all seem to be reinventing the wheel

The problem is you pick a SMS aggregator. They all tell you they have global coverage, and direct routes. They all tell you that their routing algorithm is the best of all the aggregators. And they're all full of BS.

If they weren't full of BS, I wouldn't have had a nice job managing verification code sending for a big messaging company though, so I guess it worked out for me. :P If SMS worked in general, the messaging company probably wouldn't have existed.

hrktb · on April 4, 2020

To be fair, what they ended up with seems far from a standard round robin system, and the amount of manual tweaking they acknowledge doing indicates it’s more complicated than picking an algorithm in a textbook.

Swizec · on April 4, 2020

Yeah but what looks better on your CV?

“Developed and researched a novel algorithm to reliably send messages under intense load”

Or

“Tried 5 off the shelf solutions, picked 1 that seemed okay, and moved on with my life”

That’s why we keep reinventing the wheel. Also it’s fun and we all think we’re the smartest.

thanksforfish · on April 4, 2020

> Yeah but what looks better on your CV?

Neigher. Focus on achieved business outcomes instead.

"Scaled GOV.UK Notify to 15 messages per second, ensuring reliable and timely delivery of 2FA codes and flood emergency warnings."

"Developed and researched" tells me you did something interesting but I can't tell if it was resume padding or real work. If you can't articulate why, that's a red flag. A good hiring manager should be able to detect resume padding and avoid hiring people that waste company resources on pet projects.

rootusrootus · on April 4, 2020

As a hiring manager, I know which one I'd rather see on a resume. Though I understand why an individual would choose the former.

cosmodisk · on April 4, 2020

This is the stuff most of us have to deal on daily basis. Do I just google the it and move on, or maybe I should try to come up with it my self? Every time I google, I don't feel any good at all.In fact,I don't feel anything.When I come up with some solution myself,It elevates my motivation and I always learn something new. As a manager though,I sometimes allow to do this stuff,while on other occasions I specifically tell not to spend time on some creative stuff and just get something off the shelve.

oakesm9 · on April 4, 2020

Here's the full source code if anyone wants to take a look:

https://github.com/alphagov/notifications-api

geospeck · on April 4, 2020

One interesting thing that I read the other day from a lead developer at Gov.uk was about how fast they managed to built a service within a couple of hours[1]

What I also really like about Gov.uk is that they seem to have their apps open source[2]

[1]https://twitter.com/RichardTowers/status/1243904365506760709 [2]https://github.com/alphagov

robin_reala · on April 4, 2020

https://www.gov.uk/service-manual/technology/making-source-c...

When you create new source code, you must make it open so that other developers (including those outside government) can:

- benefit from your work and build on it

- learn from your experiences

- find uses for your code which you hadn’t found

rstuart4133 · on April 5, 2020

I see everyone is focusing on 3rd party providers.

Here is another solution: smstools (https://packages.debian.org/buster/smstools), and a bunch of SMS modems pluggined into a USB hob. A SMS modem can send SMS in under 5 seconds. They say 200,000 a day max, so lets say you want to cope with sending 200,000 in 4 hours.

That's a little under 14 per second, so lets say 15 per second. You need 75 modems[0] to do that, or about AUD$4,000 worth of modems. Sorry for the AUD$ - I'm Australian. You will also need SIM's - or about AUD$1,500 worth. Don't worry about having 75 modems in one spot, the mobile phone network is designed to cope with a stadium of people all sending at the same time.

Perhaps you want to shard it for reliability - maybe 5 machines, so add another AUD$5000 for a NUC or similar all with a minimal Debian install + a web server or whatever for whatever delivery mechanism you are going to use to get the SMS's to the servers. That's AUD$10.5K total. Write some glue code - which is a week tops and job done.

The one question I'd be asking is how does that compare to using the cloud. Third party providers charge around AUD$0.05 per SMS. They say a minimum of 100,000 SMS's per day - or AUD$150K / month. The cost for the non cloud solution is AUD$10.5K for the first month, then $1,500 / month after that for the pre-paid SIMS.

Downsides: when it breaks (and it will), you will have to diagnose what's going on. That can be hard if the cause is a welding shop starting up next door. You are also going have to deal with the telco's screwing up their SMS infrastructure which seems to happen in Australia every 12 months or so. But you can fight that to some extent by geographically distributing your NUC's and using several different telco's servicing each NUC. That way it becomes more obvious what failed. Finally, instead of NUC's use industrial rated PC [1] to get your reliability up.

[0] https://www.ebay.co.uk/itm/283828240407 You need the version with an 'S' (for serial interface) suffix, although you can often just change the firmware.

[1] https://fit-iot.com/web/products/fitlet2/, industrial temperature rated. So, no stinking air conditioning required. :D

ryanlol · on April 5, 2020

Why not just buy a sim box instead?

rstuart4133 · on April 10, 2020

I've used Hypermedia’s SIM boxes. They are pretty good - neat install, didn't break for a long while, well documented. Also very expensive what what they did, but hey you are paying for a boxed solution, right?

Maybe when I was young. But I have grey hair now. I'm sure some of it is grey because over the years I've made too many on off purchases of specialised boxes promising to do everything I needed at the time. The expensive, nerve racking disasters I've had in IT were caused by boxes like that - raid arrays that used some "high speed" proprietary format, IBM SCSI boxes that needed specialised IBM disks whose firmware bug happened to spray shit across the data flowing across the SCSI bus on occasion, hell even specialised Telco APN's they work for a few years then didn't, and after 6 months they admitted they had fired the people who set it up.

When that Hypermedia box died (and they all do eventually), I phone up the suppler - and a replacement was an order, payment, international freight, and customs away. That was weeks. So we are down for weeks.

The alternative is to make do with off the shelf retail components that is sold to ordinary punters every day. Yes, the components aren't as reliable. You also have to provide glue - but you write the glue or use open source, so visibility into problems is excellent and the response time is amazing. And they are dirt cheap, so you can keep a couple of hot spares on the shelf (as you would if you had 75 modems). Failing that getting new ones is just a case of going down to you local retail outlet and picking one off the shelf. And they actually have _less_ bugs, partially because there of millions of them out there, and partially because a retail brand will be overwhelmed if their channel fills up with failures.

I saw a misbehaving IBM SAN take out video production house once, and the 100 people who worked there. (Turns out movie length video editing pushes a SAN very hard.) I didn't make that purchasing decision, but I may well have back then. There but for the grace of god go I. I've came close enough as it is.

So no 128 x SIM boxes for me thanks.

For what it's worth, a one off expensive box is worth it if the purchase price includes a man carrying the requisite spares on your door step with 8 business hours when it breaks. Big companies like Dell, HP and IBM do that for their big iron. (In an amazing coincidence, the boxes they are willing to cover with that sort of service for 5 years at a moderate increment in the purchase price almost never break.)

The other time I'm now willing to purchase specialised boxes is if they are mostly open source (so I have visibility, and very little bespoke poorly road tested complex code), and I'm purchasing 100's of them so I can realistically price in keeping a bunch of them on the shelf in the initial purchase.

867-5309 · on April 4, 2020

I thought this might have explained the mechanics behind the recent GOV_UK CORONAVIRUS ALERT, but alas. It didn't even namedrop the SMS providers

awslattery · on April 4, 2020

Based on the environment variable naming, I'd wager it's Firetext and MMG, two UK based SMS gateway providers.

https://www.firetext.co.uk/

https://www.mmg.co.uk/

oakesm9 · on April 4, 2020

You're correct. You can see them both in the source code here:

https://github.com/alphagov/notifications-api/tree/master/ap...

benbristow · on April 4, 2020

From what I've heard (could be wrong) the government literally just said to all the networks 'send this to everyone' and let the individual networks handle it.

Apparently they could've set up a system like in Japan where your phone gets emergency alerts (which I actually experienced when I visited on vacation a few years ago on my iPhone from the UK) and are handled specifically by the mobile operating system but the government were too cheap to set it up.

https://www.theguardian.com/world/2020/mar/23/government-ign...

ChrisArchitect · on April 6, 2020

ya this has rapidly the last few years become the standard in Canada. I think the government mandated all the major providers enable it/support it and they did some tests and people initially freaked out (phone alerts at odd hours alarmed people) but then 'for the greater good' people got used to it (and learning to use volume and DND settings). Main use is timely emergency alerts for missing children. Used to be done across television and radio (still is I suppose, not sure, less people have cable television and who's got the radio on). Proven effective on a number of occasions too. And especially in recent coronavirus times it's been used to send more general notices for the first time insisting people stay home/self-isolate and warning travellers returning home from abroad to quarantine for 14 days.

Symbiote · on April 4, 2020

I think this is just GSM Cell Broadcast [1].

It seems strange that it would be complicated or expensive to set up. I've used cheap, aimed-at-tourist SIMs in developing countries where cell broadcast was used to send news and adverts, which is very annoying.

[1] https://en.wikipedia.org/wiki/Cell_Broadcast

benbristow · on April 4, 2020

Which countries where they? That does sound annoying. Does it happen with standard network SIM cards?

Symbiote · on April 5, 2020

I don't know if it happens to everyone, or if I chose the worst network provider from the booths at the airport.

I can't really remember the country, it was years / 20 countries ago. Possibly Vietnam.

reaperducer · on April 4, 2020

It's happened to me in the Czech Republic.

Foreign SIM crosses the border? Here comes a flood of SMS spam!

londons_explore · on April 4, 2020

Some networks don't have GSM cell broadcast set up.

frutiger · on April 4, 2020

Similar to Japan, the US has https://www.fcc.gov/consumers/guides/wireless-emergency-aler....

saluki · on April 4, 2020

We were in Target one night and a flash flood warning came through on everyone's phone, felt like a movie hearing the alerts cascade across everyones phones and see everyone stopping and viewing the alert.

kennydude · on April 6, 2020

Yeah, I would have liked to have seen that. Really wish the phone networks were required to blacklist anyone using something similar as the sender though. A few scammers I've seen have been able to use the "GOV_UK" sender string :(

ris · on April 4, 2020

Notify didn't actually handle those particular messages. It can only, like most services, send text messages to specific numbers.

the_arun · on April 4, 2020

What if aws was govt? - https://www.cloud.service.gov.uk/

toomuchtodo · on April 4, 2020

Take note USDS/18F! Something to consider instead of Govdelivery.

SparkyMcUnicorn · on April 4, 2020

What's interesting is that this looks like a service anyone can use. API and everything.

https://www.notifications.service.gov.uk/

daguar · on April 4, 2020

It’s open source, and Australia and Canada have both deployed instances. Would love to see a US (or state run) instance.

anticensor · on April 4, 2020

United States Alert Message Service as a branch of USPS?

toomuchtodo · on April 4, 2020

You’d want it under the authority of GSA (like Login.gov), as a foundational service (messaging.gov).

867-5309 · on April 4, 2020

"if you work in central government, a local authority, or the NHS."

toomuchtodo · on April 4, 2020

Reusable open source components all governments can use is still a huge improvement over the status quo, where each government overpays contractors to build suboptimal solutions that remain closed source.

jimmySixDOF · on April 5, 2020

This topic reminds me of the story in Hawaii when they sent out a broadcast false alarm SMS alert to everyone about North Korean missiles

If I remember correctly it was due to a poorly designed drop down menu & missing confirmation challenge box. "Test Send" and "Send Send" were right on top of each other in a 10point font lol.

[1] https://en.wikipedia.org/wiki/2018_Hawaii_false_missile_aler...

Angostura · on April 4, 2020

Before we get too carried away; its worth seeing how the system coped with unprecedented pressure:

https://www.bbc.co.uk/news/technology-52037573

"Millions of mobile users in the UK have yet to receive the government's text message alert about coronavirus. The SMS - telling people to stay at home - began being sent early on Tuesday morning. But Vodafone has confirmed it only expects to complete the process later this Wednesday...."

ddddddj · on April 4, 2020

Just in case it wasn't clear from discussion elsewhere on this post, Notify didn't send out the text message mentioned in the news. That message was sent directly by the networks at request of the government.

leegraham · on April 4, 2020

Is the BBC article talking about the same thing?

The OP blog post says

> we have 2 different [text message] providers

but the BBC article says

> Vodafone is the only one of the UK's big four networks that has not finished the task.

If the two articles are talking about the same thing then it’s worth noting that the other major providers seemingly had no problems, and Vodafone chose to not send the messages at night time to avoid waking people up.

ohlookabird · on April 4, 2020

It is a really useful service! I used their email service to get updates on potential travel destinations in the past months when Covid-19 started to be a thing.

blntechie · on April 4, 2020

Not to diminish anything but 100 to 200k messages per day is not really huge in relative scale of things. Several private services and government orgs in China and India easily send 10x or more of that number.

SparkyMcUnicorn · on April 4, 2020

They never said 100-200k was the limit, but rather that's how many they're usually sending out per day.

It could be a limit, but I don't see anything that indicates there's a correlation.

petepete · on April 4, 2020

Definitely not a limit, on March 28th they sent 637k SMS messages.

https://www.gov.uk/performance/govuk-notify

I've used Notify on several projects over the last couple of years, it's a really nice service and has never caused us any problem whosoever.

robin_reala · on April 4, 2020

To be fair, both China and India have over 15x the population of the UK.

adrianmonk · on April 4, 2020

If the available providers' services start bogging down when you try to actually do it, then it can be considered huge relative to their capacity.

rospaya · on April 4, 2020

The company I work for (one of the largest providers in the world) handles 7+ billion interactions a month. Without checking our systems (and my NDA) I would bet we also provide for a lot of .gov.uk services.

mjw1007 · on April 4, 2020

I don't think the post was intended to be a boast about how many messages they send.

I think it was intended to be a description of how it's sensible to set things up at the scale they happen to have.

dirtydroog · on April 4, 2020

That surprised me too. We handle that amount of HTTP requests per second, and you don't really need a lot of infrastructure to do it. Something is wrong here. I bet it's all synchronous code and capacity is being wasted on waiting. Also, I doubt they're tailoring each individual SMS so there are likely to be efficiencies to be had here too.

adwww · on April 4, 2020

Instead of betting, you could read the code and tell us. https://github.com/alphagov/notifications-api

dirtydroog · on April 4, 2020

Is it this bit?

https://github.com/alphagov/notifications-api/blob/e386d2ac3...

I'd like to clarify that the remark about load handling was aimed at your SMS gateway providers.

CaciaraAsAServi · on April 4, 2020

Is there a particular reason why, judging from the graphs, the rate suddenly drops every 6 minutes or so? Batching or something like it?

ddddddj · on April 4, 2020

Hey, I'm the person who wrote the blog. I was slightly interested in the pattern too but didn't take the time to look into it. I assumed it was either something to do with a service sending us traffic in that pattern or maybe just something to do with Grafana. It could also be some unexplained behaviour in our system, maybe something to do with how we are pulling items off the queue. If I find some time next week I might take a proper look though!

danpalmer · on April 5, 2020

If you’re using Prometheus and restarting servers regularly (a default behaviour for gunicorn for example) then you’ll lose a little data, up to your Prometheus scrape period, ever restart.

My team found this pretty annoying for monitoring a Django site so we’ve ended up moving to a statsd push-based metrics approach and are finding numbers generally easier to trust and reason about.

CaciaraAsAServi · on April 5, 2020

Cool! I was just curious :) as for the rest, fine article :)

harel · on April 4, 2020

As a user of Notify, I can vouch for the reliability of the service. It's rock solid.

yzydserd · on April 4, 2020

It is interesting the article didn’t mention the incident 2 weeks ago when it had problems with a 7-fold increase in volume. Though it did well under the strain. https://status.notifications.service.gov.uk/incidents/jpwxyt...

ddddddj · on April 4, 2020

Hey, I'm the one who wrote the blog post. This was originally a talk a colleague on my team gave internally about 2 months ago and then I wrote it up as a blog post a few weeks back, before we had that big incident so there wasn't any particular thought on not mentioning it.

As the postmortem mentioned, it wasn't related to any of this load balancing work or our providers, it was us running into trouble with a different part of our system. That was a busy week (both in terms of numbers as you can see on https://www.gov.uk/performance/govuk-notify/notifications-by... but also in terms of us fighting fires).

londons_explore · on April 4, 2020

I'm not sure I'd advocate for this design... The traffic sharing and backoff seems crude... The "10% per minute" doesn't prevent sudden increases in load killing both providers.

I would design it like this:

Put all requests into a distributed queue, for example persistent pubsub.

Have workers take work from that queue.

Each worker should send a new request to a provider if the rate of requests sent in the past minute is < 2x the rate of requests in the previous minute, and the number of in flight requests is < 10x the average of the past minute, and the rate of errors, including timeouts, is <1%. If both providers are eligible, send to whichever has had the fewest requests in 24h.

This prevents flooding/DoSing a badly configured provider (a well configured provider would have ingress ratelimiting, and you could do away with all the above logic).

Have alerting on the age of the oldest item in the queue, and a monitoring dashboard showing dispatch rate to each provider, with response error codes.

All the state is local to the worker, and doesn't need persisting. If a worker crashes, the item doesn't get acknowledged to pubsub, and will be retried. If you like, you can autoscale the number of workers based on their cpu utilization.

I'd expect the above to scale to 10k qps per worker, and 5Mqps for 1000 workers before needing a redesign.