Our scripts have so many retry loops and arbitrary pauses mixed in to account for garbage like this. Distinguishing "did the call fail?" from "or is the system just lost in the land of eventual inconsistency?" ugh.
And yeah. shout out to Azure AAD where I can have a role assignment that is granting permissions to "unknown". We call them ghosts.
> Our scripts have so many retry loops and arbitrary pauses mixed in
This is unfortunately your fault, not Azure's. Their API is explicitly designed around this eventual consistency and weak references, so your client side must take this into account. Typical bash, Python, or PowerShell scripts are the Wrong approach with a capital W, and you will forever be tearing your hair out if you persist on using them. (Or any similar imperative deployment mechanism)
The only robust method is ARM Templates, or better yet, Bicep templates[1]. The latter simply compile down to ARM, so they're essentially equivalent but terser and with nicer tab-complete.
Compared to scripts, templates have key advantages:
1. Built-in incremental / differential deploy capability. A partially deployed template can be simply redeployed[2] to "fix it up", without requiring client-side logic for every corner case.
2. Can deploy multiple changes that would fail if deployed step-by-step. For example, App Gateway can have intermediate configurations that won't validate on the way to a valid final configuration. This is madness to unravel with scripts. Templates generally just take you to the final configuration in one step.
3. Inherently parallel. Anything that can be deployed concurrently will be. Anything. No need to write complex and error-prone parallel loops on the client side!
4. Largely immune to temporary failures like missing reads after writes.[3] The template engine has a built-in retry loop for most (all?) fallible steps. You'll see it has "failed"... and then "succeeded" anyway.
[2] Most of the time. All resources should be idempotent to redeployment, but many aren't because this is not mechanically enforced. IMHO, this is just shoddy, shoddy engineering and everyone involved should be ashamed. Being nearly idempotent is like being nearly pregnant.
[3] You still need a few tricks up your sleeve for robust deployments. Anything outside of ARM, such as Azure AD groups and RBAC tend to be a PITA. Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.
Yeah, so I'm hitting this again today, so now I can comment more knowingly.
> Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.
In the case I have today (wanting to perform an action on a newly created application), this trick doesn't work. (We make the call by specifying the application by ID, too.)
"Bad Request […] It looks like the application '[ID]' you are trying to use has been removed or is configured to use an incorrect application identifier."
It's not removed, of course, and the ID isn't incorrect.
Edit: actually, it is worse than that. So, there's the above error, and that's essentially a failure to have read-your-writes.
Our scripts retry on that, because our scripts have become accustomed to AAD's shit. But we eventually hit this sequence of events:
1. Create the app
2. Grant admin-consent
[other necessary setup]
3. Create an AKS cluster: <this fails>
And it fails because the App need to have admin-consent granted on it. But we do that, in step 2, and my logging is now good enough to show that we not only retry it after a read-your-writes failure, but that the command eventually succeeds, but the UI doesn't end up reflecting that. This is not a read-your-writes failure, this is a lost write!
To be frank, that's a lot of words to say that, instead of fixing the bugs at their core, MS wrote an entire product to try to work around the bugs in the original product, and now wants me to use that layer instead. And, yeah, that's about how MS sees the world. But "yeah, no" is the serious answer there, and we're moving anything we can to Terraform first.
Most of the failures we see are with AAD, which, AIUI, ARM templates do nothing for.
Even within ARM, as I understand them, ARM templates cannot handle deletes or changes. (They are deployments of new resources.)
And even if one could use templates, that just abstracts the same problem: how long do you wait for the template, if it has finished? (we see changes in ARM take >30 minutes to effect, and even after completion, it can take more minutes for things to "settle", i.e., successive requests to reliably return the same result. It just bottles it all into one highly inconsistent box, maybe, and requires me to learn an entire language on the side.) if it hasn't?
> That works, because the GUID can be used even if the full object is not fully replicated around yet.
1. Create resource. Success.
2. Attempt to use/reference first resource in another resource/call: Failure: Referenced resource does not exist. Odd.
3. Create resource again? Failure! Resource already exists.
Our scripts have so many retry loops and arbitrary pauses mixed in to account for garbage like this. Distinguishing "did the call fail?" from "or is the system just lost in the land of eventual inconsistency?" ugh.
And yeah. shout out to Azure AAD where I can have a role assignment that is granting permissions to "unknown". We call them ghosts.
[1]: https://jepsen.io/consistency/models/read-your-writes