This is because neither AWS nor Azure use referential integrity in any of their ...

deathanatos · on July 3, 2022

Not only referential integrity, but also not supporting Read-Your-Writes[1]. Cloud scale! Nothing like,

1. Create resource. Success.

2. Attempt to use/reference first resource in another resource/call: Failure: Referenced resource does not exist. Odd.

3. Create resource again? Failure! Resource already exists.

Our scripts have so many retry loops and arbitrary pauses mixed in to account for garbage like this. Distinguishing "did the call fail?" from "or is the system just lost in the land of eventual inconsistency?" ugh.

And yeah. shout out to Azure AAD where I can have a role assignment that is granting permissions to "unknown". We call them ghosts.

[1]: https://jepsen.io/consistency/models/read-your-writes

jiggawatts · on July 3, 2022

> Our scripts have so many retry loops and arbitrary pauses mixed in

This is unfortunately your fault, not Azure's. Their API is explicitly designed around this eventual consistency and weak references, so your client side must take this into account. Typical bash, Python, or PowerShell scripts are the Wrong approach with a capital W, and you will forever be tearing your hair out if you persist on using them. (Or any similar imperative deployment mechanism)

The only robust method is ARM Templates, or better yet, Bicep templates[1]. The latter simply compile down to ARM, so they're essentially equivalent but terser and with nicer tab-complete.

Compared to scripts, templates have key advantages:

1. Built-in incremental / differential deploy capability. A partially deployed template can be simply redeployed[2] to "fix it up", without requiring client-side logic for every corner case.

2. Can deploy multiple changes that would fail if deployed step-by-step. For example, App Gateway can have intermediate configurations that won't validate on the way to a valid final configuration. This is madness to unravel with scripts. Templates generally just take you to the final configuration in one step.

3. Inherently parallel. Anything that can be deployed concurrently will be. Anything. No need to write complex and error-prone parallel loops on the client side!

4. Largely immune to temporary failures like missing reads after writes.[3] The template engine has a built-in retry loop for most (all?) fallible steps. You'll see it has "failed"... and then "succeeded" anyway.

[1] https://docs.microsoft.com/en-us/azure/azure-resource-manage...

[2] Most of the time. All resources should be idempotent to redeployment, but many aren't because this is not mechanically enforced. IMHO, this is just shoddy, shoddy engineering and everyone involved should be ashamed. Being nearly idempotent is like being nearly pregnant.

[3] You still need a few tricks up your sleeve for robust deployments. Anything outside of ARM, such as Azure AD groups and RBAC tend to be a PITA. Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.

deathanatos · on July 7, 2022

Yeah, so I'm hitting this again today, so now I can comment more knowingly.

> Generally you want to wrap your deployments in a script that takes the object GUID of the created group and feed that into the template. That works, because the GUID can be used even if the full object is not fully replicated around yet.

In the case I have today (wanting to perform an action on a newly created application), this trick doesn't work. (We make the call by specifying the application by ID, too.)

"Bad Request […] It looks like the application '[ID]' you are trying to use has been removed or is configured to use an incorrect application identifier."

It's not removed, of course, and the ID isn't incorrect.

Edit: actually, it is worse than that. So, there's the above error, and that's essentially a failure to have read-your-writes.

Our scripts retry on that, because our scripts have become accustomed to AAD's shit. But we eventually hit this sequence of events:

  1. Create the app
  2. Grant admin-consent
  [other necessary setup]
  3. Create an AKS cluster: <this fails>

And it fails because the App need to have admin-consent granted on it. But we do that, in step 2, and my logging is now good enough to show that we not only retry it after a read-your-writes failure, but that the command eventually succeeds, but the UI doesn't end up reflecting that. This is not a read-your-writes failure, this is a lost write!

deathanatos · on July 5, 2022

To be frank, that's a lot of words to say that, instead of fixing the bugs at their core, MS wrote an entire product to try to work around the bugs in the original product, and now wants me to use that layer instead. And, yeah, that's about how MS sees the world. But "yeah, no" is the serious answer there, and we're moving anything we can to Terraform first.

Most of the failures we see are with AAD, which, AIUI, ARM templates do nothing for.

Even within ARM, as I understand them, ARM templates cannot handle deletes or changes. (They are deployments of new resources.)

And even if one could use templates, that just abstracts the same problem: how long do you wait for the template, if it has finished? (we see changes in ARM take >30 minutes to effect, and even after completion, it can take more minutes for things to "settle", i.e., successive requests to reliably return the same result. It just bottles it all into one highly inconsistent box, maybe, and requires me to learn an entire language on the side.) if it hasn't?

> That works, because the GUID can be used even if the full object is not fully replicated around yet.

Interesting. I'll keep that in mind.

efxhoy · on July 3, 2022

Not even redshift enforces uniqueness on their primary keys or referential integrity on their foreign keys. That was a fun finding out…

formerkrogemp · on July 3, 2022

> That was a fun finding out…

Not my idea of a good time, but whatever floats your boat. Hopefully you had a backup?

thayne · on July 3, 2022

That was almost certainly sarcasm.

orf · on July 2, 2022

I don’t think it is - the issue is that the Batch service needs to assume your role in order to clean up associated cluster resources like auto scaling groups.

If the role is deleted, it can’t do this.

jiggawatts · on July 3, 2022

That's what referential integrity violation means: an essential related piece of information can be deleted without the parent/using object also being deleted.

orf · on July 3, 2022

Sure, in an abstract sense maybe, but in a cross-service context it’s got nothing to do with their “cloud scale databases”.

In fact, being able to delete a role without removing any associated resources is a feature, not a bug. And how would you even ensure referential integrity in this case - you would achieve the same effect by modifying the assume role policy but keeping the role around.

You could craft a policy that only allows Batch to assume the role on a Tuesday for example.

XorNot · on July 3, 2022

Wedging a cloud resource so you can't delete it is always a bug.

bergenty · on July 3, 2022

Maybe. But the GP above has a point. What does a role have to do with this particular resource. You can’t block role deletion because they are associated with resource, that would be a functional nightmare. You also can’t just cascade delete all resources associated to a role. The only real thing that can be done is assign the resource to some super user with all the rights so they can delete it instead.

XorNot · on July 3, 2022

> You can’t block role deletion because they are associated with resource, that would be a functional nightmare.

Asserting this doesn't actually make your argument for you.

Why would this be a problem?

orf · on July 3, 2022

It doesn’t make any sense. Forgive me for saying this, but if you don’t really know how all this works then it’s a convenient throwaway thing to suggest.

So I think it’s up to you to explain how this would work when there is no difference between a role being deleted and a role being inaccessible.

What exactly would you do if I block instance with the ID “xyz” from assuming the role I have assigned it? How would you detect that I’ve done this in every single case?

XorNot · on July 3, 2022

That's an inline policy which is attached to the role. And notably AWS roles require all policies to be removed before they're deleted (which I know because I spent a bunch of time recently fixing ordering issues with CloudFormation deletes).

Again you keep just asserting this isn't possible: why? AWS are aware you need the role to exist to delete the instance (as noted up in the OP), why apparently is it completely inconceivable that the IAM system would check for this condition before executing an action - which again - irreversibly wedges a delete operation for a resource?

Unless you have some deep knowledge of how AWS IAM is implemented which makes this literally impossible, then you're asserting fluff.

orf · on July 4, 2022

It’s possible, it’s just convoluted, introduces a weird circular dependency, not always the behavior you want in the general case, possible to do without requiring changes to the IAM service via cloudfront or terraform, would require IAM to support every single possible “delete” variant (delete an instance with a backup? Without one? Final RDS snapshot name? Etc etc), would need to recursively delete resources, would require a complex set of APIs to track the asynchronous deletion process, and as stated several times before completely ineffective against conditions where the role hasn’t been deleted but the service cannot assume it. Which leaves you in exactly the state you’re trying to avoid.

In short: it would be a confusing mess for nebulous gains that doesn’t pass any kind of smell test. Instead they should just… fix the AWS batch service.

The better solution is to provide an API and console tab to show you what services last used the role, when they used it and how they used it. Which is what they do.

orf · on July 3, 2022

Yes, in the service that provides the AWS resource.

Because it didn’t handle the fact that the role it’s using might be deleted or otherwise rendered un-assumable for a variety of different reasons at any point in time.

Which is a feature. Not a bug.

TedDoesntTalk · on July 3, 2022

I’m curious how you see this as a feature when it can get you into a very expensive and unresolvable situation; a AWS resource can’t be deleted and is running up costs. You’re at the mercy of AWS support.

speedgoose · on July 3, 2022

It can also be a security vulnerability if the resource that cannot be deleted is compromised and can access or contain critical data for example.

orf · on July 3, 2022

Can it? The only case I know of involving roles specially is Batch, and the resources it’s trying (and failing) to clean up are ones with absolutely no cost.

It’s a feature because there are plenty of cases, such as a role being compromised, where you don’t want to cascade-delete every single associated resource without any need.

If you want this then you can opt into it by using cloudformation.

bergenty · on July 3, 2022

It’s not a feature but how do you solve this? You can’t block role deletion because it has a resource associated with it, that would be a functional nightmare.

speedgoose · on July 3, 2022

A very classic way of doing would be to prevent deleting a role that are resource tightly associated to it by default. If someone really wants to delete the role anyway, you can provide a feature to do that and delete associated ressources in cascade.

bergenty · on July 3, 2022

It’s like you didn’t even read what I wrote. That way sucks, especially if you have hundreds of allocated resources. Cascade delete is not a realistic option here either. Imagine someone leaves the company without notice. All their resources need to be deleted?

speedgoose · on July 3, 2022

When someone leaves a company you look at their resources and assigned new people to them or delete them. But you should prefer teams and projects over individuals anyway.

TedDoesntTalk · on July 3, 2022

"Cascading deletes"