Nginx reverse proxies retries PUT/POST/DELETE on response timeout by default

JensRantil · on March 3, 2016

This essentially became the laser of death the other day and lead to cascading failure which eventually brought down our system. That's why I'm posting this.

Very few people know about this and it's really scary. I'm happy people are voting this up to increase some awareness of this.

Potential workarounds: You might think disabling `proxy_next_upstream timeout` will do, but that will also disable connection timeout retry which is not what you want!

Increasing `proxy_connect_timeout` is not an option, because then you risk filling up too many connections in the nginx instance if the upstream server swallows SYN packets or whatnot.

The real workaround: Use haproxy. Serioously.

mhotchen · on March 3, 2016

We've hit 3 problems with nginx:

1. Exactly this, we had mystery double trades from our clients and it took us a long time to realise it was nginx assuming we timed out and routing traffic to the next server

2. It doesn't do health checks. When a server goes down it will send 1 out of every 8 real requests to the down server to see if it responds. Having disabled resubmitting of requests to avoid the double trade issue above this means when one of our servers is down, 1 out of every 8 requests will have an nginx proxy error which is significant when you have multiple API calls on a single page

3. This isn't something I've personally hit so can't explain the nitty gritty but it's something one of my coworkers dealt with: outlook webmail does something weird where it opens a connection with a 1GB content size, then sends data continually through that connection, sort of like a push notification hack. Nginx, instead of passing traffic straight through, will collect all data in the response until the response reaches the content size provided in the header (or until the connection is closed). I don't know if nginx is to blame for this one or not, but I do feel that when I send data through the proxy, it should go right through to the client, not be held at the proxy until more data is sent.

HAProxy also solved our issues and is now my go-to proxy. Data goes straight through, it has separate health checks, and it better adheres to HTTP standards. It can also be used for other network protocols which is a bonus.

timthorn · on March 3, 2016

Whilst Nginx doesn't do healthchecks, they are available in Nginx Plus. I do appreciate that it is a charged for product, but it has a number of strong features over and above the OSS version and of course support (who are very responsive indeed).

ende42 · on March 4, 2016

3. is the reason why NGinX is the recommended proxy in front of webapps with scarce parallelism (for example Ruby with Unicorn; see http://unicorn.bogomips.org/PHILOSOPHY.html for an explanation) when "slow clients" are to be expected. NGinX is protecting the webapp from blocked workers by slow clients and Outlook Webmail seems to behave just like one. I don't know by heart how to tune this behavior if one wants to avoid it but this property is the main reason we use NGinX.

vsl · on March 5, 2016

That's… unique - and wrong - spelling of the name. (Pet peeve of mine, people spell my app's name in all sorts of bizarre ways too.)

Matthias247 · on March 4, 2016

This sounds like something else. In the outlook case their servers they seem to use the connection as a stream (which is actually valid, although not really supported by browsers outside of the event-stream class), where the server only writes little chunks of data of a time. But the server there not hindered from writing by a slow client - it simply has not more data to write at that point of time.

skinowski · on March 3, 2016

Regarding 3, buffering behavior is highly configurable in nginx.(eg. proxy_request_buffering, proxy_buffering on/off)

orthecreedence · on March 4, 2016

Wrote a post where I ran into (and fixed) this problem with streaming uploads through nginx: http://killtheradio.net/technology/nginx-returns-error-on-fi...

kvz · on March 3, 2016

It's only as of 1.8 that you can disable buffering of incoming requests though. Just a few month iirc.

bungle · on March 3, 2016

Nginx can also be used for other protocols, see stream block.

mhotchen · on March 3, 2016

I didn't know that. Thanks for the correction.

kodablah · on March 3, 2016

"but that will also disable connection timeout retry which is not what you want!"

Why is this not what you want? Are you using the reverse proxy as a load balancer to multiple servers? Otherwise, if it's 1:1 proxy (for something like SSL termination) wouldn't having nginx fail/timeout when the server does be acceptable?

marcosdumay · on March 3, 2016

> Are you using the reverse proxy as a load balancer to multiple servers?

That's extremely likely.

Somebody with an Nginx reverse proxy is probably using it for high availability, load balancing and static files cache, probably at the same time. This is what it is good for.

fgonzag · on March 3, 2016

Using NGINX as a reverse proxy is an extremely common scenario. In fact that's what I currently run (with a support subscription), but will be evaluating moving to HAProxy if their tech dept does not provide a way to resolve this issue (which is actually a very big deal for me, and I was not aware)

owengarrett · on March 3, 2016

This is Owen from NGINX. We have a workaround for this behavior (https://gist.github.com/thresheek/2fa6479ffb7aca710493), and are tracking a separate new feature request. Please submit a support ticket or send me an email, owen@nginx.com.

fgonzag · on March 4, 2016

Thank you, I will be opening the ticket tomorrow. Regarding the gist you just posted, it seems this simply disables proxy_next_upstream for any and all non idempotent requests.

However what would really need to happen is to only disable proxy_next_upstream if data has been written or read from the backend(preferably configurable by backend or location for either of those two options). Right now you basically lose the redundancy in non-idempotent requests, and immediately return the error. Or maybe I read the configuration incorrectly.

JensRantil · on March 3, 2016

Yes, I have multiple servers behind nginx. It's very common.

bungle · on March 3, 2016

Does this help: http://nginx.org/en/docs/http/ngx_http_proxy_module.html#pro... http://nginx.org/en/docs/http/ngx_http_proxy_module.html#pro...

JensRantil · on March 3, 2016

No. Requests will still be retried.

gshulegaard · on March 3, 2016

What about `proxy_next_upstream off;`?

JensRantil · on March 3, 2016

If I temporarily bring an upstream application down for upgrade I want nginx to retry the next upstream. This is a very common scenario when doing reverse proxying. Disabling next upstream breaks this.

owengarrett · on March 3, 2016

This is unfortunate behavior on timeout, and we've shared a workaround solution available using maps. There's a configuration example in this Gist: https://gist.github.com/thresheek/2fa6479ffb7aca710493.

We're also going to prioritize a complete fix in the product, and encourage your comments and input on this ticket: https://trac.nginx.org/nginx/ticket/488

Disclaimer: I work @ NGINX. Thanks, Owen

JensRantil · on March 4, 2016

Double post. See answer here: https://news.ycombinator.com/item?id=11221392

colanderman · on March 3, 2016

How is one supposed to take seriously web infrastructure software that exhibits such a basic failure of understanding core web standards? From even a cursory reading of the HTTP RFCs one will understand that "POST = unsafe = don't retry after request sent = return 504 on reply timeout".

I mean, a bug's a bug; but this was known for two years!

coldtea · on March 4, 2016

>How is one supposed to take seriously web infrastructure software that exhibits such a basic failure of understanding core web standards?

How? Probably based on the fact that otherwise it's a frigging great app that powers like 15% of the web, including some of the biggest sites out there.

jimjag · on March 4, 2016

Plus, it is obvious that some fanboys will promote it no matter what...

coldtea · on March 4, 2016

Yes, please continue calling 40+ year old developers with ancient unix experience "fanboys".

Because obviously we're all 20yo in HN...

dustingetz · on March 3, 2016

The real problem is that the programmer tools of today are fundamentally flawed to the point that no software can be fully understood or verified by a human.

I can imagine a future world based on pure functional programming where this is no longer the case. You'd need to rewrite the operating system too, which is the explicit goal of the Urbit project.

nicolas_t · on March 3, 2016

I've been bitten more than once by this.

Example situation, you have a request to process an uploaded file which is only for admin purpose so you didn't take the time to use a queuing system to do the heavy lifting in the background. Then, your customer uploads a file that takes much longer than normal and the request times out, that file is then sent multiple times to the app server and the user sees multiple uploads..

The behavior in term of errors should definitely be different if the sending request failed (in which case resending to the next upstream is fine) and if receiving the response failed (in which case it's often not a good idea to resend)

_hyn3 · on March 3, 2016

This behavior is non-compliant[1] with the RFC.

Although (from the standpoint of the RFC), everything on the server side (including nginx itself) is considered the web application, nginx probably takes the implicit position that dealing with multiple requests on non-idempotent methods such as POST is really a problem that the proxied web app itself should cope with.

But then nginx puts the web app in an untenable position. Consider the example of non-idempotent POST to create a new user account. The new user account includes a username, email address, and password. Because it's proxied, nginx creates a duplicate request for this new user account in the circumstances described in this bug report.

How should the web app deal with the duplicate request?

a. Accept the first request (200 OK) and decline the second request since the account was already created (i.e., 409 Conflict), or

b. Create two duplicate user accounts (200 OK for both)

Obviously, the ONLY correct response is the first one, but what happens next is really up to nginx: will the client receive the 409 Conflict (etc) or will it receive the 200?

Well, who knows?! It's completely indeterminate.

If the client gets the 200 OK, great. But what if it doesn't? These duplicate requests seem like they could lead to an nginx race condition as well. And what gets logged?

This behavior clearly violates both the spirit and the letter of RFC 7231 (as well as being an obviously poor engineering decision!).

Note also the long time (years!)[2] that this has been a known, outstanding bug without any action taken. Another commenter actually said this caused cascading failure in their application that killed their app.

Bottom line... nginx is a great, fast static server, but definitely not a good proxy for dynamic apps. We're trying to figure out how fast we can migrate Userify (plug: SSH key management for EC2)[3] from nginx to HA-Proxy, since we use it to front-end our REST API.

1. https://tools.ietf.org/html/rfc7231#page-23

2. https://trac.nginx.org/nginx/ticket/488#comment:3

3. https://userify.com

stormbrew · on March 3, 2016

By default[1], nginx only talks to backends in http/1.0, so the operative rfc is (sadly) https://tools.ietf.org/html/rfc1945. Though it did establish GET/HEAD as safe and other methods as not, the idea of idempotence itself was not yet present and it doesn't have any language I'm aware of to restrict client retries on non-safe methods.

That said, I don't know if nginx does any better if you set it to http/1.1 mode on this issue. I assume not, to be honest.

[1] http://nginx.org/en/docs/http/ngx_http_proxy_module.html#pro...

dragonwriter · on March 3, 2016

> By default[1], nginx only talks to backends in http/1.0, so the operative rfc is (sadly) https://tools.ietf.org/html/rfc1945. Though it did establish POST/PUT/etc. as 'safe'

No, only GET and HEAD are safe in RFC 1945.

> the idea of idempotence itself was not yet present and it doesn't have any language I'm aware of to restrict client retries on non-safe methods.

That actually doesn't really change the situation that much: without an idempotence guarantee, there is no protocol-level basis for a proxy (reverse or otherwise) to assume that a non-safe method is repeatable. Under HTTP 1.0, by the RFC alone, there's no justification for treating anything other than GET or HEAD as reliably repeatable. (Except perhaps that the operations described by PUT and DELETE are at least arguably, as specified, idempotent, even though the term is not invoked and the guarantee is not made express.)

stormbrew · on March 3, 2016

> No, only GET and HEAD are safe in RFC 1945.

Brainfart typo, corrected.

jimjag · on March 3, 2016

A pretty good overview of reverse proxies can be found in [1]... spoiler: nginx is NOT one of the best-of-breed. Compliance was/is indeed an issue.

1. http://www.slideshare.net/bryan_call/choosing-a-proxy-server...

mortenlarsen · on March 3, 2016

Non-SlideShare link (I think it is the same): http://cdn.oreillystatic.com/en/assets/1/event/115/Choosing%...

todd17 · on March 3, 2016

Scenario 1:

1. nginx times out while server processes request

2. nginx makes request to second server and second server returns "account already exists"

Scenario 2:

1. nginx times out while server processes request and returns error

2. user attempts to create account again and server returns "account already exists"

lawpoop · on March 6, 2016

What about when an ajax POST adds an item to a shopping cart?

There is no unique identifier in the line item count, and thus no way to determine "account already exists".

Worst case scenario, user gets extra item(s) in their cart, doesn't really look at the totals, and orders, pays for, and receives them.

marcosnils · on March 3, 2016

We had this same problem some time ago in our company. That's why we came up with this.

https://github.com/xetorthio/nginx-upstream-idempotent

JensRantil · on March 3, 2016

Cool. Did you ever consider patching nginx upstream instead?

xetorthio · on March 3, 2016

Yes. Before implementing this module we contacted nginx developers and they didn't think it is a problem. This is why we had to create our own module.

laurent123456 · on March 4, 2016

Did they explain why they don't think it's a problem?

jimjag · on March 3, 2016

This almost seems non-compliant w/ the action HTTP spec.

SilasX · on March 3, 2016

Per my other comment [1], it's only noncompliant for POST; for PUT (and DELETE) it's more compliant than people want or expect!

[1] https://news.ycombinator.com/item?id=11217686

Rabidgremlin · on March 3, 2016

Yeah, tripped over this beauty a few years ago... Took days to figure out what was going on :( Was using nginx as internal load balancer across a RESTful services layer.

muraiki · on March 3, 2016

Forgive me for not fully understanding this. If I'm just using proxy_pass to a single server (vs using proxy_pass with round robin or using proxy_next_upstream for failover) would this still affect me? In my experience with proxy_pass, a timeout on upstream was reported to the client and the POST was not retried.

JensRantil · on March 3, 2016

No, then you are safe.

billpg · on March 4, 2016

Some APIs are not designed very well in the face of the possibility of a time-out on a POST, because the client can't be sure if the request was successful or not.

http://blog.hackensplat.com/2014/07/is-your-api-broken.html

verytrivial · on March 3, 2016

Am I the only Chrome on Android user whom received a PKCS#12 access request for trac.nginx.org? I have honestly never seen that before.

JensRantil · on March 3, 2016

Yeah, I got it, too. That was a first-timer...

typewriter_t · on March 3, 2016

I can't reproduce it. I have nginx proxy_pass'ing to two upstreams and configured with proxy_next_upstream timeout;

One of the upstreams is running iptables -A OUTPUT -p tcp --sport 8080 --tcp-flags PSH PSH -j DROP

CURLing the nginx location configured for proxy_pass'ing returns 504 GATEWAY_TIMEOUT on half of the requests, as expected.

Zikes · on March 3, 2016

This is why idempotence is so important.

dragonwriter · on March 3, 2016

Well, its why adherence to the semantics of the HTTP spec (which this behavior is an example of failing) is important: POST is not defined to be idempotent, so nothing should act by default as if POSTs are repeatable.

nickzoic · on March 4, 2016

Well, sure, and theoretically a success gets you 2xx and a failure gets you 4xx/5xx.

But there's a layer beneath HTTP as well. If all you get back is a TCP RST, did the request succeed or fail? How about if you get an ICMP unreachable or just a timeout ... should you retry?

So, the Internet being what it is, it is probably not a bad idea to aim for idempotence for the critical bits.

SilasX · on March 3, 2016

But it's only the POST that this is a problem for, right? PUT and DELETE are supposed to be idempotent so retries are okay, yes?

xxs · on March 3, 2016

There could be races between PUT/DELETE. There is no grantee how the retry was made.

SilasX · on March 3, 2016

A race shouldn't matter so long as both succeed at least once; that would have the same effect as if either/both had succeeded multiple times. The only difference is whether the user is informed of whether a related action obviated their request, which is going to happen anyway.

Edit: turns out I was wrong and assumed PUT should fail if the resource doesn't exist, which isn't how it works. (Probably because of writing apps that deprecate it in favor of PATCH.)

dragonwriter · on March 3, 2016

> A race shouldn't matter so long as both succeed at least once; that would have the same effect as if either/both had succeeded multiple times.

Idempotence only means that the same single method repeated additional times on its own will not produce different end states. It doesn't necessarily guarantee this for combinations of methods in different orderings.

It doesn't stop PUT/DELETE/PUT/DELETE from having different results than PUT/DELETE/DELETE/PUT to the same resource. (You can do assure that these are equivalent in a particular HTTP-compliant application, but it goes beyond the base semantics of HTTP to do so.)

SilasX · on March 3, 2016

I was only saying there that the combination didn't create other problems (due to race conditions), not that that fact was related to idempotence. Though it happens to be true for the combination of PUT/DELETE as well!

I think you're equating my claims about what methods are idempotent with my claims about what reorderings matter.

Someone · on March 3, 2016

Shouldn't matter?

  DELETE foo/bar
  PUT foo/bar

If that delete gets a retry, actual execution order could be

  PUT foo/bar
  DELETE foo/bar

Or am I misunderstanding this?

SilasX · on March 3, 2016

It doesn't matter: as I said, that has the same end state (no foo/bar resource), the only possible difference is response code i.e. whether (in this case) you get to learn that your update doesn't matter.

marcosdumay · on March 3, 2016

The first sequence ends with foo/bar existing. The second one ends with it not existing.

I'd say that anybody that sends a pair of PUT/DELETE requests in fast succession over the web and expects a stable result is a fool. This should have no effect on practice, because nobody should be relying on the ordering anyway.

SilasX · on March 3, 2016

Ah, my mistake. I had always equated PUT with updating and assumed it should fail if it doesn't find the resource. Big oversight!

rbw · on March 3, 2016

I think you're correct, actually. If you want to create a new object, it should be a POST. In a well designed RESTful service, a PUT on an object that doesn't exist should fail, and both of the PUT/DELETE orderings above should result in the same state of the world: the object does not exist.

SilasX · on March 3, 2016

I thought so too, but when I looked at the spec [1], it agreed with the others:

>>The PUT method requests that the enclosed entity be stored under the supplied Request-URI. If the Request-URI refers to an already existing resource, the enclosed entity SHOULD be considered as a modified version of the one residing on the origin server. If the Request-URI does not point to an existing resource, and that URI is capable of being defined as a new resource by the requesting user agent, the origin server can create the resource with that URI

[1] https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html 9.6

rbw · on March 4, 2016

Sounds like the others are indeed correct. Thanks for the citation!

thedufer · on March 3, 2016

You're probably thinking of PATCH. In many, if not most, RESTful services, PUT is given PATCH semantics. PUT is supposed to be insert-or-update.

rbw · on March 4, 2016

Ah, I stand corrected. Thanks!

colanderman · on March 3, 2016

This is why the If-Match and If-None-Match preconditions exist; they resolve PUT races by checking that the resource is in the expected state.

jessaustin · on March 3, 2016

I think you mean If-Match and If-Unmodified-Since. Actually they are relevant for DELETE as well. E.g. you might not want to DELETE if another client has just PUT.

colanderman · on March 3, 2016

If-None-Match: * ensures that another client hasn't created the resource you are trying to create. It is equally important as If-Match for resolving race conditions.

jessaustin · on March 4, 2016

It's fairly uncommon to use PUT for resource creation. In that case, however, if the server supported it, yes you could use If-None-Match. I really have to wonder about the architecture of the system, however, if two clients can simultaneously decide to create the same resource rather than two similar resources.

colanderman · on March 4, 2016

Surely you're joking.

* Literally the first thing RFC 7231 says about PUT is "The PUT method requests that the state of the target resource be created or replaced […]". RFC 7231 takes into account many changes in HTTP practice over the past decade (even bizarre ones like POST-to-GET on a 301 redirect); if create-on-PUT were frowned upon, it would be called out.

* PUT as described in RFC 7231 is the same thing as UPSERT in an RDBMS, or a write operation in a key-value store. These are certainly not uncommon DB operations; their REST analogue is similarly useful.

Here's some examples:

* PUT is how documents are created in WebDAV. WebDAV is multi-user, so two users may decide to create a document with the same name, just like on any file system. If-None-Match: * is the only way to support the O_EXCL flag on POSIX open(2).

* A resource which represents attributes of arbitrary external resources will have a URI named after the external resource (e.g. UPC or SHA-1, etc.), and therefore must be created with PUT. If-None-Match: * is the only way to prevent lost updates when the external resource is first made known to the system.

PUT-as-create is sound design supported by precedent for any system where the keys have a priori meaning.

jefftk · on March 3, 2016

Well, mostly If-None-Match is used to save bandwidth by allowing the client to validate a stale resource.

stanleydrew · on March 3, 2016

Yes. Incorrectly implementing the HTTP spec is widespread though, I imagine especially with PUT.

People are writing custom HTTP application servers and making PUT do anything and everything. Idempotence doesn't usually enter the picture.

jimjag · on March 3, 2016

"Yes. Incorrectly implementing the HTTP spec is widespread though"

Huh??? First of all, that's a pretty serious statement to be making with no evidence to support it. Plus, that is no excuse for any server that is NOT compliant.

The very fact that nginx decides to create HTTP status codes, willy-nilly, for its own use shows at least the suspicion that strict compliance is not a priority for them. Thankfully, it is for other web servers.

lucb1e · on March 3, 2016

From the few days I ever did API design (as a trainee, not even employee yet), idempotence is one of the first things you encounter when looking for general design and it definitely came into view for me. Then again I'm not the average student, but still.

bpicolo · on March 3, 2016

I mean, this is subject to human beings writing things, unfortunately

JensRantil · on March 3, 2016

Huh, didn't know. Thanks for pointing that out!

lukeschlather · on March 3, 2016

Sure, but the point is you should design your POSTs to be idempotent as well.

joosters · on March 3, 2016

Well, they are defined as non-idempotent, so there's no reason why you 'should' design this way. You can't make every request idempotent.

It's best to design so that duplicate POSTs are handled sensibly (e.g. you don't make a user pay for the same product twice), but the response to the second POST is unlikely to be the same as the first one, so they aren't idempotent.

More difficult cases are where an action could legitimately be performed twice, e.g. adding an item to a shopping cart. You must differentiate between a wrongly-duplicated request, and a real request to add the second item. One way to do this is to add parameters to the POST so that it can be identified as a duplicate. But it can be tricky to do this without holding a lot of extra state in the server application, and there are all sorts of concurrency problems when you have a cluster of servers.

lukeschlather · on March 3, 2016

There's really no such thing as perfectly idempotent operations. But it's an ideal to be emulated as much as possible. Even (or maybe especially) when something appears to be by definition not idempotent.

33degrees · on March 3, 2016

I think the most of these cases can be handled via PUT i.e. update a cart so it contains these items. That way you keep all the state on the client.

thedufer · on March 3, 2016

That's great unless people want to shop in multiple browser tabs or anything like that. The real solution here is to use the fact that we have POST which is specced as non-idempotent and not depend on the very small set of technologies that purposefully disobeys the spec.

jerf · on March 3, 2016

"Should" is as strong as you can go with that statement. Not all POSTs can be idempotent. Nginx has to deal with that in the general case.

SilasX · on March 3, 2016

A better approach IMHO is to turn them into PUTs. If something would normally be a POST but you've eg used GUIDs to ensure that creation actions are idempotent, then such actions should be PUTs.

But then again, some things must not be idempotent: eg "shuffle this deck of cards in an order that is random to me".

Edit: On second thought, you could make that idempotent too, albeit at the cost of increasing server load and your app's architecture's complexity -- you would just have to verify that the deck has had some reordering since that client's request, and not make any further reorderings in response to that client, since from their perspective it's still randomized.

Too · on March 4, 2016

Exactly. What happens if the user accidentally hits the submit button twice. Double post should be handled by your backend regardless and the frontend should be designed to submit things in a way that allows this.

drdaeman · on March 4, 2016

This applies to uWSGI (uwsgi_pass) as well, right?

Ralfp · on March 4, 2016

I'm wondering about this myself. I'm looking into alternatives to Apache with mod_wsgi.

drdaeman · on March 4, 2016

Seems that it does. And I would have a solution if only nginx had $request_uri_without_args...

The trick for uWSGI is to have `uwsgi_param PATH_INFO` not $document_uri (it won't work due to `rewrite ^ @nonidem last;`) but an originally requested URI. $request_uri almost does it, but fails when URI has query arguments.

On nginx mailing lists there is a suggestion to strip it myself, with Lua[1], but I'm surely not going to throw in Lua just for this.

[1] https://forum.nginx.org/read.php?2,215192,215195#msg-215195

lziest · on March 3, 2016

That's why I use openresty. I can use balancer_by_lua to customize my upstream selection/retry strategy.