Deploying a Django App with No Downtime

Beltiras · on Oct 22, 2015

I was responsible for a 10^6 views/day (due to most traffic being daytime, comes out to 10-20 req/sec) website running on Django. I went through hell arriving at this solution because the system was in disarray when I took over and deployed on an expensive cloud solution (so resources were limited). Finally I just put my foot down and demanded we buy servers instead, we would have more bare metal by a factor of 4 than we needed and we would pay only a fraction of what we did for the previous "solution". I could provision all the VMs I wanted, provided I didn't break the host.

I arrived at this procedure: Spin up a dyno, worker, caches and warm everything up. Run smoketests to validate everything from the app-server onwards is lit and firing on all cylinders. Redirect nginx traffic to the new setup. If everything turns to shit in a second, reverse traffic change and re-evaluate. If everything works and nothing is broken after an hour, take the old dynos/workers/caches offline.

The trick was if there were migrations that broke the older version that you did it in 2 or more steps to isolate the changed model interactions.

And nobody should be using gunicorn, it's way less performant than uWSGI (citing:http://blog.kgriffs.com/2012/12/18/uwsgi-vs-gunicorn-vs-node...).

parhamn · on Oct 22, 2015

This has a name -- its called blue/green deploys. The idea is if you're running the 'blue' version of the site, you run a new 'green' version then then route traffic to it once everything is fine (usually at the LB level) and revert if necessary

The way you solve the migration problem (much harder than it sounds) is to do only non backwards breaking migrations between a blue green switch. So you end up doing your migrations in multiple deploys (e.g add a field, and start using the new field with no references to the old field in the code, then remove the old field in the next blue/green deploy). I've generally found that it might not be worth the effort though depending on the context (especially if your initial healtcheck is before the migration).

Beltiras · on Oct 22, 2015

It depends on the cost you would assign to an event. I would sometimes on small code changes just deploy directly and touch-reload the app but I was very reluctant to deviate from protocol given experience. In the mad days before my protocol I once had the site in burning/extinguishing/reigniting cycle for 4 hours. The reason it took so long to figure out what was wrong was that the problematic code was live for 10 days in a section that had to run on each request without a problem. To this day I couldn't tell you why it suddenly started to act up. 4 hours of "House is on FIRE!" will make you cautious.

CoffeeDregs · on Oct 22, 2015

    And nobody should be using gunicorn, it's way
    less performant than uWSGI

uWSGI may be different now, but 4 years ago I built a browser toolbar C&C API server that peaked at about 2k requests per second. We used uWSGI to front our Django application with SoftLayer HTTP load balancers in front of uWSGI. uWSGI was speaking HTTP and translating to WSGI, but did so only in one thread/process. So while 8-16 Django threads were tootling along, the single uWSGI HTTP->WSGI thread was pegged. We switched to gunicorn and had dramatically better performance.

_xwpc · on Oct 22, 2015

You can deal with this by having NGINX load balance and translate from uWSGI's binary protocol to HTTP: https://uwsgi-docs.readthedocs.org/en/latest/Nginx.html

Beltiras · on Oct 22, 2015

It's well worth the effort to get uWSGI running under it's native protocol.

Here's a must-read if you use it: http://uwsgi-docs.readthedocs.org/en/latest/ThingsToKnow.htm...

Beltiras · on Oct 22, 2015

Sounds more like a crappy setup than a problem with uWSGI itself. I know it has a steep learning curve.

I know that it takes time and effort to change out software pieces that have a long history within an organization (I have some experience in that department) even when a clear and technically superior alternative is available.

I have a drawing of my favorite infrastructure that solves a ton of problems before I code one line of anything.

nginx -> uWSGI -> Python(Django) -> Celery(via RabbitMQ) -> Postgres

It's a bit more complicated than this and redis is missing but it's a nice linear representation of how I like to structure things. The only thing there not lightning-fast is Django but it doesn't need to be. I'm spending most of my time in the other pieces of software.

CoffeeDregs · on Oct 22, 2015

    Sounds more like a crappy setup than a problem with uWSGI itself.

No. As I detailed, it was a deficiency with uWSGI itself. Complicating the setup as you suggest is a crappy workaround for a uWSGI deficiency. We simplified the setup by switching to gunicorn. I would still use uWSGI, but, as with all software/tools, I would be aware that it has a deficiency.

uWSGI docs on production usage of HTTP functionality don't mention this issue so I wanted to let others know. https://uwsgi-docs.readthedocs.org/en/latest/HTTP.html?highl...

    I have a drawing of my favorite infrastructure

Wish I had the luxury. We're running 20-30 projects/applications (~3000 tasks) across our clusters. Some are Django fronted by gunicorn, some are directly fronted by flask, but 80% are non-HTTP services.

squeaky-clean · on Oct 22, 2015

I'm in a similar situation. A service getting about that many hits per day, with the majority of the traffic being during the day. We have a few servers behind a load balancer to help with the peak traffic. Luckily, traffic peaks during European daytime, so when I'm in the office (USA), it's the low period. I can pull one of our servers out of the load balancer, push the update, put it back in and watch the responses. If things don't blow up, rinse and repeat.

It's feels pretty hacky to me though, so I've been recently reading up on Fabric and deployment patterns for Django, great timing with this article being posted!

rch · on Oct 22, 2015

Good approach, and thank you for mentioning uWSGI.

Don't forget to warm up Solr or ES as well.

Beltiras · on Oct 22, 2015

I gave up on ES and never had use for Solr. I think the problems I had with ES was how the contractor set it up but the only search I have ever needed has been built with bare-naked ORM, even did a fulltext search using Postgres as a backend without an index, articles size from 10^3-10^5 characters. on the order of 10^5 articles, still finding a set of 10-20 articles containing 2-5 search terms within 1 second. This really surprised me since fulltext search in Postgres is supposed to be bad. I tried to find a bad case where the search would take forever since I thought I had the nightmare case on my hands, didn't.

uWSGI is a beast to learn but once you got the hang of it, nothing else will compare.

stevepike · on Oct 22, 2015

Postgres text search is really useful, but dedicated search servers still win when you need to start showing faceted results and more tuning of the scoring.

gt565k · on Oct 22, 2015

I strongly suggest the ES 2 day training course. I think it's called Core Elasticsearch. It dives deep into how ES works under the hood. It also provides a great overview of all the available features.

What I loved most about the course is that the presenters were extremely knowledgeable and gave great tips on how to manage your schema with ES and configure the clusters so that you don't encounter performance issues later on.

You can do some amazing things with ES, that you wouldn't necessary even know you could.

One example is the percolator[0], which can be used to slam new documents against existing queries and essentially classify the document based on search queries.

[0] https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Beltiras · on Oct 22, 2015

I'm sure. In my case the ES integration was botched and I didn't have time to learn a new piece of software to work out the kinks. It's been on my list of "tech-2-familiarize" but somehow never been exigent enough to pull the trigger on.

fmueller · on Oct 22, 2015

This only works if database changes are backwards compatible which is not always the case.

In all other cases you're procedure seems only reasonable. I don't understand why anyone would do it differently.

tonyarkles · on Oct 22, 2015

That's the trick: transform all database changes to be backward compatible between version n and n+1. It takes a little more time, and you end up with multiple deploys, but it's really quite nice.

smilliken · on Oct 22, 2015

To simplify this process further, you can run both the new and old gunicorn process on the same port using the SO_REUSEPORT feature in Linux 3.9. This way you don't need to listen on a new port and update nginx config (potentially once for each upstream web server).

I submitted a patch to gunicorn just this week about this: https://github.com/benoitc/gunicorn/issues/598.

Ideally I'd love to see all server software that listens on a port have a SO_REUSEPORT option for hot-swapping. This feature makes operations so much simpler.

geofft · on Oct 22, 2015

I'm not sure if I believe that SO_REUSEPORT is the right thing here. While both servers are running, aren't they both answering requests, depending on which one gets scheduled first?

Can you work around this by instructing the first server to stop calling accept() before launching the second server? (But keep the first server's listening socket open, in case the second server doesn't start, so you can tell it to start calling accept() again.)

smilliken · on Oct 22, 2015

Yes, that's exactly it. First both processes are accepting (at the whim of the scheduler), then the old stops accepting, and then the old exits when it's finished all outstanding requests.

geofft · on Oct 23, 2015

Why do you want both processes to accept? That sounds like a bad thing, since you can't test to make sure the new server came up right. You have no way to make sure you're hitting the new process, until you tell the old to stop accepting. And you want to be able to recover if the new process doesn't work.

What I'd suggest is to add a "stop accepting" and "start accepting" signal, maybe a URL route that checks that it came from localhost or something and sets/clears a flag. If the flag is set, the mainloop skips all calls to accept().

So the full set up is, first the old process stops accepting, but keeps the listening socket open, so new connections queue in the kernel. Then you start the new process and make sure it works. Then you shut down the old process. If the new process doesn't work (e.g., accepts connections and returns 500s or something), then you've lost some requests, but the old process is still around. So you can signal it to start accepting again.

Maybe this is overengineering, but the case I'm worried about is where something in the state of the old process is important (it has an open database connection, it has a module loaded into memory that got corrupted/deleted on disk, etc.). So if you shut down the old process before you test the new one, you have no guarantee you can get the old process back up if you need it. You may have lost the node entirely.

abrookewood · on Oct 22, 2015

But if both processes are listening on the same port, how would he check the health of only the new process?? How do you determine which process answers each incoming connection?

andy_ppp · on Oct 22, 2015

Will the old processes not be successfully killed and the new processes just take over? Healthwise supervisor would be the one checking the gunicorn processes?

Looking at this article is just seems to make ports operate either with any new server initialisation just automatically takes over the port (or maybe they become effectively load balanced between all running processes, it's hard to tell conclusively)...

http://freeprogrammersblog.vhex.net/post/linux-39-introdued-...

smilliken · on Oct 22, 2015

You can keep both the new and old process running concurrently as long as you like, and they'll both be responding to requests. You can run your health checks at this time (perhaps distinguishing the process with a header). When you're confident, send a signal to the old process to gracefully exit.

crdoconnor · on Oct 22, 2015

UWSGI also handles zero-downtime restarts:

https://uwsgi-docs.readthedocs.org/en/latest/articles/TheArt...

Beltiras · on Oct 22, 2015

If you seriously want no downtime you should never touch-reload. ALWAYS spin up new infrastructure, smoketest, switch the loadbalancer. You might run into problems with load you didn't see from testing but you can reverse and minimize downtime.

hackerboos · on Oct 22, 2015

This. I load tested my zero downtime unicorn setup and got a few lost requests during the reload.

That said, if you boot a VM and it's bad, you are going to lose requests then too.

Beltiras · on Oct 22, 2015

First of all, what is a "bad vm"?

Second: smoketests should show if the engine is up and serving requests not under load.

Third: you have the fallback of reversing the flow.

You will always lose requests. All you can do is minimize the damage.

EDIT: Want to clarify what I mean by lose requests since someone is downvoting. In the process of administrating a site with that many requests and a moving infrastructure you can plan as much as you want and try to employ hedging strategies all you want. You will STILL run into problems once in a while that will drop messages. You don't have to lose requests to handing load-balancing off to another handler and you can reload apps without losing requests if the app server is properly coded.

rer0tsaz · on Oct 22, 2015

There was a good talk about deployments at DjangoCon US 2015.

Django Deployments Done Right by Peter Baumgartner: https://www.youtube.com/watch?v=SUczHTa7WmQ&list=PLE7tQUdRKc...

ipmb · on Oct 22, 2015

That's me! (shameless plug) I'm also running a Kickstarter campaign for a screencast series that will go into detail on Django deployments http://kck.st/1FRZyMx

danialtz · on Oct 22, 2015

Nice approach! "reload" was a nice touch to learn.

What I miss usually are two parts of deploys that are often ignored, and are critical in production environments:

1. Revert/rollback to older versions. We're humans and despite having all the processes sometimes murphy's law applies on moderate to complex server setups.

2. There is git/svn for code tracking, but even more important is the database consistency. Versioned backups and restores should be also part of the whole setup.

I'm currently looking into having the full setup with no downtime with open ears.

irremediable · on Oct 22, 2015

Any advice about versioned database backups?

smt88 · on Oct 22, 2015

In my opinion, this is one of the major unsolved (or under-solved) issues in web development. You have to write and test migrations, and then you have to make sure the migrations are tied to your deployment/rollback process.

If there's a service/tool that automates a lot of this or makes it safer, I'd be really happy to learn about it!

UK-AL · on Oct 22, 2015

When I used to use django ages ago, there used to be Django south. I think its built in now.

metachris · on Oct 22, 2015

Yes, its built in now: https://docs.djangoproject.com/en/1.8/topics/migrations/

Of course a database rollback may loose information from any newly created fields.

jeffasinger · on Oct 22, 2015

So if you're aiming for zero downtime rollbacks with Django, here's how adding a new field to a model might work:

  1) Add a Migration that adds the new field, but allows null. Deploy.
  2) Add the field to the model. Also make sure it sets it's default on write. Deploy.
  3) Execute a background task to set the field to whatever it's default value should be.
  4) Add migrations to enforce integrity and add indexes. Deploy.
  5) Actually deploy code that needs the new field.

Yes, this is super annoying, no, most people don't do this. Separating out #1 and #2 means you can always roll back all the way to right after #1 without losing any data. An extra, nullable field on a model with no indices on it shouldn't hurt anything.

micah_chatt · on Oct 22, 2015

Totally agree. I've been advocating this for months but the ease/simplicity of simply restarting the server with new code has won out for now.

danialtz · on Oct 22, 2015

I'm planning to do something like this workflow most probably reinventing the wheel along software history: - Every release is already tagged, and merged into master (git-flow). - There is a post-commit hook to backup the database with the tag and time. So I know which database tag is newer. - Store the encrypted zip on S3, since each developer can deploy. - Deploy, like cuu508 did.

The problem is with restore. If there is no change in database it is more or less straight forward to download and import the db with the n-1 code tag. But if there are any inserts in database then a migration back should be applied to the latest stand of the database. I can imagine it's going to be a hefty workflow to do.

mateuszf · on Oct 22, 2015

How about .. have one db with log of events (cqrs style), the other one is transactional and transactions are being performed based on ordered events.

When deploying new version - backup the transactional db and keep a pointer to the last event in events db. Keep the events db running.

If restoring - restore transactional db backend based on backup and "play back" all events from the events db.

The distinction can also be made using one db and backing up / restoring subset of the tables.

josegonzalez · on Oct 22, 2015

So basically kafka + another layer for database migrations?

fintechie · on Oct 22, 2015

In docker-land you've got Flocker. I guess you could version the container with tags...

https://clusterhq.com/

methodover · on Oct 22, 2015

It's interesting, reading about how my counterpart at other shops do it.

In our own Django web app, we basically just use the load balancer for deploys. Our service provider (Linode) has a nice API, so when deploying we just do one server at a time and direct traffic away from the ones doing the upgrade. It's not complicated and works just fine... At least when there are just two machines.

SEJeff · on Oct 22, 2015

I do pretty much the exact same thing, but with Mesos[1], Aurora[2], and Aurproxy[3] (which is nginx).

It sends traffic to a new instance only after the healthchecks are passing. Makes for a really nice rolling restart workflow.

[1] http://mesos.apache.org/

[2] http://aurora.apache.org/

[3] https://github.com/tellapart/aurproxy

fintechie · on Oct 22, 2015

+1 for rolling upgrades... Having 2+ appservers makes upgrades easier and increases resilience in general. Docker makes this really easy...

cdnsteve · on Oct 22, 2015

Could you provide more details on this?

sidmitra · on Oct 22, 2015

I'm curious, unless i'm missing something the OP could have tried a graceful reload of the gunicorn process?

>kill -HUP <pid>

EDIT: That probably applies to nginx too.

vacri · on Oct 22, 2015

nginx also has a 'reload' command like apache - service nginx reload. No need to hunt down the pid yourself...

klibertp · on Oct 22, 2015

    nginx -s reload

also works, you don't need service/systemctl at all for this.

cuu508 · on Oct 22, 2015

Ah, thanks, good to know! I should try this instead. Would still need to do something extra when supervisor task definition changes (e.g., a new environment variable is added), but that should'n happen nearly as often as python code updates.

IgorPartola · on Oct 22, 2015

As I understand it mod_wsgi with Apache will do a code reload when you either touch your WSGI file or do an apache2 reload (that is send it a SIGHUP). The big downside there is having my Python code be re-loaded and re-parsed as the new processes get spun up. That usually results in the first few requests being really slow. But it is effectively a zero downtime system. In general, if the components of your stack support being reloaded via SIGHUP instead of having to be restarted, use that. If they do not, consider using different components.

However, I'd say that this gets a whole lot easier as you add even a "local" load balancer: that is a load balancer that lives on the same box but does not require a restart during a deploy. In this case you can start up your gunicorn/apache2/uwsgi/whatever on a new internal port, then remap which port is "live" using your firewall or by updating your load balancer's config and reloading it. This is how dokku works with Docker containers and I love that I have zero downtime deploys with it out of the box.

ipmb · on Oct 22, 2015

Instead of rewriting your supervisord config every time you should be able to use a symlink that points to the newest virtualenv. I haven't tested gunicorn, but reloading uWSGI will resolve the new symlink to your new virtualenv. Rolling back is simply adjusting the symlink.

Also make sure you're on pip 7+ to take advantage of the automatic wheel cache to speed up building new virtualenvs.

As another commenter noted, I discussed a similar approach at DjangoCon US this year https://www.youtube.com/watch?v=SUczHTa7WmQ&t=1083

Pephers · on Oct 22, 2015

For gunicorn running Flask I use something along the lines of for zero-downtime deploy which really works well as long as your app runs on a single server:

    rsync -avz ./ user@remote:/home/user/web-app
    ssh user@remote 'cd /home/user/web-app; \
    venv/bin/pip install -r requirements.txt; \
    venv/bin/alembic upgrade head; \
    supervisorctl pid web-app | xargs kill -HUP'

The full deployment script has a few extra options which I've omitted for clarity, but this is basically it.

Galanwe · on Oct 22, 2015

I don't really think you should promote these.

The real way to perform clean update is using a load balancer, offloading instances for upgrade one at a time.

Also please people stop deploying using github. Package management & versioning IS a thing.

> rsync -avz ./ user@remote:/home/user/web-app

This is not an atomic update, meaning that you could end up loading modules from the new version from code of the older version. Use a new directory to upload the new files.

> venv/bin/pip install -r requirements.txt

You do not delete the old requirements, this is incrementally bloating the virtualenv. Just create a new one, virtualenv are designed to be created/removed quickly.

> venv/bin/alembic upgrade head

Huu, that's nice if you only have 1 instance to upgrade at a time. Otherwise it will just blow up.

xmatos · on Oct 22, 2015

I use git hooks for my Django deployments and it couldn't be simpler. My post-receive hook look like this:

#!/bin/sh

dest=/var/www/myproj

GIT_WORK_TREE=$dest git checkout -f

$dest/manage.py collectstatic --noinput

$dest/manage.py migrate

touch $dest/myproj/wsgi.py

Running a git push to the prod server fires this hook and that's all. I could improve it by first checking out to a test environment to run django tests and, if everything passes, do a git checkout to production.

Plus, am I the only one using apache?

mangeletti · on Oct 22, 2015

Back before WSGI really took off, mod_python made Apache fairly prolific, but most Python apps these days are using Gunicorn (or some other Python-based WSGI server), or Nginx in front of uWSGI, so yes, to some degree, you are the only one using Apache.

As a side note: I've run copious tests against Nginx -> uWSGI setups and Gunicorn, etc. setups, and Nginx -> uWSGI is in a whole different league (way faster, way more requests per second, way less memory, and you CAN (despite what I'm reading herein), very EASILY with uWSGI, deploy without losing a single request). However, I still use Gunicorn, because it's quick and easy, it works well, and I don't have to sys admin the thing.

cheetos · on Oct 22, 2015

Aren't you concerned about allowing your servers to have access to the repo?

dbravender · on Oct 22, 2015

https://github.com/dbravender/gitric (which I wrote) is a much more generic fabric module that will work for more cases. It looks like this particular solution only works because the database query was copied to the web server which might not work for everyone. Check out the sample blue/green deployment fab file here: https://github.com/dbravender/gitric/blob/master/bluegreen-e.... We used the same technique at my last job on a Django site and deployed it several times a day for two and a half years with zero downtime.

glogla · on Oct 22, 2015

That SQL statement that does update and insert is pretty cool. It chains together both parts into single statement so it doesn't even need to start and commit trasaction, right?

It makes me wish for some kind of "advanced postgres" guide.

cuu508 · on Oct 22, 2015

I think PostgreSQL still executes it as a transaction, even though it is a single statement.

While benchmarking this, I learned that with the default and safe settings (synchronous commit), PostgreSQL can only do few hundred write transactions per second, however simple. Set synchronous_commit=off in postgresql.conf and TPS goes into thousands.

anarazel · on Oct 22, 2015

With synchronous_commit = on the throughput heavily depends on your IO subsystem and the number of concurrent transactions.

As every transaction has to be durable at commit, fsync (well, fdatasync() if supported) of a sequential log is required. Most single drives can do a ~100 (rotational disks, normal speeds) to a few thousands fsyncs/sec. If you have a battery backed raid controller you can do tens to hundreds of thousands, since the data will not actually be written to disk.

Concurrency matters because several commits can be made durable with a single flush to disk if they're happening concurrently (sometimes called "group commit").

EDIT: spelling, grammar, minor clarification.

danialtz · on Oct 22, 2015

Nevertheless, one should be very careful with unescaped queries like that in nginx. There is a reason frameworks exists around such topics.

PythonicAlpha · on Oct 22, 2015

Really nice, but when you have bigger DB-changes, this might run into trouble, when the old code runs on changed tables or the new code runs on unchanged tables ...

At least such cases should IMHO be considered before doing such upgrades.

annnnd · on Oct 22, 2015

Nice!

I would indeed sleep better if the service monitored my cron jobs. But I would sleep even better if I knew you were making some money from this, because this gives you an incentive to keep providing the service.

For instance, you could offer a paid plan for users with >N/day requests or something like that...

Not sure if you want to go there or not (and I am not currently in the market for such a solution as I have solved my need with a proprietary solution), but the fact that there is no paid plan lowers my expectations of the service.

EDIT: clarified my point.

cuu508 · on Oct 22, 2015

Thanks for the comment! I've received similar feedback a few times already, and will be thinking about aligning incentives. I do want to keep the existing feature set free though.

seanwilson · on Oct 22, 2015

Have you considering using anything like AWS or Heroku? I know you say in the article you want to keep it simple but services like those can hide this complexity for you and their solution is likely to be more robust. Also, if you're only using a single Digital Ocean droplet at the moment, you will have to consider the changes you'll need to make when you require more than one droplet or the droplet you're using goes down. Looks like a fun project; good luck!

cuu508 · on Oct 22, 2015

I might look into making it easily deployable on Heroku. Would still prefer plain VMs (like the ones from AWS and DigitalOcean). Figuring out deployment is part of the fun!

mrweasel · on Oct 22, 2015

I might be missing something, but he's using Fabric and Python 3? Fabric doesn't seem to support Python3 yet.

Or is he using Python 2 to deploy a Python 3 application?

cuu508 · on Oct 22, 2015

Correct, I run Fabric with Python 2. The app itself works with both Python 2 and Python 3.

mrweasel · on Oct 22, 2015

Aah okay, we're trying to eliminate Python 2 completely, so sadly no Fabric for us.

sidmitra · on Oct 22, 2015

Even though i've moved to python3, I keep fabric installed via apt-get and using the system python which is usually 2.7 I don't have anything in my fabric that needs access to my django app or virtualenv.

collinmanderson · on Oct 24, 2015

https://github.com/pyinvoke/invoke is the start of the py3 version.

baq · on Oct 22, 2015

fabric is a tool that's using python as an embedded language; i don't really consider it python any more as i consider Makefile python.

shubhamjain · on Oct 22, 2015

I am new to Python world but is there any reason one must be necessarily using "virtualenv" on production server? If I am not wrong, its purpose is to avoid conflicting dependencies but on the deployment server it can be assumed that only one project will be deployed, so I don't believe there can be many instances on conflicting dependencies.

riquito · on Oct 22, 2015

It's not a strictly Python problem. He's deploying new code with updated dependencies in a separate directory, while the original site is running.

If you manage globally the dependencies you would probably brake the running version of the site during deploy, because it may expects different (old) API from his dependencies. For example the new code may NOT require a previously used dependency, and thus the deploy would remove it and the running code would collapse.

Also, if the deploy fail midway you'd like the global environment to be unaffected.

est · on Oct 22, 2015

> is there any reason one must be necessarily using "virtualenv" on production server?

For a single project running on a single server without many updates, you don't need virtualenv.

For multiple projects with various age running on multiple different distros and you have dev/staging/prod you might need virtualenv badly.

fotcorn · on Oct 22, 2015

You need root to install python packages globally. It's always a good idea to deploy without requiring root permissions, for example using sudo with only nginx/gunicorn reload commands allowed.

mkesper · on Oct 22, 2015

He's pushing new versions to that server while running the old stuff. The dependencies could have changed in between.

leetrout · on Oct 22, 2015

OT but I think you can make the location a bit more readable by using the count quantifier...

^/(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})/?$

mvermaat · on Oct 22, 2015

Cool! We have a pretty similar setup implemented in Ansible for our Mutalyzer app:

https://github.com/mutalyzer/ansible-role-mutalyzer

healthchecks.io looks pretty nice, I might start using it!

Kiro · on Oct 22, 2015

I don't understand how Django apps can have downtime. When I deploy I just do a git pull and all new requests end up in the new code. Either your request goes in before the pull or after, there's no in between. What am I missing?

sidmitra · on Oct 22, 2015

The new Django code isn't just automatically loaded by the Apache(wsgi)/Gunicorn. You need to explicitly reload it. BTW on your dev server that reload happens automatically if code changes. You can probably configure your web server to reload automatically too. Although doing it automatically on production is dangerous, for eg. a migration might need to run before the code accessing the new fields runs.

andybak · on Oct 22, 2015

1. Update your code (a few seconds) 2. Update your requirements (this can take several minutes in some cases) 3. Run migrations 4. Run collectstatic 5. Restart any daemons 6. Restart your app

Even if you can guarantee no processes will restart in this period (maybe you pause your daemons and you've configured your webserver to not spawn any new workers for a while) you've got the issue of migrations.

Personally I stop the webserver before this sequence and restart it afterwards as I can't reliably reason about all the things that can go wrong in between.

andybak · on Oct 22, 2015

(meta: Once again markdown and it's ilk prove to be the bane of my life... It's like WordStar never died.)

geofft · on Oct 22, 2015

Your files don't get written to disk atomically. When you run `git pull`, git goes one-by-one and updates each file. It doesn't promise what order the files are updated in.

The automatic reload behavior is a thing peculiar to the dev server. If you tried using it in production, especially on a large repo on a loaded server, chances are that Django's auto-reloader would notice that files have been changed as soon as git updates the first file, and it would reload that new file but the remainder of your old files. So some of your requests would go through to a splinched version of your website.

Of course, the reloader will figure out soon after that the last file has been changed, but if you were willing to tolerate a brief period of incorrect code / requests returning errors, you can tolerate actual downtime.

That's why this method, and all correct zero-downtime deploy methods, involve running a new Django server in a new directory, waiting for it to start, and then signalling the web server to route requests there.

fotcorn · on Oct 22, 2015

Our big Django app requires around 10 seconds to reload (running on uwsgi). During this, requests just hang. They do not time out/return errors like 502, but the website is still unavailable.

piquadrat · on Oct 22, 2015

We have the same problem, only it can take up to 20 seconds. We get around it by taking the instance out of the load balancer during the deploy, and then warm it up with a few requests before adding it back. But it's still very annoying, and I haven't figured out where it spends the 20 seconds.

bpbp-mango · on Oct 22, 2015

What about simply spinning up a new VM on a service like AWS, keep the database and switch the load balancer you've confirmed the app is responding on the new VM?

citruspi · on Oct 22, 2015

From the post:

> Aside: With regard to technology choices, the guiding principle I’ve been following is to keep the stack as simple as is feasible for as long as possible. Adding things, like load balancers, database replication, key value store, message queue and so on, would each have certain benefits. Then on the other hand, there would also be more stuff to be managed, monitored, and kept backed up. Also, for someone new to the project, it would take more time to figure out the “ins and outs” of the system and set up everything from scratch. I see it as a nifty challenge to stay with the simple, no-frills setup, while also not compromising performance or features.

> Let’s set some constraints: no load balancer (for now anyway).

But yeah, that would work otherwise.

abrookewood · on Oct 22, 2015

Yeah, that's how I would typically do it, but he isn't on AWS and he's aiming to keep the cost down. And he's also keeping it simple, so no load balancers etc.

mrtron · on Oct 22, 2015

If you are getting a request per second, I think a load balancer on AWS has tremendous value.

ngrilly · on Oct 22, 2015

1 request per second is nothing for a modern server. Most applications are able to handle hundreds or thousands of transactions on a single machine. Why do you think a load balancer would be useful?

mherrmann · on Oct 22, 2015

Very cool. Also cool to see a free alternative to cronitor.io!

stefantalpalaru · on Oct 22, 2015

> To verify this in practice, I wrote a quick script that requests a particular URL again and again in an infinite loop.

That's not how you verify it. Use a load benchmark like "ab -c 10 -n 3000 -q domain.tld/" and if that doesn't report non-200 responses while deploying the new code, then you can talk about no downtime.

My solution for this problem (tested and deployed on Django sites) is this: https://github.com/stefantalpalaru/uwsgi_reload

GPGPU · on Oct 22, 2015

I update my Erlang based ChicagoBoss / Cowboy web services live all the time, without one transaction being lost.

Erlang lets you update live code, while it's running.

Still, I'm a big fan of Python and Django, and it's nice to see people realizing the value of "live" upgrades, even in a round-about way.