Hacker News new | past | comments | ask | show | jobs | submit login
How we rolled out one of the largest Python 3 migrations (dropbox.com)
286 points by ddeville on Sept 25, 2018 | hide | past | favorite | 131 comments



> However, rather than use the native toolchains directly, such as Xcode for macOS, we delegated the creation of platform-compliant binaries to py2exe for Windows, py2app for macOS, and bbfreeze for Linux.

I wish the authors of more Python tools would deploy standalone applications. I do not like having to maintain various sets of Python installers/package managers (because every Python tool seems to use a different installer). Especially on cloud servers that often lack a whole set of dependencies that Python developers just seem to take for granted.

I’m not a Python developer. I don’t have the time or the inclination to repackage various tools and untangle dependencies.

Given the choice between trying to figure out how to get multiple Python tools to behave together, or using another tool, I’ll almost always choose an alternative.


This is possible and actually pretty trivial.

You install all 3rd party dependencies into some directory. The command line entry point is then a simple BASH script which sets the PYTHONPATH to the appropriate installation location and then does the appropriate exec call.

You then have a functionally portable python installation.


Pretty trivial?

Really?

I think as soon as you start writing all caps trivial has gone right out the window.

Trivial would be application dependencies managed by the system package manager.

I'm at the point where I won't touch a python app or library that can't be installed via Pacman. It's just not worth my time.

And... python is still the only thing on arch that gives me the shits everytime it upgrades.


This exchange is a good demonstration that the word "trivial" has lost its meaning in the same way "literal" has. Much like I usually hear someone use term "literally" for figurative emphasis, these days I mostly hear the word "trivial" used to describe something which is clearly nontrivial.

Math textbooks have been doing this for decades, but it's leaked into common parlance with online discussion.


I believe in the mathematical community the potentially offensive term for outsiders is non-trivial, used in a technically correct manner, but oftentimes applied as synonymous of epically hard, which unexpectedly throws people off, especially the author of that non-trivial work!


I think the best translation of "trivial" when used amongst mathematicians is that it is something you should be able to figure out with your current knowledge without too much difficulty (though it might require an hour of thought). Said in another way, you don't need to learn/develop new tools or techniques for something that is trivial.

Of course this is not when most people think when they hear the word so really it's a term of art that should probably be avoided when talking to non-mathematicians.


> I think the best translation of "trivial" when used amongst mathematicians is that it is something you should be able to figure out with your current knowledge without too much difficulty

As i read through my old uni maths notes there are often wild leaps from a to e along with a little scrawl saying "trivially" or "obviously". They may have been true once, but god dammit 21 year old me was a knobber


It takes approximately a year to learn to walk after you are born. But almost everybody figures it out. And once they know, they practically never forget. So I don't know, is it not trivial?

I have never had to support all of the mobile environments of DropBox nor the scale, so I cannot claim that my 99% solution would ever meet their 99.999% requirements. But I have been able to package Python apps for Mac OSX, WinVista, Win7, Ubuntu, and CentOS at the same time using that strategy.


Being charitable, I think the parent means that the python developer sticks all the dependencies in a directory and creates a bash script to set PYTHONPATH and launch. The user receives a directory rather than an executable, but only has to use the bash script, rather than worry about any of the Python in the directory.


Now your lovingly created cross-platform app is back to linux only because you used bash to invoke it.


Use a Python script instead.


Nah, he should use some turtles instead.


Isn't this just a hacky virtualenv? 'python3 -mvenv dir && ./dir/bin/pip install requirements'?


There are a lot of minor differences, but the biggest difference is that you're able to be completely independent of the system python install. You bundle a complete python interpreter, all libraries needed, etc.

The user doesn't need to have python installed at all, and if they have 2.x instead of 3.x or 3.3 when you're expecting features that are only present in >= 3.5, it's no issue.

This may sound trivial, but it's a _huge_ deal, particularly when you need to deploy something that runs on multiple different OSes and versions of OSes.

Other than that, the "directory full of libs, binaries, and code" approach is a lot easier to package into something that will work well with the native package manager (e.g. an .msi for windows, etc).


in the ideal, those making the tools would provide this packaging, instead of asking all users to have `pip` properly set up. Similar issues exist w/r/t npm in my opinion.


That's why most of the tools I use in my day to day work are Go or Rust binaries.


They're also fast and small, which while not always a big matter is very satisfying for me.


I work in a Python shop. The Docker images we build are nearly 1 GB. I just built a Go service whose image is only 2.5MB. Admittedly it’s much simpler than the Python apps, but even a complex Go app would never reach the size of our Python app for a number of reasons:

1. Python apps require a distro base image while Go can run on scratch

2. Python images ship with the full standard library; not just the bits you import

3. In Python, if you add a dependency only to use 1 function or variable, you still end up with the whole dependency in your Docker image, while I’m pretty sure Go’s linker strips unused code.


I agree, I'm not a python fan. Python has a few good libraries I can't find in other languages, but it's slow, bad for multi core utilization, hard to distribute, is very wasteful with how many dependencies need to be included, has a terrible package manager and I prefer static strong typing.


Have you tried mypy? It's a pretty good static typing system for Python - arguably better than Go's.


I'd like type annotations to be used to optimise performance. After all, if it has been statically verified that a particular variable is always an instance of class X, why not use that to optimise code?

This is an argument for type annotations to be integrated into every dynamically typed language, rather than tacked on via an external tool.


TL;DR - I continue to root for Python's typing story, but it's just not there yet.

I have, and I wanted to like it. On its face it seems like it should be a lot better than Go's--after all, it supports generics and union types! But it falls over in trivial cases, like recursive types (i.e., there's no way to model tree structures such as JSON or linked lists). A few other hard/impossible/confusing things come to mind:

1. How do you declare a typevar for a certain scope. If I define a type parameter `T` for function `foo`, I only want `T` to be scoped to `foo`. I don't want the type checker getting confused with `T`s for other functions/classes/etc. 2. What is the signature for a function that takes args/kwargs? 3. It straight up doesn't work with popular libraries like SQLAlchemy (last I checked, these were simply not supported because the likes of SQLAlchemy are "too magical"--this is a fair take, but frustratingly limiting for users).

These are just a few because my memory is poor, but I run into these sorts of things by the dozens every time I try to use mypy. It's just not ready for prime time. Go's type system is limiting, but its limitations are much more predictable and even less limiting (it turns out recursive types and poor-man's union types are quite a bit better than first-class, non-recursive union types, for example).


It's also tacked on. It's a bit too optional. If one team member doesn't care about typing he makes all his colleagues do the work to make his code work with mypy.


While it can't compete with 2.5MB, python:3.6-alpine (which includes points 1 and 2) weighs less than 100MB. You need a lot of Python code to get to 1GB.


Fair point. Our largest Python image has only 255 Mb of Python dependencies and ~50 Mb of source code. If we could use alpine (our compliance auditors strongly prefer centos base images), it would only be ~400 Mb. This is easily an order of magnitude bigger than an equivalent Go program, but still quite a lot better.


Is that really just Python code?? Must be over a million lines, no? I've worked with Odoo, which is a bit of a kitchen-sink (ERP, CRM, POS, sales, accounting, invoice, stock management, manufacturing control, website builder, marketing and a bunch more) and its Python code weighs just 15MB, the rest is JavaScript or data files.


It's not just souce code--it's also docs and test code and other things that are tedious to omit given our current Docker image hierarchy and repository structure. Our docs are largely Sphinx docs in Python docstrings; a decent minifier could probably reduce this, but it's probably not worthwhile for a ~5% improvement on overall image size.


My favorite Python library is pandas. If I only include this with pyinstaller I already get a >500MB executable.

IMHO unacceptably large, but that's because Python doesn't know which parts of pandas it might need to execute the program.


Code pruning is a problem in Go too; as soon as your codebase or any of your dependencies use reflection, it gets disabled for the whole binary.

A simpler alternative would be for Numpy and Pandas to provide its features as subpackages, like Airflow does: https://airflow.readthedocs.io/en/latest/installation.html


I’m almost certain that only applies to individual compilation units and not the whole AST. In other words, if I use reflection in my main package, code pruning still works on dependencies, which is quite a lot better than the Python situation.


I hope some day some big corp decides to write a decent (open source) Python to C compiler, which makes optional optimizations based on type annotations.

I find this project: https://github.com/Nuitka/Nuitka very interesting, but its written and maintained only by a single person and I never got it to work with any of my apps.


Cython is a pretty good Python to C compiler and in its latest release is using type annotations... the thing is, you shouldn't need to compile your whole program (compiling is really slow and interpreted Python is fast enough for most of the code).


I was under the impression that Cython is a different language (a subset / dialect of Python). So, not a Python to C compiler.


Good point. I never understood why the developers of every language don't make it trivial to build a .exe or .app file that users can double-click to run. Seems like the Python team is penalising developers for using Python :)

It can internally use an interpreter, JIT, bytecode or full native code like C. That's a different discussion. Just don't make it a pain to distribute.


Even when I am a python developer, there is still a difference between (a) this dependency that I explicitly rely on and that I need to sort out packaging issues for, and (b) code I treat as a black box and simply expect to work.

More often than not, a typical developer's python environment tends towards https://xkcd.com/1987/


That xkcd is missing virtualenv, pyenv, pipenv, venv, and poetry.


Interesting write-up, though it leaves me terrified how a relatively small, more or less single purpose application like the Dropbox client has over 1 million lines of code in Python alone.


I found the Dropbox lecture [1] at Stanford one of the most riveting things ever. There is just so much technology behind Dropbox, it is staggering.

There is a reason why it is so much better than iCloud sync, Google Drive, Box or OneDrive.

[1] https://www.youtube.com/watch?v=PE4gwstWhmc


This led me to find another loosely related but very entertaining piece of dropbox history. The original "Show HN" post: [1]. It's funny to see so much skepticism knowing now what the company became.

[1] https://news.ycombinator.com/item?id=8863


Yes, this is one of the classics - right up there with the "less space than a Nomad, no wireless, lame" comment (which wasn't on HN I don't think - but we all know it could have been :)

Edit: I see the motherlode is in place earlier in the thread "Especially when you could build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem"


ha, that's a nice counterpart to the 'trivial' thread above https://news.ycombinator.com/item?id=18071820


The iPod comment was from slashdot, if memory serves.


Yes. CmdrTaco, when posting to the home page:

https://slashdot.org/story/01/10/23/1816257/Apple-releases-i...

(that post is old enough to vote in this year's election).


You might only use a small part of Dropbox, but I bet there is a lot of functionality that you don't care about but which is critcial to Dropbox as a business/product for others.

The fact that you think it's small means they're probably doing something right!

Relevant: https://danluu.com/sounds-easy/

(FWIW I don't use Dropbox myself, but I definitely had people ask me why Google needed 3,000 employees back in the day. Apparently it now has nearly 90K employees.)


From the article

>There's also a wide body of research that's found that decreasing latency has a roughly linear effect on revenue over a pretty wide range of latencies for some businesses. Increasing performance also has the benefit of reducing costs.

I wish he cited some of that research, because Google doesn't show much except for this amazon study with the 100ms.

I'm especially interested if there's any research on engineering tools and their latency (long build times) etc., which are chronically under addressed in quite a few large corporations. I'm just wondering if there's some studies that would make the case for me if I were to present this to management.


Especially when you could build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem!


Sarcasm doesn't translate well on the internet, so I'm really not sure if suggesting using CVS (of all things) over a mounted FTP share as a replacement for Dropbox is a joke!



> Sarcasm doesn't translate well on the internet

Given what you're replying to, I'd say irony sure does.


The size of a codebase is less about its function and more about the number of people adding code to it.


In turn this becomes the number of people required to maintain a codebase.


What would you consider the main cause(s) of that?


The job of a programmer is to write code, so that’s what they do. I doubt there are very many people at Dropbox whose job it is to remove code!


That’s an extremely narrow definition of a programmer’s job.

I think that in terms of lines if code, my contribution to many projects was net negative.



I dunno - refactoring often involves removal of a fair bit of code


Dropbox doesn't feel that small or "single purpose" to me anymore.


What does the Dropbox client do for me other than syncing files and exposing a bit of the online functionality such as generating share links? (Serious question...)


We can simply start by asking what does “syncing files” include?

Watching files. Keeping backup of files. Keeping conflicts resolved. Watching Selective Sync files and folders. Watching Smart Sync files and folders. Notifications for synced files. Etc. etc.

There’s way more the client does than what I mention.


I don't think there is much code for conflict-resolving in dropbox. Usually in case of conflicts it renames one of the involved files and add a message about conflict and the date to the name and moves on.


Perhaps there's lots of code to minimize the number of times a real conflict occurs?


I agree that syncing is far from trivial, but that doesn't change it's a single purpose.


Not to build a straw man but by this logic also a browser is a single purpose application


A rocket ship is also "single purpose."


It might be single purpose but that single purpose is a reallllllly difficult problem.


You have no idea how web development is different from desktop one


It’s funny, after Mojave was released recently I thought it might finally have a python3 installed, even if it’s not the default. Nope, still Python 2.7.

This is good for me since I deployed a Python-dependent app under the assumption that the system Python would be stable and reliable. It allows relatively complex things to be achieved with a tiny download package.

I’ve been prepared to adopt Python 3 for awhile but it just isn’t necessary when using system defaults.


It'll be very interesting to see what happens with the next macOS when Python 2 will be EOL (which is roughly 3 months after its release). The upgrade to Python 3 is long overdue.


Python being EOL sounds scary but actually won't matter to Apple. They already apply custom patches, they can carry on running python 2.7 forever, with minor bug fixes where really required.


> they can carry on running python 2.7 forever

would they do this? does anything macOS internal depend on python 2.7?


the `xattr` terminal command relies on the system installed Python, though this seems to be the only example.

If you look at the source in `/usr/bin/xattr` it does some work to deal with different versions of Python. All the work ultimately gets handled by the xattr module preinstalled with the system Python. This module has Apple's copyright in it and is different than the `xattr` module on pypi.

Wonder how this Python one-off in macOS came to be.


As a python engineer still living in 2.7, it's great to see major codebases making the move. I know I have a similar experience coming in my future, and I appreciate hearing what seems like more or less a success story come from it.


FTA :

>>> On the surface, the application would more closely resemble what the platform expects, while behind various libraries, teams would have more flexibility to use their choice of programming language or tooling.

I'm always fascinated by how the implementation of the core principles of an application is dictated by factors alien to it, such as OS, company organisation, etc. Therefore, the job of coding is often a small part compared to the amounts of trivialities, project management decisions, customer's ideas, corporate policies, etc. Although my soul is a coder's one, I always realize how much coding is just a small part of what I call application development.


Here's a nice overview of how Facebook migrated their codebase to Python 3. While it's different in nature (server side vs. client side), it's rather interesting.

https://www.youtube.com/watch?v=H4SS9yVWJYA


I hope this kills (or helps killing) the 100+ thread count I have always seen in macOS. It surpasses any other thread count from far more important/sophisticated processes.

I'd say that's a waste (if not abuse) of the system's resources and scheduling system.


I wouldn't be that worried about 100 threads, when you have 10s of thousands you can think about worrying.


Seconded, on linux for me though.


Is there any sample code that illustrates this kind of python embedding versus one of the freezer scripts ?

We used to build desktop software using pyqt and freeze it. I wonder how that entire toolchain looks like in this new way.


See https://docs.python.org/3.7/extending/embedding.html.

The idea behind embedding is you might have a Python shell in a larger app. But you can also use it to tightly control the execution of the interpreter.


But their Debian packages still depend on Python 2.


What happened to their new implementation of Python they were building? I know it was cancelled, but not really why.


Probably because it was 2.7 only, and only 10% faster on Dropbox's code.

https://blog.pyston.org/2017/01/31/pyston-0-6-1-released-and...


I'm surprised 10% wasn't enough - 10% bottom-line improvement in programming language implementation is normally massive. Twitter is singing from the roof-tops about 10% improvement in Java performance from the new Graal JIT compiler.


I would imagine that for code that is performance sensitive enough that a 10% improvement matters, they would be porting to a language with saner performance instead?


PyPy is already way better than that, and has a Python 3.5 implementation.


I'm by no means an expert on the subject, but isn't there an advantage to having a JIT compiler use LLVM?

On another note, I believe one thing that has been problematic for pypy adoption is that it does not automatically work with C extensions or Cython, and generally if someone already had performance issues with CPython, they would have written some C/Cython extensions?


PyPy is far more than just 'python with a JIT'.

It's a tool that allows you to write an entire interpreter in RPython (a subset of Python) and then have it build a native binary with a free jit compiler included, with the specifics of your language encoded within. The reference implementation for this project is a Python interpreter.

It's seriously, seriously cool.


Maybe because 10% on a server is much more valuable than 10% on a client.

On a server, you’re paying for that 10%. On a client, you’re not. If it was 10% for nearly free then sure - but maintaining a separate implementation of a language is costly.


No, Pyston said of their last version "On Dropbox’s server, we are 10% faster."

PS Thanks for all your Django contributions!


Should have read the article, whoops. Thanks for the correction!


Anyone curious to estimate the cost for Dropbox to migrate to Python 3?


and also calculating the cost savings if any. On a technical level the whole exercise seems to lead to a slightly more elegant, consistent code base, but it's still a long way from earning or saving any actual money.


If you can't get it directly (I can't), there is an archive available at https://web.archive.org/web/20180925210033/https://blogs.dro...


A guide to migrating to Python 3 (with pleasure :)

https://github.com/arogozhnikov/python3_with_pleasure


Great write-up and those two graphs are interesting. It's cool to learn how different companies treat their beta users. I wish this article touched upon more of the technically difficulties with switching from Python 2 to 3 too.

A̵l̵s̵o̵ ̵a̵ ̵s̵m̵a̵l̵l̵ ̵h̵e̵a̵d̵s̵-̵u̵p̵:̵ ̵a̵t̵ ̵t̵h̵e̵ ̵e̵n̵d̵ ̵o̵f̵ ̵t̵h̵e̵ ̵f̵i̵r̵s̵t̵ ̵p̵a̵r̵a̵g̵r̵a̵p̵h̵,̵ ̵"̵v̵e̵n̵e̵r̵a̵b̵l̵e̵"̵ ̵s̵h̵o̵u̵l̵d̵ ̵b̵e̵ ̵"̵v̵u̵l̵n̵e̵r̵a̵b̵l̵e̵"̵ ̵(̵u̵n̵l̵e̵s̵s̵ ̵y̵o̵u̵ ̵m̵e̵a̵n̵ ̵w̵e̵ ̵s̵h̵o̵u̵l̵d̵ ̵l̵o̵o̵k̵ ̵b̵a̵c̵k̵ ̵a̵t̵ ̵t̵h̵e̵ ̵s̵a̵c̵r̵e̵d̵ ̵p̵y̵w̵i̵n̵3̵2̵ ̵l̵i̵b̵r̵a̵r̵y̵ ̵w̵i̵t̵h̵ ̵h̵o̵n̵o̵r̵)̵.̵


Venerable is correct here.


You're right, I haven't seen it used outside of a religious context before.


I hope the people who told me that python wasn't good for large rollouts read this


They forgot step 1 - hire the creator of Python... But seriously, that's an interesting post.


That strategy did not help Google get off python 2.


Guido van Rossem recently stepped down from leading the Python project.

https://news.ycombinator.com/item?id=17515492


He’s still the creator.


Phew


[flagged]


We've banned this account for repeatedly posting unsubstantive comments.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll use the site as intended.


I would have thought that Go would be perfect for this sort of system-level-like programming ?


Dropbox uses Go (https://github.com/dropbox/godropbox), Rust (e.g. https://github.com/dropbox/rust-brotli) and Python on the server-side, and both Rust (https://github.com/dropbox/finderinfo-rust) and Python on the client-side.

As other commenters have noted, a lot of their Python use for large scale systems was an artifact of history and available choices at the time, but from my experience during my time there and following as an outside observer since leaving, they seem to make reasonable infrastructure and language decisions for their core product.


Go wasn't created until 2009 and Dropbox already had tons of Python code by then as well as extremely accomplished Python programmers (they hired the creator of Python only a few years later). I also don't think Go would have allowed the tight integration with, for example, OS X where Dropbox actually superimposes their icons onto your Finder icons. As far as I understand, it was only because of smart programming and use of OS X's Python / Ruby bridge functionality that they were able to do it.


There’s an official API since 2014 (Yosemite): https://developer.apple.com/library/archive/documentation/Ge...


Dropbox predates that api though. They used to hack into Finder.app's api IIRC.


Which is part of why that extension point was introduced :)

https://twitter.com/nickjshearer/status/608833134902140928


Looks like they already have Rust/C++/obj-c in use, which all fill that purpose a bit better than Go would.


Go code is about ~30% more verbose than equivalent python.


How much faster is Go? Memory usage?

It has to be less demanding on your laptop battery, for example.


Performance depends very much on what you are doing. Native Python code is much slower than Go, but Python code execution is not the bottleneck in many Python programs. NumPy may be faster than Go and disk IO is the same speed in each.


Go performance would be next tier from Python but memory usage a little less so. The Go team has made great strides in that regard from what it was. The runtime is always getting better, and it's fun to watch.

Rust or native platform (Swift/C# .Net Native, depending) are going to be even more ideal for battery usage.

Proper algorithmic choices are even more important and paramount no matter what is used. It goes without saying that poorly implemented Rust can be bested by well implemented Python.


Wow, I wouldn't have expected that. Do you have a reference for the 30%? I'd like to see what is going on to cause that big of a difference.


Every other 5 lines being the equiv of:

    if err != nil {
        ... error handling ...
    }
Python just uses exceptions, and they bubble up. No need for this excessively verbose error checking.


I'm not sure why it is downvoted. This is true. I write both Golang and Python as part of my day job. I love both languages, but this is definitely the truth.

Go is more verbose. It also gives you the wonder of the compiler telling you about doing stupid things. That doesn't make it a worse language.


Does go not have exception handling?


Go does not have exception handling at all. It is a proposal they're seriously considering for go2:

https://go.googlesource.com/proposal/+/master/design/go2draf...

See the bits about error handling and error values.


What was the reason the go designers didn't have exception handling?


Read Rob Pike's post on why: https://commandcenter.blogspot.com/2012/06/less-is-exponenti...

The go2 proposal shows that in this respect, he was woefully wrong.


go has something in the same general area of error handling called panic/recover. It's not exceptions in the normal sense, and devs expect nothing should panic across api boundaries.


Go isn't good at the system level. I.e. calls to C APIs are expensive, etc.


Wait, I thought Go was designed as a system language.


Go was designed for concurrency (and the docs take care to distinguish that from parallelism), fast compilation, and easy deployment.


Only in the Google sense i.e. "for building big systems… like servers". Not as in "close to your operating system" or "you should write an operating system in it".


Why are people downvoting this? It’s a fine question to ask. I think people have been downvoting questions more lately, and it’s unfortunate.


So much downvoting for a honest question :/


There are few comments worser on HN than "why didn't you implement in [rival language]?" as if [rival language] was a default choice that could only be deviated from if it was extensively motivated.

The claim that Go is "perfect" doesn't make things better.


Go wasn't a thing in 2007.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: