This is an avalanche of wisdom. I'm summoning the courage to enable "Show GPU Overdraw" on my apps and seeing just how bad it is.
And if anyone else was interested, the developer of Falcon Pro has already started implementing the fixes. To quote him from Twitter (@joenrv): "Reading @romainguy analysis of my app makes me feel a bit like a naked dude in a room full of people staring xD #letsgettowork"
I have very little interest in Android/mobile development, but this post was one of the most interesting things I've read in the past few days. Very well done, I feel like I've learnt very much about graphics performance (generally, not just on mobiles) in five minutes.
The OP is slightly overstating the impact of overdraw as a general thing. While in this particular case it is likely that the overdraw is the cause of poor performance in this application, it's not always going to be the cause.
Overdraw primarily eats up memory bandwidth. Bandwidth isn't the only resource you have to worry about on a GPU (though on a mobile, it's certainly important). Equally important can be the time spent running pixel and/or vertex shaders when actually drawing onscreen elements - it's quite easy for a poorly written pixel shader to add multiple milliseconds to the time taken to render one fullscreen image on an embedded device.
Unfortunately none of the steps he takes in this article, until the OpenGL ES Trace near the end, appear to give you any of the information you'd need to figure out whether overdraw is actually your problem. Maybe it's a safe bet that for most Android apps, overdraw is the issue, because they're using Skia and thus don't have access to use custom shaders?
On the other hand, that Hierarchy Viewer feature in the debug tools looks really great. I wish more SDKs offered features that nice.
You are right but apps are not in control of the shaders, at least not directly. It is possible to make the View system use complicated shaders but it's rare and I've never seen this be the cause of performance issues (except when we had bugs in our shaders generation code.)
Most UI elements are drawn with a quad and trivial shaders (a couple of multiplications in the vertex shader, a texture lookup and modulate in the fragment shader.)
In my years of working on Android I have learned that poor framerate is most of the time (not always) a combination of blocking the UI thread for too long and drawing too much.
I should have made this clearer but a lot of overdraw above 2x will likely be one of the causes of performance issue. Not all devices will behave the same of course but it's a reasonable average (devices with more bandwidth tend to have higher screen resolutions.)
What's important to remember however is that overdraw often indicates other problems. This typically the application uses more views than it needs. This impacts performance in other ways: higher memory consumption, larger tree that takes longer to manager and traverse, longer startup times, more work for the renderer (sorting display lists, managing OpenGL state, etc.)
Not really, all of the shaders come from the Android View System and are pretty simple (could probably have been done with the fixed function pipeline).
In general, you're right, you can "shade too hard" (consuming too many GPU cycles) and/or "shade too much" (consuming too much memory bandwidth); In my experience on mobile "shading too much" is more common and easier to do (especially in the era of 2560x1600 displays...). Maybe because the framebuffer and textures are on system memory and not GDDR5/whatever like on a desktop.
In the case of pixels hidden behind views that do not use blending, overdraw only eats a very small amount of memory bandwidth checking the depth/stencil buffer for each covered pixel. This check is highly optimized in the hardware to reject whole blocks of pixels while reading only a few bits. (Android UI does use the depth/stencil buffer, right???) However, I don't think that's what the article is talking about. "You can see that the transparent pixels of the bitmaps count against your overdraw."
In the case of visible views that do use blending, overdraw multiplies the time spent running shader computation right along side multiplying the full memory bandwidth consumption of the shader (much more than just checking the depth/stencil buffer). It's true that it's totally possible to write slow shaders that chug with only 1x overdraw. But, at 3x overdraw it will be 3x as bad because you are running the whole function 3x per pixel.
Android does not use the depth buffer. The UI toolkit draws back to front. We are thinking about ways to improve this but most apps draw blended primitives back to front. An optimization I want to get in is to resort the render tree to batch commands by type and state. A side effect of this will be the ability to cull invisible primitives.
The stencil is not used at the moment (well... that's actually how overdraw debugging is implemented) because the hardware renderer only support rectangular clipping regions and thus relies on the scissor instead. Given how the original 2D API was designed, using the stencil buffer for clipping could eat up quite a bit of bandwidth or require a rather complex implementation.
It is planned to start using the stencil buffer to support non-rectangular clipping regions but this will have a cost.
Remember that the GPU rendering pipeline was written for an API that was never designed to run on the GPU and some obvious optimizations applied to traditional rendering engines do not necessarily apply.
That's actually what I expected, but I couldn't find any reference. So, I defaulted to the optimistic, but probably wrong stance hoping that someone would correct me. Thanks!
This means that, at least on traditional forward rendering GPUs (Nvidia, Adreno), overdraw is full cost even for pixels covered by opaque views. Do the PowerVR chips still get effectively-zero opaque overdraw from their tile-based-deferred-rendering approach?
Wow. I didn't know that the Android SDK shipped with such advanced profiling tools. They have really improved since the last time I played with it, circa 1.x. Impressive. Can anyone comment on other similar tools?
Another common one is "strict mode", where you tell Android to notify you (via screen flash or log dump) any time your app makes a potentially blocking call (network, file system) on the UI thread. Very useful as a first optimization.
As of (IIRC) Gingerbread, you get thrown a NetworkOnMainThreadException if you're making network calls in the UI thread and haven't explicitly turned StrictMode off.
(The WinRT SDK goes a step further, there simply _aren't_ any blocking network/FS methods available. async/await makes that less painful than it would seem, though.)
PS: It's actually Honeycomb and later. This behavior only kicks in if you're compiling against HC+, anything compiled against GB or lower can freely block the UI thread with reckless abandon regardless of which Android version you're running on.
One side affect of this is that Falcon Pro now should be significantly better optimized as a result. The "What's New" log on Google Play reveals this[1]:
v1.0.2
* Optimized the app following @romainguy recommendations. Report back if you feel the butter :)
Wow, I'm amazed! Android performance tools have really advanced since my early attempts at android development (2.2-.3). I'll need to try some of these tools out.
Have they advanced on the NDK/C++ front as well? I've never been interested in the Java aspect of Android and working with the NDK was brutal compared to iOS.
As an Android developer, its really exciting to see good examples of the tools in action. The documentation on how to debug problems beyond crashes/ANRs is a bit thin.
That being said, I do have one gripe. There are some chasms between the different tools. It is a bit painful to operate all the different performance tools and deal with switching contexts for the same problem. Systrace -> traceview and back and forth.
Also, it would be nice if traceview had a text based api/interface. I know that the graph visualization must be valuable for something, but I spend the majority of my time looking for particular methods and signs of excessive consumption/trouble. Now that I think of it, this sounds like a fun weekend project :)
We are trying to have all the tools available in ADT/monitor. Systrace can actually be invoked directly from ADT/monitor, there's a button in the toolbar for this.
There's a limit of maximum 100.000 users/tokens for new Twitter clients. That's still $100 000 for $1 Twitter apps ($70 000 after Apple/Google takes their share), so it's not the end of the world of Twitter apps, but it won't make any developers rich either.
Getting 100 000 users for paid apps is very rare, Falcon Pro has just between 1000 and 5000 users so far for example. It's was just recently released though. If it starts to become popular "too fast" the price can just be raised.
I tried to do the memory bandwidth math but ended up
even more puzzled:
Anandtech Nexus 7 review says it has 5.3 GB/s of
memory bandwidth. At 60 fps that would leave 88 MB per frame.
The 1280x800 screen has 3 megabytes worth of pixels at 24 bpp.
That's 1/29 of the 88MB-per-frame. So how come the overdraw related slowdowns started appearing with just 4x overdraw?
Frame buffers are 32 bits, not 24 bits. You must also take into account the cost of blending. Also, you don't necessarily get all the bandwidth (especially when the device is not running at full clock.)
And if anyone else was interested, the developer of Falcon Pro has already started implementing the fixes. To quote him from Twitter (@joenrv): "Reading @romainguy analysis of my app makes me feel a bit like a naked dude in a room full of people staring xD #letsgettowork"