The return of the frame pointers

dsign · 2024-03-17T09:51:16.000000Z

I remember when the omission of stack frame pointers started spreading at the beginning of the 2000s. I was in college at the time, studying computer sciences in a very poor third-world country. Our computers were old and far from powerful. So, for most course projects, we would eschew interprets and use compilers. Mind you, what my college lacked in money it compensated by having interesting course work. We studied and implemented low level data-structures, compilers, assembly-code numerical routines and even a device driver for Minix.

During my first two years in college, if one of our programs did something funny, I would attach gdb and see what was happening at assembly level. I got used to "walking the stack" manually, though the debugger often helped a lot. Happy times, until all of the sudden, "-fomit-frame-pointer" was all the rage, and stack traces stopped making sense. Just like that, debugging that segfault or illegal instruction became exponentially harder. A short time later, I started using Python for almost everything to avoid broken debugging sessions. So, I lost an order of magnitude or two with "-fomit-frame-pointer". But learning Python served me well for other adventures.

nextaccountic · 2024-03-18T00:14:38.000000Z

Didn't you know about -fno-omit-frame-pointer?

e40 · 2024-03-18T01:16:53.000000Z

Perhaps they were talking about libraries they didn’t compile, and with languages with callbacks this can be a real problem.

noobermin · 2024-03-19T05:49:42.000000Z

You unfortunately do not control everything on your machine and how it's compiled. I think source vased distros like gentoo are the only exception but have their own warts.

rwmj · 2024-03-17T09:52:19.000000Z

I'm glad he mentioned Fedora because it's been a tiresome battle to keep frame pointers enabled in the whole distribution (eg https://pagure.io/fesco/issue/3084).

There's a persistent myth that frame pointers have a huge overhead, because there was a single Python case that had a +10% slow down (now fixed). The actual measured overhead is under 1%, which is far outweighed by the benefits we've been able to make in certain applications.

menaerus · 2024-03-17T12:11:41.000000Z

I believe it's a misrepresentation to say that "actual measured overhead is under 1%". I don't think such a claim can be universally applied because this depends on the very workload you're measuring the overhead with.

FWIW your results don't quite match the measurements from Linux kernel folks who claim that the overhead is anywhere between 5-10%. Source: https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...

   I didn't preserve the data involved but in a variety of workloads including netperf, page allocator microbenchmark, pgbench and sqlite, enabling framepointer introduced overhead of around the 5-10% mark.

Significance in their results IMO is in the fact that they measured the impact by using PostgreSQL and SQLite. If anything, DBMS are one of the best ways to really stress out the system.

babel_ · 2024-03-17T12:54:12.000000Z

Those are numbers from 7 years ago, so they're beginning to get a bit stale as people start to put more weight behind having frame pointers and make upstream contributions to their compilers to improve their output. People put it at <1% from much more recent testing by the very R.W.M. Jones you're replying to [0] and separate testing by others like Brendan Gregg [1b], whose post this is commenting on (and included [1b] in the Appendix as well), with similar accounts by others in the last couple years. Oh, and if you use flamegraph, you might want to check the repo for a familiar name.

Some programs, like Python, have reported worse, 2-7% [2], but there is traction on tackling that [1a] (see both rwmj's and brendangregg's replies to sibling comments, they've both done a lot of upstreamed work wrt. frame pointers, performance, and profiling).

As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles. Plus, you can always disable it in specific hotspots later when needed, which is much easier than the reverse.

Something, something, premature optimisation -- though in seriousness, this information benefits actual optimisation, exactly because we don't have the information and understanding that would allow truly universal claims, precisely because things like this haven't been available, and so haven't been widely used. We know frame pointers, from additional register pressure and extended function prologue/epilogue, can be a detriment in certain hotspots; that's why we have granular control. But without them, we often don't know which hotspots are actually affected, so I'm sure even the databases would benefit... though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency, so even a claimed "10%" drop there probably doesn't impact actual real-world usage, but that's a reason why some of the most interesting profiling work lately has been from ideas like causal profilers and continuous profilers, which answer exactly that.

[0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar... [1a]: https://pagure.io/fesco/issue/2817#comment-826636 [1b]: https://pagure.io/fesco/issue/2817#comment-826805 [2]: https://discuss.python.org/t/the-performance-of-python-with-...

adrian_b · 2024-03-17T16:02:10.000000Z

While improved profiling is useful, achieving it by wasting a register is annoying, because it is just a very dumb solution.

The choice made by Intel when they have designed 8086 to use 2 separate registers for the stack pointer and for the frame pointer was a big mistake.

It is very easy to use a single register as both the stack pointer and the frame pointer, as it is standard for instance in IBM POWER.

Unfortunately in the Intel/AMD CPUs using a single register is difficult, because the simplest implementation is unreliable since interrupts may occur between 2 instructions that must form an atomic sequence (and they may clobber the stack before new space is allocated after writing the old frame pointer value in the stack).

It would have been very easy to correct this in new CPUs by detecting that instruction sequence and blocking the interrupts between them.

Intel had already done this once early in the history of the x86 CPUs, when they have discovered a mistake in the design of the ISA, that interrupts could occur between updating the stack segment and the stack pointer. Then they had corrected this by detecting such an instruction sequence and blocking the interrupts at the boundary between those instructions.

The same could have been done now, to enable the use of the stack pointer as also the frame pointer. (This would be done by always saving the stack pointer in the top of the stack whenever stack space is allocated, so that the stack pointer always points to the previous frame pointer, i.e. to the start of the linked list containing all stack frames.)

menaerus · 2024-03-18T09:30:47.000000Z

I'd prefer discussing technical merits of given approach rather than who is who and who did what since that leads to appeal to authority fallacy.

You're correct, results might be stale, although I wouldn't hold my breath for it since there has been no fundamental change in a way how frame pointers are handled as far as my understanding goes. Perhaps smaller improvements in compiler technology but CPUs did not undergo any significant change w.r.t. that context.

That said, nowhere in this thread have we seen a dispute of those Linux kernel results other than categorically rejecting them as being "microbenchmarks", which they are not.

> though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency

Quite the opposite. All database benchmarks are end-to-end program performance and latency analysis. "Cheating" in database benchmarks is done elsewhere.

doctorpangloss · 2024-03-17T16:06:00.000000Z

> As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles.

Few can leverage that information because the open source software you are talking about lacks telemetry in the self hosted case.

The profiling issue really comes down to the cultural opposition in these communities to collecting telemetry and opening it for anyone to see and use. The average user struggles to ally with a trustworthy actor who will share the information like profiling freely and anonymize it at a per-user level, the level that is actually useful. Such things exist, like the Linux hardware site, but only because they have not attracted the attention of agitators.

Basically users are okay with profiling, so long as it is quietly done by Amazon or Microsoft or Google, and not by the guy actually writing the code and giving it out for everyone to use for free. It’s one of the most moronic cultural trends, and blame can be put squarely on product growth grifters who equivocate telemetry with privacy violations; open source maintainers, who have enough responsibilities as is, besides educating their users; and Apple, who have made their essentially vaporous claims about privacy a central part of their brand.

Of course people know the answer to your question. Why doesn’t Google publish every profile of every piece of open source software? What exactly is sensitive about their workloads? Meta publishes a whole library about every single one of its customers, for anyone to freely read. I don’t buy into the holiness of the backend developer’s “cleverness” or whatever is deemed sensitive, and it’s so hypocritical.

yjftsjthsd-h · 2024-03-17T18:41:24.000000Z

> Basically users are okay with profiling, so long as it is quietly done by Amazon or Microsoft or Google, and not by the guy actually writing the code and giving it out for everyone to use for free.

No; the groups are approximately "cares whether software respects the user, including privacy", or "doesn't know or doesn't care". I seriously doubt that any meaningful number of people are okay with companies invading their privacy but not smaller projects.

matheusmoreira · 2024-03-17T20:05:18.000000Z

"Agitators". We don't trust telemetry precisely because of comments like that. World is full of people like you who apparently see absolutely nothing wrong with exfiltrating identifying information from other people's computers. We have to actively resist such attempts, they are constant, never ending and it only seems to get worse over time but you dismiss it all as "cultural opposition" to telemetry.

For the record I'm NOT OK with being profiled, measured or otherwise studied in any way without my explicit consent. That even extends to the unethical human experiments that corporations run on people and which they euphemistically call A/B tests. I don't care if it's Google or a hobbyist developer, I will block it if I can and I will not lose a second of sleep over it.

rstuart4133 · 2024-03-17T21:15:54.000000Z

> World is full of people like you who apparently see absolutely nothing wrong with exfiltrating identifying information from other people's computers.

True. But such people are like cockroaches. They know what they are doing will be unpopular with their targets, so they keep it hidden. This is easy enough to do in closed designs, car manufacturers selling your driving habits to insurance companies and health monitoring app selling menstrual cycle data to retailers selling to women.

Compare that do say Debian and RedHat. They too collect performance data. But the code is open source, Debian has repeatable builds so you can be 100% sure that is the code in use, and every so often someone takes a look it. Guess what, the data they send back is so unidentifiable it satisfies even the most paranoid of their 1000's of members.

All it takes is a little bit of sunlight to keep the cockroaches at bay, and then we can safely let the devs collect the data they need to improve code. And everyone benefits.

doctorpangloss · 2024-03-18T00:55:09.000000Z

I fully support the schemes for telemetry that already happens. It’s just obnoxious that there’s no apparent reason behind the support of one kind of anonymous telemetry versus another. For every 1 person who digests the package repo’s explanation of the telemetry strategy, there are 19 who feel okay with GitHub having all the telemetry and none of the open source repos they host because of vibes.

babel_ · 2024-03-17T19:32:00.000000Z

I think the kind of profiling information you're imagining is a little different from what I am.

Continuous profiling of your system that gets relayed to someone else by telemetry is very different from continuous profiling of your own system, handled only by yourself (or, generalising, your community/group/company). You seem to be imagining we're operating more in the former, whereas I am imagining more in the latter.

When it's our own system, better instrumented for our own uses, and we're the only ones getting the information, then there's nothing to worry about, and we can get much more meaningful and informative profiling done when more information about the system is available. I don't even need telemetry. When it's "someone else's" system, in other words, when we have no say in telemetry (or have to exercise a right to opt-out, rather than a more self-executing contract around opt-in), then we start to have exactly the kinds of issues you're envisaging.

When it's not completely out of our hands, then we need to recognise different users, different demands, different contexts. Catering to the user matters, and when it comes to sensitive information, well, people have different priorities and threat models.

If I'm opening a calendar on my phone, I don't expect it to be heavily instrumented and relaying all of that, I just want to see my calendar. When I open a calendar on my phone, and it is unreasonably slow, then I might want to submit relevant telemetry back in some capacity. Meanwhile, if I'm running the calendar server, I'm absolutely wanting to have all my instrumentation available and recording every morsel I reasonably can about that server, otherwise improving it or fixing it becomes much harder.

From the other side, if I'm running the server, I may want telemetry from users, but if it's not essential, then I can "make do" with only the occasional opt-in telemetry. I also have other means of profiling real usage, not just scooping it all up from unknowing users (or begrudging users). Those often have some other "cost", but in turn, they don't have the "cost" of demanding it from users. For people to freely choose requires acknowledging the asymmetries present, and that means we can't just take the path of least resistance, as we may have to pay for it later.

In short, it's a consent issue. Many violate that, knowingly, because they care not for the consequences. Many others don't even seem to think about it, and just go ahead regardless. And it's so much easier behind closed doors. Open source in comparison, even if not everything is public, must contend with the fact that the actions and consequences are (PRs, telemetry traffic, etc), so it inhabits a space in which violating consent is much more easily held accountable (though no guarantee).

Of course, this does not mean it's always done properly in open source. It's often an uphill battle to get telemetry that's off-by-default, where users explicitly consent via opt-in, as people see how that could easily be undermined, or later invalidated. Many opt-in mechanisms (e.g. a toggle in the settings menu) often do not have expiration built in, so fail to check at a later point that someone still consents. Not to say that's the way you must do it, just giving an example of a way that people seem to be more in favour of, as with the generally favourable response to such features making their way into "permissions" on mobile.

We can see how the suspicion creeps in, informed by experience... but that's also known by another word: vigilance.

So, users are not "okay" with it. There's a power imbalance where these companies are afforded the impunity because many are left to conclude they have no choice but to let them get away with it. That hasn't formed in a vacuum, and it's not so simple that we just pull back the curtain and reveal the wizard for what he is. Most seem to already know.

It's proven extremely difficult to push alternatives. One reason is that information is frequently not ready-to-hand for more typical users, but another is that said alternatives may not actually fulfil the needs of some users: notably, accessibility remains hugely inconsistent in open source, and is usually not funded on par with, say, projects that affect "backend" performance.

The result? Many people just give their grandma an iPhone. That's what's telling about the state of open source, and of the actual cultural trends that made it this way. The threat model is fraudsters and scammers, not nation-state actors or corporate malfeasance. This app has tons of profiling and privacy issues? So what? At least grandma can use it, and we can stay in contact, dealing with the very real cultural trends towards isolation. On a certain level, it's just pragmatic. They'd choose differently if they could, but they don't feel like they can, and they've got bigger worries.

Unless we do different, the status quo will remain. If there's any agitation to be had, it's in getting more people to care about improving things and then actually doing them, even if it's just taking small steps. There won't be a perfect solution that appears out of nowhere tomorrow, but we only have a low bar to clear. Besides, we've all thought "I could do better than that", so why not? Why not just aim for better?

Who knows, we might actually achieve it.

doctorpangloss · 2024-03-18T01:00:58.000000Z

[flagged]

babel_ · 2024-03-18T02:44:11.000000Z

Telemetry is exceedingly useful, and it's basically a guaranteed boon when you operate your own systems. But telemetry isn't essential, and it's not the heart of the matter I was addressing. Again, the crux of this is consent, as an imbalance of power easily distorts the nature of consent.

Suppose Chrome added new telemetry, for example, like it did when WebRTC was added in Chrome 28, so we really can just track this against something we're all familiar (enough with). When a user clicks "Update", or it auto-updated and "seamlessly" switched version in the background / between launches, well, did the user consent to the newly added telemetry?

Perhaps most importantly: did they even know? After all, the headline feature of Chrome 28 was Blink, not some feature that had only really been shown off in a few demos, and was still a little while away from mass adoption. No reporting on Chrome 28 that I could find from the time even mentions WebRTC, despite entire separate articles going out just based on seeing WebRTC demos! Notifications got more

So, capabilities to alter software like this are, knowingly or unknowingly, undermine the nature of consent that many find implicit in downloading a browser, since what you download and what you end up using may be two very different things.

Now, let's consider a second imbalance. Did you even download Chrome? Most Android devices often have it preinstalled, or some similar "open-core" browser (often a Chromium-derivative). Some are even protected from being uninstalled, so you can't opt out that way, and Apple only just had to open up iOS to non-Safari backed browsers.

So the notion of consent via the choice to install is easily undermined.

Lastly, because we really could go on all day with examples, what about when you do use it? Didn't you consent then?

Well, they may try to onboard you, and have you pretend to read some EULA, or just have it linked and give up the charade. If you don't tick the box for "I read and agree to this EULA", you don't progress. Of course, this is hardly a robust system. Enforceability aside, the moment you had it over to someone else to look at a webpage, did they consent to the same EULA you did?

... Basically, all the "default" ways to consider consent are nebulous, potentially non-binding, and may be self-defeating. After all, you generally don't consent to every single line of code, every single feature, and so on, you are usually assumed to consent to the entire thing or nothing. Granularity with permissions has improved that somewhat, but there is usually still a bulk core you must accept before everything else; otherwise the software is usually kept in a non-functional state.

I'm not focused too specifically on Chrome here, but rather the broad patterns of how user consent typically assumed in software don't quite pan out as is often claimed. Was that telemetry the specific reason why libwebrtc was adopted by others? I'm not privy to the conversations that occurred with these decisions, but I imagine it's more one factor among many (not to mention, Pion is in/for Go, which was only 4 years old then, and the pion git repo only goes back to 2018). People were excited out of the gate, and libwebrtc being available (and C++) would have kept them in-step (all had support within 2013). But, again, really this is nothing to do with the actual topic at hand, so let's not get distracted.

The user has no opportunity to meaningfully consent to this. Ask most people about these things, and they wouldn't even recognise the features by now (as WebRTC or whatever is ubiquitous), let alone any mechanisms they may have to control how it engages with them.

Yet, the onus is put on the user. Why do we not ask about anything/anyone else in the equation, or consider what influences the user?

A recent example I think illustrates the imbalance and how it affects and warps consent is the recent snafu with a vending machine with limited facial recognition capabilities. In other words, the vending machine had a camera, ostensibly to know when to turn on or not and save power. When this got noticed at a university, it was removed, and everyone made a huge fuss, as they had not consented to this!

What I'd like to put in juxtaposition with that is how, in all likelihood, this vending machine was probably being monitored by CCTV, and even if not, that there is certainly CCTV at the university, and nearby, and everywhere else for that matter.

So what changed? The scale. CCTV everywhere does not feel like something you can, individually, do anything about; the imbalance of power is such that you have no recourse if you did not consent to it. A single vending machine? That scale and imbalance has shifted, it's now one machine, not put in place by your established security contracts, and not something ubiquitous. It's also something easily sabotaged without clear consequence (students at the university covered the cameras of it quite promptly upon realising), ironically, perhaps, given that this was not their own property and potentially in clear view of CCTV, but despite having all the same qualities as CCTV, the context it embedded in was such that they took action against it.

This is the difference between Chrome demanding user consent and someone else asking for it. When the imbalance of power is against you, even just being asked feels like being demanded, whereas when it doesn't quite feel that way, well, users often take a chance to prevent such an imbalance forming, and so work against something that may (in the case of some telemetry) actually be in their favour. However, part and parcel with meeting user needs is respecting their own desires -- as some say, the customer is always right in matters of taste.

To re-iterate myself from before, there are other ways of getting profiling information, or anything you might relay via telemetry, that do not have to conform to the Google/Meta/Amazon/Microsoft/etc model of user consent. They choose the way they do because, to them, it's the most efficient way. At their scale, they get the benefits of ubiquitous presence and leverage of the imbalance of power, and so what you view as your system, they view as theirs, altering with impunity, backed by enough power to prevent many taking meaningful action to the contrary.

For the rest of us, however, that might just be the wrong way to go about it. If we're trying to avoid all the nightmares that such companies have wrought, and to do it right by one another, then the first step is to evaluate how we engage with users, what the relationship ("contract") we intend to form is, and how we might inspire mutual respect.

In ethical user studies, users are remunerated for their participation, and must explicitly give knowing consent, with the ability to withdraw at any time. Online, they're continually A/B tested, frequently without consent. On one hand, the user is placed in control, informed, and provided with the affordances and impunity to consent entirely according to their own will and discretion. On the other, the user is controlled, their agency taken away by the impunity of another, often without the awareness that this is ongoing, or that they might have been able to leverage consent (and often ignored even if they did, after all, it's easy to do so when you hold the power). I know which I'd rather be on the other end of, at least personally speaking.

So, if we want to enable telemetry, or other approaches to collaborating with users to improve our software, then we need to do just that. Collaborate. Rethink how we engage, respect them, respect their consent. It's not just that we can't replicate Google, but that maybe we shouldn't, maybe that approach is what's poisoned the well for others wanting to use it, and what's forcing us to try something else. Maybe not, after all, that's not for us to judge at this point, it's only with hindsight that we might truly know. Either way, I think there's some chance for people to come in, make something that actually fits with people, something that regards them as a person, not simply a user, and respects their consent. Stuff like that might start to shift the needle, not by trying to replace Google or libwebrtc or whatever and get the next billion users, but by paving a way and meeting the needs of those who need it, even if it's just a handful of customers or even just friends and family. Who knows, we might start solving some of the problems we're all complaining about yet never seem to fix. At the very least, it feels like a breath of fresh air.

doctorpangloss · 2024-03-18T03:23:11.000000Z

You’re agreeing with me.

> Rethink how we engage, respect them, respect their consent.

One way to characterize this is leadership. Most open source software authors are terrible leaders!

You’re way too polite. Brother, who is making a mistake and deserves blame? Users? Open source non corporate software maintainers? Google employees? Someone does. It can’t be “we.” I don’t make any of these mistakes, leave me out of it! I tell every non corporate open source maintainer to add basic anonymized telemetry, PR a specific opt-out solution with my preferred Plausible, and argue relentlessly with users to probe the vapors they base their telemetry fears on. We’re both trying to engage on the issue, but the average HN reader is downvoting me. Because “vibes.” Vibes are dumb! Just don’t be afraid to say it.

brendangregg · 2024-03-17T12:17:00.000000Z

Those are microbenchmarks.

menaerus · 2024-03-17T12:18:45.000000Z

pgbench is not a microbenchmark.

brendangregg · 2024-03-17T12:27:52.000000Z

From the docs: "pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over"

While it might call itself a benchmark, it behaves very microbenchmark-y.

The other numbers I and others have shared have been from actual production workloads. Not a simple program that tests same sequence of commands over and over.

menaerus · 2024-03-17T13:41:03.000000Z

While pgbench might be "simple" program, as in a test-runner, workloads that are run by it are far from it. It runs TPC-B by default but can also run your own arbitrary script that defines whatever the workload is? It also allows to run queries concurrently so I fail to understand the reasoning of it "being simple" or "microbenchmarkey". It's far from the truth I think.

weebull · 2024-03-17T15:27:50.000000Z

Anything running a full database server is not micro.

brendangregg · 2024-03-17T22:53:06.000000Z

If I call the same "get statistics" command over and over in a loop (with zero queries), or 100% the same invalid query (to test the error path performance), I believe we'd call that a micro-benchmark, despite involving a full database. It's a completely unrealistic artificial workload to test a particular type of operation.

The pgbench docs make it sound microbenchmark-y by describing making the same call over and over. If people find that this simulates actual production workloads, then yes, it can be considered a macro-benchmark.

menaerus · 2024-03-18T08:28:05.000000Z

"get statistics" is not what TPC-B does. Nor the invalid queries nor ...

From https://www.tpc.org/tpcb/, a TPC-B workload that pgbench runs by default:

    In August 1990, the TPC approved its second benchmark, TPC-B. In contrast to TPC-A, TPC-B is not an OLTP benchmark. Rather, TPC-B can be looked at as a database stress test, characterized by:

      Significant disk input/output
      Moderate system and application execution time
      Transaction integrity

    TPC-B measures throughput in terms of how many transactions per second a system can perform. Because there are substantial differences between the two benchmarks (OLTP vs. database stress test), TPC-B results cannot be compared to TPC-A.

    ...

    Transactions are submitted by programs all executing concurrently.

brendangregg · 2024-03-19T01:12:41.000000Z

I think you missed the context of what I was responding to, which was about whether databases could even have micro-benchmarks.

You also missed the word "Obsolete" splattered all over the website you sent me, and the text that TPC-B was "Obsolete as of 6/6/95".

menaerus · 2024-03-19T11:04:02.000000Z

I don't think I have. I was only responding to the factually incorrect statement of yours that pgbench is a microbenchmark.

> which was about whether databases could even have micro-benchmarks.

No, this was an argument of yours that you pulled out out of nowhere. The topic very specifically was about the pgbench and not whether or not databases can have micro-benchmarks. Obvious answer is, yes, they can as any other software out there.

I think that you kinda tried to imply that pgbench was one of such micro-benchmarks in disguise and which is why I c/p the description which proves that it is not.

> You also missed the word "Obsolete"

I did not since that was not the topic being discussed at all. And in a technical sense, it doesn't matter at all. pgbench still runs so it is very much "not obsolete".

brendangregg · 2024-03-19T22:21:20.000000Z

I didn't pull this argument out of nowhere, please read the direct comment I was replying to. Your position is also completely untenable: this benchmark was obsoleted by its creators 29 years ago, who very clearly say it is obsolete, and you're arguing that it isn't because it "still runs."

I'm guessing that this discussion would be more productive if you would please say who you are and the company you work for. I'm Brendan Gregg, I work for Intel, and I'm well known in the performance space. Who are you?

menaerus · 2024-03-20T08:27:48.000000Z

> I'm guessing that this discussion would be more productive if you would please say who you are and the company you work for. I'm Brendan Gregg, I work for Intel, and I'm well known in the performance space. Who are you?

Wow, just wow. How ridiculous this is?

> Your position is also completely untenable: this benchmark was obsoleted by its creators 29 years ago, who very clearly say it is obsolete, and you're arguing that it isn't because it "still runs."

My only position in the whole thread was "1% of overhead cannot be universally claimed" and I still stand by it 100%. pgbench experiment from Linux kernel folks was just one of the counter-examples that can be found in the wild that goes against your claim. And which you first disputed by saying that it is a micro-benchmark (meaning that you have no idea what it is) and now you're disputing it by saying it's obsolete (yes, but still very relevant in database development, used in the Linux kernel and not the slightest technical reasoning after all).

Personally, I couldn't care less about this but if names is what you're after, you're not disagreeing with me but with the methodology and results presented by Mel Gorman, Linux kernel developer.

brendangregg · 2024-03-21T01:42:17.000000Z

It's not ridiculous at all. Who are you?

You are backing away from your other positions, for example:

> I fail to understand the reasoning of it "being simple" or "microbenchmarkey". It's far from the truth I think.

Do you now agree that TPC-B is too simple and microbenchmarky? And if not, please tell me (as I'm working on the problem of industry benchmarking in general) what would it take to convince someone like you to stop elevating obsoleted benchmarks like TPC-B? Is there anything?

jeltz · 2024-03-26T12:43:03.000000Z

A major postgres contributor (Andres Freund) disagreed with you about pgbench but, yes, feel free to dismiss them just because you found some words on a web page.

I am just a minor PostgreSQL contributor and consultant of no import, but do you seriously think you are superior to PostgreSQL core devs and know more than them about PostgreSQL just because you are a Linux kernel dev? I really do not like your attitude here and your appeal to your own authority when you are not even an authority here.

Pgbench is used heavily both in the development of PostgreSQL and by people who tune PostgreSQL. So it is not obsolete. Maybe it is a bad benchmark and that the PostgreSQL community should stop using it but then you need a stronger argument than just some claim about obsoleteness from 1995. If a PostgreSQL core developer says in 2024 that it is still relevant I think that weighs a bit higher than a claim from 1995.

menaerus · 2024-03-21T09:25:24.000000Z

Yes, indeed, it is very ridiculous to pull out arguments such as "who are you". I mean, wtf?

Your replies demonstrate lack of technical depth in certain areas and which makes me personally doubt in your experiment results. And you know what, that is totally fine.

But your attitude? A total disbelief.

> Do you now agree that TPC-B is too simple and microbenchmarky?

No, why would I, you did not present any evidence that would support that claim of yours?

And contrary to you, and to your own embarrassment, I do not use personal attacks when I go out of technical depth.

> what would it take to convince someone like you to stop elevating obsoleted benchmarks like TPC-B? Is there anything?

You don't have to convince me anything. Remember that this is just a random internet page where people are exchanging their opinions.

Wish you a pleasant day.

anarazel · 2024-03-17T18:38:38.000000Z

The are loads of real world workloads that have similar patterns to pgbench, particularly read only pgbench.

barrkel · 2024-03-17T12:55:13.000000Z

This isn't an argument for a default.

menaerus · 2024-03-17T13:46:40.000000Z

I was not even trying to make one. I was questioning the validity of "1% overhead" claim by providing the counter-example from respectable source.

edwintorok · 2024-03-17T12:28:08.000000Z

You probably already know, but with OCaml 5 the only way to get flamegraphs working is to either:

* use framepointers [1]

* use LBR (but LBR has a limited depth, and may not work on on all CPUs, I'm assuming due to bugs in perf)

* implement some deep changes in how perf works to handle the 2 stacks in OCaml (I don't even know if this would be possible), or write/adapt some eBPF code to do it

OCaml 5 has a separate stack for OCaml code and C code, and although GDB can link them based on DWARF info, perf DWARF call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563#issuecomment-193...)

If you need more evidence to keep it enabled in future releases, you can use OCaml 5 as an example (unfortunately there aren't many OCaml applications, so that may not carry too much weight on its own).

[1]: I haven't actually realised that Fedora39 has already enabled FP by default, nice! (I still do most of my day-to-day profiling on an ~CentOS 7 system with 'perf record --call-graph dwarf -F 47 -a', I was aware that there was a discussion to enable FP by default, but haven't noticed it has actually been done already)

namibj · 2024-03-17T16:28:32.000000Z

No, LBR is an Intel-only feature.

edwintorok · 2024-03-18T00:36:02.000000Z

https://www.phoronix.com/news/AMD-Zen-4-LbrExtV2 LBR is supposed to work on AMD too, except it doesn't. I'll have to open a bug report (it records the data, it just can't parse it afterwards)

ColonelPhantom · 2024-03-19T12:38:43.000000Z

Also, if you want LBR on Zen3 or earlier, you're SOL. I noticed that one the hard way.

Stuff like this is making me lean more towards getting an Intel for my next computer, at least on desktop where their worse power efficiency is less of an issue. But then they keep gating AVX-512 due to their E-cores not supporting it... You really can't win these days.

awaythrow999 · 2024-03-17T14:50:58.000000Z

Frame pointers are still a no-go on 32bit so anything that is IoT today.

The reason we removed them was not a myth but comes from the pre-64 bit days. Not that long ago actually.

Even today if you want to repurpose older 64 bit systems with a new life then this of optimization still makes sense.

Ideally it should be the default also for security critical systems because not everything needs to be optimized for "observability"

Narishma · 2024-03-17T14:53:58.000000Z

> Frame pointers are still a no-go on 32bit so anything that is IoT today.

Isn't that just 32-bit x86, which isn't used in IoT? The other 32-bit ISAs aren't register-starved like x86.

weebull · 2024-03-17T15:34:33.000000Z

It would be, yes. x86 had very few registers, so anything you could do to free them up was vital. Arm 32bit has 32 general purpose registers I think, and RISC V certainly does. In fact there's no difference between 32 and 64 bit in that respect. If anything, 64-bit frame pointers make it marginally worse.

CountSessine · 2024-03-17T16:00:04.000000Z

Sadly, no. 32-bit ARM only has 16 GPR’s (two of which are zero and link), mostly because of the stupid predication bits in the instruction encoding.

That said, I don’t know how valuable getting rid of FP on ARM is - I once benchmarked ffmpeg on 32-bit x86 before and after enabling FP and PIC (basically removing 2 GPRs) and the difference was huge (>10%) but that’s an extreme example.

fanf2 · 2024-03-17T17:55:41.000000Z

Arm32 doesn’t have a zero-value register. Its non-general-purpose registers are PC, LR, SP, FP – tho the link register can be used for temporary values.

CountSessine · 2024-03-21T00:19:58.000000Z

Ah yes - the weird one was PC not zero. Anyhoo, all of these mistakes were fixed with Arm64.

brendangregg · 2024-03-17T11:27:57.000000Z

Thanks; what was the Python fix?

rwmj · 2024-03-17T11:48:45.000000Z

This was the investigation: https://discuss.python.org/t/python-3-11-performance-with-fr...

Initially we just turned off frame pointers for the Python 3.9 interpreter in Fedora. They are back on in Python 3.12 where it seems the upstream bug has been fixed, although I can't find the actual fix right now.

Fedora tracking bug: https://bugzilla.redhat.com/2158729

Fedora change in Python 3.9 to disable frame pointers: https://src.fedoraproject.org/rpms/python3.9/c/9b71f8369141c...

brendangregg · 2024-03-17T12:21:25.000000Z

Ah right, thanks, I remember I saw Andrii's analysis in the other thread. https://pagure.io/fesco/issue/2817#comment-826636

sidewndr46 · 2024-03-20T13:46:40.000000Z

I'm still baffled by this attitude. That "under 1%" overhead is why computers are measurably slower than 30 years ago to use. All those "under 1%" overhead add up

ReleaseCandidat · 2024-03-17T08:37:38.000000Z

That's one thing Apple did do right on ARM:

> The frame pointer register (x29) must always address a valid frame record. Some functions — such as leaf functions or tail calls — may opt not to create an entry in this list. As a result, stack traces are always meaningful, even without debug information.

https://developer.apple.com/documentation/xcode/writing-arm6...

microtherion · 2024-03-17T10:58:37.000000Z

On Apple platforms, there is often an interpretability problem of another kind: Because of the prevalence of deeply nested blocks / closures, backtraces for Objective C / Swift apps are often spread across numerous threads. I don't know of a good solution for that yet.

felixge · 2024-03-17T12:00:49.000000Z

I'm not very familiar with Objective C and Swift, so this might not make sense. But JS used to have a similar problem with async/await. The v8 engine solved it by walking the chain of JS promises to recover the "logical stack" developers are interested in [1].

[1] https://v8.dev/blog/fast-async

astrange · 2024-03-17T15:04:15.000000Z

Swift concurrency does a similar thing. For the older dispatch blocks, Xcode injects a library that records backtraces over thread hops.

adsharma · 2024-03-17T04:42:37.000000Z

I was at Google in 2005 on the other side of the argument. My view back then was simple:

Even if $BIG_COMPANY makes a decision to compile everything with frame pointers, the rest of the community is not. So we'll be stuck fighting an unwinnable argument with a much larger community. Turns out that it was a ~20 year argument.

I ended up writing some patches to make libunwind work for gperftools and maintained libunwind for some number of years as a consequence of that work.

Having moved on to other areas of computing, I'm now a passive observer. But it's fascinating to read history from the other perspective.

starspangled · 2024-03-17T05:10:01.000000Z

> So we'll be stuck fighting an unwinnable argument with a much larger community.

In what way would you be stuck? What functional problems does adding frame pointers introduce?

rwmj · 2024-03-17T09:57:00.000000Z

You do get occasional regressions. eg. We found an extremely obscure bug involving enabling frame pointers, valgrind, glibc ifuncs and inlining (all at the same time):

https://bugzilla.redhat.com/show_bug.cgi?id=2267598 https://github.com/tukaani-project/xz/commit/82ecc538193b380...

oasisaimlessly · 2024-03-29T18:23:27.000000Z

FYI, this "regression" was likely actually a symptom of a purposely-inserted backdoor. See [1] and my comment there [2].

[1]: https://news.ycombinator.com/item?id=39865810

[2]: https://news.ycombinator.com/item?id=39867301

adsharma · 2024-03-17T05:48:36.000000Z

I wasn't talking about functional problems. It was a simple observation that big companies were not going to convince Linux distributors to add frame pointers anytime soon and that what those distributors do is relevant.

All of the companies involved believed that they were special and decided to build their own (poorly managed) distribution called "third party code" and having to deal with it was not my best experience working at these companies.

starspangled · 2024-03-17T06:00:49.000000Z

Oh, I just assumed you were talking about Google's Linux distribution and applications it runs on its fleet. I must have mis-assumed. Re-reading... maybe you weren't talking about any builds but just whether or not to oppose kernel and toolchain defaulting to omit frame pointers?

adsharma · 2024-03-17T06:06:47.000000Z

Google didn't have a Linux distribution for a long time (the one everyone used on the desktop was an outdated rpm based distro, we mostly ignored it for development purposes).

What existed was a x86 to x86 cross compilation environment and the libraries involved were manually imported by developers who needed that particular library.

My argument was about the cost of ensuring that those libraries were compiled with frame pointers when much of the open source community was defaulting to omit-fp.

dooglius · 2024-03-17T12:57:51.000000Z

Would it not be easier to patch compilers to always assume the equivalent of -fno-omit-frame-pointer

adsharma · 2024-03-17T17:44:34.000000Z

That was done in 2005. But the task of auditing the supply chain to ensure that every single shared library you ever linked with was compiled a certain way was still hard. Nothing prevented an intern or a new employee from checking in a library without frame pointers into the third-party repo.

In 2024, you'd probably create a "build container" that all developers are required to use to build binaries or pay a linux distributor to build that container.

But cross compilation was the preferred approach back then. So all binaries had a rpath (run time search path to look for shared library) that ignored the distributor supplied libraries.

Having come from a open source background, I found this system hard to digest. But there was a lot of social pressure to work as a bee in a system that thousands of other very competent engineers are using (quite successfully).

I remember briefly talking to a chrome OS related group who were using the "build your own custom distro" approach, before deciding to move to another faang.

cruffle_duffle · 2024-03-17T19:00:03.000000Z

> or pay a linux distributor to build that container.

What does this mean?

adsharma · 2024-03-17T19:41:15.000000Z

I didn't mean anything nefarious here :)

Since Google would rather have the best brains in the industry build the next search indexing algorithm or the browser, they didn't have the time to invest human capital into building a better open source friendly dev environment.

A natural alternative is to contract out the work. Linux distributors were good candidates for such contract work.

But the vibe back then was Google could build better alternatives to some of these libraries and therefore bridging the gap between dev experience as an open source dev vs in house software engineer wasn't important.

You could see the same argument play out in git vs monorepo etc, where people take up strong positions on a particular narrow tech topic, whereas the larger issue gets ignored as a result of these belief systems.

tempay · 2024-03-17T05:16:07.000000Z

It “wastes” a register when you’re not actively using them. On x86 that can make a big difference, though with the added registers of x86_64 it much less significant.

inkyoto · 2024-03-17T10:55:21.000000Z

Wasting a register on comparatively more modern ISA's (PA-RISC 2.0, MIPS64, POWER, aarch64 etc – they are all more modern and have an abundance of general purpose registers) is not a concern.

The actual «wastage» is in having to generate a prologue and an epilogue for each function – 2x instructions to preserve the old frame pointer and set a new one up, and 2x instruction at the point of return – to restore the previous frame pointer.

Generally, it is not a big deal with an exception of a pathological case of a very large number of very small functions calling each other frequently where the extra 4x instructions per each such a function will be filling up the L1 instruction cache «unnessarily».

weebull · 2024-03-17T15:40:29.000000Z

Those pathological cases are really what inlining is for, with the exception of any tiny recursive functions that can't be tail call optimised.

inkyoto · 2024-03-17T23:58:14.000000Z

Yes, inlining (and LTO can take it a notch or two higher) does away with the problem altogether, however the number of projects that default to «-Os» (or even to «-O2») to build a release product is substantial to large.

There is also a significant number of projects that go to great lengths to force-override CFLAGS/CXXFLAGS (usually with «-O2 -g» or even with «-O») or make it extraordinary difficult to change the project's default CFLAGS, for no apparent reasons which eliminates a number of advanced optimisations in builds with default build settings.

charleshn · 2024-03-17T05:38:47.000000Z

It's not just the loss of an architectural register, it's also the added cost to the prologue/epilogue. Even on x86_64, it can make a difference, in particular for small functions, which might not be inlined for a variety of reasons.

Asooka · 2024-03-17T15:18:31.000000Z

If your small function is not getting inlined, you should investigate why that is instead of globally breaking performance analysis of your code.

Sesse__ · 2024-03-17T22:40:26.000000Z

A typical case would be C++ virtual member functions. (They can sometimes be devirtualized, or speculatively partially devirtualized, using LTO+PGO, but there are lots of legitimate cases where they cannot.)

kaba0 · 2024-03-17T18:23:54.000000Z

CPUs spend an enormous amount of time waiting for IO and memory, and push/pop and similar are just insanely well optimized. As the article also claims, I would be very surprised to see any effect, unless that more instructions themselves would spill the I-cache.

charleshn · 2024-03-17T19:17:20.000000Z

I've seen around 1-3% on non micro benchmarks, real applications.

Aee also this benchmark from Phoronix [0]:

  Of the 100 tests carried out for this article, when taking the geometric mean of all these benchmarks it equated to about a 14% performance penalty of the software with -O2 compared to when adding -fno-omit-frame-pointer.

I'm not saying these benchmarks or the workloads I've seen are representative of the "real world", but people keep repeating that frame pointers are basically free, which is just not the case.

[0] https://www.phoronix.com/review/fedora-frame-pointer

starspangled · 2024-03-17T05:27:36.000000Z

Right, but I was asking about functional problems (being "stuck"), which sounded like a big issue for the choice.

nlewycky · 2024-03-17T05:24:14.000000Z

It caused a problem when building inline assembly heavy code that tried to use all the registers, frame pointer register included.

brcmthrowaway · 2024-03-17T16:41:49.000000Z

What area?

jart · 2024-03-17T06:21:01.000000Z

[flagged]

quotemstr · 2024-03-17T09:37:49.000000Z

The clear and obvious win would have been adoption of a universal userspace generic unwind facility, like Windows has --- one that works with multiple languages. Turning on frame pointers is throwing in the towel on the performance tooling ecosystem coordination problem: we can't get people to fix unwind information, so we do this instead? Ugh.

rwmj · 2024-03-17T10:01:16.000000Z

Yes, although the universal mechanisms that have been proposed so far have been quite ridiculous - for example having every program handle a "frame pointer signal" in userspace, which doesn't account for the reality that we need to do frame unwinding thousands of times a second with the least possible overhead. Frame pointers work for most things, and where they don't work (interpreted code) you're often not that interested in performance.

quotemstr · 2024-03-17T12:33:28.000000Z

> every program handle a "frame pointer signal" in userspace

Yep. That's my proposal.

> which doesn't account for the reality that we need to do frame unwinding thousands of times a second with the least possible overhead

Yes, it does. The kernel has to return to userspace anyway at some point, and pushing a signal frame during that return is cheap. The cost of signal delivery is the entry into the kernel, and after a perf counter overflow, you've already paid that cost. Why would the actual unwinding be any faster in the kernel than in userspace?

Also, so what if a thread enters the kernel and samples the stack multiple times before returning to userspace? While in the kernel, the userspace stack cannot change --- therefore, it's sufficient to delay userspace stack collection until the kernel returns to userspace anyway.

You might ask "Don't we have to restore the signal mask after handling the profiling signal?"

Not if you don't define the signal to change the signal mask. sigreturn(2) is optional.

rwmj · 2024-03-17T13:12:06.000000Z

This sounds vastly more complex already than following a linked list. You've also ignored the other cost which is getting the stack trace data out of the program. Anyway I'm keen to see your implementation and test how it works in reality.

quotemstr · 2024-03-17T13:23:51.000000Z

[flagged]

rwmj · 2024-03-17T14:39:51.000000Z

We have to deal with reality if we want to measure and improve software performance today. The current reality is that frame pointers are the best choice. Brendan's article outlines a couple of possible future scenarios where we turn frame pointers off again, but they require work that is not done yet (in one case, advances in CPUs).

loeg · 2024-03-17T17:32:04.000000Z

Your argument would be more compelling without the swipe in the final sentence.

jart · 2024-03-17T14:34:59.000000Z

I propose that a frame pointer daemon be introduced too, for managing the frame pointer signals. We shall modify _start() to open up an io_uring connection to SystemD so that a program may share its .eh_frame data. That way the kernel can still unwind its stack in case apt upgrade changes the elf inode.

quotemstr · 2024-03-17T15:13:50.000000Z

Neither of you has identified anything technically wrong with unwinding via signal and neither of you has proposed a mechanism through which we might support semantically informative unwinding through paged-out code or interpreted languages.

Sarcasm is not a technical argument.

jart · 2024-03-17T16:06:38.000000Z

I don't need to. Fedora and Ubuntu have already changed their policies to restore frame pointers. As far as I can tell, your proposal is no longer on the table. If you aren't willing to accept the decision, then you should at least understand that the onus is on you now to justify why things need to change.

quotemstr · 2024-03-17T21:34:28.000000Z

> It's such a clear and obvious win that the rest of us should have the opportunity to persuade them

> I don't have to

Pick one, jart.

jart · 2024-03-17T14:03:37.000000Z

Cosmopolitan Libc does frame pointer unwinding once per function call, when the --ftrace flag is passed. https://justine.lol/ftrace/

samatman · 2024-03-17T16:36:24.000000Z

I think this came off somewhat aggressive. I vouched for the comment because flagging it is an absurd overreaction, but I also don't think pointing out isolated individuals would be of much help.

Barriers to progress here are best identified on a community level, wouldn't you say?

But people, please calm down. Filing an issue or posting to the mailing list to make a case isn't sending a SWAT team to people's home. It's a technical issue, one well within the envelope of topics which can be resolved politely and on the merits.

akira2501 · 2024-03-17T18:25:25.000000Z

> that the rest of us should have the opportunity

What entitles you to this opportunity?

Joker_vD · 2024-03-17T06:11:41.000000Z

Of course, if you cede RBP to be a frame pointer, you may as well have two stacks, one which is pointed into by RBP and stores the activation frames, and the other one which is pointed into by RSP and stores the return addresses only. At this point, you don't even need to "walk the stack" because the call stack is literally just a flat array of return addresses.

Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

naasking · 2024-03-17T07:01:24.000000Z

It simplifies storage management. A stack frame is a simple bump pointer which is always in cache and only one guard page for overflow, in your proposal you need two guard pages and double the stack manipulations and doubling the chance of a cache miss.

Joker_vD · 2024-03-17T07:36:24.000000Z

Yes, two guard pages are needed. No, the stack management stays the same: it's just "CALL func" at the call site, "SUB RBP, <frame_size>" at the prologue and "ADD RBP, <frame_size>; RET" at the epilogue. As for chances of a cache miss... probably, but I guess you also double them up when you enable CFET/Shadow Stack so eh.

In exchange, it becomes very difficult for the stack smashing to corrupt the return address.

imtringued · 2024-03-17T11:46:55.000000Z

The reduceron had five stacks and it was faster because of it.

dan-robertson · 2024-03-17T08:58:32.000000Z

Note the ‘shadow stacks’ CPU feature mentioned briefly in the article, though it’s more for security reasons. It’s pretty similar to what you describe.

rwmj · 2024-03-17T10:02:11.000000Z

Shadow stacks have been proposed as an alternative, although it's my understanding that in current CPUs they hold only a limited number of frames, like 16 or 32?

amluto · 2024-03-17T10:56:18.000000Z

You may be thinking of the return stack buffer. The shadow stack holds every return address.

stefan_ · 2024-03-17T11:22:34.000000Z

While here, why do we grow the stack the wrong way so misbehaved programs cause security issues? I know the reason of course, like so many things it last made sense 30 years ago, but the effects have been interesting.

astrobe_ · 2024-03-17T09:41:16.000000Z

You may be ready for Forth [1] ;-). Strangely, the Wikipedia article apparently doesn't put forward that Forth allows access both to the parameter and the return stack, which is a major feature of the model.

[1] https://en.wikipedia.org/wiki/Forth_(programming_language)

samatman · 2024-03-17T16:52:53.000000Z

That does seem like a significant oversight. >r and r>, and cousins, are part of ANSI Forth, and I've never used a Forth which doesn't have them.

mikewarot · 2024-03-17T16:45:23.000000Z

Forth has a parameter stack, return stack, vocabulary stack

STOIC, a variant of Forth, includes a file stack when loading words

samatman · 2024-03-17T16:54:37.000000Z

I'm not sure what you're referring to with "vocabulary stack" here, perhaps the dictionary? More of a linked list, really a distinctive data structure of its own.

astrobe_ · 2024-03-17T19:18:50.000000Z

Maybe OP refers to vocabulary search order manipulation [1]. It's sort of like namespaces, but "stacked". There's also the old MARKER and FORGET pair [2].

The dictionary pointer can also be manipulated in some dialects. That can be used directly as the stack variant of the arena allocator idea. It is particularly useful for text concatenation.

[1] https://forth-standard.org/standard/search [2] https://forth-standard.org/standard/core/MARKER

sweetjuly · 2024-03-17T14:06:37.000000Z

>Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

You'd have to argue that the cost of moving things to this other page and managing two pointers (where one is less powerful in the ISA) is meaningfully cheaper than the other equally effective mitigation of stack cookies/protectors which are already able to provide protection only where needed. There is no real security benefit to doing this over what we currently have with stack protectors since an arbitrary read/write will still lead to a CFI bypass.

weebull · 2024-03-17T15:47:21.000000Z

> The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

The classic buffer overflow issue should spring immediately to mind. By having a separate return address stack it's far less vulnerable to corruption through overflowing your data structures. This stops a bunch of attacks which purposely put crafted return addresses into position that will jump the program to malicious code.

It's not a panacea, but generally keeping code pointers away from data structures is a good idea.

titzer · 2024-03-17T16:13:42.000000Z

Virgil doesn't use frame pointers. If you don't have dynamic stack allocation, the frame of a given function has a fixed size can be found with a simple (binary-search) table lookup. Virgil's technique uses an additional page-indexed range that further restricts the lookup to be a few comparisons on average (O(log(# retpoints per page)). It combines the unwind info with stackmaps for GC. It takes very little space.

The main driver is in (https://github.com/titzer/virgil/blob/master/rt/native/Nativ... the rest of the code in the directory implements the decoding of metadata.

I think frame pointers only make sense if frames are dynamically-sized (i.e. have stack allocation of data). Otherwise it seems weird to me that a dynamic mechanism is used when a static mechanism would suffice; mostly because no one agreed on an ABI for the metadata encoding, or an unwind routine.

I believe the 1-2% measurement number. That's in the same ballpark as pervasive checks for array bounds checks. It's weird that the odd debugging and profiling task gets special pleading for a 1% cost but adding a layer of security gets the finger. Very bizarre priorities.

moonchild · 2024-03-18T00:11:58.000000Z

You can add bounds checks to c, but that costs a hell of a lot more than 1-2%. C++ has them off by default for std::vector because c++ is designed by and for the utterly insane. Other than that, I can't off the top of my head think of a language that doesn't have them.

chlorion · 2024-03-18T17:52:51.000000Z

The bounds safety C compiler extension research by Apple has measured the runtime impact of adding bounds checking to C and it is not a lot more than 1-2% in almost all cases. Even in microbenchmarks its often around 5%. The impact on media encoding and decoding was around 1-2% and the overall power use on the device did not change.

https://www.youtube.com/watch?v=RK9bfrsMdAM https://llvm.org/devmtg/2023-05/slides/TechnicalTalks-May11/...

It's a myth that bounds checking has extraordinary performance costs and cannot be enabled without slowing everything to a halt. Maybe this was the case 10 years ago or 20 years ago or something, but not today.

shrimp_emoji · 2024-03-18T00:19:39.000000Z

> C++ has them off by default for std::vector because c++ is designed by and for the utterly insane.

And for those who value performance and don't want to pay the cost of "a lot more than 1-2%" ;p

UncleMeat · 2024-03-18T13:03:08.000000Z

The data I've seen for turning on bounds checks in std::vector shows overhead considerably lower than 1-2%.

moonchild · 2024-03-18T01:19:37.000000Z

std::vector falls into the category of things that are easy to bounds check, st the cost, even under today's primitive compilers, is low. It's direct pointer accesses—which are common in c but not in c++ or most other languages—that are hard to and therefore cost more to bounds check.

shrimp_emoji · 2024-03-18T03:12:05.000000Z

That's assuming you're keeping no metadata about your C array(s) that you're bounds-checking, which would be very slow indeed. :o You'd be traversing pointers until you hit a tombstone value or something. But would anyone do this in performance-chasing code? Cuz, otherwise, with metadata to support your bounds checks, you're doing the same thing that I assume std::vector is doing: asking your array metadata about whether something's in bounds. And that's extra cycles, which can add up depending on what you're doing!

Btw, in my experience, std::vector is fast. Insanely fast. "I don't understand how it can be so fast", "barely distinguishable from raw arrays in C" fast. Not doing bounds checking is probably part of that, though far from the whole story.

pjmlp · 2024-03-18T07:35:54.000000Z

std::regexp, std::map, fronzen ABI.... apparently the value for performance is relative at WG21.

dap · 2024-03-17T04:52:56.000000Z

Good post!

> Profiling has been broken for 20 years and we've only now just fixed it.

It was a shame when they went away. Lots of people, certainly on other systems and probably Linux too, have found the absence of frame pointers painful this whole time and tried to keep them available in as many environments as possible. It’s validating (if also kind of frustrating) to see mainstream Linux bring them back.

trws · 2024-03-17T22:31:25.000000Z

I’m sincerely curious. While I realize that using dwarf for unwinding is annoying, why is it so bad that it’s worth pessimizing all code on the system? It’s slow on Debian derivatives because they package only the slow unwinding path for perf for example, for license reasons, but with decent tooling I barely notice the difference. What am I missing?

brenns10 · 2024-03-18T15:11:01.000000Z

Using DWARF is annoying with perf because the kernel doesn't support stack unwinding with DWARF, and never will. The kernel has to be the one which unwinds the user space stacks, because it is the one managing the perf counters and handling the interrupts.

Since it can't unwind the user space stacks, the kernel has to copy the entire stack (8192 bytes by default) into the output file (perf.data) and then the perf user space program will unwind it later. It does this for each sample, which is usually hundreds of times per second, per CPU. Though it depends how you configured the collection.

That does have a significant overhead: first, runtime overhead: copying 8k bytes, hundreds of times per second, and writing it to disk, all don't come for free. You spend quite a bit of CPU time doing the memcpy operation which consumes memory bandwidth too. You also frequently need to increase the size of the perf memory buffer to accommodate all this data while it waits for user space to write it to disk. Second, disk space overhead, since the 8k stack bytes per sample are far larger than the stack trace would be. And third, it does require that you install debuginfo packages to get the DWARF info, which is usually a pain to do on production machines, and they consume a lot of disk space on their own.

Many of these overheads aren't too bad in simple cases (lower sample rates, fewer CPUs, or targeting a single task). But with larger machines with hundreds of CPUs, full system collections, and higher frequencies, the overhead can increase exponentially.

I'm not certain I know what you mean by the "slow unwinding path for perf", as there is no faster path for user space when frame pointers are disabled (except Intel LBR as outlined in the blog).

Sesse__ · 2024-03-18T21:08:44.000000Z

I assume you're talking about the “nondistro build”?

The difference between -g (--call-graph fp) and --call-graph dwarf is large even with perf linked directly against binutils, at least in my experience (and it was much, much worse until I got some patches into binutils to make it faster). This is both on record and report.

There are also weird bugs with --call-graph dwarf that perf upstream isn't doing anything about, around inlining. It's not perfect by any means.

BeeOnRope · 2024-03-18T03:02:24.000000Z

Which is the slow unwinding path? The one from libbfd?

javierhonduco · 2024-03-17T12:51:25.000000Z

Overall, I am for frame pointers, but after some years working in this space, I thought I would share some thoughts:

* Many frame pointer unwinders don't account for a problem they have that DWARF unwind info doesn't have: the fact that the frame set-up is not atomic, it's done in two instructions, `push $rbp` and `mov $rsp $rbp`, and if when a snapshot is taken we are in the `push`, we'll miss the parent frame. I think this might be able to be fired by inspecting the code, but I think this might only be as good as a heuristic as there could be other `push %rbp` unrelated to the stack frame. I would love to hear if there's a better approach!

* I developed the solution Brendan mentions which allows faster, in-kernel unwinding without frame pointers using BPF [0]. This doesn't use DWARF CFI (the unwind info) as-is but converts it into a random-access format that we can use in BPF. He mentions not supporting JVM languages, and while it's true that right now it only supports JIT sections that have frame pointers, I planned to implement a full JVM interpreter unwinder. I have left Polar Signals since and shifted priorities but it's feasible to get a JVM unwinder to work in lockstep with the native unwinder.

* In an ideal world, enabling frame pointers should be done on a case-by-case. Benchmarking is key, and the tradeoffs that you make might change a lot depending on the industry you are in, and what your software is doing. In the past I have seen large projects enabling/disabling frame pointers not doing an in-depth assessment of losses/gains of performance, observability, and how they connect to business metrics. The Fedora folks have done a superb and rigorous job here.

* Related to the previous point, having a build system that enables you to change this system-wide, including libraries your software depends on can be awesome to not only test these changes but also put them in production.

* Lastly, I am quite excited about SFrame that Indu is working on. It's going to solve a lot of the problems we are facing right now while letting users decide whether they use frame pointers. I can't wait for it, but I am afraid it might take several years until all the infrastructure is in place and everybody upgrades to it.

- [0]: https://web.archive.org/web/20231222054207/https://www.polar...

rwmj · 2024-03-17T13:19:24.000000Z

On the third point, you have to do frame pointers across the whole Linux distro in order to be able to get good flamegraphs. You have to do whole system analysis to really understand what's going on. The way that current binary Linux distros (like Fedora and Debian) works makes any alternative impossible.

spc476 · 2024-03-18T03:28:53.000000Z

It could be one instruction: ENTER N,0 (where N is the amount of stack space to reserve for locals)---this is the same as:

    PUSH EBP
    MOV  ESP,ESP
    SUB  SP,N

(I don't recall if ENTER is x86-64 or not). But even with this, the frame setup isn't atomic with respect to CALL, and if the snapshot is taken after the CALL but before the ENTER, we still don't get the fame setup.

As for the reason why ENTER isn't used, it was deemed too slow. LEAVE (MOV SP,BP; POP BP) is used as it's just as fast as, if not faster, than the sequence it replaces. If ENTER were just the PUSH/MOV/SUB sequence, it probably would be used, but it's that other operand (which is 0 above in my example) that kills it performance wise (it's for nested functions to gain access to upper stack frames and is every expensive to use).

felixge · 2024-03-17T13:26:06.000000Z

Great comments, thanks for sharing. The non-atomic frame setup is indeed problematic for CPU profilers, but it's not an issue for allocation profiling, Off-CPU profiling or other types off non-interrupt driven profiling. But as you mentioned, there might be ways to solve that problem.

brancz · 2024-03-17T16:29:48.000000Z

Great comment! Just want to add we are making good progress on the JVM unwinder!

claytonwramsey · 2024-03-17T06:27:43.000000Z

That's very interesting to me - I had seen the `[unknown]` mountain in my profiles but never knew why. I think it's a tough thing to justify: 2% performance is actually a pretty big difference.

It would be really nice to have fine-grained control over frame pointer inclusion: provided fine-grained profiling, we could determine whether we needed the frame pointers for a given function or compilation unit. I wouldn't be surprised if we see that only a handful of operations are dramatically slowed by frame pointer inclusion while the rest don't really care.

naasking · 2024-03-17T07:05:01.000000Z

> 2% performance is actually a pretty big difference.

No it's not, particularly when it can help you identify hotspots via profiling that can net you improvements of 10% or more.

pm215 · 2024-03-17T09:38:06.000000Z

Sure, but how many of the people running distro compiled code do perf analysis? And how many of the people who need to do perf analysis are unable to use a with-frame-pointers version when they need to? And how many of those 10% perf improvements are in common distro code that get upstreamed to improve general user experience, as opposed to being in private application code?

If you're netflix then "enable frame pointers" is a no-brainer. But if you're a distro who's building code for millions of users, many of whom will likely never need to fire up a profiler, I think the question is at least a little trickier. The overall best tradeoff might end up being still to enable frame pointers, but I can see the other side too.

jart · 2024-03-17T16:34:45.000000Z

It's not a technical tradeoff, it's a refusal to compromise. Lack of frame pointers prevents many groups from using software built by distros altogether. If a distro decides that they'd rather make things go 1% faster for grandma, at the cost of alienating thousands of engineers at places like Netflix and Google who simply want to volunteer millions of dollars of their employers resources helping distros to find 10x performance improvements, then the distros are doing a great disservice to both grandma and themselves.

alerighi · 2024-03-17T22:17:26.000000Z

I mean if you need to do performance analysis on a software just recompile it. Why it's such a big deal?

In the end a 2% of performance of every application it's a big deal. On a single computer may not be that significant, think about all the computers, servers, clusters, that run a Linux distro. And yes, I would ask a Google engineer that if scaled on the I don't know how many servers and computers that Google has a 2% increase in CPU usage is not a big deal: we are probably talking about hundreds of kW more of energy consumption!

We talk a lot of energy efficiency these days, to me wasting a 2% only to make the performance analysis of some software easier (that is that you can analyze directly the package shipped by the distro and you don't have to recompile it) it's something stupid.

jart · 2024-03-18T05:34:03.000000Z

The average European home consumes 400 watts at any given moment. Modern digital smart meters can consume 4 watts on average, which is 1% of a household's power consumption. On the grand scheme of society, these 1% losses in each home add up. If we consider all the grid monitoring equipment that's typically employed by electrical companies outside the home, the problem becomes much greater. In order to maximize energy efficiency and improve our environmental footprint, we must remove these metering and monitoring devices, which don't actually contribute to the delivery and consumption of power.

kloimand · 2024-03-18T17:24:32.000000Z

> I mean if you need to do performance analysis on a software just recompile it. Why it's such a big deal?

Not having them enabled on all dependencies makes them significantly less useful if your application interacts with the system in a meaningful way. "Just recompile" (glibc + xorg + qt + whatever else you use) is a very hefty thing.

Having them enabled by default also enables devs to ask their users for traces. I can reasonably tell someone to install sysprof or something, click a few buttons and send me a trace. I cannot reasonably tell them to compile the project (+ dependencies). And tbh, even devs cba if they have to jump through too many hoops.

> We talk a lot of energy efficiency these days, to me wasting a 2% only to make the performance analysis of some software easier (that is that you can analyze directly the package shipped by the distro and you don't have to recompile it) it's something stupid.

We're trading all kinds of efficiencies for all kinds of niche benefits all the time. This enables analysis of what is shipped in real use across the whole stack with the alternative being much less useful or a lot more (human) work. To me it's worth 2% and I'm shocked (well... not really) it's not for more people.

quotemstr · 2024-03-17T21:40:15.000000Z

Presenting people with false dichotomies is no way to build something worthwhile

samatman · 2024-03-17T16:58:42.000000Z

I would say the question here is what should be the default, and that the answer is clearly "frame pointers", from my point of view.

Code eking out every possible cycle of performance can enable a no-frame-pointer optimization and see if it helps. But it's a bad default for libc, and for the kernel.

rwmj · 2024-03-17T11:13:38.000000Z

You can turn it on/off per function by attaching one of these GCC attribute to the function declaration (although it doesn't work on LLVM):

  __attribute__((optimize("no-omit-frame-pointer")))
  __attribute__((optimize("omit-frame-pointer")))

ndesaulniers · 2024-03-17T22:34:36.000000Z

The optimize fn attr causes other unintended side effects. Its usage is banned on the Linux kernel.

inglor_cz · 2024-03-17T09:35:25.000000Z

The performance cost in your case may be much smaller than 2 per cent.

Don't completely trust the benchmarks on this; they are a bit synthetic and real-world applications tend to produce very different results.

Plus, profiling is important. I was able to speed up various segments of my code by up to 20 per cent by profiling them carefully.

And, at the end of the day, if your application is so sensitive about any loss of performance, you can simply profile your code in your lab using frame pointers, then omit them in the version released to your customers.

account42 · 2024-03-18T11:35:38.000000Z

> And, at the end of the day, if your application is so sensitive about any loss of performance, you can simply profile your code in your lab using frame pointers, then omit them in the version released to your customers.

That is what should be done but TFA is about distros shipping code with frame pointers to end uses because some developers are too lazy to recompile libc when profiling. Somehow shipping different copies of libc, one indended for end users on low-powered devices and one indended for developers is not even considered.

rhinoceraptor · 2024-03-18T14:56:55.000000Z

If you can't introspect the release version of your software, you have no way of determining what the issue is. You're doing psuedo-science and guesswork to try and replicate the issue on a development version of the software. And if you put in a few new logging statements into the release version, there's a pretty good chance that simply restarting the software will cause the symptom to go away.

rwmj · 2024-03-17T10:03:22.000000Z

The measured overhead is slightly less than 1%. There have been some rare historical cases where frame pointers have caused performance to blow up but those are fixed.

loeg · 2024-03-17T07:09:53.000000Z

It’s usually a lot less than 2%.

boulos · 2024-03-17T17:34:47.000000Z

JIT'ed code is sadly poorly supported, but LLVM has had great hooks for noting each method that is produced and its address. So you can build a simple mixed-mode unwinder, pretty easily, but mostly in process.

I think Intel's DNN things dump their info out to some common file that perf can read instead, but because the *kernels* themselves reuse rbp throughout oneDNN, it's totally useless.

Finally, can any JVM folks explain this claim about DWARF info from the article:

> Doesn't exist for JIT'd runtimes like the Java JVM

that just sounds surprising to me. Is it off by default or literally not available? (Google searches have mostly pointed to people wanting to include the JNI/C side of a JVM stack, like https://github.com/async-profiler/async-profiler/issues/215).

weebull · 2024-03-17T16:04:27.000000Z

Just as a general comment on this topic...

The fact that people complain about the performance of the mechanism that enables the system to be profiled, and so performance problems be identified, is beyond ironic. Surely the epitome of premature optimisation.

account42 · 2024-03-18T11:27:18.000000Z

It's not moronic when the people profiling and the people having performance issues are different groups. That some developer benefits from having frame pointers in libc does not mean that all users of that software also need to have frame pointers enabled.

im3w1l · 2024-03-18T11:45:15.000000Z

It goes both ways. I see people with ultra-bloated applications trying to add even more bloat and force it on the rest of us too. Like the guy saying that DWARF unwinding is impractical when you have 1Gbyte code cache.

I don't see the people writing hyperfast low-level code asking for this.

My experience with profiling is anyway that you don't need that fine-grained profiling. It's main use is finding stuff like "we spend 90% of our time reallocating a string over and over to add characters one at a time". After a few of those it's just "it's a little bit slow everywhere".

Sesse__ · 2024-03-18T21:21:48.000000Z

> After a few of those it's just "it's a little bit slow everywhere".

And after that, you need fine-grained profiling to find multiple 1% wins and apply them repeatedly.

(I do this for a living, basically)

im3w1l · 2024-03-18T21:24:16.000000Z

Like disabling frame pointers

AtlasBarfed · 2024-03-17T16:38:04.000000Z

So what are these other techniques the 2004 migration from frame pointers assumed would work for stack walking? Why don't they work today? I get that _64 has a lot more registers, so there's minimal value to +1 the register?

loeg · 2024-03-17T17:28:04.000000Z

In 2004, the assumption made by the GCC developers was that you would be walking stacks very infrequently, in a debugger like GDB. Not sampling stacks 1000s of times a second for profiling.

doubloon · 2024-03-17T16:17:15.000000Z

im sure in ancient mesopotamia there was somebody arguing about you could brew beer faster if you stop measuring the hops so carefully but then someone else was saying yes but if you dont measure the hops carefully then you dont know the efficiency of your overall beer making process so you cant isolate the bottlenecks.

the funny thing is i am not sure if the world would actually work properly if we didn't have both of these kinds of people.

eqvinox · 2024-03-17T08:53:55.000000Z

This doesn't detract from the content at all but the register counts are off; SI and DI count as GPRs on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

cesarb · 2024-03-17T12:16:25.000000Z

> [...] on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

That is, on i686 you have 7 GPRs without frame pointers, while on x86_64 you have 14 GPRs even with frame pointers.

Copying a comment of mine from an older related discussion (https://news.ycombinator.com/item?id=38632848):

"To emphasize this point: on 64-bit x86 with frame pointers, you have twice as many registers as on 32-bit x86 without frame pointers, and these registers are twice as wide. A 64-bit value (more common than you'd expect even when pointers are 32 bits) takes two registers on 32-bit x86, but only a single register on 64-bit x86."

brendangregg · 2024-03-17T12:47:40.000000Z

Thanks!

tdullien · 2024-03-17T07:06:04.000000Z

As much as the return of frame pointers is a good thing, it's largely unnecessary -- it arrives at a point where multiple eBPF-based profilers are available that do fine using .eh_frame and also manually unwinding high level language runtime stacks: Both Parca from PolarSignals as well the artist formerly known as Prodfiler (now Elastic Universal Profiling) do fine.

So this is a solution for a problem, and it arrives just at the moment that people have solved the problem more generically ;)

(Prodfiler coauthor here, we had solved all of this by the time we launched in Summer 2021)

felixge · 2024-03-17T10:25:19.000000Z

First of all, I think the .eh_frame unwinding y'all pioneered is great.

But I think you're only thinking about CPU profiling at <= 100 Hz / core. However, Brendan's article is also talking about Off-CPU profiling, and as far as I can tell, all known techniques (scheduler tracing, wall clock sampling) require stack unwinding to occur 1-3 orders of magnitude more often than for CPU profiling.

For those use cases, I don't think .eh_frame unwinding will be good enough, at least not for continuous profiling. E.g. see [1][2] for an example of how frame pointer unwinding allowed the Go runtime to lower execution tracing overhead from 10-20% to 1-2%, even so it was already using a relatively fast lookup table approach.

[1] https://go.dev/blog/execution-traces-2024

[2] https://blog.felixge.de/reducing-gos-execution-tracer-overhe...

searealist · 2024-03-17T09:53:44.000000Z

I'm under the impression that eh_frame stack traces are much slower than frame pointer stack traces, which makes always-on profiling, such as seen in tcmalloc, impractical.

int_19h · 2024-03-17T09:09:31.000000Z

PolarSignals is specifically discussed in the linked threads, and they conclude that their approach is not good enough for perf reasons.

javierhonduco · 2024-03-17T12:22:57.000000Z

Curious to hear more about this. Full disclosure: I designed and implemented .eh_frame unwinding when I worked at Polar Signals.

int_19h · 2024-03-17T20:13:50.000000Z

I think I might have confused two unrelated posts. The one that references Polar Signals is this one:

https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/...

So not a perf issue there, but they don't think the workflow is suitable for whole-system profiling. Perf issues were in the context of `perf` using DWARF:

https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/...

tdullien · 2024-03-17T09:34:32.000000Z

Oh nice, I can't find that - can you post a link?

weinzierl · 2024-03-17T09:24:28.000000Z

Also I've heard that the whole .eh_frame unwinding is more fragile than a simple frame pointer. I've seen enough broken stack traces myself, but honestly I never tried if -fno-omit-frame-pointer would have helped.

tdullien · 2024-03-17T09:33:54.000000Z

Yes and no. A simple frame pointer needs to be present in all libraries, and depending on build settings, this might not be the case. .eh_frame tends to be emitted almost everywhere...

So it's both similarly fragile, but one is almost never disabled.

The broader point is: For HLL runtimes you need to be able to switch between native and interpreted unwinds anyhow, so you'll always do some amount of lifting in eBPF land.

And yes, having frame pointers removes a lot of complexity, so it's net a very good thing. It's just that the situation wasnt nearly as dire as described, because people that care about profiling had built solutions.

quotemstr · 2024-03-17T09:41:35.000000Z

Forget eBPF even -- why do the job of userspace in the kernel? Instead of unwinding via eBPF, we should ask userspace to unwind itself using a synchronous signal delivered to userspace whenever we've requested a stack sample.

bregma · 2024-03-17T14:21:41.000000Z

Context switches are incredibly expensive. Given the sampling rate of eBPF profilers all the useful information would get lost in the context switch noise.

Things get even more complicated because context switches can mean CPU migrations, making many of your data useless.

quotemstr · 2024-03-17T15:16:21.000000Z

What makes you think doing unwinding in userspace would do any more context switches (by which I think you mean privilege level transitions) than we do today? See my other comment on the subject.

> Things get even more complicated because context switches can mean CPU migrations, making many of your data useless.

No it doesn't. If a user space thread is blocked on doing kernel work, its stack isn't going to change, not even if that thread ends up resuming on a different thread.

Tomte · 2024-03-17T07:10:14.000000Z

You mean we don‘t need accessible profiling in free software because there are companies selling it to us. Cool.

tdullien · 2024-03-17T08:50:44.000000Z

Parca is open-source, Prodfiler's eBPF code is GPL, and the rest of Prodfiler is currently going through OTel donation, so my point is: There's now multiple FOSS implementations of a more generic and powerful technique.

brancz · 2024-03-17T07:44:25.000000Z

Parca's user-space code is apache2 and the eBPF code is GPL.

nemetroid · 2024-03-17T11:38:22.000000Z

If you're sufficiently in control of your deployment details to ensure that BPF is available at all. CAP_SYS_PTRACE is available ~everywhere for everyone.

shaggie76 · 2024-03-17T12:29:53.000000Z

I thought we'd been using /Oy (Frame-Pointer Omission) for years on Windows and that there was a pdata section on x64 that was used for stack-walking however to my great surprise I just read on MSDN that "In x64 compilers, /Oy and /Oy- are not available."

Does this mean Microsoft decided they weren't going to support breaking profilers and debuggers OR is there some magic in the pdata section that makes it work even if you omit the frame-pointer?

MarkSweep · 2024-03-17T13:33:47.000000Z

Some Google found this: https://devblogs.microsoft.com/oldnewthing/20130906-00/?p=33...

“Recovering a broken stack on x64 machines on Windows is trickier because the x64 uses unwind codes for stack walking rather than a frame pointer chain.”

More details are here: https://learn.microsoft.com/en-us/cpp/build/exception-handli...

musjleman · 2024-03-17T17:03:05.000000Z

> In x64 compilers

The default is omission. If you have a Windows machine, in all likelihood almost no 64 bit code running on it has frame pointers.

> OR is there some magic in the pdata section that makes it work even if you omit the frame-pointer

You haven't ever needed frame pointers to unwind using ... unwind information. The same thing exists for linux as `.eh_frame` section.

quotemstr · 2024-03-17T15:20:53.000000Z

Microsoft has had excellent universal unwinding support for decades now. I'm disappointed to see someone as prominent as this article's author present as infeasible what Microsoft has had working for so long.

Iwan-Zotow · 2024-03-18T03:13:30.000000Z

exactly! starting with C structured exceptions

Iwan-Zotow · 2024-03-18T03:12:46.000000Z

MS had unwinding support done right for a LONG time, 32 and 64, starting with structured C exceptions

codeflo · 2024-03-17T16:34:51.000000Z

All of this information is static, there's no need to sacrifice a whole CPU register only to store data that's already known. A simple lookup data structure that maps an instruction address range to the stack offset of the return address should be enough to recover the stack layout. On Windows, you'd precompute that from PDB files, I'm sure you can do the same thing with whatever the equivalent debug data structure is on Linux.

loeg · 2024-03-17T17:35:26.000000Z

It isn't entirely static because of alloca().

account42 · 2024-03-18T11:43:04.000000Z

Banning alloca() would make more sense than re-enabling frame pointers.

loeg · 2024-03-18T15:56:10.000000Z

At least re-enabling FPs is possible.

fsmv · 2024-03-17T16:37:30.000000Z

[deleted]

WalterBright · 2024-03-17T05:31:42.000000Z

Guess I'll add it back in to the DMD code generator!

tkiolp4 · 2024-03-17T10:54:12.000000Z

Are his books (the one about Systems Performance and eBPF) relevant for normal software engineers who want to improve performance in normal services? I don’t work for faang, and our usual performance issues are solved by adding indexes here and there, caching, and simple code analysis. Tools like Datadog help a lot already.

polio · 2024-03-17T16:32:41.000000Z

Profiling is a pretty basic technique that is applicable to all software engineering. I'm not sure what a "normal" service is here, but I think we all have an obligation to understand what's happening in the systems we own.

Some people may believe that 100ms latency is acceptable for a CLI tool, but what if it could be 3ms? On some aesthetic level, it also feels good to be able to eliminate excess. Finally, you should learn it because you won't necessarily have that job forever.

wavemode · 2024-03-17T16:22:04.000000Z

Diving into flame graphs being worthwhile for optimization, assumes that your workload is CPU-bound. Most business software does not have such workloads, and rather (as you yourself have noted) spend most of their time waiting for I/O (database, network, filesystem, etc).

And so, (as you again have noted), your best bet is to just use plain old logging and tracing (like what datadog provides) to find out where the waiting is happening.

zzbn00 · 2024-03-17T09:59:01.000000Z

NiX (and I assume Guix) are very convenient for this as it is fairly easy to turn frame pointers on or off for parts or whole of the system.

tzot · 2024-03-17T11:37:42.000000Z

I am not sure, but I believe -fomit-frame-pointer in x86-64 allows the compiler to use a _thirteenth_ register, not a _seventeenth_ .

pajko · 2024-03-17T04:48:44.000000Z

There's another option: https://lesenechal.fr/en/linux/unwinding-the-stack-the-hard-...

loeg · 2024-03-17T05:03:58.000000Z

Brendan mentions DWARF unwinding, actually, and briefly mentions why he considers it insufficient.

haberman · 2024-03-17T05:44:17.000000Z

The biggest objection seems to be the Java/JIT case. eh_frame supports a "personality function" which is AIUI basically a callback for performing custom unwinding. If the personality function could also support custom logic for producing backtraces, then the profiling sampler could effectively read the JVM's own metadata about the JIT'ted code, which I assume it must have in order to produce backtraces for the JVM itself.

loeg · 2024-03-17T06:21:06.000000Z

This also seems like a big objection:

> The overhead to walk DWARF is also too high, as it was designed for non-realtime use.

kouteiheika · 2024-03-17T08:10:30.000000Z

Not a problem in practice. The way you solve it is to just translate DWARF into a simpler representation that doesn't require you to walk anything. (But I understand why people don't want to do it. DWARF is insanely complex and annoying to deal with.)

Source: I wrote multiple profilers.

brendangregg · 2024-03-18T01:25:51.000000Z

For a busy 64-CPU production JVM, I tested Google's Java symbol logging agent that just logged timestamp, symbol, address, size. The c2 compiler was so busy, constantly, that the overhead of this was too high to be practical (beyond startup analysis). And all this was generating was a timestamp log to do symbol lookup. For DWARF to walk stacks there's a lot more steps, so while I could see it work for light workloads I doubt it's practical for the heavy production workloads I typically analyze. What do you think? Have you tested on a large production server where c2 is a measurable portion of CPU constantly, the code cache is >1Gbyte and under heavy load?

kouteiheika · 2024-03-18T12:53:25.000000Z

I regularly profile heavy time-sensitive (as in: if the code takes too long to run it breaks) workloads, and I even do non-sampling memory profiling (meaning: on every memory allocation and deallocation I grab a full backtrace, which is orders of magnitude more data than normal sampling profiling) and it works just fine with minimal slowdown even though I get the unwinding info from vanilla DWARF.

Granted, this is using optimized tooling which uses a bunch of tricks to side-step the problem of DWARF being slow, I only profile native code (and some VMs which do ahead-of-time codegen) and I've never worked with JVM, but in principle I don't see why it wouldn't be practical on JVM too, although it certainly would be harder and might require better tooling (which might not exist currently). If you have the luxury of enabling frame pointers then that certainly would be easier and simpler.

(Somewhat related, but I really wish we would standardize on something better than DWARF for unwinding tables and basic debug info. Having done a lot of work with DWARF and its complexity I wouldn't wish it upon my worst enemy.)

loeg · 2024-03-17T17:06:22.000000Z

In this thread[1] we're discussing problems with using DWARF directly for unwinding, not possible translations of the metadata into other formats (like ORC or whatever).

[1]: https://news.ycombinator.com/item?id=39732010

kouteiheika · 2024-03-17T17:45:35.000000Z

I wasn't talking about other formats. I was talking about preloading the information contained in DWARF into a more efficient in-memory representation once when your profiler starts, and then the problem of "the overhead is too high for realtime use" disappears.

menaerus · 2024-03-17T09:41:51.000000Z

From https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf

    DWARF-based unwinding can be a bottleneck for time-sensitive program analysis tools. For instance the perf profiler is forced to copy the whole stack on taking each sample and to build the backtraces offline: this solution has a memory and time overhead but also serious confidentiality and security flaws.

So if I get this correctly, the problem with DWARF is that building the backtrace online (on each sample) in comparison to frame pointers is an expensive operation which, however, can be mitigated by building the backtrace offline at the expense of copying the stack.

However, paper also mentions

    Similarly, the Linux kernel by default relies on a frame pointer to provide reliable backtraces. This incurs in a space and time overhead; for instance it has been reported (https://lwn.net/Articles/727553/) that the kernel’s .text size increases by about 3.2%, resulting in a broad kernel-wide slowdown.

and

    Measurements have shown a slowdown of 5-10% for some workloads (https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy@suse.de/T/#u).

haberman · 2024-03-17T15:58:26.000000Z

But that one has at least some potential mitigation. Per his analysis, the Java/JIT case is the only one that has no mitigation:

> Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.

rwmj · 2024-03-17T10:05:26.000000Z

DWARF unwinding isn't practical: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...

rfoo · 2024-03-17T10:46:58.000000Z

TBH this sounds more like perf's implementation is bad.

I'm waiting for this to happen: https://github.com/open-telemetry/community/issues/1918

javierhonduco · 2024-03-17T12:30:17.000000Z

There's always room for improvement, for example, Samply [0] is a wonderful profiler that uses the same APIs that `perf` uses, but unwinds the stacks as they come rather than dumping them all to disk and then having to process them in bulk.

Samply unwinds significantly faster than `perf` because it caches unwind information.

That being said, this approach still has some limitations, such as that very deep stacks won't be unwound, as the size of the process stack the kernel sends is quite limited.

- [0]: https://github.com/mstange/samply

BinaryRage · 2024-03-17T21:30:34.000000Z

I remember talking to Brendan about the PreserveFramePointer patch during my first months at Netflix in 2015. As of JDK 21, unfortunately it is no longer a general purpose solution for the JVM, because it prevents a fast path being taken for stack thawing for virtual threads: https://github.com/openjdk/jdk/blob/d32ce65781c1d7815a69ceac...