As a scientist, I'm interested how computer programmers work with data. \* They ...

dgacmu · on Dec 30, 2023

I'm glad you noted programmers; as a computer science researcher, my reaction was the same as yours. I don't think I ever used a CDF for data analysis until grad school (even with having had stats as a dual bio/cs undergrad).

novok · on Dec 30, 2023

It's because that's usually the data scientist's job, and most eng infra teams don't have a data scientist and don't really need one most of the time.

Most of the time they deal with data the way their tools generally present data, which correlate closely to most analytics, perf analysis and observability software suites.

Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.

mpoteat · on Dec 30, 2023

A standard CS program will cover statistics (incl. calculus-based stats e.g. MLEs), and graphics is a very common and popular elective (e.g. covering OpenGL). I learned all of this stuff (sans shaders) in undergrad, and I went to a shitty state college. So from my perspective an entry level programmer should at least have a passing familiarity with these topics.

Does your experience truly say that the average SWE is so ignorant? If so, why do you think that is?

oakejp12 · on Dec 31, 2023

> A standard CS program will cover statistics

> graphics is a very common and popular elective

I find these statements to be extremely doubtful. Why would a CS program cover statistics? Wouldn't that be the math department? If there any required courses, it's most likely Calc 1/2, Linear Algebra, and Discrete Math.

Also, out of the hundreds of programmers I've met, I don't know any that has done graphics programming. I consider that super niche.

dehugger · on Jan 1, 2024

Thankfully all programmers have a CS degree, as there are absolutely no career paths into the industry that could possibly bypass a four year degree. What a relief, imagine the horror of working with plebians that never took a course on statistics!

DeathArrow · on Dec 30, 2023

>Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.

I did write shaders and used quaternions back in the day. I also worked on microcontrollers, did some system programming, developed mobile and desktop apps. Now I am working on a rather large microservice based app.

perfmode · on Dec 30, 2023

you’re a unicorn

azalemeth · on Dec 30, 2023

> ChatGPT punched out a reasonably sensible t test!

I think the distribution is decidedly non normal here and the difference in the medians may well have also been of substantial interest -- I'd go for a Wilcox test here to first order... Or even some type of quantile regression. Honestly the famous Jonckheere–Terpstra test for ordered medians would be _perfect_ for this bit of pseudoanalysis -- have the hypothesis that M3 > M2 > M1 and you're good to go, right?!

(Disclaimers apply!)

whimsicalism · on Dec 30, 2023

12,000 builds? Sure maybe the build time distribution is non-normal, but the sample statistic probably is approximately normal with that many builds.

RayVR · on Dec 30, 2023

Many people misinterpret what is required for a t-test.

azalemeth · on Dec 30, 2023

I meant that the median is likely arguably the more relevant statistic, that is all -- I realise that the central limit theorem exists!

whimsicalism · on Jan 1, 2024

Fair enough - I see what you are saying. Sorry for my accidentaly condescending reply.

Herring · on Dec 29, 2023

>They drew histograms, which are hard to compare.

Note that in some places they used boxplots, which offer clearer comparisons. It would have been more effective to present all the data using boxplots.

tmoertel · on Dec 29, 2023

> They drew histograms, which are hard to compare.

Like you, I'd suggest empirical CDF plots for comparisons like these. Each distribution results in a curve, and the curves can be plotted together on the same graph for easy comparison. As an example, see the final plot on this page:

https://ggplot2.tidyverse.org/reference/stat_ecdf.html

mnming · on Dec 29, 2023

I think it's partly because the audiences are often not familiar with those statistics details either.

Most people hates nuances when reading data report.

fallous · on Dec 30, 2023

I think you might want to add the caveat "young computer programmers." Some of us grew up in a time where we had to learn basic statistics and visualization to understand profiling at the "bare metal" level and carried that on throughout our careers.

jxcl · on Dec 29, 2023

Yeah, I was looking at the histograms too, having trouble comparing them and thinking they were a strange choice for showing differences.

NavinF · on Dec 30, 2023

> cumulative distribution functions, where you can see if they overlap or one is shifted

Why would this be preferred over a PDF? I've rarely seen CDF plots after high school so I would have to convert the CDF into a PDF inside my head to check if the two distributions overlap or are shifted. CDFs are not a native representation for most people

unsung · on Dec 30, 2023

I can give a real example. At work we were testing pulse shaping amplifiers for Geiger Muller tubes. They take a pulse in, shape it to get a pulse with a height proportional to the charge collected, and output a histogram of the frequency of pulse heights, with each bin representing how many pulses have a given amount of charge.

Ideally, of all components are the same, there is no jitter, and if you feed in a test signal from a generator with exactly the same area per pulse, you should see a histogram where every count is in a single bin.

In real life, components have tolerances, and readouts have jitter, so the counts spread out and you might see, with the same input, one device with, say, 100 counts in bin 60, while a comparably performing device might have 33 each in bins 58, 59, and 60.

This can be hard to compare visually as a PDF, but if you compare CDF's, you see S-curves with rising edges that only differ slightly in slope and position, making the test more intuitive.

dash2 · on Dec 30, 2023

If one line is to the right of the other everywhere, then the distribution is bigger everywhere. (“First order stochastic dominance” if you want to sound fancy.) I agree that CDFs are hard to interpret, but that is partly due to unfamiliarity.