Hacker News new | past | comments | ask | show | jobs | submit login

As a scientist, I'm interested how computer programmers work with data.

* They drew beautiful graphs!

* They used chatgpt to automate their analysis super-fast!

* ChatGPT punched out a reasonably sensible t test!

But:

* They had variation across memory and chip type, but they never thought of using a linear regression.

* They drew histograms, which are hard to compare. They could have supplemented them with simple means and error bars. (Or used cumulative distribution functions, where you can see if they overlap or one is shifted.)




I'm glad you noted programmers; as a computer science researcher, my reaction was the same as yours. I don't think I ever used a CDF for data analysis until grad school (even with having had stats as a dual bio/cs undergrad).


It's because that's usually the data scientist's job, and most eng infra teams don't have a data scientist and don't really need one most of the time.

Most of the time they deal with data the way their tools generally present data, which correlate closely to most analytics, perf analysis and observability software suites.

Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.


A standard CS program will cover statistics (incl. calculus-based stats e.g. MLEs), and graphics is a very common and popular elective (e.g. covering OpenGL). I learned all of this stuff (sans shaders) in undergrad, and I went to a shitty state college. So from my perspective an entry level programmer should at least have a passing familiarity with these topics.

Does your experience truly say that the average SWE is so ignorant? If so, why do you think that is?


> A standard CS program will cover statistics

> graphics is a very common and popular elective

I find these statements to be extremely doubtful. Why would a CS program cover statistics? Wouldn't that be the math department? If there any required courses, it's most likely Calc 1/2, Linear Algebra, and Discrete Math.

Also, out of the hundreds of programmers I've met, I don't know any that has done graphics programming. I consider that super niche.


Thankfully all programmers have a CS degree, as there are absolutely no career paths into the industry that could possibly bypass a four year degree. What a relief, imagine the horror of working with plebians that never took a course on statistics!


>Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.

I did write shaders and used quaternions back in the day. I also worked on microcontrollers, did some system programming, developed mobile and desktop apps. Now I am working on a rather large microservice based app.


you’re a unicorn


> ChatGPT punched out a reasonably sensible t test!

I think the distribution is decidedly non normal here and the difference in the medians may well have also been of substantial interest -- I'd go for a Wilcox test here to first order... Or even some type of quantile regression. Honestly the famous Jonckheere–Terpstra test for ordered medians would be _perfect_ for this bit of pseudoanalysis -- have the hypothesis that M3 > M2 > M1 and you're good to go, right?!

(Disclaimers apply!)


12,000 builds? Sure maybe the build time distribution is non-normal, but the sample statistic probably is approximately normal with that many builds.


Many people misinterpret what is required for a t-test.


I meant that the median is likely arguably the more relevant statistic, that is all -- I realise that the central limit theorem exists!


Fair enough - I see what you are saying. Sorry for my accidentaly condescending reply.


>They drew histograms, which are hard to compare.

Note that in some places they used boxplots, which offer clearer comparisons. It would have been more effective to present all the data using boxplots.


> They drew histograms, which are hard to compare.

Like you, I'd suggest empirical CDF plots for comparisons like these. Each distribution results in a curve, and the curves can be plotted together on the same graph for easy comparison. As an example, see the final plot on this page:

https://ggplot2.tidyverse.org/reference/stat_ecdf.html


I think it's partly because the audiences are often not familiar with those statistics details either.

Most people hates nuances when reading data report.


I think you might want to add the caveat "young computer programmers." Some of us grew up in a time where we had to learn basic statistics and visualization to understand profiling at the "bare metal" level and carried that on throughout our careers.


Yeah, I was looking at the histograms too, having trouble comparing them and thinking they were a strange choice for showing differences.


> cumulative distribution functions, where you can see if they overlap or one is shifted

Why would this be preferred over a PDF? I've rarely seen CDF plots after high school so I would have to convert the CDF into a PDF inside my head to check if the two distributions overlap or are shifted. CDFs are not a native representation for most people


I can give a real example. At work we were testing pulse shaping amplifiers for Geiger Muller tubes. They take a pulse in, shape it to get a pulse with a height proportional to the charge collected, and output a histogram of the frequency of pulse heights, with each bin representing how many pulses have a given amount of charge.

Ideally, of all components are the same, there is no jitter, and if you feed in a test signal from a generator with exactly the same area per pulse, you should see a histogram where every count is in a single bin.

In real life, components have tolerances, and readouts have jitter, so the counts spread out and you might see, with the same input, one device with, say, 100 counts in bin 60, while a comparably performing device might have 33 each in bins 58, 59, and 60.

This can be hard to compare visually as a PDF, but if you compare CDF's, you see S-curves with rising edges that only differ slightly in slope and position, making the test more intuitive.


If one line is to the right of the other everywhere, then the distribution is bigger everywhere. (“First order stochastic dominance” if you want to sound fancy.) I agree that CDFs are hard to interpret, but that is partly due to unfamiliarity.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: