Not Your Grandmother's Textbook Exercise

wzwy · on Sept 2, 2022

> (d) Compute the mean age, and print it out as the median. It's much easier to program, and Mary Jones will probably never notice the difference.

Lol. If it’s not the most important statistic to Mary Jones, I can see why a programmer might do this.

> Mary Jones in the Accounting Department tells you that she has a file of personnel records [2022 update: an Excel sheet] and asks you to develop a program to compute the median age of the personnel in the file.

I’ll probably add option (e) if it’s Excel: Mary Jones, you can learn and do this yourself. Just Google it or something.

bambax · on Sept 2, 2022

It's not just Mary Jones. The controversial (to say the least) French medical professor Didier Raoult last year showed he didn't quite grasp the difference between median and mean. There was some mockery on Twitter but no journalist ever called him up on it.

brutusborn · on Sept 2, 2022

I couldn't believe an Academic could be this ignorant, so I had to look it up. I found it but only basic french. I can't find a translation, does anyone know of one? https://mobile.twitter.com/cubic_logic/status/15410902478613...

d1sxeyes · on Sept 2, 2022

EN: '[...] some things to put together. You see that if we look at the mean or the median, the mean is when a distribution is like this [indicates a bell curve with his hands], we can calculate a mean, but when it's like this [wiggles his finger], you can't calculate a mean, you do the, the, the bar which separates them in half. And so [...]'

FR: '[...] d'objets de mettre ensemble. Vous voyez que si on regarde la moyenne ou la médiane, la moyenne c'est quand une distribution est comme ça, on peut faire une moyenne, quand elle est comme ça on n'a pas [...] pour faire une moyenne, on fait, le, le, le, la barre qui sépare à moitié les uns les autres. Et donc [...]'

There's a short phrase I can't quite catch at about 9s... which literally translates as 'we don't have [...] to do a mean', so I'm inferring that he means 'you can't calculate a mean' from context (previous clause 'on peut faire une moyenne' => 'we can do a mean' => 'we can calculate a mean').

alain94040 · on Sept 2, 2022

Actually, I believe he is probably correct in that case. He is discussing whether the mean is meaningful depending on the shape of the distribution. It's probably not fair to summarize his explanation as saying he doesn't understand mean and average.

HN's guidelines recommend to assume the best.

d1sxeyes · on Sept 2, 2022

I would agree that you should always try to assume the best (and I didn't actually suggest he doesn't understand), but he certainly explains it clumsily, and his explanation does not illustrate the difference between a median and a mean, or when one would be more meaningful than the other.

pkd · on Sept 3, 2022

I don't think it's the best explanation, but I do think he understands that mean as a measure of central tendency is not that useful when distributions are skewed or multimodal (the squiggly line gesture).

d1sxeyes · on Sept 4, 2022

Yes but:

1. If data is multimodal (wiggly line), the median isn’t a great measure of central tendency either.

2. Perfectly normally distributed data will have the same mean and median.

I agree people have jumped on this as evidence the guy doesn’t understand when he probably does, but what he says is a long way from the truth.

deanmen · on Sept 2, 2022

Excel has MEDIAN built-in https://support.microsoft.com/en-us/office/median-function-d...

sbierwagen · on Sept 2, 2022

You can also install the Analysis ToolPak add-in, select the column, click Data -> Data Analysis -> Descriptive Statistics. Gives mean, median, mode, SD and a bunch of other things. https://support.microsoft.com/en-us/office/use-the-analysis-...

Has been in Excel since 2007.

mananaysiempre · on Sept 2, 2022

Since 1997 at least (except it’s under Tools there).

cwillu · on Sept 2, 2022

The problem states that Mary has an Excel file. It does _not_ state that Mary has an Excel software licence, or even a computer with which to run it.

k_sze · on Sept 2, 2022

You have LibreOffice nowadays.

ZoomZoomZoom · on Sept 2, 2022

The problem states that Mary has an Excel file. It does _not_ state that Mary has an access to a personal computer or a device allowing remotely accessing one.

k_sze · on Sept 2, 2022

My bad. The problem also does not state that Mary has adequate eyesight to use a computer or that she even has access to electricity to power a computer. Life is so hard

pjc50 · on Sept 2, 2022

Quite difficult now to reach this scale with personnel records, but: imagine that you're asked to provide the median of one trillion data points. This could provide lots of interesting possible approaches, since it's about the boundary between "big data" vs "do it on your laptop". And you can't just load it into Excel.

(and also to ask about the range of the data!)

nmstoker · on Sept 2, 2022

For a trillion records, if you were in the UK government, you might just drop those above the Excel limit and proceed until journalists flagged the concern nationally.

Background context here:

- https://www.theguardian.com/politics/2020/oct/05/how-excel-m...

raverbashing · on Sept 2, 2022

For 1 trillion records I'd say: estimate the probability distribution and values and give the corresponding value as the answer

Psyladine · on Sept 2, 2022

re: (e), they can't. I don't mean they cannot physically operate the sequence of events (even a properly incentivized chimpanzee could do that), but that they lack the programmatic sense of problem solving that is assumed (quite arrogantly) by the programmatic crowd.

In the same way people take their cars to mechanics when they can do the job themselves, or call triple A for a flat tire when the spare and jack are in the trunk, or reach for the phone instead of a plunger when the toilet backs up, people don't lack the specific skillset, they lack the openness to problems being solvable. They are in perpetual search of answers, not solutions.

They don't want to learn to fish, they don't want to expend energy on process. They want someone else to do it. It's not laziness, it's not incompetence, and it's not entitlement. They are just of a mindset they cannot solve problems, and so they pick up the phone and abstract away the answer.

And I promise you, no 10 minute condescending conversation with someone is going to redefine their core attitudes about life. You're just going to have your well poisoned in the break room with your piss poor attitude because she called the plumber and he told her to fix her own shit (in a manner of speaking).

Ma8ee · on Sept 3, 2022

No, it’s not that they don’t think they can do it, it’s that one of the foundations of modern economics is specialisation. I know I can repair my car myself, and even think it would be kind of fun. But the mechanic in his shop already knows exactly what to do and in which order. I’d spent hours reading and watching videos before I even could start and still make plenty mistakes. The mechanic already has the tools and know where to source the parts. I’d spend half a day just to make sure that I’d get the right parts to a reasonable price. In the end it would take me four days what the mechanic would do in an hour or two.

And I buy my fish in the supermarket, for even more obvious reasons. This example make me believe you actually might be trolling.

So the lady in the original examples uses that the company already employs people who are specialists in doing this kind of things, Because she has more important things to do with her time. And that you don’t understand what she’s doing all day that make her not having time for fighting with spreadsheets is your failure, not hers.

082349872349872 · on Sept 1, 2022

Interesting finding in the post referenced by TFA:

> In other words, you can throw money at a project to make things happen faster, but the highest time reduction you will ever be able to gain is by a quarter. Such a result, confirmed by many studies (by Boehm and many others after him), is typical of the kind of strong empirical work that Boehm favored.

cxr · on Sept 2, 2022

I'm partial to John Ousterhout's "The greatest performance improvement of all is when a system goes from not-working to working".

<https://web.stanford.edu/~ouster/cgi-bin/sayings.php>

zwkrt · on Sept 2, 2022

Five hours and no one has actually attempted to answer the question as far as I can tell.

If I was Mary I would probably want A. As a programmer I would tend toward C becuase I have been burned by not questioning requirements before. If Mary were me (so to speak) I would hope she asked me such questions, but most people giving requirements get annoyed with too much hair splitting, at least hair splitting from their perspective.

jamesrom · on Sept 2, 2022

I like A, as it iterates toward C anyway.

You can deliver value with the naïve implementation delivered quickly.

If Mary Jones has further requirements that were not considered at the outset, now she'll be able to consider them.

Conversely, if you've built a bunch of assumptions into your code, they'll be tested in the real world sooner rather than later.

mattkrause · on Sept 2, 2022

I'd cheat in a slightly different way.

If the data are employee ages, you know that they are probably integers[0] and certainly bounded between 0 and 128.

Bin the ages into years and accumulate the per-year and total counts, which is a snap: O(N) time, O(1) space, and trivial to parallelize. Find the bin where the cumulative sum is equal to half the total counts and you're done.

[0] Even if they're not integers, you probably don't need ages more granular than a day, which is still a manageable number of bins (~50k).

samuel · on Sept 2, 2022

A) is the right *business* answer the 99% of the time, assuming you have a fast routine for sorting an array but not one for computing the median directly.

Mary shouldn't care about a) vs b) unless she has to wait too much to get the answer. If she has, then her need has to be weighted against other business needs, so the data gathered in c) becomes relevant. b) then becomes one of c's aftermaths.

d) is dishonest

So it is a) -> c)

I wouldn't go directly to c), a) is cheap enough.

zeristor · on Sept 2, 2022

“Just go away and write the code”

smeagull · on Sept 2, 2022

They're ages, and therefore integers over a tiny range. You ignore these solutions, and insert the counts of values into a tiny dictionary, and compute the median from that.

layer8 · on Sept 2, 2022

Maybe they want the median age in number of days?

mattkrause · on Sept 2, 2022

Assuming Methuselah doesn't work for you, you can still do that in about 200k of memory.

layer8 · on Sept 3, 2022

Only at the cost of poor cache locality. :)

mattkrause · on Sept 5, 2022

Shouldn't it fit completely in L1 (or at worst, L2)? You need roughly 146,100 bytes: 120 years * 365.25 days/year * 4 bytes/day (uint32 as an accumulator) = 175,320 kb.

MacBooks have something like 192 kb shared + 128 kb/core, so that should be okay if not much else is going on....

evolvingstuff · on Sept 2, 2022

Maybe I'm missing something obvious, but this seems like a way to calculate a mode, not a median.

xanathar · on Sept 2, 2022

I think the idea in parent post is: fill the dictionary (I would use a 200 entries array though), iterate over the keys in sorted order summing the values, stop when you are at N/2. The 'iterate over keys in order' part depends heavily on the data structure choice of the dictionary (good for treemaps, bad for hashmaps, best for preallocated array).

Incidentally this falls in the 'sort an then pick the middle' because the original text says nowhere that is must be a n*log(n) sort and what I described here is essentially the same but with counting sort (and a couple redundant steps removed).

wikfwikf · on Sept 2, 2022

The right answer.

mojomark · on Sept 2, 2022

I think the point is being missed here. This response does not answer the direct brief nor the (more important IMHO) underlying question that attempts to excercise empathy for the customer:

"Rank the responses in the order of their relative concern with programming considerations, economic considerations, or other important considerations. If you were the programmer, which approach would you prefer? If you were Mary Jones, which approach would you prefer?"

After 20 years in industry, I constantly see people banging their heads against the wall, not understanding why the customer "doesn't get it," when in many cases they do in fact "get it" but the solution has different meaning to them. The complex challenge is identifying all of their underlying motivations and viewing the problem from their perspective.

Empathizing with the customer is the lesson, not the math. It's important because it makes for a happier professional relationship and if you're technically right, knowing the customer's drivers can help you efficiently negotiate the implementation path forward in a way that everyone is satisfied with and garuntees more future business. The lesson teaches a path to professional happiness, and that's invaluable. Who wants to be grumpy and angry 80% their life?

wikfwikf · on Sept 2, 2022

If someone asks for the median, you supply the median if possible. The assumption underlying the question is that it is difficult to calculate the median.

In fact, in this case, it is not difficult.

Going back to a professional who has requested the median and asking them if they didn't actually want the mean is not a display of empathy. It is patronising. The gender of the stakeholder is perhaps of relevance here.

mattkrause · on Sept 2, 2022

I'm not sure I agree.

You certainly shouldn't sneakily make the substitution. However, making a "client" aware of options and the associated trade-offs is very reasonable. Heck, I'd say it's the most professional approach to this problem!

It may be that Mary just needed some measure of central tendency, and would be happier with the mean now versus the median in a week. Alternately, she might need the median specifically for some legally-mandated reporting, and only that will do. You'll never know if you don't ask.

alain94040 · on Sept 2, 2022

If someone asks for the median, you supply the median if possible

I wouldn't. I have seen first-hand so many projects that took weeks of efforts, because we assume the person making the requirements used perfect language, when they could have been done in a day instead. One word can be the entire difference.

Maybe it happens to you in a big company: "the VP said '...'", so you must do exactly as the VP said. In practice, it turns out that the VP, just like Mary, could easily agree they didn't care if it's median or mean, they just need some number that summarizes the data.

Understanding the customer is key.

akudha · on Sept 1, 2022

Compute the mean age, and print it out as the median. It's much easier to program, and Mary Jones will probably never notice the difference.

This made me chuckle

popotamonga · on Sept 1, 2022

Customer only realizes 5 years later and you put the blame the customer for not discovering it earlier.

kungfufrog · on Sept 1, 2022

Where's the option to start firing up a Kubernetes cluster?

pjscott · on Sept 2, 2022

Since this is from the 1980s, would a VAX cluster suffice?

retcore · on Sept 2, 2022

A high volume transaction concurrency capable routine scenario would make a good update for contemporary financial services. Delta median and tick moving average strategies eg for thought. Also, how to recalculate for cancellation, hot fill or kill and other execution strategies.

The original question can get you thinking very close to contemporary problems, with only modest variations. We should have more longevity built into our textbooks.

zeristor · on Sept 2, 2022

Surely the best of both worlds would be to run Vax on Kubernetes?

gonzus · on Sept 2, 2022

You certainly meant "run Kubernetes on Vaxen", right?

analog31 · on Sept 1, 2022

Also, how long would it actually take to sort a number of records consistent with the largest workforce today?

xelxebar · on Sept 2, 2022

Dyalog APL is fast:

    ¯1↑5e6↑{(⊂⍋⍵)⌷⍵}10e6?⍤⍴100

This finds the median of 10 million "ages" in just under 50 milliseconds. Granted, with this randomly generated data it's probably doing something smart to avoid a full 10MB allocation, so on a static dataset the performance might be a bit different.

mlochbaum · on Sept 2, 2022

(Former Dyalog implementor) I have 9ms for the actual median calculation, which is much faster than generating the ages. It's a counting sort on 1-byte values: since the counts are so large it can just write results with memset() or an equivalent which is vectorized. You could also get the median from the counts directly, although I'm not sure APL has a way to do those counts that's faster than just sorting.

pcblues · on Sept 2, 2022

Like any Sandra Bullock movie, I can't tell if I have witnessed the funniest or most achingly sad thing I have ever seen. I like both.

bobsmooth · on Sept 2, 2022

A quick Google search says Walmart has 2.3 million employees. Using python to generate 2.3 million random age and name pairs and then sort them took about 8 minutes. And I'm pretty sure most of that time was spent creating the array.

thrown_22 · on Sept 2, 2022

    x = [(randint(20,80), "John Smith")  for x in range(2_300_000)]
    x.sort(key=lambda x: x[0])
    x[2_300_000//2]

Takes under a second. You can expand it to generate random names rather than just ages, if you feel like it, but the end result wouldn't be much different.

In short: this is something you shouldn't even think about in modern computers since it's that fast.

bobsmooth · on Sept 2, 2022

Haha, removing the library I used to generate random names took my code down to a few seconds.

vlovich123 · on Sept 2, 2022

On my laptop generating it takes ~1s and sorting takes ~280ms. As always, dependent on the power of the machine running things.

thrown_22 · on Sept 2, 2022

After thinking about it for a while, the reason why it was a hard problem back then is that the payroll wouldn't fit in memory and you'd need to do a sort on disk.

In this day and age it would be like trying to sort 100 TB of data. Not a hard problem, but not exactly something you do daily either. And still dreadfully slow since you need to use disk memory while copying everything over.

A more modern version of the question would be:

>You have a list of integers, in random order, that is 20 times larger than the hard drive you have on your work station and 1,000 times larger than your work station ram....

vlovich123 · on Sept 2, 2022

Naw. Sorting is a total waste here. I can read off an "infinite" amount of records of the wire with O(1) space usage and O(n) result. Even back then. Look at my radix sort solution: https://news.ycombinator.com/item?id=32684017

The intent behind the problem in the textbook is fine - think non-linearly about the problem. Just the example used has a trivial solution not mentioned (i.e. option 5).

thrown_22 · on Sept 3, 2022

Yes, that's why I changed it to having a list of arbitrary integers.

You're also assuming integer ages when a more natural comparison would have date of birth. Which has 365 times the number of buckets. Something that was again problematic to fit into a 1980s memory.

082349872349872 · on Sept 2, 2022

3e6 employees all of whom were born after 1768? <1s?

harry8 · on Sept 1, 2022

It's an excel file.

(e) Show her the excel function wizard and launch her empowerment for great justice!

gumby · on Sept 2, 2022

Seriously, interviewers should ask these kinds of questions, not leetcode.

corpMaverick · on Sept 2, 2022

Simple leet code questions are good to find out who can think algorithmically. But once people catch up they are not longer useful since you can study them and memorize them. Then the leet code questions need to be more difficult and it becomes a race. Now you are obligated to study them even if you are already a practicing Software engineer.

Same thing would happen to these questions. People would memorize them and answer not on experience or intuition but on memorization.

zeristor · on Sept 2, 2022

The question sort of answers itself as you read through the options, but then that can’t be taken for granted.

Attrecomet · on Sept 2, 2022

Yes, in a world where fizzbuzz is a useful filter for advanced software developer positions, I'm pretty sure this would also help.

layer8 · on Sept 2, 2022

The task is to evaluate and discuss the options, which would be the valuable part in an interview.

taylorius · on Sept 2, 2022

> (d) Compute the mean age, and print it out as the median. It's much easier to program, and Mary Jones will probably never notice the difference.

Possible improvement: The median will always be a member of the actual dataset, so after calculating the mean, run through the data to find the entry which is closest to it, and return that for the "median". Could make our dastardly plan harder to uncover.

eesmith · on Sept 2, 2022

If there is an even number of elements then the median is usually defined as the arithmetic mean of the two middle values. Eg, the median of [1, 3, 5, 7] is 4.

qsort · on Sept 2, 2022

Depends on how you define 'median'.

- Any value that minimizes the sum of absolute differences? Then all real numbers between 3 and 5 are a median of that set.

- With an inequality over the PDF? The same as above.

- The 1/2-quantile? Then you need to tell me if we're rounding up or down.

I've seen that definition before, but it's a completely useless definition for any legitimate use case of the median as an order statistic. I have no idea where it comes from.

d1sxeyes · on Sept 2, 2022

https://en.wikipedia.org/wiki/Median has this definition.

It's also the definition that I was taught at school.

Oxford Languages defines it as: "denoting or relating to a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it", which also supports the idea of taking the midpoint between the two middle values.

Merriam Webster define it explicitly in the way OP does (arithmetic mean of the two middle values): https://www.merriam-webster.com/dictionary/median

I'm not sure why you think it's a 'completely useless definition'. It would perhaps be a logical argument that data sets of even length do not have a median, but I think this would be even more useless.

To the layperson (which Mary Jones undoubtedly is), providing the 'median' age as an age half of the employees are older than/younger than, is what would be expected. Your first bullet point makes sense here. For example here:

18, 20, 25, 30, [35, 40], 45, 50, 55, 60

Most people would expect the median to be calculated as 37.5. However, as you mention, any real number higher than 35 and lower than 40 would equally fulfill the definition of 'equal probability of falling above or below' assuming this is the full population. Would you then argue that there are infinitely many medians in this data set, or none?

There are other issues with calculating 37.5, such as the fact it implies a higher degree of precision than it ought to. But ultimately, as a practical and usable insight into data, especially data with extreme outliers or high levels of skew, I strongly disagree that this definition is completely useless.

qsort · on Sept 2, 2022

> Would you then argue that there are infinitely many medians in this data set

By the definition I usually go by (the one with the inequalities), that dataset has infinitely many medians. There is no reason to say the midpoint is privileged. This is a better definition because it also works on sets where an order is defined but the arithmetic mean is not.

If asked to report a single value with no qualifications, I would report either extreme of the interval.

d1sxeyes · on Sept 2, 2022

> There is no reason to say the midpoint is privileged.

This is a matter of opinion. I agree it's not 'privileged', but it sacrifices a minimal amount of accuracy for a usable answer.

> If asked to report a single value with no qualifications, I would report either extreme of the interval.

That's a very interesting answer - so for my contrived example of 18, 20, 25, 30, [35, 40], 45, 50, 55, 60, you would report 35 or 40?

How about if I change the dataset a bit (so it's even more contrived): 18, 18, 18, 18, [18, 60], 60, 60, 60, 60 - would you still report either 18 or 60 (again, asking without qualifications)?

I think most people would agree that the median calculated as the arithmetic mean of the two middle values (39) is misleading, but that it is less misleading than ignoring half of the data set. How would you respond to that?

qsort · on Sept 2, 2022

> you would report 35 or 40?

Generally I report the lower end because it more easily generalizes to different quantiles.

> How about if I change the dataset a bit would you still report either 18 or 60 (again, asking without qualifications)?

Yes. I don't see how answering 39 is any better in that case.

> How would you respond to that?

I would respond that you should look at the distribution. Reporting the median is particularly misleading in the case of bimodal distributions. I would try to explain that summary statistics are to be taken with caution, possibly showing a picture.

d1sxeyes · on Sept 2, 2022

> Yes. I don't see how answering 39 is any better in that case.

Fair enough. I think 39 would show there's no bias towards younger or older employees. I think 39 is better but I can see both sides of the argument, and can ultimately see that it boils down to a question of personal preference as to which of two wrong answers is 'less wrong'.

irishsultan · on Sept 2, 2022

In any reasonable definition of what a median is then there is either a unique value, or the mean between the two middle elements is a median, so it's perfectly reasonable to take the average (something which you can always do and gives a unique answer).

bloak · on Sept 2, 2022

If they are real numbers then you can take the arithmetic mean of the two middle elements, but you might prefer to define the median in such a way that the definition is applicable to any set with a total order.

For example, names can be ordered alphabetically, so you can find the median of a set of names! OK, that's a silly example, but I expect someone could come up with a more sensible one.

Another case might be where you're using geometric means everywhere else in your working, but then you want a median in one place, but there are an even number of elements, so why would you suddenly use an arithmetic mean of the two middle elements?

qsort · on Sept 2, 2022

> I expect someone could come up with a more sensible one.

- Imagine you hand out a survey where some answers are on a likert scale. "strongly dislike" is worse than "dislike", which is worse than "neutral" and so on.

- The median ranking over a set of ranking on different criteria.

- When the domain is not the real numbers. "The median employee of this company has 1.5 kids".

eesmith · on Sept 2, 2022

> I have no idea where it comes from.

I went to archive.org and searched for intro books in statistics (there are a lot!). All of them gave preference to the average of the two middle values. One of them also pointed out there were valid alternatives.

"Since there are 10 values [in the sorted list], the median is the average of the fifth and sixth values. However, it could be also be 4.1, or 4.9, or any other value between 4 or 5, since each of these values divides the 10 values into two equal groups." - https://archive.org/details/introductiontost0000chao/page/54...

"For a sample of size n, the sample median is the ... Average of the two order statistics in the middle if n is even." - https://archive.org/details/introductiontost0000salv/page/64...

"We find no middle value; therefore, the median is the mean of the two middle values" - https://archive.org/details/US_Navy_Training_Course_-_Introd...

"The sample median for this collection is given by the middle observation if n is odd. If n is even, the sample median is the average of the two middle observations." - https://archive.org/details/introductiontost0000milt_g9m2/pa...

"sample median = ... the average of the middle two values if n is even" - https://archive.org/details/introductiontost0000peck_o4j9/pa...

"If there were an even number of observations (say, six instead of five), the calculation would be more complex. We would need to find the middle pair of numbers (the third and fourth observations), and then find the value that’s halfway between them, by adding the values together and dividing by two." - https://archive.org/details/introductiontost0000haan/page/n6...

"If there is an even number of values, the median is equal to the mean of the two middle values of the ordered set." - https://archive.org/details/introductiontoec0000merr/page/18...

taylorius · on Sept 2, 2022

That's true - maybe that gets us off the hook. :-)

mgaunard · on Sept 2, 2022

I prefer the 2-quantile myself.

hhmc · on Sept 2, 2022

Also, the median can only be 1stddev away from the mean in any distribution, so this can further winnow the search.

mjburgess · on Sept 2, 2022

...this isn't true:

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10^30

Chirono · on Sept 2, 2022

I’d double check your working there because it certainly is true. In your example the standard deviation (2e29) is far bigger than the difference between the median and mean (5e28).

zeristor · on Sept 2, 2022

Reminds me of the distribution in salaries.

I read …, 1, 10, 30 but …, 1, 10^30 is far more dramatic.

hhmc · on Sept 2, 2022

Someone has already explained why your example is wrong, but for a general proof, for dist X, let u = mean, m = median

| u - m | = |E(X - m)|

         <= E(|X - m|)  

         <= E(|X - u|)

         <= E(sqrt((X - u)^2))

         <= stddev

mostly steps via Jensen's inequality

mjburgess · on Sept 2, 2022

oops... I interpreted the claim, admittedly bizarrely, as something like:

fit N(u, s) to {(x, y)...} then m will lie u+-s

rather than, def. s^2 = E[X^2] - E[X]^2

popotamonga · on Sept 2, 2022

(x) find and npm package and question nothing, whatever happens happens. If it fails blame the author.

sinuhe69 · on Sept 2, 2022

Wow! I love the old school time where text books stimulated you to think broadly and deeply. Not just solving some artificial tiny problems and let you think when you can solve them all, you’re on the top of the world.

deanmen · on Sept 2, 2022

If it is indeed an Excel spreadsheet MEDIAN is a built-in function

https://support.microsoft.com/en-us/office/median-function-d...

abrax3141 · on Sept 2, 2022

Just report 38.8 years, the median age for us population as reported by the census. Mary won’t notice, it’s scalar in time and space, and in the limit you’ll be correct.

thinkingemote · on Sept 2, 2022

These are personnel records - people who are working at least (16-64 years) so not the total population. Will be similar though. Depends on demographics.

Additionally, one can assume that there's more younger people working in a company than older and make a little bit more accurate guess.

abrax3141 · on Sept 2, 2022

Aha! Let’s build a DNN model, train it on all the occupation data we can scrape from the web, then run it forward to deliver Mary’s result. Once the training is completed, say a year (or, with advances in gpu tech, two years), the forward model can be run in scalar time to confabulate a very convincing wrong answer to any demographic question Mary might have! That way we can fire the whole engineering team because the model will comphabulate answers to any future question Mary might have. (“. . . in the style of a software engineer.”)

vlovich123 · on Sept 2, 2022

It’s actually currently 42 For what it’s worth (went up 2 years in the last decade).

abrax3141 · on Sept 2, 2022

Oh. And it can be engineered at minimal cost (although all the test writing will require 2FTEs for 6months). UnleSS you write it in Haskell, in which case tests are unnecessary, but it’ll be 3FTEx2yrs to send the engineering team to a monastery to learn how to tie themselves into the appropriate monadic knot.

zeristor · on Sept 2, 2022

PebblesRox · on Sept 2, 2022

The median of life, the universe, and everything.

duxup · on Sept 2, 2022

> Spend a couple of weeks attempting (unsuccessfully) to improve on Floyd's algorithm, a day programming Floyd's algorithm, and then return to Mary Jones with some questions on the size and format of her file.

I like to think this is hinting that you should have asked first and maybe her answer is that she has a small number of records and if you took that option B you just wasted a few weeks.

Right?

totoglazer · on Sept 1, 2022

In that book in 1981 it’s quite likely to be a problem faced by the grandmother of many current programmers.

SamBam · on Sept 1, 2022

That's why it's Not your Grandmother's Textbook Exercise.

totoglazer · on Sept 2, 2022

Maybe it’s just poorly tilted then? This literally IS. I guess they are trying to imply other problems aren’t? But that would apply to some other theoretical question not mentioned here…

umanwizard · on Sept 1, 2022

Presumably the list is not ordered adversarially, so you can just use quickselect, at O(n) running time.

pjscott · on Sept 1, 2022

My personnel records are always ordered adversarially, on general principle. :-)

But if you’re willing to assume that none of the ages are more than some reasonable constant bound, like no employees over the age of 65535 or something, then there’s a fun guaranteed O(n) algorithm closely related to counting sort.

naniwaduni · on Sept 2, 2022

I wouldn't really expect your personnel ages to leave the range of an unsigned octet in the foreseeable future. Hell, I'm pretty sure we're not expecting to have personnel exceeding 127yo anytime soon.

(Note, however, that counting occurrences is in principle at least O(log n) space and, without a bit of cleverness, likely O(n log n) time—your integers need to be O(log n) wide to count that big!)

tantalor · on Sept 2, 2022

> counting sort

There's the right answer right there.

I'll add: let the AI handle the details.

  def getMedianAge(person_list):
    # Let copilot fill in rest

exact_string · on Sept 2, 2022

You can use the median of medians with quickselect to guarantee the O(N) time complexity.

insane_dreamer · on Sept 2, 2022

(e) introduce Mary to Lotus 1-2-3

wzwy · on Sept 2, 2022

Woah, flashback to was it Crossing the Chasm or Inside the Tornado?

I’m not sure if the latter has an updated version, so it’s probably that.

jeffrallen · on Sept 2, 2022

Update for 2022: (e) import some random code from NPM that turns out to be O(n^2), and has a bug which results in the same output as (d) would have had. Also, it has one dependency that will be deleted with no replacement next month, and another that will be a victim of a supply chain attack the month after that.

Supplemental question (for "software ethics" option students): If you've got 3 emails from recruiters waiting replies and you're 50% sure you won't be in your job in 1 month, does that increase or decrease the desirability of option (e)?

pcblues · on Sept 2, 2022

I would also try to find a way that didn't involve programming. Less bugs.

vlovich123 · on Sept 2, 2022

The exercise is a bit weird. The problem statement says "find the record" but all the answers are about printing the median age.

So answer to the question posed? None of the above. One can make a simplifying observation that ages are all < 255. I'd then use a radix sort.

   from random import randint
   from time import time

   x = [(randint(20,80), "John Smith")  for x in range(2_300_000)]

   start = time.time()
   ages = [0] * 256
   for entry in x:
       ages[entry] += 1
   for age, count in enumerate(ages):
       total += count
       if total >= 2_300_000//2:
           print(age)
           break
   print(time.time() - start)

This is ~1.6x faster than the variant that uses sort with a custom key that someone else posted. If you simplify it into an array of numbers, it's still 5-10% faster on my machine than sort (i.e. comparing pure native code running sort vs running radix sort in interpreted Python).

If you converted this to native code I'm sure this would be over an order of magnitude faster, both because O(N) is faster than O(n log n) and because you're doing a linear scan through the records (just jumping randomly around within a 255 byte array which can be 1 cache line on some CPUs). Indeed, when I compare this in native code [1], on my machine it's 3ms for radix sort processing of 2.3M records vs 77ms (~25x faster). When run with -O0, it's 24ms vs 724ms (~30x faster).

If I needed to find the actual record for the median employee(s) (there can be more than one obviously), I'd just scan over the data linearly a second time which should STILL be faster [2] (in the benchmark it's still 7x faster than comparison sort).

[1] https://quick-bench.com/q/Bra8T5nv9nFzgUUUKxjVLNCie6I [2] https://quick-bench.com/q/CN0WBMpyICM2ZwnwgQadBX3cPj4

Liru · on Sept 2, 2022

I had to reread this comment multiple times since I felt that I had a fundamental failure of reading comprehension somewhere along the line.

> The problem statement says "find the record" but all the answers are about printing the median age.

> So answer to the question posed? None of the above.

The problem statement does not say "find the record". The problem statement says, fully quoted:

> Rank the responses in the order of their relative concern with programming considerations, economic considerations, or other important considerations. If you were the programmer, which approach would you prefer? If you were Mary Jones, which approach would you prefer?

bisby · on Sept 2, 2022

Indeed. The question asks you to rank responses. Furthermore,

> The problem statement says "find the record" but all the answers are about printing the median age.

No. it doesn't. It asks you to "compute the median age" without asking about any records.

> she has a file of personnel records [2022 update: an Excel sheet] and asks you to develop a program to compute the median age of the personnel in the file. Here are four different ways you might respond:

So the textbook asks you to rank possible ways to respond, and the hypothetical is just asking "can you process this list and tell me the median age". No one in any context is asking for a record.

vlovich123 · on Sept 2, 2022

You’re right. I was looking at option A. Therefore radix sort is strictly the right answer. Simple, easy to read, easy to understand and an order of magnitude faster.

mgaunard · on Sept 2, 2022

If you cared about speed, you wouldn't sort at all.

pcblues · on Sept 2, 2022

I love this so much. I know what I should do, and yet my conscience will make my final decision more expensive for the company I work for. Should I be fired for that?

corpMaverick · on Sept 2, 2022

Take a random sample of 100 elements. Calculate the median from them.

yccs27 · on Sept 2, 2022

Ooh, I'm getting nerd-sniped here. I wonder how accurate this method is - getting a pen and paper, BRB.

spennant · on Sept 2, 2022

With, or without replacement?

andrepd · on Sept 2, 2022

> (b)

Many languages like C++ have this algorithm built in. Code reuse is amazing!

wrycoder · on Sept 2, 2022

If there are any boxes of greenbar sitting around, it’s C A B D.

Otherwise, A C B D.

mgaunard · on Sept 2, 2022

ACBD is clearly the right ranking.

jenaimarre · on Sept 2, 2022

We almost agree, I would go for D, A, C, B :-)

tmaly · on Sept 2, 2022

I can remember learning OO with Eiffel.