Is there a simple derivation or justification for this estimator?

mturmon · on June 10, 2010

This is a classic question where some of the usual assumptions of the method conventionally applied, the Maximum Likelihood (ML) principle, break down. There's a good explanation at:

http://en.wikipedia.org/wiki/German_tank_problem#Example

In a nut, the ML estimator of N, the number of tanks produced, is the max of the serial numbers. But this is biased, in particular, it tends to systematically underestimate N. (Because you're unlikely to actually observe the top serial number in your random sample.)

So you can add a correction term which is, intuitively, the expected gap between the serial numbers in the finite sample. The correction makes up for the fencepost effect. It goes to zero as the number of samples increases.

xel02 · on June 10, 2010

There is, had to learn it in a probability class. http://en.wikipedia.org/wiki/German_tank_problem

brg · on June 10, 2010

To sum up the derivation for those looking for crib notes; model the problem as choosing k items u.a.r. from [1..N]. Compute P(max = i), and from this compute the expectation of the max. After simpliciation this is given in terms of k and N, and hence we have an estimator for N in terms of k and max.

ckuehne · on June 11, 2010

According to Wikipedia the estimator is (n+1)/ns-1 not (n+1)/n(s+1) as the author stated.

MarkBook · on June 10, 2010

I think I get part of it n/n+1 means that as n gets bigger it becomes ever more likely that s is close to the real maximum. But I have no idea why 1 is added to s

MarkBook · on June 10, 2010

continuing my foolish musings I would have thought originally that getting the mean of the sample and multiplying by 2 would have been a good estimate. But one problem with that is that there is no possibility of the maximum being lower than anything in the sample so there has to be an allowance for uncertainty upwards. Is that the reason for adding 1 to s?

mturmon · on June 10, 2010

You can show that the max is a comprehensively better estimator than twice the mean.

This is one of the rare cases where you can outperform the mean by a large margin.

The standard deviation of the mean-based estimator, which measures its accuracy, will go down like 1/sqrt(n) where n is the number of samples in the finite set. This is the rate standardly seen in lots of estimation problems.

But the standard deviation of the max-based estimator will go down like 1/n, a much faster rate. That's one manifestation of the "weirdness" of this problem. It means you can get surprisingly good estimates of the number of tanks considering the small numbers of observations.

MarkBook · on June 10, 2010

Thanks, I think I get it on a superficial level. The British seemed to have some pretty useful statisticians during WWII. See this also : http://www.dur.ac.uk/stat.web/bomb.htm

EDIT : So there's about a fifty fifty chance that the second observation will be bigger than the first. After the second observation the range of possible future observations will be divided into 3 zones, 1 below the lowest of the previous 2 observations 1 above the higher and 1 between the two previous observations, so there is now only a 1/3 possibility of the next observation being above the higher of the 2 previous observations. Next time that drops to 1/4 then 1/5, so that's how the standard deviation decreases proportional to 1/n

mturmon · on June 10, 2010

I like that explanation.