Hacker News new | past | comments | ask | show | jobs | submit login
Using loaded dice to cheat at Settlers of Catan, and p-values to avoid suspicion (izbicki.me)
266 points by jackpirate on Dec 15, 2017 | hide | past | favorite | 103 comments



Very nice hack and analysis thereof.

I do take issue with this statement, though, in particular the part highlighted (by me):

> "it’s impossible for your opponents to scientifically prove that you’re cheating. This impossibility is due to methodological defects in the current state of scientific practice, and we’ll highlight some ongoing work to fix these defects."

The impossibility is not due to methodological defects in the current state of scientific practice. It's intrinsic to the world. If you cheat only a bit, and only rarely, it might be impossible to detect it statistically.

Of course, there are problems with the current state of scientific practice and the mindless application of statistics (particularly the 5% significance level). The replicability crisis in psychology comes to mind.

However, I think statistics is unfairly maligned here. Statistics mostly delivers on its promise, it just promises less than what some people think.

The world is messy, and statistics can help us to make sense of it, but it can only do so much. In particular, it's not magic. You need many observations to make statistical statements, and more observations to make more precise statements (or statements at a higher significance level). That's intrinsic to how the world is. It's not a problem with "methodological defects in the current state of scientific practice". And there aren't simple solutions "to fix these defects".

(Having said that, the article is great. Many proposals (some of them mentioned in the article), such as a standard 0.5% significance level, more Bayes, awareness of power in addition to significance, avoiding p-hacking, etc., are important and useful steps. But they still won't allow your opponent to prove that you were cheating :-)


> The impossibility is not due to methodological defects in the current state of scientific practice. It's intrinsic to the world. If you cheat only a bit, and only rarely, it might be impossible to detect it statistically.

Not if you take enough samples. Just have to do it enough to reduce the noise.

But to your main point, yeah, in a game of Catan you can just get unlucky. And I think a fair amount of people claim that Catan dice aren't fair in the first place.

Also, did they test the unweighted dice? To understand if there was an inherent and unknown bias in their dice to start with? If they did, I missed that part, and they just assumed their dice were fair to begin with.


I don’t think you are disagreeing with the person you are responding to. They are trying to say the problem isn’t necessarily the ‘scientific process’, it is that if you cheat rarely enough you might not give your opponent enough samples to prove it. Your point that if you had enough samples you could prove it is actually agreeing with them.


Oh, I'm definitely agreeing with the person. I was just being a little snarky at that one statement.

I am also trying to ask a question that is extremely relevant to the author's "study".

"Are the original dice fair?"

That was never asked, it was assumed. It is a pretty hefty assumption too. I think everyone that has played a game of Catan has, at some point, questioned the fairness of the dice. I'm not saying they are unfair, but because of manufacturing methods, it is quite possible to get a pair of dice with an ever so slight bias.

ALWAYS ALWAYS ALWAYS find the bias in the instruments you are using to perform an experiment. THEY ALL HAVE THEM. NEVER NEVER NEVER assume your instruments are accurate without first verifying.


> I think everyone that has played a game of Catan has, at some point, questioned the fairness of the dice.

This is why a lot of people play with a set of 36 cards (one 2, two 3's, three 4's...one 12). There really aren't that many rolls in a game, and there's a lot of variation that may result in a 5 never coming up, which adds more chance to the game than I like. Cards mean that while the order is random, you get each number in the deck eventually.


I've never heard of anyone playing like that before actually. And it seems to completely mess with the statistical nature of the game. Even if you shuffled them each time, shuffling methods are bad and you probably wouldn't get a nice normal distribution. Which fair dice produce.


If you don't shuffle until you run out of cards, you mess with the instantaneous probability of any particular draw, but you hardcode the distribution. You are guaranteed to get the distribution you put in the deck, and I think that's a better measure of fairness than having a constant instantaneous probability.

As the deck becomes smaller, your ability to predict the next draw scales with your skill at counting cards. I think that's a good feature for a random number generator to have, as it's what allows games like poker and blackjack to be about more than simply getting lucky.


Well, I agree that a game needs to have more than luck. But something like blackjack is a horrible example because there's a lot of skill that can be applied even without counting cards. Not only that, but blackjack still has a large amount of luck involved with it even when you are highly skilled.

Personally, I find that the best games are ones that have a good balance of luck and skill. Too much luck and there is no skill, you might as well play a slot machine (some people enjoy games candy land). Too much skill, and winning distributions become too low. A highly skilled player will ALWAYS win, and creates too high of a barrier to entry (connect 4 or dots and squares are an example here). There are (a few) exceptions to the latter like Go, which has so many possible moves that you might as well have an element of randomness, but there is still a steep learning curve.

Catan is one of my favorite games introduction to Euro Games, because the learning curve is low, and there is enough luck that an intelligent novice can win. I don't actually believe the dice are unfair, but the low number of rolls makes each game different. This means the skilled player needs to be highly adaptive to the changing environment.

Dice create a nice normal distribution that are independent. While over a large number of games, 6 and 8 are great choices, there will be games where you just don't do well (they are rare). By not shuffling the cards, you are creating a flat distribution and really removing the vast majority of luck in the game (you still have luck in the order of the cards, order of placement, and order of turn). You now have a dependent probability function, and I think you could make great arguments that you remove all the things that (I believe, and laid out above) make the game great. I think you could also make arguments that the setup is the most important part of the game (when your cards are dependent events). But it is a game, and these are just opinions.


It absolutely changes the game, since you can now plan that an unlikely roll will definitely come up at predictable times.


The author was trying to analyse the effect of loaded dice on a game of Catan, not test whether the water-soaking method produced loaded dice. It doesn't really matter where the loaded dice came from.


All dice are slightly loaded.


Statistics is fine, people who are bad at statistics are the problem. When it comes down to it, many "scientists" are rather poorly educated.


Indeed, statistics is fine, but I think the problem is larger than scientists being poorly educated. It looks to me like an awful lot of scientists aren't acting in good faith. I don't think that their conduct necessarily rises to the level of dishonesty, more like they aren't taking obvious steps to mitigate the potential for erroneous results.


> aren't taking obvious steps

Because none of their peers think those steps are obvious, and their peers determine their funding, status, etc.


Yeah, the idea that being able to adjust the priors for a Bayesian analysis based on suspicious behavior, and then saying that proves cheating, is certainly not the case. The cheater is not going to agree with your assessment of their ‘suspicious’ behavior. You might as well just skip the analysis and say you proved they cheated because of their behavior.


>The impossibility is not due to methodological defects in the current state of scientific practice. It's intrinsic to the world. If you cheat only a bit, and only rarely, it might be impossible to detect it statistically.

I don't understand what you're saying or maybe I just want to go further. But I would say it's an ill-posed question whether someone's cheating in that sense. They could win a one in a million chance repeatedly over and over on their first try and it wouldn't be proof that they're cheating. It's not even necessarily probable. The probability that they're cheating depends not just on how good their results seem to be but also other things, primarily: on your priors about how likely it is that they would cheat.

If a known cheater gets a string of mild good luck it is more indicative of cheating than if a known honest person does. Bayesian reasoning has to be used for this right?


Well said. It doesn't reveal problems with current methodology, it reveals problems with its application, and understanding. Larger sample sizes would fix quite a bit. Studies, even "pilot" or "priming" studies, are not reliable at a .05 significance. Especially when most of the false positives will get published, but nearly none of the false negatives will see the light of day.


More practically, all dice are slightly loaded.


I think there are other statistical methods that would take account of sample size in different ways -- perhaps saying in some cases "you don't have enough data to determine validity yet". Among other things that different tests would reveal. p-value is not the only way to analyze your results statistically.

p-values say a certain thing, and are not appropriate to be used as the single universal statistical test for validity, and may likely not even be the best 'default' choice.

I think the message here is indeed that statistical verification/validation is _hard_, and requires actual statistical expertise to know what the tests you are applying mean (and what the assumed pre-requisites of the test are) and apply the right ones. The current practice of "p-value and only p-value, always" is flawed.

(But it's also because the influences of current scientific/academic practice are to motivate people to find the _quickest and cheapest way to get published_, rather than to give yourself additional barriers to publication in pursuit of truth!)

There has been a lot of writing about this in more formal and less cutesy settlers-example ways, this is not an original observation to this blog post. It's in fact been a growing concern, including leading to a recent statement of caution on over- and mis-use of p-values by the American Statistical Association.

http://www.nature.com/news/statisticians-issue-warning-over-...

https://www.sciencedirect.com/science/article/pii/S106345841...

http://blogs.plos.org/publichealth/2015/06/24/p-values/

http://www.ejves.com/article/S1078-5884(15)00585-7/pdf

http://sphweb.bumc.bu.edu/otlt/mph-modules/ep/ep713_randomer...

I like this as a demonstration of one way p-values can be insufficient or lead you wrong, as people have been increasingly discussing.

It would be interesting to apply some of the other suggested statistical tests (confidence intervals?) to the settlers experiment results.


> perhaps saying in some cases "you don't have enough data to determine validity yet

Contrary to popular belief, that's actually exactly what a p-value is. A p-value determines whether your sample size is large enough to reject the null hypothesis, given the observed variance in the sample data.

Introductory statistics classes don't really explain this properly, because it would require knowledge of continuity theory that most don't have at that stage, but that's really what a p-value is already telling you.


I admit I'm not a statistician (or someone that regularly does quantitative research) and don't really understand what a p-value says (without refreshing myself on it for reading for 30 minutes each time heh). So thanks for correction.

But I think many of the researchers applying it don't either, heh.

The real point is just that there are other statistical tests that might be appropriate here, and perhaps would be more reliable at catching this. And that there's actually widespread agreement among experts that p-value testing is widely wrongly used.

> large enough to reject the null hypothesis

Or at any rate, that there's a calculated max 5% chance of the null hypothesis being true, if the assumptions/pre-requisites of the p-value test were actually met. Which is still actually kinda large, 5%, depending. And as I try to think about and confuse myself, may actually be entirely the wrong test as applied here, not sure.


> But I think many of the researchers applying it don't either, heh.

As a statistician myself, I'd agree with that statement.

I've actually said previously - and only half-joking - that p-values should be banned from research journals. The problem with p-values is that they don't mean what people generally assume they do, and because they look so close to what people want them to mean ("the probability that my conclusions are wrong, given my assumptions and data"), it's very easy to project spurious meaning onto them.

In reality, p-values actually provide very little information, and the information they do provide is generally not of relevance. But they're so commonly-used that it's very hard to convince people to use more sophisticated techniques for reporting and modeling information.


I guess it won't change until the 'peer-review' process for every article includes a statistician, and papers actually get rejected for improper or insufficient use of statistics.

I think academia these days forces researchers to really care about little except getting grants and getting published (with the former effected by the latter) -- caring about using statistics properly (let alone the actual validity or usefulness of their findings) will hurt rather than help their careers unless it effects one of those two things positively.


> To measure this bias, my wife and I spent the next 7 days rolling dice while eating dinner.

Some people have very different relationships with their spouses than I do.


I thought that was weird as well. They only had to do like 100 rolls to have enough data to draw conclusions from.


100 rolls is not enough. If you keep the same ratio, you can compute the p-value and I'm sure it will be way over 0.05.


The other clue players can use to detect cheating is your behaviour. They can notice that you are favouring the higher numbers. This is also what can catch people in gambling games. The betting patterns of card counters are noticeable - and the eudaemonic folks who found how to predict tilted roulette tables bet on sectors of the wheel - which again is a distinctive pattern.

So it's pretty much always those behavioural cues that other people will use to suspect cheating rather than the p-value.


If I'm playing with someone that is making AK47's into utensils, I would think twice about raising my concerns regarding the loaded dice.

Hmm the dice are behaving funny, but that's fine, everything is just fine...

https://izbicki.me/blog/turning-an-ak-47-into-a-serving-ladl...


Ah yes, the "let the Wookiee win" strategy. You lose all the games, but you do get to keep playing with both your arms.

I tried strategically throwing games a long time ago. It backfired after I conceded defeat, and rather than bask in victory, the "winning" player gave me a narrowed-eyes suspicious look and flipped over my hidden cards. They made it obvious that I would have won already if I had made the optimal play on my previous turn. She was pissed, because even though she won the game, she didn't beat me, because I wasn't even playing the same game as everyone else. We have some seriously competitive people in my family, and we had a good argument about whether one was required to make a guaranteed winning move if it was possible to do so.


> betting patterns of card counters

Large changes in bet-size are an obvious indicator, so card counters must work in teams. One is the counter, the other gambler is the whale who bets large when the count is favorable.

Unfortunately, the casino has a simple defense: table limits. The whale must sit at the high-roller table, the counter must sit elsewhere. Count cards as much as you like, but you'll never make money at a modern casino ("you" being the aggregate of all card counters, not an individual).


The dice should have been tested before changing them. They might haven't been perfect dice to begin with. Also the lack of automatization is disturbing. I find it shocking that no robot arm or pattern recognition was used to do the tests. Poor wife indeed! "If only my husband was a bigger geek" she must've thought.


He got his wife to do it. That's the original automation.


By this analysis, you can average between 5 and 15 more resources over the course of the game vs unloaded dice, and then goes on to show that there aren't enough dice rolls in a game to prove the dice are loaded.

The problem here is that you don't get the expected number of resources in every game, and there's no analysis of the variance. I suspect that the result of insignificance is correct, in that this "cheat" provides such a slight advantage that it won't materially affect a single game. By the time you've played enough games to reliably use the advantage, your opponents will have seen enough die rolls to show their bias.


> By the time you've played enough games to reliably use the advantage, your opponents will have seen enough die rolls to show their bias.

I have a hard time believing that after 100 games someone would say "you know what, it appears to me that sixes have been rolled slightly more over the previous 100 games than I would expect" and an even harder time believing the next logical would be "you must have altered the dice!"


Looks like you only play boardgames with casuals.


If you're OP, you keep a running mental tally of other player's resource cards, so it is immediately obvious which rolls have been coming up more often than chance, and which ones have been coming up less.

There are 21 different ways to load 2 dice with one weighted face each. Here's a table showing which rolls are advantaged or disadvantaged, according to which faces are weighted.

       02 03 04 05 06 07 08 09 10 11 12
  1,1   <  <  <  <  <  <  >  >  >  >  >
  1,2   <  <  <  <  <  =  >  >  >  >  >
  1,3   <  <  <  <  <  =  >  >  >  >  >
  1,4   <  <  <  <  <  =  >  >  <  >  >
  1,5   <  <  =  =  <  =  >  =  =  <  >
  1,6   <  =  =  =  =  >  =  =  =  =  <
  2,2   =  <  <  <  =  <  =  >  >  >  =
  2,3   =  <  <  <  <  =  <  >  >  >  =  <<< GOOD
  2,4   =  <  =  <  >  =  >  >  =  >  =
  2,5   =  =  <  =  =  >  =  =  <  =  =
  3,3   =  =  <  =  >  <  >  =  >  =  =
  3,4   =  =  =  =  <  >  <  =  =  =  =  <<< BEST
  4,4   =  =  >  =  >  <  >  =  <  =  =
  5,2*  =  =  <  =  =  >  =  =  <  =  =
  5,3   =  >  =  >  >  =  >  <  =  <  =
  5,4   =  >  >  >  <  =  <  <  <  <  =  <<< GOOD
  5,5   =  >  >  >  =  <  =  <  <  <  <
  6,1*  <  =  =  =  =  >  =  =  =  =  <
  6,2   >  <  =  =  >  =  <  =  =  <  <
  6,3   >  >  <  >  >  =  <  <  <  <  <
  6,4   >  >  >  >  >  =  <  <  <  <  <
  6,5   >  >  >  >  >  =  <  <  <  <  <
  6,6   >  >  >  >  >  <  <  <  <  <  <
* duplicated to show symmetry

So for this game, I'd probably weight 3 and 4, to advantage 7s over 6s and 8s, and build more on the 5s and 9s. Most players that have no knowledge of the load of the dice will prefer to have at least one settlement on a 6 or 8. The 2,3 and 5,4 pairs also disadvantage 6 and 8, with bias for higher or lower numbers.


That must be it :)


Note that if your only goal was to test whether the dice was loaded, using a test against only one outcome (in this case, only 6s) is not the best way.

Use either a Kolmogorov-Smirnov test[1] or the Anderson-Darling test[2].

The intuition is that these tests are more powerful because they take the difference between the entire empirical distribution minus the expected probability mass distribution. You're using 'all the numbers' simultaneously to check for cheating.

Funnily enough, I first learned about these tests a nearly a decade ago, precisely because I wanted to know whether Settlers dice were loaded.

[1] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_tes...

[2] https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test


I think for this case (with 6 discrete outcomes) you could just use Pearson's Chi-Square test. What you describe is general enough for continuous distributions.

https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test


My old boss was into Settlers, and they stayed late to play some games a few times. I got invited once and weird stuff happened.

Somehow I ended up next to both 6's on the map.

As near as I can tell, one of the dice we used would not land on 4. Whatever was going on, the number of 6's and 9's rolled that night were abnormally high. By the time people figured out we were rolling as many 6's as we had 7's and 8's combined, I already had a town on the coast and pretty much steamrolled everybody with sheep and wheat.

Really the game wasn't that fun at that point, so I just tried to end it as fast as possible.

When you're playing an RPG, everybody seems to gravitate toward their 'lucky dice' which are most likely defective in the right way. But if you're playing a game with others you probably want the fairest dice you can find.

I remember years ago seeing a sales video from some retired aerospace engineer that was making geometrically perfect dice. He'd worked out the resins so they cured uniformly. He'd stack his dice next to some random set and point out how the other guy's stack curved to one side while his was perfectly straight.

Probably didn't sell a lot of those to DND players, except perhaps GMs.


> Probably didn't sell a lot of those to DND players, except perhaps GMs.

GM's don't have to show their dice rolls and can overrule the outcome if it fits their storyline better, so the rolling is mostly ceremonial anyway.


I'd be interested to see the results by loading the 1-2 edge of the die, or even the 1-2-3 corner. The former should result in more 5s and 6s, and the latter more 4s, 5s, and 6s. And since the weight would be not centered on a face, would make the die more likely to tumble over to a preferred face if the die had to shed its sideways momentum. The unbalanced die would prefer to tumble or skip when a non-preferred face shows, and slide or rock when a preferred face shows.

The problem, of course, is that there is no way to prevent your opponent from accidentally benefiting from your cheating. If you have biased the dice to favor 8s over 6s, they might just plop down on an 8 before you can place your settlements. You would have to use two sets of loaded dice, one biased high, and the other biased low, and swap them out after first settlements are placed.

Also, in my experience, the other players will still accuse you of cheating whether they can prove it scientifically or not, because they prefer to believe that you are better than they are at being sneaky and underhanded than better at honest strategy.

For instance, Clue (aka Cluedo) is a game where one skilled player can repeatedly curbstomp lesser skilled players. If you do, they then grab your player sheet to look at all the strange and indecipherable symbols you have on it that are not simple Xs, and they accuse you of cheating. You can either explain the mathematical advantage of the additional information you record, lose all future advantage against those players, and get accused of "breaking the game", or you can remain silent and still not be able to play it again because you're a "cheater".

Besides that, what self-respecting board gamer doesn't just leave the crappy wooden dice in the box, and use the good dice from their dice bag?

I like this analysis, though. It reminds me of the guy who designed a one-sided die.


This post explains in a very simple way the scientific method (p-values) to prove that a result is significant or not, and the problems with it.

It's crazy that by using a p-value of 0.05, it means that 5% of all scientific results might be false.


This is not true — it’s actually worse.

The 5% figure means that, when there is no signal to detect, we have a 5% chance of falsely claiming there is one. It does not say anything about the case when there is a signal and we do not detect it, which is known as the type II error rate.

With reasonable assumptions about sample size and the fraction of times there really is a signal, you can find that the majority of published results are false:

http://journals.plos.org/plosmedicine/article?id=10.1371/jou...

I have written an intuitive explanation of this, and a bunch more, in my book: https://www.statisticsdonewrong.com/p-value.html


Great book, thanks for writing it!


> It's crazy that by using a p-value of 0.05, it means that 5% of all scientific results might be false.

That would only be the case if scientists were robots who immediately published anything with a p-value up to 0.05. They're not, though. If they get clearly nonsensical results, they will obviously re-evaluate it. In other words, the p-value doesn't incorporate the fact that the experiment passed sanity checks in your own head (and the reviewers') before it was published. (And yes, there are bad actors in every field who game the system, but my point still stands.)


From what I've heard on HN, scientists are actually robots who massage their data until they get a p-value <0.05 and then immediately publish.


No, the publishing process takes a long time. Sometimes it could be years.


Yeah, but in Russia they use nine women to produce a baby in only one month.


This sounds like some kind of comment about divisibility of the work to publish something. I don't get it though.

After the bulk of the paper is written, I can easily proofread, typeset, etc everything myself in less than a week. Now get someone else to double check that. Lets say that is another week.

After that the only thing is to get someone worthwhile to spend some time on your paper and point out anything confusing or erroneous. Granted, this could take a month or so of study. However, I never really saw that happen in practice. In reality you would be lucky to get people to glance over it one evening.

So what is taking so long?


In my experience a significant fraction of the time it takes to publish a paper is spent waiting for the journal. During that time you can do other useful research. The long delay between submitting, getting through the reviewers and the actual publication is one of the reasons why for example in CS a lot of the interesting stuff happens in conference publications with fast turnarounds and the journal versions of the same paper appear a year or two later.


>"waiting for the journal"

Yes, what are they doing?


I suspect this factor is balanced, or completely overruled, g the scientists who get p values greater than 0.05, decide that result doesn’t pass their sanity check (it clearly should be significant!) and collect more data or tweak the methods until it’s significant.


I think that if scientists were robots, results would be much better than we have today - robots don't care about careers and grants.


Depends on the field. Some fields are very careful about this, while others are not.


5% false discovery rate is true if the apriori probability of each result is 50%. If a journal wants to publish only surprising results, and accepts p=0.05 with a good methodology as true, it will have higher rate of false claims, because the most interesting things are more surprising than a coin toss.

And if you slice the data from a single experiment in 40 independent ways, your chances to get something with purely random significance p<0.05 are better than 50% for a single study…


It doesn't "mean" that the a priori is 50%, I guess what you want to say is that if we consider the a priori probabilities are equal between the two hypothesis, then we can use the 5% p-value.

But if the probability of the hypothesis is much less likely, we might need a p-value much lower to be sure.

Edit: my comment doesn't mean much since you edited your first sentence.


>"It's crazy that by using a p-value of 0.05, it means that 5% of all scientific results might be false."

How so? I don't see any connection between significance level and % of false scientific results at all.

If you assume the "null hypothesis" is always true, then 5% of the results should falsely say otherwise. Of course, this is if all the assumptions behind the math hold, no p-hacking, etc.

However, that is like saying it is extremely rare for there to be a correlation between any two phenomena. We don't live in that universe. In our universe, correlations are extremely common:

>"These armchair considerations are borne out by the finding that in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant. See, for example, the papers by Bakan [1] and Nunnally [8]. Data currently being analyzed by Dr. David Lykken and myself, derived from a huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant relationships in 91% of pairwise associations among a congeries of 45 miscellaneous variables such as sex, birth order, religious preference, number of siblings, vocational choice, club membership, college choice, mother's education, dancing, interest in woodworking, liking for school, and the like. The 9% of non-significant associations are heavily concentrated among a small minority of variables having dubious reliability, or involving arbitrary groupings of non-homogeneous or nonmonotonic sub-categories. The majority of variables exhibited significant relationships with all but three of the others, often at a very high confidence level"

-Theory testing in psychology and physics: A methodological paradox. http://www.fisme.science.uu.nl/staff/christianb/downloads/me...


Obligatory xkcd: https://www.xkcd.com/882/


This is a neat project! The statistical test used could be a lot more sensitive, by taking advantage of information about the physics of biased dice. They test whether 6 is rolled more often than it should be relative to the other 5 numbers, but they should instead look at whether it's rolled more often relative to 1. Since these are on opposite sides of the die, and unfairness in 6-sided dice comes from uneven distribution of weight, these go together; you get to combine the information from too-many 6s and too-few 1s and get a test that needs only about half as many rolls to reach a given confidence level.


I'm the author, and I totally agree!

Originally, I was going to point out other null hypotheses that we could try to reject. This has the advantage that some of them (like the one you propose) model the physics. And it also has the disadvantage that if we propose too many hypotheses to test, then we will be more likely to get false positives. But the article was already too long, and this is a HUGE can of worms to open.


I really enjoy these fun thought experiments.

The only thing I didn't see clarified though is whether the die was loaded using the water only once, or whether they re-loaded the die every night during the week they were testing.


Hi, author here. I loaded the dice only once. The weighed 0.65 ounces before hand, and 0.75 ounces afterward. They kept that weight the whole week.


So a suspicious player could detect the loaded dice simply by weighing them? That actually makes a lot of sense. If you look into actual dice/coin manufacturing they do not run tests like "roll a sample of dice a million times each, then check p-value". Instead they have a clever manufacturing process with stringent tolerances. If they run such tests at all it would be to find the upper bound on deviation from fairness over the lifetime of the dice (eg 100k rolls).


I wonder if there are any good, simple procedures that could be applied during a game to mitigate against biased dice (without giving up on physical dice)?

For games that only use one die, perhaps something like generating a random number in 1 <= r <= 6n[1], using some procedure that all the players have input to so they can all agree to trust the number, and then shifting all die roll results by that number? So if the number were 3 and the die came up 2, it would count as 5.

That would not remove the bias from the die, but it would move it to a different number. For games where some numbers are consistently good for you and some are consistently bad for you that would be enough to make it so that biasing the die does not work in the long run--some games it would end up in your favor and some it would end up against you.

For games with two dice, that might not work as well. If a single random number were picked and used to shift both die results, it would shift the bias just like with the single die game.

However, the dice will still be biased to come up matching, and in some games matching numbers on the two is significant.

One might try to address this by generating two random numbers at the start of the game, one for each die, for the shifts. That would have bias against getting a match on the two dice, so would provide an advantage in games where a match is bad.

[1] Generalizing to n-sided dice is left as an exercise for the reader.


Unbiasing a coin is easy (the Von Neuman technique): toss it twice, if the results are different then output the first toss. If the results are the same, forget both and start again.

You can use this technique to unbias a 4-die or 8-die by throwing a coin multiple times (apply the above technique for each digit in base 2).

To produce a fair 6-die from a biased coin, the following should work.

- Using the Von-Neumann technique, creates 3 independent unbiased coin X, Y, Z. Then X= 1 + X + 2Y + 2Z is uniform in {1,...,8}, we basically randomize each digit in base 2.

- Finally, throw the unbiased 8-die until you obtain a number different than 7,8. The result is unbiased in {1,...6}.

    import numpy as np
    
    def biased_coin():
        return np.random.choice([0,1], p=[0.1,0.9])

    def unbiased_coin():
        a = biased_coin()
        b = biased_coin()
        if a != b:
            return a
        else:
            return unbiased_coin()

    def unbiased_eight():
        return 1 + unbiased_coin() + 2*unbiased_coin() +4*unbiased_coin()

    def unbiased_six():
        d = unbiased_eight()
        if d < 7:
            return d
        else:
            return unbiased_six()

    np.bincount([unbiased_six() for i in range(6000)])

Edit: there was a mistake in the previous version of the comment because the digits in base 2 were not be independent. The last version is correct, I believe :)


Given a dice with an unknown bias, you can generate a stream of unbiased random bits [1].

The method is an extension of von Neumann's trick for un-biasing a biased coin in which you flip it multiple times and take advantage of symmetry (specifically flip it twice and return heads if you got heads then tails, return tails if you got tails then heads, or repeat the procedure if you obtained two heads or two tails).

You can then go from this bit-stream to a number in {1,...,6} using the trick jknz mentions below (take 3 successive bits, and interpret them as a binary number between 0 and 7, add one to get a number between 1 and 8, and start again if the result is 7 or 8).

[The method in the paper is apparently asymptopically optimal (as the number of faces on the dice goes to infinity). However, for small numbers of faces I think it is slightly worse than the method jknz described. Consider the 4-faced dice: the Peres method converts a pair of dice rolls into 1 or zero bits, but jknz's method would sometimes produces two bits (the rolls 1,4/2,3/3,2/4,0 are binary 00,11/01,10/10,01/11,00 resulting in sampled bits 0,0/0,1/1,0/11). Working out all the cases shows this makes the method slighly more efficient.]

[1]: http://dx.doi.org/10.1109/TIT.2014.2381223


One (cumbersome) solution is for all players to provide their own dice. Whenever a die roll occurs, all players roll one die at the same time, and the totals are summed and the remainder is taken after dividing by the number of players. So, as long as at least one player's die is fair the outcome will be fair.

If you're playing Catan by email, this approach can be extended with bit commitment. Each player rolls a die and sends a message to the other players with the number they rolled signed with their own digital signature but also encrypted with a key randomly chosen for that round of die rolls.

None of the other players can know what the other players rolled without the key. Once everyone has sent all the other players their rolls (i.e. they've "committed" to a particular die roll but the other players don't know what it is yet), they exchange keys to reveal the numbers they rolled.


> the remainder is taken after dividing by the number of players.

Minor nitpick: you should take the remainder after dividing by the number of sides on a die, with the understanding that zero == max possible roll.


Yeah, that's right I think.


I think you raise a very interesting question, and I like your initial solution.

I feel like a game is fair if all players know that a die is biased, but no player has an advantage in predicting the outcome.

To achieve that: for each die, for each roll have the players agree on a random permutation of the observed faces to actual faces for that roll. As a result even if all players know that 6 will land with probability 1 in the original throw, after permuting each side has probability 1/6.

The agreement algorithm: have the parties each propose a permutation, and compose the permutations together in order to get a final "unbiased" permutation. I think if there is at least one honest party, this will work.

Another approach is to let everyone sample the dies before playing until they are satisfied, and let everyone adjust their strategy accordingly.


Presumably the person who supplies the dice knows how they're loaded. With a shift, they might still be able to use that information to their advantage. Instead, you could use the agreed-upon randomness at the beginning of the game to apply numeric labels to the faces of the dice. As long as the labels themselves aren't loaded... and the person who supplied the dice would still have information about the distribution.


Could it be that your dice was biased to begin with? Or have you already rolled them ~5000 times before you placed them in water?


I was wondering this; is a control unnecessary here as we assume an equal distribution, or is that assumption false?


I think it’s false because every mechanical device have some variation, but is it so small it can don’t “affect” the result?


Einstein: "God does not play dice."

Hawking: "God definitely plays dice, but He sometimes confuses us by throwing them where they can't be seen"

Izbicki: "And if the dice are loaded, we can't prove it."

There is some plain & profound truth in this article underlying some pretty cool math.


I really like Settlers of Catan, but I have never found a nicely designed online site to play it. Does any of you know of anything?

Also surprised that there's no "free" + "open source" version of it, like FreeCiv or FreeCol or FreeOrion.


Tabletop Simulator is good, once you get used to it and learn a few keyboard shortcuts! Only downside is it's usually $30 and I am not 100 percent sure if the catan ports are safe to forever remain? If anyone has insight on the legalities of Tabletop Simulator I am curious.


In the US it's $20, with discounts to $10 (look for the upcoming Steam sale). I am not qualified to speak on legality but a cursory search suggests that the catan ports are technically not safe... still, there are many versions in the workshop as catan is such a popular game. It's hard to imagine that they all get pulled without replacement.


Thanks for the reply. I hope they continue on there!

>In the US it's $20, with discounts to $10 (look for the upcoming Steam sale).

Thanks, looks like it will finally be time to gift it to a friend!


I'm a little late with this, but my brother, cousins and I have been playing at http://games.asobrain.com/ for years. For copyright reasons, it's Explorers there.

They also have Carcasonne, which they call Tolouse.


http://www.playcatan.com had a pretty good client... too bad it's not open for new registrations any more. They have Catan Universe now instead - not sure how good it is, though.


What about Pioneers (http://pio.sourceforge.net/)?


A hypothesis was presented: using higher numbers will give 5-15 more resource cards over the course of a game. Then, conclusions were drawn that standard statistical techniques are deficient for not being able to detect the bias.

Where are the results of the actual experiment, say of 10 games played using the biased strategy, against a control opponent? Perhaps, standard statistical techniques are correct.

Without empirical tests of the hypothesis, where is the science? Perhaps, the dice are only loaded for a short time, in which time all the skew occurred. Without testing of the hypothesis, it is impossible to know.


A nice mitigation technique is for everyone to bring their own dice, put them into a hat and require random selection of two of them per turn.

Of course if you suspect your friends are cheating you have bigger problems.


This solves the problem by reducing it to a trusted random number generator (the hat). At that point you may as well skip a step and draw individual outcomes from the hat.


I don't get this.

>It’s impossible for your opponents to scientifically prove that you’re cheating.

Then two paragraphs later you scientifically prove you were cheating (simply roll dice thousands of times).


> while playing a game.

Your opponents can't prove cheating based on the game alone.


Which is kind of a dumb thing to say. They even note that a typical Catan game only has 60 rolls. You could have many suspicious looking rolls that would result from fair dice. The fact is that 60 rolls isn't going to be enough even if you were cheating a lot more than what was shown in this post.


Because you don’t roll dice a thousand times during the game.


What does the calculation of advantage mean there? The advantage per one nice and one naughty roll (a pair of rolls) is multiplied by the total number of single rolls in a 4-player game.

And I guess an honest player would pick the nice locations as often as naughty ones (there is no reason to actually prefer them). I don't know the rules, so I don't know if there are enough naughty locations to roll only on them without _that_ being suspicious.


Cool analysis. Doesn't every player use the same die for a game like Catan?


Yes, but presumably only you would know the bias of the dice and would exploit it by preferentially building next to higher numbered spaces.


> posted on 2018-12-14

OP cheats at time too, posting this a full year ahead of schedule. :)


I can finally beat my in-laws at this game this holiday season!


Skilled players can beat unskilled players regardless of the bias of the dice. It doesn't matter if you get 5-15 more brick cards if it's an ore game. And how do you keep the other players from building on the bias-favored numbers?

If you're going to cheat, you're better off sleeving your most critical resource cards, to protect them from 7-rolls, soldiers, and monopoly cards taking them from your hand. You unsleeve and swap when you get a favorable resource roll, and hope nobody is OP enough to keep a running total of all opponent resources in their head.


The problem is, how do you know those numbers are bias-favored if your sample size is a few dice rolls? And if there is a bias in those few dice rolls, there's no guarantee it will continue. Sure, a more skilled player will always beat a much less skilled player, but the randomness of the dice certainly closes the skill gap.

It's much different than a game like Agricola or Dominion where you're playing 100% against what the other people do and not relying on the outcomes of 4 dice rolls to determine what you want to do on your turn.


The randomness of the board set up partially cancels the bias you introduce into the dice. There are 20 different possible numeric token layouts, and each of those has trillions of different possible tile combinations.

You really have no practical way of knowing in advance which resource tiles will be under the number tokens you have biased for by loading the dice. So cheating in this fashion doesn't really provide that much of an advantage over the other players. Since the intentional cheating is scientifically indistinguishable from randomness, all good players will already be able to compensate for unfavorable variations in die rolls.


It's actually ( 12864852000 * 20 ) possible board combinations. Not quite trillions; more like a quarter of a trillion.


Now I have something else to blame besides luck when I lose


Great article! One nitpick however in the informal definition of a p-value: the p-value is the probability of getting results similar or more extreme to the results we observed if the dice are not biased.


Our set came with plastic dice. Crap!





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: