How does this work if there are multiple covering sets with the same (minimal) number of question paths? Does it matter which minimal covering set you evaluate, or does your strategy need to be some kind of superposition derived from all the minimal sets?
Also, if there are some words that are never picked by your strategy, doesn't that open the door for an alternative set of question paths that only covers the now-reduced dictionary? (cue reference to the Unexpected Hanging Paradox)
In many ways, the entropy is a heuristic for finding a good question path given the known probabilities for each word in the dictionary. The formula that I quoted above can be trivially extended for nonuniform distributions of the chosen words. It sounds like the Nash equilibrium would be designed such that (after eliminating dominated strategies), the expected information gain at each step would be the same for all question paths. I wouldn't want to have to prove it, though.
I buy that information gain is the same as the optimal strategy, so long as you are taking into account your opponent's optimal strategy with a weighted probability distribution.
The unexpected hanging paradox is a great link, thanks :) awesomely enough though, game theory handles this problem! This is the "why" behind minimax. If John Nash were sentenced by that judge, Nash would decide what day to expect by flipping a coin, and the judge would segfault :p
To answer your question though, if there are multiple paths to the same words (eg Tree(1) leads to Union(a,b), Tree(2) leads to Union(b,c), and Tree(3) leads to Union(a,c)), that surprisingly doesn't make the solution any harder... though my links above aren't particularly useful for it.
To find the optimal strategy for the player choosing the word: for every possible set of words (so all 2^words-1 options), calculate what the optimal strategy would be for your opponent if they were playing against someone picking randomly from that group, and then choose the group causing the lowest expected value for your opponent.
Make sense? [edit: NO! I didn't show that optimal would necessarily come from a uniform random selection from the word bag... and a proof doesn't jump to mind, though my intuition tells me so]
[edit 2: YES! Just assume that the strategy doesn't need to be uniform [though I still maintain it will be], and use gradient descent to find the optimal weighting]
That's not even close to giving a practical algorithm because (1) the set of words is huge, the powerset of words is huger (2) you didn't describe how to find the optimal counter strategy. And actually I don't think it's necessary to consider any subset of words.
The usual way to find a Nash equilibrium is with linear programming. But for that you have to construct a matrix of size #(deterministic strategies for player 1) * #(deterministic strategies for player 2). Player 1 does not have so many stategies (just the number of words in the dictionary) but player 2 has an enormous number of strategies.
So something else is needed. We can try to write an algorithm for finding expected value of the the optimal counter strategy to when player 1 is playing a mixed strategy, let's call it ev(s) where `s` is a game state. Then we would have ev(s) = max_{s' in successors of s} ev(s'). Perhaps this can be done using dynamic programming, but it could lead to far too many states being evaluated. If you could give an approximated upper bound then perhaps some of the paths could be discarded early. Anyway then the Nash equilibrium is to minimize this ev over all mixed stategies of player 1, i.e. all probability distributions over the words. How one would do that is another tricky question.
One positive news is that we can consider each word length separately, as the final Nash equilibrium would always choose a word of the same length for player 1, i.e. the optimal strategy for hangman would always choose a word of size n (but we don't know n yet, and n will depend on the dictionary).
Sorry about that. I got too caught up in showing that there was in fact an optimal strategy, and there were no hangman paradoxes. In response to (2), that was done in the GGP -- you are selecting the question that for all words gives you the least distance from the answer (and if that doesn't cover every answer, then the minimal number of questions to do so). I posted ruby code in the GGP for a heuristic version, the non heuristic version would look similar.
What follows is my solution to this game after having a night to sleep on this. It's O(26^26 times words) [because it requires you construct the optimal move for the guesser], so perhaps only of interest to myself.
Problem restatement: Imagine instead that we are playing a game where, for doors 1 through m, we must pick a door and hide some prize behind that door. Our opponent must then choose from a list of moves that open ranges of doors.
To solve that problem, it is relatively easy. First form the set of transitive closures on moves, such that any move that shares a door with another move is in the same closure. Next, for each closure, mark every door by the number of moves that can reach it, and progressively remove the highest marked number until doing so would then leave no doors. If at any point this process led to those moves no longer sharing a closure, reform the closures, add them to the list, and process them again in this manner.
Observe that for any closure remaining, for every pair of doors (a,b) reachable by the same moves, we can discard the second without changing the optimal strategy (because any move the guesser made would be the same to us for both a and b).
Now observe that for any closure remaining, if the dictionary only included words reachable from that closure, picking with a uniform randomness among the doors not removed by counting or pairing would be optimal (as the same number of moves can reach any remaining door). To prove that moves with higher marks would only decrease our EV, simply observe that if we did include any door with more marks, then those moves could be favored by the guesser to increase the guessers chance of winning.
To complete the solution, we work out the the expected value of playing each closure independently, and select between them by the ratio of their inverse EV with the whole. (So if we had two closures, one with EV -1 [if we could only choose from words behind this door we would always loose], and the other with -4/5 [all doors can can be reached by 4 out of the 5 moves in this closure], we would pick between the two with a ratio of 4:5).
For a proof that the above mixing is optimal, see the analysis for the game "Bluff".
Anyone who understood all that, lives in San Fransisco, Santa Barbara, or San Diego, and would like to meet up to discuss economics, game theory, algorithms, machine learning, computer science, or world domination over coffee is cordially invited to email me at dls@lithp.org (yes, that's "lisp with a lisp", perhaps puns should have been on that topic list as well :p)
Also, if there are some words that are never picked by your strategy, doesn't that open the door for an alternative set of question paths that only covers the now-reduced dictionary? (cue reference to the Unexpected Hanging Paradox)
In many ways, the entropy is a heuristic for finding a good question path given the known probabilities for each word in the dictionary. The formula that I quoted above can be trivially extended for nonuniform distributions of the chosen words. It sounds like the Nash equilibrium would be designed such that (after eliminating dominated strategies), the expected information gain at each step would be the same for all question paths. I wouldn't want to have to prove it, though.