Hacker News new | past | comments | ask | show | jobs | submit login
A maximally-dense encoding for n-choose-k (2013) (farside.org.uk)
58 points by bmc7505 on May 20, 2022 | hide | past | favorite | 28 comments



At the heart of the opus audio codec is a maximally-encoding for combinations with replacement and sign[1], which is n-choose-k where n can be reused and there is a sign for each chosen dimension. Or stated another way, n dimensional integer vectors where the sum of absolute values add to k.

This enumeration can be implemented with the same kind of recursion as in the link, with a little bit of elaboration. Though interestingly, it (and the formula in the link) can also be be implemented with recursive table lookups quite efficiently, and for small fixed N with closed form formula (also true for the simpler combination code in the link).

These maximally dense algebraic codes can be designed for a great many possible applications.

For example, for a generational rolling cuckoo filter Pieter Wuille and I came up with an algebraic code for coding 4 sorted generation numbers with the requirement that all 4 are within a window of half the total range[2]. In prior published work on cuckoo filters used a large table of all possible values of combinations with replacement (to efficiently pack small sorted numbers). We found the algebraic code to be faster than a big dumb table, presumably due to cache locality, even though our fastest encoder/decoder still use tables but only tiny ones (at least for the sizes we were considering).

A challenge for implementing these sorts of functions is that their inverses often require operations like integer square or cube roots which are not particularly fast unless the ranges are small enough to implement them with tables.

[1] https://github.com/xiph/opus/blob/master/celt/cwrs.c#L74 [2] https://github.com/sipa/bitcoin/blob/202006_cuckoo_filter/sr...


For those who want to explore these ideas in Python, see the nth_combination() and random_combination() recipes in the itertools docs¹ or just install them from more-itertools².

    # Find the 48,001st combination from 32 choose 8
    >>> nth_combination(range(32), 8, 48000)
    (0, 1, 2, 5, 11, 21, 27, 28)
¹ https://docs.python.org/3/library/itertools.html#module-iter... ² https://more-itertools.readthedocs.io/en/stable/


I recently wrote a Haskell module [1] for computing with so-called Rankings, which I formalize as a triple consisting of a size, an unrank function that maps integers in the range 0..size-1 to objects, and a rank function that maps objects to integers.

A similar Haskell module [2] was written by Brent Yorgey.

The multinomialRanking function in my module generalizes the binomial rankings considered in the article, as well as permutations. By composing all kinds of rankings together, some very interesting classes of objects can be enumerated.

Even very densely coded chess positions; by nesting 14 different rankings, I obtained a ranking of size ~ 8.7x10^45 that includes all legal chess positions. Sampling millions of random chess positions [3] and determining their legality [4] allowed us to estimate the number of legal chess positions as ~ 4.8 * 10^44.

[1] https://github.com/tromp/ChessPositionRanking/blob/main/src/...

[2] https://hackage.haskell.org/package/simple-enumeration-0.2/d...

[3] https://github.com/tromp/ChessPositionRanking

[4] https://github.com/peterosterlund2/texel/blob/master/doc/pro...


Are there any legal chess positions that cannot be arrived at by a legal move from a legal position?


A legal position is any position reachable from the starting position by a valid game of chess, i.e. a sequence of legal moves.

Thus there are no such positions. Every legal position can be arrived at by playing a legal move from another legal position. That includes the starting position, since Nf3 Nf6 Ng1 Ng8 cycles back to it.


I guess what the question is, is whether there's a chess position where every piece is in a legal position, but could never be achieved in practice.

I can think of several, for example if you had all 32 pieces on the board, but pawns were behind each other on the same file.


That's not a legal position, since it cannot be reached in a valid game of chess. What do you mean by a piece being in a legal position? A position involves placement of all pieces, as well as turn to move, castling rights, and en-passent capturabililty.


"legal chess position" is probably defined in terms of being able to arrive at it by legal moves from legal positions.


If efficiently calculating this mapping is of interest, see also http://www.thelowlyprogrammer.com/2010/04/indexing-and-enume... .

(Python isn't my usual language these days, but this is a great example where the seamless bigint support shines. The number of possible sets grows fast!)


For the open question on bounded number of choices, the idea is still "count the number of smaller combinations".

Given a k-combination C, we first count the number of all possible combinations of less than k elements; this is binom(n, 0) + binom(n, 1) + ... + binom(n, k-1).

Then we simply add the number of smaller combinations that are of k elements; this is the N calculated by the formula in the blog.

The final answer is N' = binom(n, 0) + binom(n, 1) + ... + binom(n, k-1) + binom(C_k, k) + binom(C_{k-1}, k-1) + ... + binom(C_1, 1).


As betrayed by the reference to Google+, this needs a (2013).


I made an interactive Svelte app that will show the decimal number that each permutation maps to, as well as its utf-8 and base64 encoding.

32 choose 8: https://svelte.dev/repl/08e1d6bbaf4f417ea43a168c2b5ecb91

3 Dice: https://svelte.dev/repl/6c9f4d8d5f3b4b3cac08b985f64540a3


I am curious if there is a similar bijection for mapping all subsets whose size is at most k with the natural numbers? The problem of minimally representing bounded-cardinality subsets is mentioned briefly under “Open Questions”, but I was unable to find a suitable solution. My first guess would be to encode each fixed-length subset family positionally, but this encoding still has some redundancy.


Yes. The formulation in the blog post is a bit messy, but possibly more general way to think about it is something like:

0. set y = 1. set x = 0.

1. How many possible choices are there _given what we've picked so far_? Call it N.

2. Pick one of them. Call it 'a'.

3. set x = x + a * y. set y = y * N.

4. Repeat steps 1-3 until the number of choices drops to be less than 2.

Decoding is the same thing in reverse. Start with x. mod it with the first N. that's the first choice. divide by N, mod with the number of remaining choices. that's the second choice. etc.

For the case of n choose k, we get the rule that we can't pick anything lower than what has been picked so far. This means that first choice is one of (n-k+1) items. (We can't pick higher than that as we'd run out of choices before we reached k).

With this construction, we can see your case is easy. You just need to be able to signal that some choice was the last choice. So the each choice is from a set of N+1, being the N possibilities plus "picked nothing". Picking nothing means there's no remaining choices so it terminates.


This is an easy formulation but not necessarily optimal. In particular the step 1 is not well defined, and any further definition results in a non-optimal encoding.

Say we have n = 6 and k = 3 and we've already seen 4; what are possible choices at this point? If you allow 3 here it would be equivalent to picking 3 first then 4 and there would be a redundant encoding. You can "fix" this by fixing the order, so without a loss of generality let's assume a strictly increasing order. Then the worst case would be (1, 2, 3), where N is respectively 6, 5 and 4, resulting in log_2 120 = 6.90 bits of information entropy instead of the optimal log_2 24 = 4.32 bits.

The real fix would involve counting each subcases. There are 10 strictly increasing sequences starting with 1, 6 starting with 2, 3 starting with 3 and one starting with 4. The step 1 should account for this non-uniform distribution at the first place. The combinatorial number system described in the OP does this automatically.


For the first case, you can't pick a number lower than any existing number. So if 4 is already picked, then only 5 or 6 can be picked, but if we pick 6, then there's no third number available to pick, so there only 1 possibility (picking 5). So it terminates.

The worst case of (1,2,3) is respectively 4, 3 and 2, resulting in log_2 24 = 4.32 bits.

The first choice isn't picking from 6, it's picking from 4. (because if 5 or 6 are chosen, then there's not enough following numbers to be able to choose 2 more numbers). That's the piece about '(n-k+1) choices for the first number'. 5 or 6 are not possible choices given the ordering requirement.


> The worst case of (1,2,3) is respectively 4, 3 and 2, resulting in log_2 24 = 4.32 bits. The first choice isn't picking from 6, it's picking from 4.

Oh, you are correct that the first number has only 4 possible choices. But what about the second number? It can't be 1 (already picked) and 6 (ordering requirement), but other numbers are open to pick. The third number is no different, it can't be 1 and 2 (already picked) but otherwise all other numbers are allowed. The resulting N would be 4, 4 and 4, i.e. 6 bits.

Maybe it helps to see the fuller picture:

  1st  2nd  3rd         2^(expected # of bits)
  
  1    2    3, 4, 5, 6  64
  1    3    4, 5, 6     48
  1    4    5, 6        32
  1    5    6           16
  
  2    3    4, 5, 6     36
  2    4    5, 6        24
  2    5    6           12
  
  3    4    5, 6        16
  3    5    6           8
  
  4    5    6           4
Ideally (and necessarily) the last column should be 24 for all possible cases, but your algorithm can go anywhere from 4 (2 bits) to 64 (6 bits).


Nuts, you are correct that it isn't optimal.

It's not quite as simple as just 2 bits per digit, because the last digit can only be large if the earlier digits are small.

This makes the worst case is (1,2,6) which encodes as 48, for 5.58 bits.


Great! Very clear. This should be the TL;DR... or the dense formulation of the dense encoding ;-)


There is.

If you have a bijection between the items of some finite set S and {1,…,n} and a bijection between the items of some other finite set T disjunct with S and {1,…,m}, constructing a bijection between S ∪ T and {1,…,n+m} is trivial. For example, map the items of S to {1,n} as before and the items of T to {n+1,…n+m}, adding n to each of the values that you originally mapped each item of T to.

If you have a third finite set disjunct with S and T, it’s disjunct with S ∪ T, so you can repeat this construction.

“All subsets whose size is at most k” is the union of “all subsets whose size is zero”, “all subsets whose size is 1”, “all subsets whose size is 2”, … “ all subsets whose size is k”, and those sets are finite and disjunct, so you can use the above to construct such a mapping.

(as long as the set of common items is computable, “disjunct” is not essential in the above, but it would make the mapping more complex. “Finite” also isn’t necessary, as long as they’re countably infinite and you have finitely many sets. I think even a countably infinite number of countably infinite sets would work)


How do you map a Natural to the corresponding subset in O(1), while still preserving the compactness of the fixed-size combination encoding? That’s the key property we want from our bounded-length encoding.


Did I miss any requirement about this having to be O(1)?

The solutions in the article certainly aren’t, with their for and while loops.

I think that’s moving the goalposts.


> I think that’s moving the goalposts.

It depends on your cost model. If your cost model is with respect to n, this requirement does not move the goalposts because the solution in the article does not depend on n and k is fixed. Computing (1) the natural number corresponding to any n choose k subset or (2) the subset corresponding to any natural number n takes constant time (assuming a constant cost model for all naturals).


Isn’t that mapping ambiguous? Maybe I’m mission something, but if you have a number to decode, say n+4, how you do differentiate between (3,n+1) and (2,n+2) and (1,n+3)?


If you have to decode n+4, you know the result can’t be in S, as its elements are mapped to {1,…,n}.

So, the result is in T. You subtract n, leaving 4, and use whatever bijection you had for T to decode that into an element of T.

If, on the other hand, you had n-4, you know it must decode to an element of S, and directly use its decoder to find that element.


When I read the title, for a moment I thought that this was about finding a maximal subset of n-choose-k vectors for a given distance d, such that all combination of vectors differ at atleast d positions.


Very cool. One use that comes to mind is encoding chords. They usually consist of about 4 (or fewer) notes chosen from 3 octaves (36 notes). It could have applications in machine learning.


> "Last month" oh, ok, something new

> "[...] Google+ post" Wait a minute, this isn't recent at all!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: