Hacker News new | past | comments | ask | show | jobs | submit login
Specific Problems with Other RNGs (pcg-random.org)
36 points by nkurz on Nov 11, 2015 | hide | past | favorite | 38 comments



I went back to the root page of this site to see what PCG is, and I don't understand it. What's the point of an RNG that is "more secure" than an insecure RNG, but less secure than a real CSPRNG? What does "Predictability: Challenging" actually mean?


It means exactly what it sounds like: the author tried to attack the generator, failed, and thus concluded that it must be a Very Hard Problem.

Note that the field of pseudorandom number generation for non-cryptographic purposes is much less rigorous than cryptography. Typically, for a new generator to be accepted as "OK" all it needs to do is pass a number of fixed statistical tests, usually one of the TestU01 batteries [1]. This is usually the only falsifiable claim you get, and if you're familiar with diffusion and how to achieve it this is easy to work out. Other falsifiable claims include equidistribution, but that is not a very useful guarantee---simply incrementing by 1 will achieve that. The notion of indistinguishability against computationally-bounded adversaries does not exist. All of this contributes to this being a field where crackpottery abounds, and is hard to distinguish good from bad.

For example, here's a generator that uses 1/4 of an AES round per output word, and that passes the Crush battery of statistical tests (which uses ~2^35 samples). Is it a good generator? I would not bet on it.

  #include <stdint.h>
  #include <immintrin.h>

  struct S {
    static const unsigned kAesRounds = 1;
    union {
      uint32_t state_[4 * 4];
      __m128i  words_[4];
    } u_;
    unsigned counter_;
    const __m128i key_;

    S(uint64_t seed)
    : u_{{0}}, key_{_mm_set_epi64x(seed, -seed)}, counter_{0} {
      for(unsigned i = 0; i < 16; ++i) (void)next();
    }

    uint32_t next() {
      const uint32_t output = u_.state_[counter_];
      __m128i t = u_.words_[counter_ & 3];
      counter_ = (counter_ + 1) & 15;
      for(unsigned i = 0; i < kAesRounds; ++i)
        t = _mm_aesenc_si128(t, key_);
      u_.words_[counter_ & 3] = t;
      return output;
    }
  };
[1] http://simul.iro.umontreal.ca/testu01/tu01.html


Nit: you call _mm_aesenc_si128() once per call to next(), right? I don't see how that constitutes "1/4th of an AES round per output word". (You do output 1/4th of an AES block.)

You are, of course, right about the actual point you're making. And calling _mm_aesenc_si128() once per 4 calls to next() may well suffice to pass a statistical test. Then again, even an LSFR passes most statistical tests...


Yes, you're right, I am calling the AES round once per `next()` call. But note that this his could be rewritten as

  if(counter_ >= 4) {
    counter_ = 0;
    u_.words_[0] = _mm_aesenc_si128(u_.words_[0], key_);
  }
  return u_.state_[counter_++];
Which only uses one AES call every 4 words (and only one block of storage). Instead, I chose to avoid the `if` and compute blocks ahead of time, which makes for more predictable performance.


Is there a particular reason you would bet against it, or do you just saying that you wouldn't would bet for any implementation without proven theoretical properties?

(As an aside, it's probably just my lack of familiarity, but considering how simple the algorithm is I find the syntax surprisingly hard to follow. Is this modern C++?)


Although this thing is able to fool the 200 or so tests from TestU01, it would take a short time for an expert to devise a test that would distinguish it from random, and even recover the key (read: seed). On the other hand, AES-CTR (conjecturally) fools _all_ statistical tests that take less than 2^128 effort, and on top of that it is still very very fast (0.63 cycles per byte on Haswell). So what really was gained? Cryptographic primitives have gotten so fast that the additional speed from these ad hoc generators seems like an unworthy risk. I concede that it can be very fun to try to make them, though.

Additionally, a generator that passes those 200 tests is not guaranteed to be high-quality for a particular usage, which might as well be considered nothing but another statistical test. There is the famous case of the r250 generator---an additive Fibonacci generator that passed all statistical tests of the time---which turned out to have characteristics that rendered costly real-world simulations wrong [1].

That is C++, yes, but I don't think there is much of anything modern about it. Trying to save on space probably contributed to its unreadability a bit.

[1] http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.69....


Here's a paper that includes an analysis of a similar approach that uses 5 rounds of AES under the name ARS-5: http://www.thesalmons.org/john/random123/papers/random123sc1...


I know this paper, and like this approach. Reduced-round primitives let you take advantage of existent analysis, and keep some guarantees while improving speed. In the case of AES, we know that the probability of any differential for 4 rounds is at most 2^-113; similar for linear probabilities [1]. I'm curious why ARS-4 fails Crush; I wonder if the custom key schedule is the culprit.

[1] https://eprint.iacr.org/2005/321



That PCG is ostensibly less predictable is not a prime selling point, more a small nice-to-have in the scheme of things.

Of course, PCG doesn't exist to secure secrets. That's for cryptographic primitives, not any old RNG. However, there are downsides for using trivially predictable RNGs. A randomized algorithm might degrade performance with carefully-crafted inputs. This may expose you to DOS attacks. Hashes are a better known issue, where several languages have moved to make hashing less predictable, and thus less game-able.

It would surely be better for an implementation to avoid exposing any of this altogether, but an RNG intended for general use should have the worst-case scenario in mind.

The reason not to go full-on cryptographic is simply that the alternatives are faster and have more features. A list of some of the cool stuff PCG has is on the front-page, but here goes:

It's extremely fast, pretty much the best non-cryptographic RNG out there distribution wise, of those tested, and a tiny dependency. The RNG is reproducible and supports multiple streams, including streams dependent on the address space. This also allows emulating Haskell's RNG splitting (where one RNG creates two new independent ones), which is neat if you don't want to share state between multiple parts of your program but still want global reproducibility, or even if you're just a functional programmer. You can skip it backwards and forwards, which is little used but a ton of help when you do want it. Each instantiated RNG is a few bytes big. Something like the Mersenne Twister is a bad thing to instantiate wildly, especially on memory-constrained systems. Not a massive issue, but it's still silly any RNG fails at this. Last but certainly not least, it's a tiny dependency.


So basically the only normal engineering reason to use PCG is if you need deterministic insecure random numbers: in other words, for simulations and testing code.

It is hard to imagine another scenario in which using an insecure RNG is good engineering. Secure RNGs are somewhat costlier than fast insecure RNGs, but it is very hard to predict which incidental random numbers won't ever need to be secure.

(I would generally take anything 'pbsd says about RNGs to the bank.)


If you're fine with a cryptographic RNG, then definitely. I'd even agree that cryptographic RNGs should be default. Rust made the right choice here.

But look at C++'s new RNG library. Go's random library. People don't want to give up the lightweight, fast defaults they have. Even Rust's canonical random library has an XorShiftRng. These are what PCG is competing with, and replacing all those weak RNGs with PGC RNGs would make them all much better.


It's definitely not a cryptographically secure PRN. I suspect it's used for people who want a lot of random data for modelling.

There's a bit of discussion here: https://news.ycombinator.com/item?id=9887548

Slavik81 in that thread mentions raytracing as an example use.


Only cryptographers care about "unpredictability." Outside of cryptography, the interesting properties of a PRNG are speed, period length, and state size.

But a good CSPRNG has excellent statistical properties. So if you make "unpredictability" a requirement, their slow PRNG becomes very attractive. In other words, they made up a problem for their solution.

If you need a statistical PRNG, use a statistical PRNG. (The best are xorshift1024* and xorshift128+.) If you need a CSPRNG, use a CSPRNG.


This sentiment is totally untrue and is why the password resets in PHP apps are the subjects of security conference presentations.

It can be surprisingly tricky to predict which random values an application uses will end up being security-critical.


An interesting case would be something like hash flooding prevention. The cost of a full CSPRNG is probably not worth it--no security is broken if we lose the secrecy of one generated seed--but at the same time we want something that is at least nontrivial for an attacker to predict.

I agree that most cases you either don't care at all or need a real CSPRNG.


Preventing hash flooding requires a single 128 bit seed per table. How expensive is it to generate that with a real CSPRNG?


Depends how many hash tables you have! :)

I agree that it's a marginal gain here, but Thomas asked for a case where predictability of "more difficult than a LCG or Mersenne Twister, but not infeasible under crypto assumptions" would have value. This would be an example.


I believe it means it's more difficult to guess it's internal state than other commonly used PRNGs.


What does the "zero will be produced once less often than every other output" for negative qualities of XorShift and XorShift* means? That they are not able to generate "0"?

Also, does anyone know if PCG is in use somewhere today?


I believe go has switched to it or its considering switching. Someone linked to a code review awhile back.


ChaCha20 is (relatively) slow, but! ChaCha8 does not have known attacks and 2.5 times faster than ChaCha20.

Not mentioning speed variability of ChaCha family is a flaw in analysis.


I mentored a gsoc student this summer who worked on rngs, and the only two algorithms that passed the big crush statistical suite were pcg random and split mix.


the knocks against the openbsd arc4random, namely

> No facility for a user-provided seed, preventing programs from getting reproducible results

> Periodically “stirs” the generator using kernel-provided entropy; this code must be removed if reproducible results desired (in testing its speed, I deleted this code)

seem like exactly the kinds of foot guns you really want removed from an RNG you're using for real live code.


Sometimes you want well-mixed numbers, but in a sequence that you can rerun. If you are simulating a process or testing games, having a fully deterministic setup if you know a seed is a great help to reproduce unusual behavior sighted.


Some people need reproducible "randomness". They're testing a model, for example. I'm not sure if this RNG is good for them or not.


Some people seem forever confused about the difference between random and deterministic.


Reproducibility and seed-ability isn't a footgun unless it's easy to do wrong accidentally, or hard to do right accidentally.

The former really isn't a problem except for really poor APIs, and the later isn't made worse by it... although the PCG random author does like to make a point of just how bad it is in C++ (see [1]).

[1] https://www.reddit.com/r/cpp/comments/31857s/random_number_g...


What about unit tests?


Produce a set of random numbers once and store them as part of the test set. That way your test is independent of the black box which is your RNG. Or make your own simple RNG - in these cases you are looking for "arbitrary, definitely not just sequential" rather than cryptographically random (or, indeed, truly random at all).

If you rely on the order of the output, for reliably consistent test cases in this instance, avoid using black-box RNGs which could change implementation under your nose without much or any warning.


Cryptographically random isn't the aim here - pseudorandom is.

A seedable black-box RNG that guarantees what seed corresponds to what stream is way simpler to use than a list of random numbers, IMHO. It's the difference between copying one number and copying a million.


> Cryptographically random isn't the aim here

We are agreeing here.

> A seedable black-box RNG that guarantees what seed corresponds

If you can control when the blackbox gets updated, or the blackbox carries a guarantee of stable output for any given seed over the while time of its existence, yes. But if you are using OS provided RNGs, as a for instance, their behaviour is not defined (well they are, but those definitions are not set in stone) and may change as kernel updates happen.

That is why I suggest "making your own simple PRNG". This could be as simple as picking a known documented algorithm as implemented by a particular library and using that in your test cases - it doesn't need to be as much as writing your own function even.


Right... but nobody's suggesting replacing the OS's RNG.


Having encountered a number of VPS setups relying on haveged or rng-tools/virtio-rng, has anyone observed "specific problems" with the misconfiguration/misuse of haveged on VPS?


Has there been any research into generating a PRNG using a genetic algorithm whose fitness function is the Crush and Diehard test results?


Not quite that, but here's someone's approximation of this approach for an extremely fast hash:

  So this was a fun hash that I came up with when I was looking
  for a 128-bit hash for falkervisor's code coverage database.
  This hash passes all of smhasher's tests, faster than any other
  hash that does the same.

  I made this up by randomly testing things until it passed
  all of smhashers tests. It is likely that this hash is heavily
  flawed and smhasher just doesn't stress its flaws.
https://github.com/gamozolabs/falkhash


Zilong Tan's fast-hash contains a generator which genetically optimizes over a combination of the cheap ops mul, xorshift left, xorshift right and rotate right, and also does genetic mutations over the initial magic states.

https://code.google.com/p/fast-hash/source/browse/trunk/hash...

The fit function is only the avalanche test, but that's easily exchangable.


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: