Hacker News new | past | comments | ask | show | jobs | submit login
Making Python Less Random (healeycodes.com)
87 points by healeycodes 4 months ago | hide | past | favorite | 43 comments



Another way to do this that covers more sources of non-determinism would be to run your python code under Meta’s Hermit: https://developers.facebook.com/blog/post/2022/11/22/hermit-...


Well, I never had issues with finding threading bugs on a normal Linux. The flakiness in Meta software tests, at least of the open source published kind, is that the code bases are a mess and are rewritten every two weeks, because apparently LOC is the measure of success.


I'm... confused. Being able to intercept and modify syscalls is a neat trick, but why is it applicable here?

In python you generally have two kinds of randomness: cryptographically-secure randomness, and pseudorandomness. The general recommendation is: if you need a CSRNG, use ``os.urandom`` -- or, more recently, the stdlib ``secrets`` module. But if it doesn't need to be cryptographically secure, you should use the stdlib ``random`` module.

The thing is, the ``random`` module gives you the ability to seed and re-seed the underlying PRNG state machine. You can even create your own instances of the PRNG state machine, if you want to isolate yourself from other libraries, and then you can seed or reseed that state machine at will without affecting anything else. So for pseudorandom "randomness", the stdlib already exposes a purpose-built function that does exactly what the OP needs. Also, within individual tests, it's perfectly possible to monkeypatch the root PRNG in the random module with your own temporary copy, modify the seed, etc, so you can even make this work on a per-test basis, using completely bog-standard python, no special sauce required. Well-written libraries even expose this as a primitive for dependency injection, so that you can have direct control over the PRNG.

Meanwhile, for applications that require CSRNG... you really shouldn't be writing code that is testing for a deterministic result. At least in my experience, assuming you aren't testing the implementation of cryptographic primitives, there are always better strategies -- things like round-trip tests, for example.

So... are the 3rd-party deps just "misbehaving" and calling ``os.urandom`` for no reason? Does the OP author not know about ``random.seed``? Does the author want to avoid monkeypatches in tests (which are completely standard practice in python)? Is there something else going on entirely? Intercepting syscalls to get deterministic randomness in python really feels like bringing an atom bomb to a game of fingerguns.


The article makes it fairly clear they this is mainly a kind of nerd-sniping - there are better solutions for practical purposes, but the author wanted to explore a different approach and learn a bit about syscall interception along the way.


If you're developing a game, there's a fairly big issue in that many things may be requesting values from, and thus incrementing, the PRNG, and many of them could be indirectly controlled by the user (where they are, where they're looking, etc. https://www.youtube.com/watch?v=1hs451PfFzQ is a fun video about reverse-engineering Zelda to predict the randomness in a minigame)

As far as the approach, I agree in that I don't understand why 'no code changes' is that important, especially in the context of Python which has a general attitude of consent towards monkeypatching code. Maybe one of the randomness sources was hashing all the source files? :P


Python has perhaps the least tolerant culture toward monkeypatching of languages that are capable of it. Outside a couple well-known common cases (gevent, I think?) it’s widely frowned upon.


Monkey-patching is usually the wrong solution because python is extremely extensible. These days I can’t think of reasons.

Back in the day, sometimes we had to monkey-patch interface layers like database drivers and other code that was open to modification but closed to extension. Usually to disable some legacy or proprietary feature that broke everything else. Like “you have to use a database from 1993” and it had a `assert check_winxp_version()` or something dumb in an `__init__.py` top-level.

These days, there are mature or python-native solutions to all of those that I recall.

- However! This article is more like using the debugger and ptrace as a Game Genie or save editor than about the utility of `prng = random.Random(123)`. The actual point of the article wasn’t much about python ;)


That video was awesome!


To be fair to the OP, the implicitness in Python in general and the random seeding in particular is confusing, especially if 3rd party modules are involved.

In C++, if you use std::mt19937, everything from seeding to the explicit generator is crystal clear while being terse as well.


Maybe I'm missing something, but if you can set os.urandom to a custom function, why not implement your own stateful PRNG in python and patch urandom to point to that? Then you can, among other things, seed the PRNG yourself in unit tests, all from within python and without touching syscalls.


Python randomness is something I've fought with for a few years. A while back (it's on my GitHub, I can find it if any replies care) I had an issue with something about distributed monte carlo sims all ending up with the same seed or something. More recently I've had an issue that I wanted a large number of random bytes but generated the same across multiple programs. Thinking about it now I could have used an LFSR or similar, but I just seeded the random module and it went fine.

Editing to add that another thing that trips me to every few years is that the hash function isn't repeatable between runs. Meaning if you run the program and record a hash of an object, then run it again, they'll be different. This is good for more secure maps and stuff but not good for thinking you can store them to a file and use them later.



Correct. But it's not often mentioned until you go looking for it.


I think the right mental model of a hash is that it's a transient value. There's no setting in which it makes sense to store it unless you explicitly control the hash function. Even in languages where it is static over runs, it can change over language versions, which any saved data will (presumably) eventually hit.


Yes, while it's a great practice, it's just something that newbies of any programming like me at the time may run into. Often when being taught programming, hash functions are purely functions of input data. My point was that Python, for very valid and good reasons, does not do that.


Well they are pure functions of input data. It's just that which pure function they are should be treated as a random value that changes over runs / version numbers.


Of course, this doesn't help with someone (e.g. me) who prefers to get their random numbers by reading them from /dev/random:

    $ strace python3 -c 'with open("/dev/random", "rb") as f: print(f.read(8))'
    [snip-snip]
    openat(AT_FDCWD, "/dev/random", O_RDONLY|O_CLOEXEC) = 3
    newfstatat(3, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x8), ...}, AT_EMPTY_PATH) = 0
    ioctl(3, TCGETS, 0x7ffd8198d640)        = -1 EINVAL (Invalid argument)
    lseek(3, 0, SEEK_CUR)                   = 0
    read(3, "\366m@\t5Q9\206\341\316/pXK\266\273~J\27\321:\34\330VL\253L\34\217\264L\373"..., 4096) = 4096
    write(1, "b'\\xf6m@\\t5Q9\\x86'\n", 19b'\xf6m@\t5Q9\x86'
    ) = 19
    close(3)                                = 0
There is also /dev/urandom.


From kernel 5.18 onwards, /dev/random and /dev/urandom are exactly the same.


They are still two different filenames (and two different inodes), if you want to intercept openning them.


os.urandom does read from /dev/urandom: https://docs.python.org/3/library/os.html#os.urandom


Cool deepdive into syscalls! We've built a deterministic simulator in Python to test the performance of our medical device under different scenarios, and have handled this problem with a few very simple approaches:

1. Run each simulation in its own process, using eg multiprocessing.Pool

2. Processes receive a specification for the simulation as a simple dictionary, one key of which is "seeds"

3. Seed the global RNGs we use (math.random and np.random) at the start of each simulation

4. For some objects, we seed the state separately from the global seeds, run the random generation, then save the RNG state to restore later so we can have truly independent RNGs

5. Spot check individual simulations by running them twice to ensure they have the same results (1/1000, but this is customizable)

This has worked very well for us so far, and is dead simple.


This is utterly insane:

   import os
   os.urandom = lambda n: b'\x00' * n
   import random
   random.randint = lambda a, b: a
I love it!


That's monkey patching, and it actually would've worked fine. There isn't enough context in the write-up to say for sure, but presumably he was just doing it too late, after the third-party library was already imported. At that point the third-party library has its own reference to the original function(s), so patching the reference(s) in the source module doesn't do anything. If the source module had been patched first, though, it all would've worked out.


I think he was saying something else was calling it and that was busting other things. Gevent did some crazy antics to get the whole tcp interface patched up. https://www.gevent.org/api/gevent.monkey.html#gevent.monkey....


Right, but the reason something else was able to call it was that he patched it too late. The same thing can happen with gevent. From the docs:

> Patching should be done as early as possible in the lifecycle of the program. For example, the main module (the one that tests against __main__ or is otherwise the first imported) should begin with this code, ideally before any other imports:

    from gevent import monkey
    monkey.patch_all()
A corollary of the above is that patching should be done on the main thread and should be done while the program is single-threaded.

It's possible to patch later on, but much more involved. If you patch module A after you've already loaded module B, which itself loads module A, then you have to both patch module A and track down and patch every reference to module A in module B. Usually those will just be global references, but not always.


This assumes there are no calls to random functions from C extensions. Still, I would have started with the above.


Less so that, since he says he knows the sources of randomness, but it does assume esoteric import methods aren't used. If for some reason the third-party library is e.g. loading modules with importlib, all bets are off.


Maybe it doesn't fit completely the author needs but an even less intrusive way to control random is to seed it manually.


I know people hate “enterprise”-type software design, but this is a typical case where Dependency Injection would have made the solution trivial without the need for any OS-specific hacks.

And while the article serves as a nice introduction to ptrace(), I think as a solution to the posted problem it's strictly more complicated than just replacing the getrandom() implementation with LD_PRELOAD (which the author also mentions as an option). For reference, that can be done as follows:

    % cat getrandom.c 
    
    #include <string.h>
    #include <sys/types.h>
    
    ssize_t getrandom(void \*buf, size_t buflen, unsigned int flags) {
      memset(buf, 0, buflen);
      return buflen;
    }
    
    % cc getrandom.c -shared -o getrandom.so
    
    % LD_PRELOAD=./getrandom.so python3 -c 'import os; print(os.urandom(8))'
    b'\x00\x00\x00\x00\x00\x00\x00\x00'
Note that these solutions work slightly differently: ptrace() intercepts the getrandom() syscall, but LD_PRELOAD replaces the getrandom() implementation in libc.so (which normally invokes the getrandom() syscall on Linux).


A better way to deal with this versus dependency injection frameworks is to allow the generator to be configurable and to be honest about where the seed is coming from. The random seed is an implicit input read from the runtime environment as a side effect. Future languages should treat side effects as first class and allow, for example, custom handlers to be installed to intecept or modify their behaviour.


Can you elaborate on what is "Dependency Injection"? Is this different from the example you show here with LD_PRELOAD?


The topic is too complex to do it justice in a Hacker News comment, especially since it's one of those things that takes some time and practical experience for the concept to “click”.

A good starting point is this article (though it's a little outdated): https://martinfowler.com/articles/injection.html

DI is not very common in Python, for a variety of reasons, but there apparently are DI frameworks, like: https://python-dependency-injector.ets-labs.org/index.html


I've asked a similar question, along the lines 'Can you explain Dependency Injection to me, assuming I already know Haskell?'

And the answer was basically along the lines of: it's a fancy way to pass something like a function argument.


It's primarily a technique used in object oriented programming. So it's hard to translate to Haskell.

The big picture of what a DI framework does is let you declare your structural object graph using a config file or decorators and have the whole thing instantiated at runtime automagically.

The detailed view is "a fancy way to pass something like a function argument". An object that has dependencies gets them passed in ("injected") at runtime rather than calling dependent function directly or internally instantiating dependent objects and calling them.

Doing things this way in OO languages has a number of benefits, including improved testability.


It's more subtle than that. DI and DI frameworks are two separate things that often get conflated.

Injecting dependencies when you instantiate an object, or passing dependencies into a method via arguments, rather than having methods or constructors create their dependencies, is good design and increases testability.

On the other hand, DI frameworks are, IMO, an awful awful mess and mistake and I'm glad the industry has moved away from them. The problem with DI frameworks is that setting parameters automatically is only one part of it; the other part of it is object lifecycle management. After all, a Foo is automatically injected into your class, it means the DI framework needs to know how to create a Foo, and dispose of a Foo.

This is where you get into per-request sessions, context hooks, and all of the stupid Spring bean bullshit everyone has come to hate and is really the main reason for so much anti-"enterprise" software patterns.

Funnily enough, it is unnecessary. Just make your dependencies into parameters and you'll get 95% of the benefit of DI. The last 5% is the code to wire up the object graph at certain entrypoints which can be large, but should be simple code.


Agreed. I was mostly trying to give the briefest overview I could for a hn comment.

I don't love spring, but I've seen the benefits from having a well-organized, declarative graph of the top-level objects (i.e. the core objects that live for the entire process lifetime). It provides a clear pattern for how to add new code to a large, growing codebase. Without such a structure, devs end up tacking new code on in arbitrary ways.


> It's primarily a technique used in object oriented programming. So it's hard to translate to Haskell.

Yes. I specifically asked like that to avoid getting a description of DI just couched in more OOP jargon but also to provide some alternative programming vocabulary (so you don't have to try to explain everything in everyday English, which seldom goes well).


Dependency injection means that the caller explicitly provides dependencies to functions/classes rather than having the classes/function getting their dependencies from the environment.

Taking an example like the article. Lets say you have a game with a ghost which randomly moves left or right. This would NOT be dependency injected:

    class Ghost:
        def __init__(self):
            self.pos = 5

        def move(self):
            match random.randint(0, 1):
                case 0:
                    self.pos -= 1
                case 1:
                    self.pos += 1
It's constructed like this:

    ghost = Ghost()
Ghost's behaviour depends on the state of the global RNG, but that isn't obvious from the perspective of the user of this class.

So instead we apply DI, and pass a random number generator in:

    class Ghost:
        def __init__(self, rng):
            self.pos = 5
            self.rng = rng

        def move(self):
            match self.rng.randint(0, 1):
                case 0:
                    self.pos -= 1
                case 1:
                    self.pos += 1
It's constructed like:

    ghost = Ghost(rng=random)
Now the fact that the class uses random numbers is explicit, and you can pass in an alternative RNG for testing purposes.

DI is a very useful technique that can make the construction of your system understandable, and make it easy to mock out dependencies. Much like mocking however - it shouldn't be over-used. If you use DI too much your code will become opaque as you'll never know what the concrete type of code you're calling is. Python's progressive typing can help here to some extent.

-----------

Dependency injection is not to be confused with dependency injection systems, which are complicated beasts that obscure what dependencies are actually constructed or provided. They make DI implicit again with the argument that it's better because you don't have to pass parameters manually. I would argue that if you need a dependency injection system, maybe you've over-used dependency injection.

-----------

I like to think of it as being related to capability based security[1], where you have to explicitly provide your dependencies, otherwise you won't be able to access them.

[1]: https://en.wikipedia.org/wiki/Capability-based_security


Came here after reading the article to tell this. Aside from monkey patching that would have been my go to.


Rather than writing a program you can also just use gdb and do it interactively...


Just me or the solution will work on anything that depends on the SYS_getrandom syscall?


Yes, the solution has nothing to do with Python.

I'm not sure if the problem had anything to do with Python. The article is a bit silent on the specific issue with randomness. If detouring urandom() fixed it, it was probably the randomized hash tables.

It cannot have been third party modules calling random.seed() since that would not have been fixed by the hack (meant positively).

You can say that randomized hash tables by default are a mistake, same as the crippled arbitrary precision arithmetic.

If you write a web service, just set the proper defaults at the start of your program.


Yes, it works on any process that makes a SYS_getrandom call.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: