Trying to find out more about this EU stall thing Brendan talks about. Is it instruction sampling that gives you the reason for the stall? Sounds like a pretty advanced hw functionality.
I'm not sure, it seems to me like this should be doable in Nvidia as well. This is a paper that uses instruction sampling (called CUPTI) in Nvidia to provide optimization advice:
The issue there is that that info is what Nvidia chooses to port out from the on-chip execution. Most of what we can do for observation is in the kernel driver space and not really on-chip or even low level transit to the chip. One of the other commenters pointed out that you can get huge benefits from avoiding busy waiting on the returned data from the chip, which makes total sense, but also increases latency, which didn't work for my near-realtime use case when I was investigating. Other than those types of low hanging fruit where you can accept a little latency for better power state management, it's hard to find low level optimizations specifically for Nvidia through the closed source parts of the CUDA stack or through the driver transit to chip when those are intentionally hidden.
A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...
Great work! Was wondering if you deal with transferring the python environment remotely. Usually a large part of the difficulty is dealing with dependencies.
We make sure the remote containers have CUDA/Pytorch/Numpy/Matplotlib set up if you're using a GPU-based machine. It's actually far easier for me to run ML-based code through Moonglow now than on my Macbook - it's really nice to start with a clean environment every time, instead of having to deal with dependency hell.
We don't yet transfer the python environment on the self-serve options, though for customers on AWS we'll help them create and maintain images with the packages they need.
I do have some ideas for making it easy to transfer environments over - it would probably involve letting people specify a requirements.txt and some apt dependencies and then automatically creating/deploying containers around that. Your idea of actually just detecting what's installed locally is pretty neat too, though.
This is fascinating! I thought only Reinforcement Learning was doing things like this but you're saying you can do this via fuzzying? What does this mean exactly? How is it able to learn to advance through all these levels? Is there an underlying learning mechanism at play?
It appears that you are not familiar with the concept of fuzzing.
Fuzzing is a moderately advanced software testing technique popularized in the '90s that operates on a very simple idea: If you feed a program's inputs with arbitrary/random data, this could be used to discover bugs in the program with little human effort.
In the 90s they fed random data into the stdin of unix utilities and found that many programs crashed. [0] In this context printing an error message that says "I can't interpret the input" is a valid state, but reading past the end of a buffer because the input confused the program is a bug. Variants can be designed to test any API layer.
More recently Coverage Guided Fuzzers use information about which code paths are executed for each input as a way to reach a variety of program states more quickly. Also, starting with a prefix known to produce an interesting state can also speed up testing.
There's no learning exactly, as the post explains the fuzzer is aware of various RAM addresses (as well as having a tactic for how it "presses" buttons in the game). It's just trying to explore the space of Mario's level + his x and y coordinates.
This means that, without a learning procedure to direct Mario towards the end of the level, it can only reach the end by itself because the levels (and Mario's in-memory data structures in general) are pretty small, right?
Or rather, if there were tons of irrelevant state, it could always end up trapped somewhere and never actually complete a level even after centuries of fuzzing.
Something similar was tested in the Twitch Plays Pokemon [0] gaming experiment, but there the inputs appeared random but weren't actually random: there were "factions" that either tried to sabotage the run, or that tried to make it progress. Ultimately the majority of the players were cooperating to complete the game and this was a deciding factor to make the run succeed. Maybe fuzzing Pokemon can't complete the game, the way that TPP could (or reinforcement learning could).
The space is large, it just turns out if you direct Mario to explore with a bit of bias (so, in general, there's some favoring of exploring from states where Mario's x coordinate is to the right, e.g.) it completes the levels.
I think Pokemon could be beaten with our techniques. Final Fantasy on NES poses similar problems to Pokemon, and that is a game at which some progress has been made in the past, here.