Why Is My Perfectly Good Shellcode Not Working? Cache Coherency on MIPS and ARM

TickleSteve · on Feb 6, 2019

"Shellcode" in this sense is the small number of instructions that initiates the exploit, i.e. its contained in the data you overflow the buffer with.

The intention is that you tricking the firmware into executing your shellcode to ramp up the exploit.

The shellcode is "executable data" and the typical prevention for this type of exploit is execution-prevention on data (and stack) regions (for processors that support it).

signa11 · on Feb 6, 2019

but there is always return-to-libc style attacks which defeat executable stack protection...

saagarjha · on Feb 6, 2019

And hence we have ASLR to prevent against those attacks…

DyslexicAtheist · on Feb 6, 2019

only for 64-bit, but disclaimers might apply in (32-bit) IoT land:

> Our results suggest that, for current 32-bit architectures, address-space randomization is ineffective against the possibility of generic exploit code for a single flaw; and brute force attacks can be efficient, and hence effective

[1] https://web.stanford.edu/~blp/papers/asrandom.pdf

edit: also ROP

saagarjha · on Feb 6, 2019

32-bit IoT usually has a lot more problems that are easier to exploit than being able to brute force ASLR, but yes: if you are able to run attacks quickly, it’s pretty easy to get lucky and find the offset. There really isn’t much you can do on 32 bit to fix this, unfortunately, unless you start doing fancy things like reordering functions themselves to increase the number of bits of entropy (and even then, you’re not getting much…)

wtallis · on Feb 6, 2019

Since IoT is somewhat less tied to existing architectures, it's not completely out of the question that somebody could produce a microcontroller that keeps separate stacks for return addresses and function arguments/results. If the former stack is not directly modifiable, that removes most of the need for ASLR.

dfox · on Feb 6, 2019

Major selling point of most modern microcontroller architectures is that they don't have separate hardware return address stack, and are thus "C compatible". Many early microcontrollers had exactly that as it also typically means simpler hardware implementation.

wtallis · on Feb 7, 2019

"C compatible" strikes me as a very overblown way of describing the consequences of choosing whether to have a separate return address stack. In practice, I don't see it actually amounting to anything more than a need to use privileged instructions to implement setjmp, which is perfectly reasonable for a high-security environment.

est31 · on Feb 6, 2019

And ASLR is bypassable as well if you have other bugs that leak information about the stack layout. Overall it just gets more annoying and more tricky because you need to find more bugs and they need to fit together.

Hendrikto · on Feb 6, 2019

> Overall it just gets more annoying and more tricky

That‘s the point, isn‘t it? No system will ever be 100% secure, so all we can try to achieve is to deter attackers.

hinkley · on Feb 6, 2019

Deterrence is a bit of Lord of the Flies though. I'm making myself less attractive so you go after some other poor SOB.

But with IoT I'm not necessarily after one person. I'm going after a million people who all have the same garbage device. If the payoff is bigger the incentive to keep pushing past obstacles is very high.

Filligree · on Feb 6, 2019

It's entirely possible, in theory, to write bug-free code. It just isn't possible in practice, for humans. Isn't that rather sad?

null_content · on Feb 6, 2019

As a general statement, that is impossible in both practice and theory due to the halting problem.

You CAN write proven bug free code, but only if you are willing to accept a long series of limitations.

bennofs · on Feb 6, 2019

The halting problem only tells you that there can be no algorithm to prove this automatically for any given program.

It could still be the case that for all practically relevant programs there is a proof in each case that they are secure, even if there is no automatic way to find it.

Filligree · on Feb 6, 2019

I'm willing to accept "Might enter an infinite loop", but AFAIK the halting problem says nothing about security.

saagarjha · on Feb 6, 2019

> the halting problem says nothing about security

It makes it impossible to say whether certain things can be proven to be secure.

crankylinuxuser · on Feb 6, 2019

That's only if an algorithm would halt if it has a condition to do so.

If you use a finite state machine, You can control the states and make mathematical assurances that your program terminates.

The solution to the halting problem is that you work backwards, instead of forwards. You don't let a program run forever and hope for the answer. We also don't give infinite memory to programs to eventually exit/halt or not - My machines have 8 GB and 32GB ram. And oomkiller does its job rather well.

salawat · on Feb 7, 2019

That does not solve the halting problem. The halting problem is entirely framed around making a program that can specifically compute whether another arbitrary program will return.

Tests work because they are arranged by humans to verify the system under test for validity, but only for one specific system at a time.

The halting problem is basically an admission that one is unable to write a test to verify arbitrary programs on arbitrary, Turing complete hardware. You must be 10% smarter than the piece of equipment in short.

Killing something for taking too long or eating too much memory makes no substantive judgement on whether the computation at hand is ready to return or not, You (or your OS) just unilaterally made the decision that the computation wasn't worth continuing.

saagarjha · on Feb 6, 2019

So, solution to the halting problem: just kill anything that doesn't halt and return true?

crankylinuxuser · on Feb 6, 2019

No. AWS Lambda allows up to 10s of processing before killing and returning a (killed, exceeded time) for the api call.

The Halting problem depends on a turing machine. As stated by Wikipedia, a turing machine is:

"a mathematical model of computation that defines an abstract machine, which manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, given any computer algorithm, a Turing machine capable of simulating that algorithm's logic can be constructed."

The first description how to build one is, "The machine operates on an infinite memory tape divided into discrete cells."

Infinite memory? Well, that's already trivial to dismiss then.

And I also said that things that need to be proved should be using finite state machines. Using an FSM means a dimensionality reduction to a thing that doesn't suffer from the halting problem. One can make complicated graphs, yet still not introduce the same pitfalls as a turing machine has.

And, here's a paper arguing that FSM's dont suffer from halting problem. https://pdfs.semanticscholar.org/025b/97d66265dfbcb02d9dd1a2...

hinkley · on Feb 6, 2019

This is why some scripting and configuration languages are now toying with the notion that "not Turing Complete" is a feature not a limitation of the system.

I'm kinda holding my breath that they turn out to be right.

hinkley · on Feb 6, 2019

    Programmers only care about three numbers:  zero, one, and "out of memory"

the_duke · on Feb 6, 2019

Maybe, if operating under a lot of restrictions and using theorem proof, but even then the compiler and the OS (and the CPU) also need to be bug-free for complete protection.

Compilers have plenty of bugs, so do operating systems and CPUs as we have seen in recent years.

Florin_Andrei · on Feb 6, 2019

> It's entirely possible, in theory, to write bug-free code.

Only when "code" is thought of as an abstract notion existing in isolation from inputs, outputs, real systems, etc.

simias · on Feb 6, 2019

You can also insert stack-protecting canaries with some compilers (-fstack-protector on GCC). A bit more limited and it incurs a small runtime cost but you only need compiler support.

saagarjha · on Feb 6, 2019

Stack canaries make these attacks much more annoying, but it’s still possible to perform these if you can overwrite a function pointer or perform a controlled read or write (rather than a buffer overflow).

signa11 · on Feb 6, 2019

hmm, is the PLT and GOT also randomized ? if not, then that could possibly be another option for return-to-libc style attacks despite aslr.

saagarjha · on Feb 6, 2019

If you have a PIE binary, then yes.

saagarjha · on Feb 6, 2019

Of course, it’s so named because the code traditionally grants access to a shell.

tyingq · on Feb 6, 2019

Didn't know this was the lingo. Apparently there are related terms that go with it like "egg", "egg-hunt", "omelette", etc. https://en.m.wikipedia.org/wiki/Shellcode

taneq · on Feb 6, 2019

Note: This is talking about not talking about exploiting buffer overruns, not about BASH scripts. I was very confused for a moment there.

rocqua · on Feb 6, 2019

shellcode vs shell-code.

Shellcode traditionally meant binary that would do something like exec("/bin/bash") (i.e. code that launches a shell). When abusing a buffer overflow, you would usually want to somehow get this code to be executed. Later, the term expanded to mean any kind of user-provided bytes an attacker wants to be interperted and executed as machine instructions.

porpoisely · on Feb 6, 2019

FYI: It's called shellcode because the load in the exploit/buffer overrun are actual machine instructions to start a command shell. So essentially it is code to open a shell ( shellcode ).

If you want to hack a system, the ideal program you want is a command shell. And if you exploit a program with elevated rights, then you could get a shell with elevated rights which would let you ultimately take control over the system.

Edit: And if anyone wants to learn about computers/programming, I suggest you try it yourself. It's an excellent way of learning how the system works at a low level.

pm215 · on Feb 6, 2019

The post seems to be slightly confused about the Arm DSB and ISB instructions. These don't affect the caches at all -- they are barriers, which affect ordering and pipelining. DSB says "don't go any further until all memory accesses made by instructions before this one have completed", and ISB says "flush the pipeline so the next insn is reread from the cache" (among other things). So self-modifying and other tricky code needs to consider these insn ordering issues as well as cache manipulation, and barrier insns aren't a substitute for doing the cache maintenance.

ajross · on Feb 6, 2019

Pipeline control certainly does effect caches. That's where the memory access being controlled goes. I mean, it's tempting to try to understand them as an orthogonal thing, but in fact memory hierarchies and CPU pipelines (and other stuff like store forward buffers) are tied at the hip and it's pointless to try to "understand" them in isolation.

At best it's an exercise in specification pedantry. The only reason instruction barriers, fences and serialization instructions seem to make sense in docs is that someone at the architecture sat down and wrote a memory and execution model that can be abstracted by whatever API it is they present to the user.

And someone doing performance or exploitation work at the level of the linked article really needs to understand that model, not the API. (Though you're absolutely right that this particular article seems to be a bit mixed up about both)

pm215 · on Feb 6, 2019

I meant that it doesn't affect it in the sense that it doesn't change the way the cache behaves or ask it to flush or invalidate any of its contents. You can have a pipelined CPU which needs the barrier insns but which has no cache at all (most likely in embedded). And if the problem you're having is that your data is in the dcache but not the icache then the barrier insns won't help, because they're not architecturally defined to do that.

I certainly agree that you need to understand the architectural model to write reliable (and cross-cpu portable code) -- and the architectural model for Arm does distinguish barriers from cache maintenance -- though if you're writing exploit shellcode then "works most of the time and we can't do better given the way the bug we're exploiting to get code execution" is sometimes what you're going to end up with, at which point knowing what the specific hardware is doing can help.

classichasclass · on Feb 6, 2019

The same principles probably apply to Power ISA (isync, hwsync, icbi, etc.), and for that matter SPARC. Lots of PowerPC chips still turn up in embedded applications.

vectorEQ · on Feb 6, 2019

it's so perfect that it doesn't work. great stuff. instantly lost interest.

giancarlostoro · on Feb 6, 2019

I had issues with my own shell script this past week or two. But my issue was colliding hypervisors. Fun times. I'm not savvy enough with Bash as it is, it was frustrating.

Edit: I guess my shell code breaking frustration is somehow irrelevant to this issue. I mean running into the issue in the article is a rarity these days, at least on standard / mainstream hardware.

matthewmacleod · on Feb 6, 2019

Different kind of “shell” code maybe? :)

saagarjha · on Feb 6, 2019

For multiple reasons: most people don’t use MIPS, don’t have to write shellcode, and don’t have to deal with cache coherency :)