So one of these days I was reading DailyDave and came across this thread, where Dave Aitel references a past presentation given by Halvar Flake. In that presentation, Halvar basically throws in a bunch of problems that would be nice to see security researchers solve in the future. Most of them are not trivial, and haven't been solved to this day (it's Dec 2009, and the preso is from 2006). Others have simply questionable solvability. The fact is, a silly idea crossed my mind as soon as I set my eyes on one of these problems, and I figured I'd code a small proof of concept just to see if it's any useful.
The issue here is code obfuscation. In order to degrade readability of compiled code, developers of code protectors and malware writers may choose to bloat an originally small piece of code into a much larger number of instructions, which, as we all are well aware, exponentially strengthens the reverser's headache. It can also be used as means for code polymorphism, thus making malware harder to fingerprint.
So, for instance, the code:
ADD EAX, 0x20
Can be obfuscated into:
MOV EBX, 0x80000000
ROL EBX, 6
NEG EBX
SUB EAX, EBX
Or:
MOV EBX, 0xFFFFFFF8
IMUL EBX, 4
NEG EBX
ADD EAX, EBX
Of course the algorithm that performs the obfuscation can iterate on the obfuscated over and over to bloat it indefinitely.
Our challenge, therefore, is to make the reverser's life easier. Ideally, we would have code that can take those 2 groups of 4 instructions (or, as a matter of fact, any other possible obfuscation of the original code) and return it to its normal, i.e. original, form. Halvar says he's not sure such goal is achievable. I'm not convinced either. In any case, I think it's a problem that deserves some effort put on...
Obs.: Please note that the "normal" form is fairly abstract in this case. The original code might have been either human or compiler generated. As such, the code we want to go back to may depend on some constraints (e.g. "the smallest code in bytes" or "the smallest code in number of instructions) or even some quite relative parameters (e.g. "the most human-readable code").
Obs.2: It's been a couple of years now that I've been wanting to check how the methods for polynomial identity testing would perform when applied to program equivalence. Unfortunately, I've always been busy or distracted by something else. Perhaps what I need is some motivation, so if someone wants to seriously dive into this, let me know.
So, my basic idea here is to partially emulate the obfuscated code in such a way that the final state of the context is put on terms of the initial state. Thus, we can understand the _effects_ by making the _causes_ explicit. Let's take a look...
So what we have here is our little code normalizer running over the two samples of obfuscated code I showed above. It displays the state of our "processor" after the given code runs, in term of the initial register states (eax0, ebx0, ecx0 and edx0, for their respective registers). We can see that what both samples do is essentially add 32, i.e. 0x20, to whatever was in EAX. Notice that the values in EBX are different, hence arguably characterizing different effects. The catch here, however, is that EBX wouldn't have been touched at all in the original code. It's just a side-effect of the obfuscation.
This approach is different than simply debugging or using regular emulation of the code and diff'ing the 'before' and 'after' of the CPU context. And I think it's better too. Suppose you start with EAX = 4 and just runs your debugger/emulator. Then you have EAX = 36. Now, what happened here? Did the code do EAX = EAX + 32 or EAX = EAX * 9? You'll need further testing.
Now, this obviously works well for constants unfolding and I'm sure it could be made more useful with branching support, to deal with all those nasty jump insertions. These are the things I had in mind when I came up with this. I'm not sure if it scales well for other obfuscation techniques. Then again, this wasn't supposed to be run over a large ammount of code, anyway. Just that little function, or even block, that had its internals especially obfuscated.
Well, the second step to make it really rocking now would be to generate code from those explicited "causes". For instance, we can clearly look at the first example and think "Ok, so it's ADD EAX, 0x20 and MOV EBX, 0x20" or something like that. We want some code to turn those math signs and assignments into arithmetical/logical and data moving instructions. My hint on this would be to have some kind of genetic algorithm that knows some simple transformation rules and whose heuristics to choose generations to further work on would be "smaller in number of instructions", simply using time as a stopping condition. I don't think that would be very hard to do, and perhaps could be very impressive already. Who knows?
Of course, being this a proof of concept, the original code could be first improved in a number of ways:
If anyone performs such modifications or any other improvements, I'll be happy to give them their deserved credit and post the improved code on this page.
Ok, now a little disclaimer is in order: I AM NOT A PYTHON CODER. I rarely code in Python, though it's such a nice language. In any case, it's very likely that this code is ugly, inefficient, lacks enough validation or is just plain wrong. I honestly do not care. I only did it in python so it would be easier for users to plug it into IDA, Immunity Debugger, PyDbg, and all the other reversing tools that support Python these days. Then again, you can always improve it and send it back to me so I can post it again here.
Have fun. Download it.