Posts

Showing posts from 2006

Manual Decompilation

Argh. It's 2006, and I still don't have a good decompiler. All is not lost. Thankfully, there are still interesting things to decompile that are both small and contain lots of stuff that makes decompilation easy (e.g., symbols, relocations). So, let's do it manually using some trustworthy old fashioned tools: a disassembler, a text editor and some string processing tools. Let's choose a target. I'm going to go with a linux kernel module because they are small, contain symbols and relocations and because there exist GPL ones that I won't get in trouble for reverse engineering publicly. Just choosing something at random from /lib/modules on my Ubuntu linux box I come across new_wlan_acl.ko from the madwifi-ng drivers. Right, now we need a disassembly. No problem. Just do objdump -d --adjust-vma=0x8000000 new_wlan_acl.ko > out.dis . That almost gives me the object as it would look mapped into the linux kernel. Slight problem though, none of the reloca

Spring Cleaning

I, like many people, own a robotic vacuum cleaner. It's crap. With all the advanced robotic technology it has on board, it's still lacking in vacuum cleaner technology . Although it is bagless, it's not the good kind of bagless. There's no double vortex mechanism here. In fact, there's very little suction on it at all. That's kind of important for a vacuum cleaner I think. Besides which, the robotic technology isn't all that "advanced" anyway. It still gets stuck in corners or on that same part of the rug. So when I feel the carpet needs a bit of a spruce, I pull out the trusty Vax and put my back into it. Where's my maidbot? It's been 40 years since the The Jetsons and I am still waiting for my flying car, err, I mean, maidbot. Now, of course, I realise that it would be a bit expensive to get together a crack team of Japanese scientists just to make me a maidbot (and yes, they have to be Japanese scientists) and that no amount o

God Damn I Hate Cygwin

First of all, let's get the rant out of the way. Cygwin is a big pile of junk. It's like everyone who touches Cygwin gets the clue beaten out of them before they're allowed to hack on it. Here's a tip guys, if you're going to try to do something really hard (like build a POSIX compatible software layer on top of an operating system that holds many parts of the POSIX standard as the anti-thesis of its design) you have to put a lot of effort into it . Ok, done. I recently had this problem: Fatal server error: could not open default font 'fixed' for which I googled, lots and lots and lots and found no adequate solution. I found the Cygwin/X Frequently Asked Questions and it had two suggests as to what the problem could be: You don't actually have the xorg-x11-fnts package installed (duh, thanks guys, yeah, that wasn't the first thing I checked). The mount point for /usr/X11R6/lib/X11/fonts was invalid at the time that Cygwin's setup.exe

Windows Printer on Linux

I'm a Ubuntu user but I also run a number of Windows machines for work purposes. I was actually shocked recently to discover that CUPS supports my Brother HL-5140 laser printer without me having to download any binary blobs from the Brother website. So that's good. What's not so great is that gnome-cups-ui has no concept of workgroups or domains for connecting via smb to a windows printer. When you go to "add new printer" or when you view the connection properties after adding the new printer, the fields you are presented with are: Host Printer Username Password After googling for a bit, I discovered that people have figured out that if you put the workgroup name in the Host field and then enter username:password@host/Printer into the Printer field and nothing into either the username or password fields, you can print! Curious about this obscurity, I did an apt-get source gnome-cups-ui and hunted around for a bit. Eventually, I found some code that looks basicall

A 5mb binary blob in the kernel?

If you pop over to the NVIDIA web site and download the 3d card drivers for Linux, you'll note that there is a /usr/src/nv directory. In that directory is source code to the "thin layer" to the Linux kernel which NVIDIA links their binary blob. This, of course, makes no legal difference - NVIDIA are still extending the Linux kernel and therefore it is unlawful to distribute the Linux kernel along with the NVIDIA drivers, but NVIDIA doesn't do that, so it's not a problem - for them. Anyway, that's a side issue. I was thinking, recently, about the Linux kernel's "tainted" flag. Essentially, whenever you install a kernel module that isn't under some accepted open source license, the kernel sets a flag so that developers know there is a chance that any bugs you report might have been caused by code they can't fix. In general, this is not such a big deal as kernel modules are typically small and easy to isolate. The NVIDIA graphics driv

Skip the Intermediaries

Sometimes copyright law is just stoopid. Sometimes the rules just don't apply. Have you heard the story of Steve McDonald and White Stripes? Here's nice flash animation of what happened, or you can keep reading.. Steve McDonald is a veteran member of the band Redd Kross. He likes the White Stripes, but he thought they would sound better if they had a bass guitarist. So he appointed himself. He had the equipment, and the skill, so he made up some bass tracks and added them to his favourite White Stripes songs. He then posted those songs on the Redd Kross website - without permission. Of course, this could land him in hot water, but luckily he bumped into Jack White who gave his assurances that he wasn't going to sue. Before that happened I'm guessing Steve McDonald just didn't give a damn.. after all, he's a rocker, man. I'm not a rocker, but I am a rebel, ask anyone. I once posted the full c99 standard to my web site so people wouldn't have

At last. Some violence!

Hehe, after struggling to place the head, torso and legs of pedestrians together to make the animations work in FreeSynd for the last week or so I happened to find this information about HELE-0.ANI, HFRA-0.ANI and HSTA-0.ANI. Turns out these three files contain just about everything you need to know to draw objects in the game. There are 1970 animations which are made up of frames taken from a pool of 8949, each of which are made up of elements taken from a pool of 10486, each of which index into the available 1180 game sprites. The amount of work required to manually reproduce these animations (and that I had intended on doing) is phenomonal. A conservative estimate is that this information has knocked a year off the development time. In just one day (today) I have figured out the exact placement of units on the map, fixed the drawing of animations and cleaned up a lot of very ugly code. The result is that agents can now walk around, with and without all the different weapons

Improving On An Old Chestnut

Image
Breaking my own rule that improvements should be belayed until such time as the original game has been reimplemented, I've added waypoints and path finding to agent movement in FreeSynd . I found this documentation of the A* path finding algorithm to be most enlightening.. although, frankly, the sketch is more than adequate to implement a useful algorithm for agent movement. It always annoyed me when I told my agents to go to point X and they got stuck inside buildings or took some inordinately long path. Now you can hold down the ctrl key and enter waypoints manually, or you can let the path finder do its job (this is the default). Pressing the ctrl key without clicking the mouse will display the selected agents' current path(s) as a bright yellow line. At the moment the planning algorithm does not take elevated terrain into account, so trying to send agents onto bridges or up stairs simply won't work just yet. There's still a hell of a lot to do before the game

Into The Unknown..

Image
Back when Mike and I started the Boomerang decompiler project, we had the choice to start from scratch or to use our work on the UQBT project as a base. There was pros and cons to both approaches, but in the end we decided that UQBT had quite a lot of value in it, so we went with that. Unfortunately, at the time Mike and I started working on a decompiler based on UQBT, the copyright was owned by a number of parties. We had to talk to lawyers from the University of Queensland, Sun Microsystems and the individual people who had worked on the codebase over the years. Initially the lawyers were of a single opinion: no way. The individual contributors, however, were of mixed opinions: what for? and which license? As it turned out, answering the second question also answered the first question and started to put the lawyers minds at ease. The goal of the Boomerang decompiler project was decided, way back in 2001, to be a guiding force in bringing about a market for decompilation techn

Other Fun

I've recently been working on some game related projects. First, there's the RPG engine I've started which I call Stallion . I've received just about no interested about this from anyone so I havn't been paying it much attention. Now and then I get the desire to hack on an RPG and none of the available codebases for graphical RPGs interest me as much as the ones for textual RPGs. So why not hack on a textual RPG? Because it's just so dry without players to consider, and with players to consider you've just gotta be so dedicated. Then there's my other project. I've always been a fan of Bullfrog's Syndicate. My fan page gets a lot of hits (probably because I give out copies of the game on it). If you've never played Syndicate you probably won't understand the attraction. You'll also think we're all crazy for fooling around with DosBox or old hardware to get the game to actually run and never consider doing the like yourse

I havn't forgotten you

Havn't been working on Boomerang much lately. I made a few changes to make the exports of DLL files appear as entrypoints in the GUI. Seems to work, will check it in soon. I also made some recent changes that handle gotos inside switches better. This means that magic like Duff's Device now decompile sensibly.

And then, some calls just don't return

A small number of library procedures are "called" but never actually return. Eventually I'd like to have a way to specify these procedures with anotations in the signature files, but for the moment they are hard coded in the frontend. That's acceptable for the moment as there is only five: _exit, exit, ExitProcess, abort and _assert . Thing is, what happens when you have a branch over one of these calls, as you often do. Typically you end up with phi statements that refer to the call or the statements before it because there's no way to show that these definitions are killed by the call but not defined by it. We could add this to the SSA form but a simpler solution is available. Whenever the decompiler determines that the destination of a call is one of these no return procedures then we simply drop the out edge of the containing basic block. Without an out edge the definitions cannot flow beyond the call. Using dataflow based type analysis and some of m

MinGW's tricky prologue code

Continuing with my ongoing test program extract_kyra.exe from the scummvm tools I've been looking at the very first call in main . It would appear that this exe was compiled with stack checking runtime code enabled. That very first call is to a short little procedure that takes a single parameter in the eax register; the number of bytes to subtract from esp . Here's a disassembly of the procedure: push ecx mov ecx, esp add ecx, 8 loc_405336: cmp eax, 1000h jb short loc_40534D sub ecx, 1000h or dword ptr [ecx], 0 sub eax, 1000h jmp short loc_405336 loc_40534D: sub ecx, eax or dword ptr [ecx], 0 mov eax, esp mov esp, ecx mov ecx, [eax] mov eax, [eax+4] jmp eax It not only s

Chronicle of a Decompilation

Image
I've been promising to do this for a while, so here goes. I have downloaded a small open source utility called extract_kyra which is part of the scummvm tools . It doesn't really matter what the utility does. In fact, I prefer not to know as it gives me an unfair advantage compared to, say, decompiling malware. What's important is that this tool is under the GPL, so it is permisible for me to decompile it. I will be writing a number of these posts that describe all the issues I've run into and the steps I've completed to produce the decompilation. I promise that I will not go and download the source code for this program until such time as I believe my decompilation is "complete". The first problem with decompiling this program is that Boomerang has failed to find main() . I can tell this because the load phase of the workflow does not list main as a found entrypoint, it only lists start . This is because this exe was compiled with MinGW and it

Bi-directional Dataflow Type Inference

After reading this paper I've starting implementing a brand new type analysis in Boomerang. The first step is calculating reverse dominance frontiers. A regular dominance frontier is the set of nodes that have a predecessor that is dominated by the given node, but is not itself dominated. These are the nodes where phi statements are to be placed when transforming a program into SSA form. The reverse dominance frontier is much the same, except it is the successors that must be post -dominated. Boomerang already calculates immediate post-dominators for use by the code generation phase, but we've never before had a use for reverse dominator frontiers. The paper describes an extension to SSA form called static single information form (SSI) which introduces a new kind of assignment: the sigma function statement which are placed on the reverse dominator frontiers. The purpose of this new statement is to split definitions before uses on different code paths. I will be using

Short Circuit Analysis

I've checked in some code that detects branches to branches and merges them if they meet some basic requirements. As such, Boomerang can now generate code like: if (a < b && b < c) { // x } else { // y } instead of generating a goto statement. I've checked in a couple of test programs that exercise this analysis. I havn't looked at how this analysis effects loops or more obscure control structures.. so there could well be bugs here.

Unfinished Assortments

I received an email of inspiration from Mike a week ago outlining how similar conjunctions and disjunctions in short circuited if statements are at the binary level. After looking at the problem myself I found it was a pretty simple problem. If one of the two out edges of a branch node is another branch node which contains only a branch statement, and the destination of that statement is shared with the first branch statement then you can merge the condition of the second branch into the condition of the first branch. Exactly what combination of fall/taken edges are shared by the two branches is what determines whether an || or a && should be used. This is a pretty easy transformation to do just before code generation (or just after decompilation) and I'm about half way through implementing it. Unfortunately I got sidetracked. My work, you know, the people who pay me, they have me doing - wait for it - Java development. I moaned about not just being a one trick pony

What license is that MTA?

A Mail Transfer Agent is that bit of software that listens on port 25 and accepts mail when you send it. There's a lot of them available, but which ones are truely free? I find that a good moral compass on questions of licensing is to look at the OpenBSD project . What they use is typically the most free you can get. So what do they use? Sendmail , which has these license terms . They're pretty ass. Basically you can distribute it if you're "open source" in the GPL sense of the term; you have to promise to hand over source code, or if you are "freeware". So yeah, if you want to make a binary only CD of OpenBSD and include Sendmail you're going to have to promise whoever you give it to that you'll give them the source if they ask, or you can't charge them anything more than distribution costs. Seems kind of anti-OpenBSD-philosophy to me. But maybe there's nothing better out there. What about qmail ? Ask anyone and they'll te

Order of Decompilation

Which order you decompile the procedures of a program in can determine how much information you have available to make good analysis. In Boomerang we've developed a system of depth first search of the call graph, which takes into account recursion, to have the most information about parameters available when needed. For example, if a program consists of two procedures, one called by the other, it makes the most sense to decompile the procedure which is the leaf in the call graph so that the procedure that calls it has the most information about what parameters it needs to pass to it. What happens if the way the leaf procedure takes a parameter that is a pointer to a compound type? By the parameter's usage the decompiler may be able to determine that it is a pointer, it might even be able to determine that it is a pointer to a compound, but unless every member in the compound is accessed in a way that restricts that member's type sufficiently, the type that the decompiler

Binary Release Alpha 0.3

I've added a new binary release of Boomerang for win32 to the sourceforge project. The Linux binaries are also up. If you'd like to try making a release for some other platform, please let me know. What can the new release do? Well, it crashes less. It supports ELF .o files much better than previous releases. It includes some changes that make for better output in some instances. Overall, just general improvements. What can't the new release do? This is still an alpha release. That means we don't expect you to be able to do very much work with it. Running it on /bin/ls will still give you horrendous output, but try /bin/arch.

So Many Types and No (Good) Type Analysis

The type analysis in Boomerang remains a nice big mess. There have been three attempts at a type analysis system: dataflow based, constraint based and adhoc. At the moment the adhoc gives the best results, and the other two crash, a lot. Sometimes there is an absolute orgy of types in an input program, and the type analysis will assign the type void to a local. I've just added some more adhoc type analysis that will recognise when the programmer is assigning the elements of an object of unknown type to an object of known type and copy the known types for the elements to the unknown type. This is very specific but hopefully it occurs in more than just the one input program I was looking at. In C the programmer would have written something like this: struct foo *p = someFuncThatReturnsAFoo(); p->name = global.name; p->count = global.count; p->pos = global.pos; p->other = 0; If that call is to a library proc we will have the struct foo , and know that p is a pointe

Conflation of Pointer and Array Types

A common source of confusion for new C programmers is the conflation of pointers and arrays that C does. I often think of the dynamic semantics of the language when I'm thinking deeply about passing arrays to functions. Typically, you can tell an experienced programmer that C always passes arrays by reference, never by value, and they won't go wrong. Not all languages are like this, so in Boomerang we try to represent pointers and arrays as seperate non-conflated types. In our type system an array is a type used to describe those bytes in memory that contain a finite number of objects of a particular base type. Similarly, a pointer is a type used to describe those bytes in memory that contain an address, which if followed will reveal a single object of a particular base type. As such, it is necessary to refer to somethings explicitly using the Boomerang type system that are typically implied by the C type system. For example, a C string is often written in the C type syste

Reverse Strength Reduction

Strength reduction is a compiler optimisation that tries to replace expensive operations involving a loop variant with less expensive operations. The most common example of strength reduction is the replacement of array indexing with pointer arithmetic. Consider this program fragment: for (int i = 0; i < 25; i++) arr[i] = 9; where arr is an array of integers. An unoptimised compilation, in an RTL form, might look like this: 1 *32* r25 := 0 2 BRANCH 6 Highlevel: r25 >= 25 3 *32* m[a[arr] + r25 * 4] := 9 4 *32* r25 := r25 + 1 5 GOTO 2 which would actually be some very nice RTL for the decompiler to discover because it is easy to recognise that arr is an array with stride 4 and replace statement 3 with: 3 *32* a[r24] := 9 unfortunately, the compiler will have gotten to the RTL before we have, and most any compiler will do some strength reduction to get rid of that multiply in the middle of the left of statement 3. So what the decompiler will see is more like this: 1 *32* r

Using Qt4 and the Boehm Garbage Collector on Linux

I had an major issue where the Boehm garbage collector was crashing and spitting errors because of my use of Qt4's QThread class. The problem was simple enough, Qt4 calls pthread_create when it should be calling GC_pthread_create. I could have solved this problem by modifying qthread_private.cpp to do this, but that would mean distributing a modified Qt4, which is just silly for such a small change. So after much annoyance, I managed to come up with a solution that, although not pretty, seems to work. As such, there will be a Linux version of the GUI available to download when I make binary packages sometime in the next week.

Forcing a Signature

A while ago I added a bool to the Signature class that allowed the user to specify that the signature was already fully formed and did not need any processing. This was part of the "symbol file" hack that we used to do procedure-at-a-time decompilation using the command line tool. I noticed today that we were not honouring the forced bit anymore, for example, we were removing unused parameters and returns from the signature, so I fixed that. It occured to me that any procedure we discover via a function pointer was an excellent candidate for setting the forced bit on. The results were pretty spectacular as locals and globals were soon inheriting type information from the parameters of the signature. Unfortunately, the same could not be said of returns. In particular, consider some code like this: mystruct *proc5() 12 { *v** r24 } := malloc(343) 13 m[r24{12} + 4] := "foo" 14 RET r24{12} It's pretty clear that any local we create for r24 should be of type m

Types, Globals and Varargs

I have a sample input program that has some code similar to this: 228 { *32* r24, *32* r28 } := CALL knownLibProc( .. arguments .. ) .. 307 *32* m[r24{228}] := 232 308 *32* m[r24{228} + 4] := 91 309 *32* m[r24{228} + 8] := "some string" where knownLibProc returns a pointer to a struct in r24. Early in the decompilation this type will be propogated into the statements in 307, 308 and 309 producing: 307 *32* m[r24{228}].size := 232 308 *32* m[r24{228}].id := 91 309 *32* m[r24{228}].name := "some string" our intermediate representation doesn't have an operator equivalent to C's -> operator, the above is more like writing (*p).size, but the backend takes care of that and will emit a -> instead. Unfortunately I was getting an assert fault before I even get to that. The problem was that the 228 instance of r24 was being assigned a local variable, and that local was not inheriting the return type of the call. So the adhoc type analysis would take a look a

Memory Leaking and Automatic Collection

I checked in a change to Makefile.in today that lets one disable the garbage collector more easily. I then tried out a few memory leak detection tools. First I tried ccmalloc. I couldn't even get this working, it just crashes, even with the sample program on the web site. Then I gave mpatrol a go. I'd heard good things about mpatrol. Unfortunately it doesn't seem to like dynamic loading of shared libraries and (for no good reason) we don't link to the loaders statically in Boomerang. So I gave up and installed valgrind. It still rocks. It not only told me how much memory we were leaking and where from, it also told me some use-before-define errors of which I wasn't aware. I checked in fixes. Next, I had a look at Boost's shared_ptr class. I'm hoping to figure out a way to easily add something like this to replace the garbage collector. Unfortunately, the use of smart pointers is anything but easy. You'd think that you could define something

Debugging the Decompiler

One of the most useful features of the new GUI will be the ability to step through a decompilation and inspect the RTL at each step. To date I have implemented a Step button that allows the user to inspect a procedure before each major phase of the decompilation on that procedure. In the future, I intend to add more debugging points, perhaps even to the resolution of a single RTL change. I expect that some way for the user to specify the desired level of resolution will be required. Whether that is a bunch of menu options, or a spinner or even multiple Step buttons (step to next change, step to next analysis, step to next phase, step to next procedure, etc), I havn't decided. The UI already has a course form of breakpoints. At the decoding phase you can specify which procedures you want to inspect, and the decompilation will run without stopping until it gets to one of those procedures. It would be sensible to allow the user to set a breakpoint on a particular line of the RTL

Multithreaded Garbage Collection on Linux

I tried compiling the Boomerang GUI on Linux last night. After much effort getting Qt4 to compile I was hopeful that everything would just work. Unfortunately the trick I used to get the garbage collector to only collect memory allocated by the decompiler thread on WIN32 doesn't work on Linux. Apparently you're supposed to call GC_pthread_create instead of pthread_create to start new threads, well I'm not calling either, I'm getting Qt to create my thread for me. So what to do? I guess I could modify Qt to use GC_pthread_create, but that means any binaries I distribute will have to include my modified Qt. I'm going to look into ways to register a thread with the allocator directly. Another alternative is to just stop all this garbage collector nonsense, but modifying Boomerang to delete when appropriate is just out of the question. I have seriously considered using smart pointers, possibly from Boost, but as of yet I've not made a firm decision. It would

Displaying Pretty RTL

I did some fixes to the html output option in the various print() functions of Boomerang today. This is all so I can display RTL as pretty as possible. I'm thinking that hovering the mouse over a RefExp should highlight the line that is referenced by it. That's my first goal, and then I'll work on a context menu. All this is possible because I can hide pointers in the html output which refer back to the intermediate representation. Qt4 has the advantage that good html display widgets are standard parts of the toolkit. What I don't intend to do is to allow freeform manipulation of the RTL. That would require an RTL parser, which I'm simply not in the mood to write, at least this month.

Support for Linux Kernel Modules

Boomerang can now load Linux kernel modules and find the init_module and cleanup_module entrypoints. Loading ELF sections is a tricky business. I now honour the alignment information, which means kernel modules will load into Boomerang with the same section placement as they are loaded into IDA Pro. There was also some problems with the way we handle R_386_PC32 relocations. Checking the type of the symbol indicated by the relocation solves the problem. I also managed to speed it up significantly by removing an unnecessary check. Hopefully it really is unnecessary. My globals-at-decode-time code is now checked in. I await howls of disapproval on the mailing list, but hey, I do so every time I check in.

More GUI Work and Relocations

Today I got a lot of work done on the GUI. I can now edit signature files in a tab at decode time and the corresponding signatures and parameters are shown in the Library Procedures table. For the rare times where a decompilation actually makes it to code generation without something going wrong, I can now open the output file and edit it in a tab. I even gave my main window a title. On the topic of relocations/symbol information. I can now load a linux .o file and get absolutely no addresses in my RTL. This is because I take a relocation at memory location x as an absolute guarentee that memory location x contains an address. I look up the address in the symbol map and replace the constant that would ordinarily be produced with an a[global] expression. One surprise I had on my test binary was that string constants are not assigned a symbol. I expected at least a "no name" symbol. As such, I speculatively test the memory at the given address and, if a suitable string

Woe of the Instruction Decoder

Boomerang uses the NJMC toolkit to decode instructions. These are the files in frontend/machine/pentium (or sparc or ppc or whatever you're interested in). We chose to use this technology because it ment we didn't have to write code to implement a new architecture, we could just write a "specification" file. Unfortunately, the NJMC toolkit is slowly rotting. It is hard to build. I've never built it. Mike has built it a couple of times (and failed a lot more times). Every architecture is different and no-one maintains it. We also have some issues with the code it generates. It produces huge cpp files which struggle to compile on some build targets and make the resulting binary much bigger than it could be. So how much work is it to replace? I considered writing a new tool that would take the same input files as the NJMC toolkit and generate the same output, but that only solves half the problems. Then I came to wonder, what's wrong with just using a

More relocation mayhem

I couldn't get to sleep last night, as something about relocations was nagging at me. Finally, around 2am, it hit me. I got up and sent a long email to Mike. The problem is, we've been thinking about relocations way too literally. The ElfBinaryFile loader treats relocations as the loader of an operating system would treat them, as numbers to be added to addresses in the image. But a decompiler wants a lot more information than this. The relocations tell us something that is gold and we just ignore it. For example, suppose you have a relocation like this: 00000009 R_386_32 myarray To a traditional loader it is saying: go look up the symbol myarray and calculate its address in memory, then go to offset 9 in the .text segment and add that address to whatever is there. But to a decompiler, what it is telling us is that we should add a global reference to myarray to the expression we generate for the address at offset 9. So say the instruction that included offse

Overlapping registers

The x86 instruction set is an ugly mess. Often with a desire to make things more flexible, people make things harder to understand. In the case of instruction sets, this makes a decompiler's job more difficult. Consider the following x86 asm code: mov eax, 302 mov al, 5 mov ebx, eax What value is in ebx? It makes it easier if we write 302 as 12Eh. Then we can easily say that ebx contains 105h, that is, 261. In boomerang, the decoder would turn those three instructions into this RTL: *32* r24 := 302 *8* r8 := 5 *32* r27 := r24 This is clearly wrong. As the information that r8 overlaps with the bottom 8 bits of r24 is completely absent. This is more correct: *32* r24 := 302 *16* r0 := truncu(32, 16, r24) *8* r12 := r24@15:8 *8* r8 := truncu(32, 8, r24) *8* r8 := 5 *32* r24 := r24@31:8 | zfill(8, 32, r8) *16* r0 := truncu(32, 16, r24) *32* r27 := r24 But just look at the explosion in the number of statements. I havn't even included statements to define bx, bh, and bl, whi

Better Support for Relocatable ELF Files

Looking at how the Boomerang ELF loader handles symbols and relocations, I noticed that something was clearly wrong for relocatable files (i.e., .o files). The loader was assuming that the VirtualAddress members of the section table were set as they are in executable images. This is not the case. It is the duty of the loader to choose an arbitary starting address and to load each section at appropriate offsets from that address. I decided that choosing the same default address that IDA Pro uses was probably a good idea. I often switch between Boomerang and IDA Pro to gather information, especially information that Boomerang has gotten wrong. I also decided to delay loading any section that starts with ".rel." until all the other sections are loaded because IDA Pro does so. I don't know why it does it, but I want my addresses to match up with those in IDA Pro, so I have to follow their lead. After fixing this, I noticed that all the symbols and relocations now point

A(nother) GUI for Boomerang

Quite a while ago I attempted to write a GUI for Boomerang. In fact, I've done this a couple of times. The stalling point has always been: what good is a GUI? Decompilers are supposed to be automatic. You should be able to give a decompiler an executable and trust it to spit out a C file that meets some definition of program equivalence with that executable. So if the decompiler works perfectly, what is there for the user to do? Surely anything they can offer will be more productively applied to the output, and for that they can just use standard source code manipulation tools. Well, there's two problems with this line of thinking. First, there's the sad fact that no decompiler is perfect. In fact, the state of the art is far, far, from adequate, let alone perfect. Secondly, standard source code manipulation tools are woefully underpowered for even the most simplest tasks of traditional reverse engineering (where traditional means "starting from source code&q