Black Magic Probe ARM semihosting

Sun, 15 Sep 2013 19:12:48 +0000
tech arm embedded

If you are developing stuff on ARM Cortex M-series devices and need a reasonably priced debugger, I'd have to recommend the Black Magic Probe. For about 70 bucks you get a pretty fast debugger that directly understands the GDB remote protocol. This is kind of neat as you don't need to have run some local server (e.g.: stlink); the debugger appears as a standard CDC ACM USB serial device.

The thing about it which is pretty cool is that the debugger firmware is open source, so when you suspect a bug in the debugger (which, just like compiler bugs, does happen in the real world) you can actually go and see what is going on and fix it. It also means you can add features when they are missing.

Although the debugger has some support for ARM semihosting, unfortunately this support is not comprehensive, which means if you use an unsupported semihosting operation in your code you end up with a nasty SIGTRAP rather than the operation you were expecting.

Unfortunately one of the simplest operations, SYS_WRITEC, which simply outputs a single character was missing, which was disappointing since my code used it rather extensively for debug output! But one small commit later and the debug characters are flowing again. (As with many of these things, the two lines were the easy bit, the hardest and most time consuming bit was actually installing the necessary build pre-requisites!)

The trouble with literal pools

Fri, 02 Jan 2009 06:43:37 +0000
article tech arm

Yesterday we saw how with some careful programming, and the right compiler flags we could get GCC to generate optimally small code for our particular function. In this post we take a look at one of the ways in which GCC fails to produce optimal code. Now there are actually many ways, but the I want to concentrate on in this post is the use of literal pools.

So, what is a literal pool? I’m glad you asked. The literal pool is an area of memory (in the text segment), which is used to store constants. These constants could be plain numerical constants, but their main use is to store the address of variables in the system. These addresses are needed because the ARM instruction does not have any instructions for directly loading (or storing) an address in memory. Instead ldr and str can only store at a ±12-bit offset from a register. Now there are lots of ways you could generate code with this restriction, for example, you could ensure your data section is less than 8KiB in size, and reserve a register to be used as a base for all data lookups. But such approach only works if you have a limited data section size. The standard approach that is taken is that when a variable is used its address is written out into a literal pool. The compiler then generates two instructions, first to read the address from this literal pool, and the second is the instruction to access the variable.

So, how exactly does this literal pool work? Well, so that a special register is not needed to point at the literal pool, the compiler uses the program counter (PC) register as the base register. The generated codes looks something like: ldr r3, [pc, #28]. That codes loads a value at a 28-byte offset from the current value of PC into the register r3. r3 then contains the address of the variable we want to access, and can be used like: ldr r1, [r3, #0], which loads the value of the variable (rather than the address) into r1. Now, as the PC is used as the base for the literal pool access, it should be clear that the literal pool is stored close enough to the code that needs to use it.

To ensure that the literal pool is close enough to the code using it, the compiler stores a literal pool at the end of each function. This approach works pretty well (unless you have a 4KiB+ function, which would be silly anyway), but can be a bit of a waste.

To illustrate the problem, consider this (contrived) example code:

static unsigned int value;

unsigned int
    return value;

set_value(unsigned int x)
    value = x;

Now, while this example is contrived, the pattern involved exhibits itself in a lot of real-world code. You have some private data in a compilation unit (value), and then you have a set of accessor (get_value) and mutator (set_value) functions that operate on the private data. Usually the functions would be more complex than in our example, and usually there would be more than two. So lets have a look at the generated code:

00000000 <get_value>:
   0:	4b01      	ldr	r3, [pc, #4]	(8 <get_value+0x8>)
   2:	6818      	ldr	r0, [r3, #0]
   4:	4770      	bx	lr
   6:	46c0      	nop			(mov r8, r8)
   8:	00000000 	.word	0x00000000
			8: R_ARM_ABS32	.bss

0000000c <set_value>:
   c:	4b01      	ldr	r3, [pc, #4]	(14 <set_value+0x8>)
   e:	6018      	str	r0, [r3, #0]
  10:	4770      	bx	lr
  12:	46c0      	nop			(mov r8, r8)
  14:	00000000 	.word	0x00000000
			14: R_ARM_ABS32	.bss

You can see that each function has a literal pool (at address 0x8 and 0x14). You can also see that there is a relocation associated with each of these addresses (R_ARM_ABS32 .bss). This relocation means that at link time the address of value will be stored at locations 0x8 and 0x14. So, what is the big deal here? Well, there are two problems. First, we have two literal pools containing duplicate data, by storing the address of value twice, we are wasting 4 bytes (remember from yesterday, we have a very tight memory budget and we care where every byte goes). The second problem, is that we need to insert a nop in the code (at address 0x6 and 0x12), because the literal pool must be aligned.

So, how could the compiler be smarter? Well, if instead of generating a literal pool for each individual function it did it for the whole compilation unit, then instead of having lots of little literal pools with duplicated data through-out, we would have a single literal pool for the whole file. As a bonus, you would only need alignment once as well! Obviously if the compilation unit ends up being larger than 4KiB then you have a problem, but in this case you could still save up producing the literal pool until after 4KiB worth of code. As it turns out the commercial compiler from ARM, RVCT, does exactly this. So lets have a look at the code it generates:

00000000 <get_value>:
   0:	4802      	ldr	r0, [pc, #8]	(c <set_value+0x6>)
   2:	6800      	ldr	r0, [r0, #0]
   4:	4770      	bx	lr

00000006 <set_value>:
   6:	4901      	ldr	r1, [pc, #4]	(c <set_value+0x6>)
   8:	6008      	str	r0, [r1, #0]
   a:	4770      	bx	lr
   c:	00000000 	.word	0x00000000
			c: R_ARM_ABS32	.data$0

You see that the code is more or less the same, but there is just one literal pool right at the end of the file, and no extra nops are needed for alignment. Without merging literal pools we have a .text size of 24 bytes, with the merging we slash that down to 16 bytes.

So merging literal pools is pretty good, but the frustrating thing is that in this example, we don’t even need the literal pool!. If we examine the final compiled image for this program:

Disassembly of section ER_RO:

00008000 <get_value>:
    8000:	4802      	ldr	r0, [pc, #8]	(800c <set_value+0x6>)
    8002:	6800      	ldr	r0, [r0, #0]
    8004:	4770      	bx	lr

00008006 <set_value>:
    8006:	4901      	ldr	r1, [pc, #4]	(800c <set_value+0x6>)
    8008:	6008      	str	r0, [r1, #0]
    800a:	4770      	bx	lr
    800c:	00008010 	.word	0x00008010
Disassembly of section ER_RW:

00008010 <value>:
    8010:	00000000 	.word	0x00000000

You should notice that the actual location of the value variable is 0x8010. At address 0x800c we have the literal pool storing the address of a variable which is in the very next word! If we optimised this by hand, we would end up with something like (need to verify the offset):

Disassembly of section ER_RO:

00008000 <get_value>:
    8000:	4802      	ldr	r0, [pc, #4]	(8008 <set_value+0x8>)
    8002:	4770      	bx	lr

00008004 <set_value>:
    8004:	6008      	str	r0, [pc, #0]
    8006:	4770      	bx	lr
Disassembly of section ER_RW:

00008008 <value>:
    8008:	00000000 	.word	0x00000000

If we get rid of the literal pool entirely, we save the memory of the literal pool itself (4 bytes), plus the two instructions need to load values out of the literal pool (4 bytes). This cuts our text size down to a total of only 8 bytes! This is a factor 3 improvement over the GCC generated code. Granted, you are not always going to be able to perform this type of optimisation, but when you care about size, it is quite important. It would be nice if gcc supported a small data section concept so that you could specify variables that essentially resised within the literal pool instead of needing an expensive (in terms of space and time) indirection.

For this project, it looks like the code will have to be hand tweaked assembler, which is frustrating, because when you use a person as your compiler iterations become really expensive, and you want to make sure you get your design right up front.

Small code

Thu, 01 Jan 2009 16:05:38 +0000
tech article arm

A project that I’m working on at the moment calls for a very small footprint. This post is about how to make code really small for the ARM architecture.

As you can probably guess, I’m interested in operating system code, so as an example, I’ve taken a very simple piece of operating system code, and stripped it right back to demonstrate some of the techniques to use when optimising for space. So here is the snippet of code we are going to optimise:

01 struct thread {
02     unsigned int notify_bits;
03 };
05 unsigned int
06 poll(struct thread *thread, unsigned int mask)
07 {
08     unsigned int result;
09     result = thread->notify_bits & mask;
10     if (result) {
11         thread->notify_bits &= ~result;
12     }
13     return result;
14 }

In this very simple operating system we have threads (thread data is stored in struct thread). Each thread has a set of 32 signals (encoded in a single word notify_bits). This poll is used by a thread to determine if it has been sent any signals. The mask parameter is the set of signals that the thread is interested in checking. So, a thread can check if a single signal has been thread, or if any signal has been set, or if a specific subset of signals has been set. The function returns the signals that are available (which is simply the bit-wise and of notify_bits and mask). It is important that the function clears any signals that have been returned. This makes sure that if poll is called twice the same signals are not returned. This is achieved in lines 10—12.

So, our goal is to try and get this code as small as possible. And every byte counts! First off, we just try and compile this with the standard ARM gcc compiler. I’m using version 4.2.2. So we start with: $ arm-elf-gcc -c poll.c -o poll.o. We can then use object dump to work out what the compiler did: $ arm-elf-objdump -dl poll.o.

00000000 <poll&ht;:
   0:	e1a0c00d 	mov	ip, sp
   4:	e92dd800 	push	{fp, ip, lr, pc}
   8:	e24cb004 	sub	fp, ip, #4	; 0x4
   c:	e24dd00c 	sub	sp, sp, #12	; 0xc
  10:	e50b0014 	str	r0, [fp, #-20]
  14:	e50b1018 	str	r1, [fp, #-24]
  18:	e51b3014 	ldr	r3, [fp, #-20]
  1c:	e5932000 	ldr	r2, [r3]
  20:	e51b3018 	ldr	r3, [fp, #-24]
  24:	e0023003 	and	r3, r2, r3
  28:	e50b3010 	str	r3, [fp, #-16]
  2c:	e51b3010 	ldr	r3, [fp, #-16]
  30:	e3530000 	cmp	r3, #0	; 0x0
  34:	0a000006 	beq	54 <poll+0x54>
  38:	e51b3014 	ldr	r3, [fp, #-20]
  3c:	e5932000 	ldr	r2, [r3]
  40:	e51b3010 	ldr	r3, [fp, #-16]
  44:	e1e03003 	mvn	r3, r3
  48:	e0022003 	and	r2, r2, r3
  4c:	e51b3014 	ldr	r3, [fp, #-20]
  50:	e5832000 	str	r2, [r3]
  54:	e51b3010 	ldr	r3, [fp, #-16]
  58:	e1a00003 	mov	r0, r3
  5c:	e24bd00c 	sub	sp, fp, #12	; 0xc
  60:	e89da800 	ldm	sp, {fp, sp, pc}

So, our first go gets us 100 bytes of code. Which is 10 bytes (or 2.5 instructions) for each of our lines of code. We should be able to do better. Well, the first thing is we should try is to use optimisation: $ arm-elf-gcc -c poll.c -o poll.o -O2. This gives us a much better code generation output:

00000000 <poll>:
   0:	e5902000 	ldr	r2, [r0]
   4:	e1a0c000 	mov	ip, r0
   8:	e0110002 	ands	r0, r1, r2
   c:	11e03000 	mvnne	r3, r0
  10:	10033002 	andne	r3, r3, r2
  14:	158c3000 	strne	r3, [ip]
  18:	e12fff1e 	bx	lr

So this got us down to 28 bytes. A factor 4 improvement for one compiler flag, not bad. Now, -O2 does some standard omptimisations, but -Os, will do optimisations specifically for reducing the amount of code. So trying: $ arm-elf-gcc -c poll.c -o poll.o -Os. This gives a little bit better code-gen:

00000000 <poll>:
   0:	e5903000 	ldr	r3, [r0]
   4:	e1a02000 	mov	r2, r0
   8:	e0110003 	ands	r0, r1, r3
   c:	11c33000 	bicne	r3, r3, r0
  10:	15823000 	strne	r3, [r2]
  14:	e12fff1e 	bx	lr

Down to 24 bytes (6 instructions), is pretty good. Now, as you can see the generated code has 32-bits per instruction. The some of the ARM architectures have two distinct instruction sets, ARM and Thumb. The Thumb instruction set uses 16-bit per instruction, instead of 32-bit. This denser instruction set can enable much smaller code sizes. Of course there is a trade-off here. The functionality of the 16-bit instructions is going to be less than the 32-bit instructions. But lets give it a try. At the same time, we will tell the compiler the exact CPU we want to compile for (which is the ARM7TDMI-S) in our case. The compiler line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi -mthumb. Which produces code like:

00000000 <poll>:
   0:	6803      	ldr	r3, [r0, #0]
   2:	1c02      	adds	r2, r0, #0
   4:	1c08      	adds	r0, r1, #0
   6:	4018      	ands	r0, r3
   8:	d001      	beq.n	e <poll+0xe>
   a:	4383      	bics	r3, r0
   c:	6013      	str	r3, [r2, #0]
   e:	4770      	bx	lr

So, now we are down to 16 bytes, so in Thumb we need 8 instructions (2 more than ARM), but each is only 2 bytes, not 4, so we end up with a 1/3 improvement. To get any further, we need to start looking at our code again, and see if there are ways of improving the code. Looking at the code again:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     if (result) {
06         thread->notify_bits &= ~result;
07     }
08     return result;
09 }

You may notice that the branch instruction on line 5, you may notice that this is actually redundant. If result is zero, then ~result well be 0xffffffff. Given this thread->notify_bits &= 0xffffffff will not change the value of thread->notify_bits. So, we can reduce this to:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int result;
04     result = thread->notify_bits & mask;
05     thread->notify_bits &= ~result;
06     return result;
07 }

When we compile this we get down to:

00000000 <poll>:
   0:	6803      	ldr	r3, [r0, #0]
   2:	4019      	ands	r1, r3
   4:	438b      	bics	r3, r1
   6:	6003      	str	r3, [r0, #0]
   8:	1c08      	adds	r0, r1, #0
   a:	4770      	bx	lr

This gets us down to 6 instructions (12 bytes). Pretty good since we started at 100 bytes. Now lets look at the object code in a little bit more detail. If you look at address 0x8, the instruction simply moves register r1 into register r0 so that it in the right place for return. (Note: The ARM ABI has the return value stored in r0). This seems like a bit of a waste, it would be good if there was a way we could have the value stored in r0 and not waste an instruction just moving values between registers. Now, to get a better understanding of what the instructions are doing, I’m going to slightly rewrite the code, and then compile with debugging, so we can see how the generated code matches up with the source code. So first, lets rewrite the code a little bit:

00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03     unsigned int tmp_notify_bits = thread->notify_bits;
04     mask &= tmp_notify_bits;
05     tmp_notify_bits &= ~mask;
06     thread->notify_bits = tmp_notify_bits;
07     return mask;
08 }

You should convince yourself that this code is equivalent to the existing code. (The code generated is identical). So, we can now line up the source code with the generated code. On line 03 (unsigned int tmp_notify_bits = thread->notify_bits), this matches up with address 0x0 (ldr r3, [r0, #0]). So, register r3 is used to store variable tmp_notify_bits. The parameter thread is stored in register r0. Now, line 04 (mask &= tmp_notify_bits) matches directly with address 0x2 (ands r1, r3). Register r1 matches directly with the mask parameter. The important part of restructuring the code as we have done, is that it becomes obvious that we can directly using the mask parameter instead of needing an extra variable like in the previous code. As we continue line 05 (tmp_notify_bits &= ~mask), matches directly to 0x4 (bics r3, r1). The bics instruction is quite neat in that it can essentially do the &= ~ in one 16-bit instruction. Line 06 (thread->notify_bits = tmp_notify_bits) stores the result back to memory, matches directly to 0x6. Now the final line of code (return mask;), needs two instructions 0x8 and 0xa (adds r0, r1, #0 and bx lr). Now the reason we need to instructions is because mask is stored in register r1, and the return value needs to be in register r0. So how can we get mask to be stored in r0 all along? Well, if we switch the parameters around poll(unsigned int mask, struct thread *thread), the mask will instead be stored in r0 instead of r1. (Note: We don’t necessarily have to change the interface of the function. If we want to support keep the interface we can use a simple macro to textually swap the parameters.) If we compile this, we get the following generated code:

00000000 <poll>:
   0:	680b      	ldr	r3, [r1, #0]
   2:	4018      	ands	r0, r3
   4:	4383      	bics	r3, r0
   6:	600b      	str	r3, [r1, #0]
   8:	4770      	bx	lr

So, we have got this down to 5 instructions, 10 bytes. This is a factor 10 improvement, not bad!

So a bit of a review:

Android booting on Neo 1973

Sun, 02 Nov 2008 16:03:45 +0000
android arm tech

Well, it started almost a year ago, but I finally now have Android booting on my Neo 1973 phone:

It ain’t exactly running fast yet, and not everything it working 100%, but I think most of the tricky bits are done. I’m starting to push most of these changes back to the android project. It seems that while I’ve been working on this Sean McNeill has been having similar successes getting Android up on the latest Freerunner phones.

Android on ARMv4 (take 2)

Mon, 27 Oct 2008 21:36:32 +0000
android tech article arm

So, my earlier post on this was a little premature; anyone who has tried out the code has found out that it pretty much doesn’t work (hey I did warn you!). Now there are a range of fun reasons why this didn’t work, most of which I’ve now solved.

Firstly, it turns out that EABI and ARMv4T are pretty much incompatible. (I’ll post separately about that!). In short, thumb interworking doesn’t (can’t) work, so I’ve reverted back to plain old ARMv4 architecture as my target (the only difference between ARMv4 and ARMv4T is the thumb stuff, which we can’t use until the compiler / spec is fixed.). So I’ve updated the to support ARMv4 for now as well.

Of course the next problem that this introduces is that the bx instruction doesn’t exist on ARMv4, and GCC (helpfully) complains and stops the compilation. Now a BX without thumb support is simply a mov pc, instruction, so I went through and provided a BX macro that expands to either bx or mov pc,. This is a little bit nasty/invasive because it touches all the system call bindings, thankfully these are generated anyway, but it makes the diff quite large. (When I have time I’ll make it so that generation is part of the buid system, not a manual process.)

The next problem is that the provided compiler’s libgcc library is build for ARMv5, and has instructions that just don’t exist on ARMv4 (shc as clz), so I went and built a new compiler targeted to ARMv4. There is no reason why this couldn’t be set up as a multi-lib compiler that supports both, but I don’t have enough GCC wizardry in me to work that out right now. So a new compiler.

This got things to a booting stage, but not able to mount /system or /data. Basically, Android by default uses yet another flash file-system (YAFFS), but for some reasons, which I couldn’t fully work out initially, the filesystem just didn’t seem to cleanly initialise and then mount. So, without diving too deep, I figured I could just use jffs2 instead, which I know works on the target. So I upgraded the Android build system to support allowing you to choose which filesystem type to use, and providing jffs2 as an option. This was going much better, and I got a lot further, far enough that I needed to recompile my kernel with support for some of the Android specific drivers like ashmem, binder and logger. Unfortunately I was getting a hang on an mmap call, for reasons that I couldn’t quite work out. After a lot of tedious debugging (my serial console is broken, so I have to rely on graphics console, which is really just an insane way to try and debug anything), anyway, it turns out that part of what the Dalvik virtual machine does when optimising class files is to mmap the file as writable memory. This was what was failing, with the totally useless error invalid argument. Do you know how many unique paths along the mmap system call can set EINVAL? Well it’s a lot. Anyway, long story short, it turns out that the jffs2 filesystem doesn’t support writable mmaps! %&!#.

After I finished cursing, I decided to go back to using yaffs and working out what the real problem is. After upgrading u-boot (in a pointless attempt to fix my serial console), I noticed a new write yaffs[1] command. This wasn’t there in the old version. Ok, cool, maybe this has something do to with the problem. But what is this the deal with yaffs versus yaffs1? Well it turns out that NAND has different pagesize, 512 bytes, and 2k (or multiples thereof, maybe??). And it turns out that YAFFS takes advantage of this and has different file systems for different sized NAND pages, and of course, everything that can go wrong will so, the filesystem image that the build system creates is YAFFS2 which is for 2k pages not 512b pages. So, I again updated the build system to firstly build both the mkyaffs2image and the mkyaffsimage tool, and then set off building a YAFFS file system.

Now, while u-boot supports yaffs filesystem, device firmware update doesn’t (appear to). So this means I need to copy the image to memory first, then on the device copy it from memory to flash. Now, the other fun thing is that dfu can only copy 2MB or so to RAM at a time, and the system.img file is around 52MB or so, which means that it takes around 26 individual copies of 2MB sections.... very, very painful. But in the end this more or less worked. So now I have a 56MB partition for the system, and a 4MB partition for the user and things are looking good.

Good that is, right up until the point where dalvik starts up and writes out cached version of class files to /data. You see, it needs more than 4MB, a lot more, so I’m kind of back to square one. I mean, if I’d looked at the requirements I would have read 128MB of flash, but meh, who reads requirements? The obvious option would be some type of MMC card, but as it turns out the number of handy Fry’s stores on Boeing 747 from Sydney to LA number in the zeroes.

So the /system partition is read-only, and since the only problem with jffs2 was when we were writing to it, it seems that we could use jffs2 for the read-only system partition, which has the advantage of jffs2 doing compression, and fitting in about 30MB, not about 50MB, leaving plenty of room for the user data partition, which is where the Dalvik cached files belong. This also has the advantage of being able to use normal DFU commands to install the image (yay!). So after more updates to the build system to now support individually setting the system filesystem type and the user filesystem type things seem a lot happier.

Currently, I have a system that boots init, starts up most of the system services, including the Dalvik VM, runs a bunch of code, but bombs out with an out-of-memory error in the pixelflinger code which I’m yet to have any luck tracing. Currently my serial console is fubar, so I can’t get any useful logging, which makes things doubly painful. The next step is to get adb working over USB so I have at least an output of the errors and warning, which should give me half a chance of tracking down the problem.

So if you want to try and get up to this point, what are the steps? Well, firstly go and download the android toolchain source code. and compile it for a v4 target. You use the --target=armv4-android-eabi argument to configure if I remember correctly.

Once you have that done, grab my latest patch and apply it to the Android source code base. (That is tar file with diffs for each individual project, apply these correctly is left as an exercise for the reader). Then you want to compile it with the new toolchain. I use a script like this:


     MKJFFS2_CMD="ssh nirvana -x \"cd `pwd`; mkfs.jffs2\""  \
     SYSTEM_FSTYPE=jffs2 \
     USERDATA_FSTYPE=yaffs \
     TARGET_TOOLS_PREFIX=/opt/benno/bin/armv4-android-eabi- $@

Things you will need to change it the tools prefix, and the mkjffs2 command. The evil-hackery above is to run it on my linux virtual machine (I’m compiling the rest under OS X, and I can’t get mkfs.jffs2 to compile under it yet.)

After some time passes you should end up with a ramdisk.img, userdata.img and system.img files. The next step is to get a usable kernel.

I’m using the OpenMoko stable kernel, which is 2.6.24 based. I’ve patched this with bits of the Android kernel (enough, I think, to make it run). Make sure you configure support for yaffs, binder, logger and ashmem. Here is the kernel config I’m currently using.

At this stage it is important you have a version of u-boot supporting the yaffs write commands, if you don’t your next step is to install that. After this the next step is to re-partition your flash device. In case it isn’t obvious this will trash your current OS. The useful parts from my uboot environment are:

bootcmd=setenv bootargs ${bootargs_base} ${mtdparts} initrd=${rdaddr},${rdsize}; nand read.e ${kaddr} kernel; nand read.e ${rdaddr} ramdisk; bootm ${kaddr}
bootargs_base=root=/dev/ram rw console=tty0 loglevel=8

Note the mtdparts which defines the partitions, and the bootcmd. (I’m not entirely happy with the boot command, mostly because when I install new RAM image I need to manually update $rdsize, which is a pain).

With this in place you are ready to start. The first image to move across is your userdata image. Now to make this happen we first copy it into memory using dfu-util:

sudo dfu-util -a 0 -R -D source/out/target/product/generic/userdata.img  -R

Then you need to use the nand write.yaffs1 command to copy it to the data partition. Note, at this stage I get weird behaviour, I’m not convinced that the yaffs support truly works yet! Afterwards I get some messed up data in other parts of the flash (which is why we are doing it first). After you have copied it in, I suggest reseting the device, and you may find you need to reinitialise u-boot (using dyngen, and resetting up the environment as above.

After this you are good to use dfu-util to copy accross the kernel, system.img and ramdisk.img. After copying the ramdisk.img across update the rdsize variable with the size of the ramdisk.

Once all this is done, you are good to boot, I wish you luck! If you have a working serial console you can probably try the logcat command to see why graphics aren’t working. If you get this far please email me the results!

Compiling the Android source code for ARMv4T

Thu, 23 Oct 2008 23:02:13 +0000
tech article android arm

After a lot of stuffing around installing new hard drives so I had enough space to actually play with the source code, getting screwed by Time Machine when trying to convert my filesystem from case-insenstive to case-insensitive (I gave up and am now usuing a case-sensitive disk image on top of my case-insenstive file system.. sigh), I finally have the Android source code compiling, yay!.

Compiling is fairly trivial, just make and away it goes. The fun thing is trying to work out exactly what the hell the build system is actually doing. I’ve got to admit though, it is a pretty clean build system, although it isn’t going to win any speed records. I’m going to go into more details on the build sstem when i have more time, and I’ve actually worked out what the hell is happening.

Anyway, after a few false starts I now have the build system compiling for ARMv4T processors (such as the one inside the Neo1973), and hopefully at the same time I haven’t broken compilation from ARMv5TE.

For those interested I have a patch available. Simply apply this to the checked out code, and the build using make TARGET_ARCH_VERSION=armv4t. Now, of course I haven’t actually tried to run this code yet, so it might not work, but it seems to compile fine, so that is a good start! Now once I work out how to make git play nice I'll actually put this into a branch and make it available, but the diff will have to suffice for now. Of course I’m not the only one looking at this, check out Christopher’s page for more information. (Where he actually starts solving some problems instead of just working around them ;)

The rest of this post documents the patch. For those interested it should give you some idea of the build system and layout, and hopefully it is something that can be applied to mainline.

The first changes made are to the file. A new make variable TARGET_ARCH_VERSION is added. For now this is defaulted to armv5te, but it can be overridden on the command line as shown above.

project build/
diff --git a/core/combo/ b/core/combo/
index adb82d3..a43368f 100644
--- a/core/combo/
+++ b/core/combo/
@@ -7,6 +7,8 @@ $(combo_target)TOOLS_PREFIX := \
 $(combo_target)CC := $($(combo_target)TOOLS_PREFIX)gcc$(HOST_EXECUTABLE_SUFFIX)
 $(combo_target)CXX := $($(combo_target)TOOLS_PREFIX)g++$(HOST_EXECUTABLE_SUFFIX)
 $(combo_target)AR := $($(combo_target)TOOLS_PREFIX)ar$(HOST_EXECUTABLE_SUFFIX)

The next thing is to make the GLOBAL_CFLAGS variable dependent on the architecture version. The armv5te defines stay in place, but an armv4t architecture version is added. Most of the cflags are pretty similar, except we change the -march flag, and change the pre-processor defines. These will become important later in the patch as they provide the mechanism for distinguishing between versions in the code.

@@ -46,6 +48,7 @@ ifneq ($(wildcard $($(combo_target)CC)),)
 $(combo_target)LIBGCC := $(shell $($(combo_target)CC) -mthumb-interwork -print-libgcc-file-name)
+ifeq ($(TARGET_ARCH_VERSION), armv5te)
 $(combo_target)GLOBAL_CFLAGS += \
 			-march=armv5te -mtune=xscale \
 			-msoft-float -fpic \
@@ -56,6 +59,21 @@ $(combo_target)GLOBAL_CFLAGS += \
 			-D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ \
 			-D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ \
 			-include $(call select-android-config-h,linux-arm)
+ifeq ($(TARGET_ARCH_VERSION), armv4t)
+$(combo_target)GLOBAL_CFLAGS += \
+			-march=armv4t \
+			-msoft-float -fpic \
+			-mthumb-interwork \
+			-ffunction-sections \
+			-funwind-tables \
+			-fstack-protector \
+			-D__ARM_ARCH_4__ -D__ARM_ARCH_4T__ \
+			-include $(call select-android-config-h,linux-arm)
 $(combo_target)GLOBAL_CPPFLAGS += -fvisibility-inlines-hidden

The next bit we update is the file. The dynamic libraries in android are laid out explicitly in virtual memory according to this map file. If I’m not mistaken those address look suspiciously 1MB aligned, which means they should fit nicely in the pagetable, and provides some opportunity to use fast-address-space-switching techniques. In the port to ARMv4 I have so far been lazy and instead of fixing up any assembler code I’ve just gone with existing C code. One outcome of this is that I need the for my foreign function interface, so I’ve added this to the map for now. I’m not 100% sure that when compiling for ARMv5 this won’t cause a problem. Will need to see. Fixing up the code to avoid needing libffi is probably high on the list of things to do.

diff --git a/core/ b/core/
index d4ebf43..6e0bc43 100644
--- a/core/
+++ b/core/
@@ -113,3 +113,4 @@             0x9A700000          0x9A500000               0x9A400000        0x9A300000               0x9A200000

The next module is the bionic module which is the light-weight C library that is part of Android. This has some nice optimised routines for memory copy and compare, but unfortunately they rely on ARMv5 instructions. I’ve changed the build system to only use the optimised assembler when compiling with ARMv5TE, and falling back to C routines in the other cases. (The strlen implementation isn’t pure assembly, but the optimised C implementation has inline asm, so again it needs to drop back to plain old dumb strlen.)

project bionic/
diff --git a/libc/ b/libc/
index faca333..3fb3455 100644
--- a/libc/
+++ b/libc/
@@ -206,13 +206,9 @@ libc_common_src_files := \
 	arch-arm/bionic/_setjmp.S \
 	arch-arm/bionic/atomics_arm.S \
 	arch-arm/bionic/clone.S \
-	arch-arm/bionic/memcmp.S \
-	arch-arm/bionic/memcmp16.S \
-	arch-arm/bionic/memcpy.S \
 	arch-arm/bionic/memset.S \
 	arch-arm/bionic/setjmp.S \
 	arch-arm/bionic/sigsetjmp.S \
-	arch-arm/bionic/strlen.c.arm \
 	arch-arm/bionic/syscall.S \
 	arch-arm/bionic/kill.S \
 	arch-arm/bionic/tkill.S \
@@ -274,6 +270,18 @@ libc_common_src_files := \
 	netbsd/nameser/ns_print.c \
+ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
+libc_common_src_files += arch-arm/bionic/memcmp.S \
+		arch-arm/bionic/memcmp16.S \
+		arch-arm/bionic/memcpy.S \
+		arch-arm/bionic/strlen.c.arm
+libc_common_src_files += string/memcmp.c string/memcpy.c string/strlen.c string/ffs.c
 # These files need to be arm so that gdbserver
 # can set breakpoints in them without messing
 # up any thumb code.

Unfortunately, it is clear that this C only code hasn’t been used in a while as there was a trivial bug as fixed by the patch below. This makes me worry about what other bugs that aren’t caught by the compiler may be lurking.

diff --git a/libc/string/memcpy.c b/libc/string/memcpy.c
index 4cd4a80..dea78b2 100644
--- a/libc/string/memcpy.c
+++ b/libc/string/memcpy.c
@@ -25,5 +25,5 @@
-#define MEM_COPY
+#define MEMCOPY
 #include "bcopy.c"

Finally, frustratingly, the compiler’s ffs() implementation appears to fallback to calling the C library’s ffs() implementation if it can’t doing something optimised. This happens when compiling for ARMv4, so I’ve added an ffs() implementation (stolen from FreeBSD).


 * Find First Set bit
ffs(int mask)
        int bit;

        if (mask == 0)
                return (0);
        for (bit = 1; !(mask & 1); bit++)
                mask = (unsigned int)mask >> 1;
        return (bit);

The next module for attention is the dalvik virtual machine. Again this has some code that relies on ARMv5, but there is a C version that we fall back on. In this case it also means pulling in libffi. This is probably the module that needs to most attention in actually updating the code to be ARMv4 assembler in the near future.

project dalvik/
diff --git a/vm/ b/vm/
index dfed78d..c66a861 100644
--- a/vm/
+++ b/vm/
@@ -189,6 +189,7 @@ ifeq ($(TARGET_SIMULATOR),true)
 ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
 	# use custom version rather than FFI
 	#LOCAL_SRC_FILES += arch/arm/CallC.c
 	LOCAL_SRC_FILES += arch/arm/CallOldABI.S arch/arm/CallEABI.S
@@ -204,6 +205,16 @@ else
 		mterp/out/InterpC-desktop.c \
+	# use FFI
+	LOCAL_C_INCLUDES += external/libffi/$(TARGET_OS)-$(TARGET_ARCH)
+	LOCAL_SRC_FILES += arch/generic/Call.c
+		mterp/out/InterpC-desktop.c \
+		mterp/out/InterpAsm-desktop.S
 LOCAL_MODULE := libdvm

Next is libjpeg, which again, has assembler optimisation that we can’t easily use without real porting work, so we fall back to the C

project external/jpeg/
diff --git a/ b/
index 9cfe4f6..3c052cd 100644
--- a/
+++ b/
@@ -19,6 +19,12 @@ ifneq ($(TARGET_ARCH),arm)
+# the assembler doesn't work for armv4t
+ifeq ($(TARGET_ARCH_VERSION),armv4t)
 # temp fix until we understand why this broke

For some reason compiling with ARMv4 doesn’t allow the prefetch loop array compiler optimisation, so we turn it off for ARMv4.

@@ -29,7 +35,10 @@ LOCAL_SRC_FILES += jidctint.c jidctfst.S
-LOCAL_CFLAGS += -O3 -fstrict-aliasing -fprefetch-loop-arrays
+LOCAL_CFLAGS += -O3 -fstrict-aliasing
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
+LOCAL_FLAGS += -fprefetch-loop-arrays
 #LOCAL_CFLAGS += -march=armv6j
 LOCAL_MODULE:= libjpeg

Next up is libffi, which is just a case of turning it on since we now need it for ARMv4.

project external/libffi/
diff --git a/ b/
index f4452c9..07b5c2f 100644
--- a/
+++ b/
@@ -6,7 +6,7 @@
 # We need to generate the appropriate defines and select the right set of
 # source files for the OS and architecture.
-ifneq ($(TARGET_ARCH),arm)
+ifneq ($(TARGET_ARCH_VERSION),armv5te)
 LOCAL_PATH:= $(call my-dir)
 include $(CLEAR_VARS)

The external module opencore contains a lot of software implemented codecs. (I wonder about the licensing restrictions on these things...). Not surprisingly these too are tuned for ARMv4, but again we fall back to plain old C.

project external/opencore/
diff --git a/codecs_v2/audio/aac/dec/ b/codecs_v2/audio/aac/dec/
index ffe0089..6abdc2d 100644
--- a/codecs_v2/audio/aac/dec/
+++ b/codecs_v2/audio/aac/dec/
@@ -150,7 +150,7 @@ LOCAL_SRC_FILES := \
 LOCAL_MODULE := libpv_aac_dec
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
diff --git a/codecs_v2/audio/gsm_amr/amr_wb/dec/ b/codecs_v2/audio/gsm_amr/amr_wb/dec/
index e184178..3223841 100644
--- a/codecs_v2/audio/gsm_amr/amr_wb/dec/
+++ b/codecs_v2/audio/gsm_amr/amr_wb/dec/
@@ -48,7 +48,7 @@ LOCAL_SRC_FILES := \
 LOCAL_MODULE := libpvamrwbdecoder
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
diff --git a/codecs_v2/audio/mp3/dec/ b/codecs_v2/audio/mp3/dec/
index 254cb6b..c2430fe 100644
--- a/codecs_v2/audio/mp3/dec/
+++ b/codecs_v2/audio/mp3/dec/
@@ -28,8 +28,8 @@ LOCAL_SRC_FILES := \
 	src/pvmp3_seek_synch.cpp \
 	src/pvmp3_stereo_proc.cpp \
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
 	src/asm/pvmp3_polyphase_filter_window_gcc.s \
 	src/asm/pvmp3_mdct_18_gcc.s \
@@ -46,7 +46,7 @@ endif
 LOCAL_MODULE := libpvmp3
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)

Unfortunately it is not just the build file that needs updating in this module. I need to manually go and update the headers so that some optimised inline assembler is only used in the ARMv5 case. To be honest this messes these files up a little bit, so a nicer solution would be preferred.

diff --git a/codecs_v2/video/m4v_h263/enc/src/dct_inline.h b/codecs_v2/video/m4v_h263/enc/src/dct_inline.h
index 86474b2..41a3297 100644
--- a/codecs_v2/video/m4v_h263/enc/src/dct_inline.h
+++ b/codecs_v2/video/m4v_h263/enc/src/dct_inline.h
@@ -22,7 +22,7 @@
 #ifndef _DCT_INLINE_H_
 #define _DCT_INLINE_H_
-#if !defined(PV_ARM_GCC)&& defined(__arm__)
+#if !(defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_5TE__))
 #include "oscl_base_macros.h"
@@ -109,7 +109,7 @@ __inline int32 sum_abs(int32 k0, int32 k1, int32 k2, int32 k3,
 #elif defined(__CC_ARM)  /* only work with arm v5 */
 #if defined(__TARGET_ARCH_5TE)
 __inline int32 mla724(int32 op1, int32 op2, int32 op3)
     int32 out;
@@ -266,7 +266,7 @@ __inline int32 sum_abs(int32 k0, int32 k1, int32 k2, int32 k3,
     return abs_sum;
-#elif defined(PV_ARM_GCC) && defined(__arm__) /* ARM GNU COMPILER  */
+#elif defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_5TE__) /* ARM GNU COMPILER  */
 __inline int32 mla724(int32 op1, int32 op2, int32 op3)
diff --git a/codecs_v2/video/m4v_h263/enc/src/fastquant_inline.h b/codecs_v2/video/m4v_h263/enc/src/fastquant_inline.h
index 6a35d43..fbfeddf 100644
--- a/codecs_v2/video/m4v_h263/enc/src/fastquant_inline.h
+++ b/codecs_v2/video/m4v_h263/enc/src/fastquant_inline.h
@@ -25,7 +25,7 @@
 #include "mp4def.h"
 #include "oscl_base_macros.h"
-#if !defined(PV_ARM_GCC) && defined(__arm__) /* ARM GNU COMPILER  */
+#if !(defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_V5TE__)) /* ARM GNU COMPILER  */
 __inline int32 aan_scale(int32 q_value, int32 coeff, int32 round, int32 QPdiv2)
@@ -423,7 +423,7 @@ __inline int32 coeff_dequant_mpeg_intra(int32 q_value, int32 tmp)
     return q_value;
-#elif defined(PV_ARM_GCC) && defined(__arm__) /* ARM GNU COMPILER  */
+#elif defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_V5TE__) /* ARM GNU COMPILER  */
 __inline int32 aan_scale(int32 q_value, int32 coeff,
                          int32 round, int32 QPdiv2)
diff --git a/codecs_v2/video/m4v_h263/enc/src/vlc_encode_inline.h b/codecs_v2/video/m4v_h263/enc/src/vlc_encode_inline.h
index 69857f3..b0bf46d 100644
--- a/codecs_v2/video/m4v_h263/enc/src/vlc_encode_inline.h
+++ b/codecs_v2/video/m4v_h263/enc/src/vlc_encode_inline.h
@@ -18,7 +18,7 @@
-#if !defined(PV_ARM_GCC)&& defined(__arm__)
+#if !(defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_V5TE__))
 __inline  Int zero_run_search(UInt *bitmapzz, Short *dataBlock, RunLevelBlock *RLB, Int nc)
@@ -208,7 +208,7 @@ __inline  Int zero_run_search(UInt *bitmapzz, Short *dataBlock, RunLevelBlock *R
     return idx;
-#elif defined(PV_ARM_GCC) && defined(__arm__) /* ARM GNU COMPILER  */
+#elif defined(PV_ARM_GCC) && defined(__arm__) && defined(__ARCH_ARM_V5TE__) /* ARM GNU COMPILER  */
 __inline Int m4v_enc_clz(UInt temp)

A similar approach is needed in the skia graphics library.

project external/skia/
diff --git a/include/corecg/SkMath.h b/include/corecg/SkMath.h
index 76cf279..5f0264f 100644
--- a/include/corecg/SkMath.h
+++ b/include/corecg/SkMath.h
@@ -162,7 +162,7 @@ static inline int SkNextLog2(uint32_t value) {
     With this requirement, we can generate faster instructions on some
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARM_ARCH_5TE__) && !defined(__thumb__)
     static inline int32_t SkMulS16(S16CPU x, S16CPU y) {
         SkASSERT((int16_t)x == x);
         SkASSERT((int16_t)y == y);

The sonivox module (no idea what that is!), has the same requirement of updating the build to avoid building ARMv5 specific code.

project external/sonivox/
diff --git a/arm-wt-22k/ b/arm-wt-22k/
index 565c233..a59f917 100644
--- a/arm-wt-22k/
+++ b/arm-wt-22k/
@@ -73,6 +73,7 @@ LOCAL_COPY_HEADERS := \
 ifeq ($(TARGET_ARCH),arm)
+ifeq (($TARGET_ARCH),armv5)
 	lib_src/ARM-E_filter_gnu.s \
 	lib_src/ARM-E_interpolate_loop_gnu.s \

The low-level audio code in audioflinger suffers from the same optimisations, and we need to dive into the code on this occasion to fix things up.

project frameworks/base/
diff --git a/libs/audioflinger/AudioMixer.cpp b/libs/audioflinger/AudioMixer.cpp
index 9f1b17f..4c0890c 100644
--- a/libs/audioflinger/AudioMixer.cpp
+++ b/libs/audioflinger/AudioMixer.cpp
@@ -400,7 +400,7 @@ void AudioMixer::process__validate(state_t* state, void* output)
 static inline 
 int32_t mulAdd(int16_t in, int16_t v, int32_t a)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     asm( "smlabb %[out], %[in], %[v], %[a] \n"
          : [out]"=r"(out)
@@ -415,7 +415,7 @@ int32_t mulAdd(int16_t in, int16_t v, int32_t a)
 static inline 
 int32_t mul(int16_t in, int16_t v)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     asm( "smulbb %[out], %[in], %[v] \n"
          : [out]"=r"(out)
@@ -430,7 +430,7 @@ int32_t mul(int16_t in, int16_t v)
 static inline 
 int32_t mulAddRL(int left, uint32_t inRL, uint32_t vRL, int32_t a)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     if (left) {
         asm( "smlabb %[out], %[inRL], %[vRL], %[a] \n"
@@ -456,7 +456,7 @@ int32_t mulAddRL(int left, uint32_t inRL, uint32_t vRL, int32_t a)
 static inline 
 int32_t mulRL(int left, uint32_t inRL, uint32_t vRL)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     if (left) {
         asm( "smulbb %[out], %[inRL], %[vRL] \n"
diff --git a/libs/audioflinger/AudioResamplerSinc.cpp b/libs/audioflinger/AudioResamplerSinc.cpp
index e710d16..88b8c22 100644
--- a/libs/audioflinger/AudioResamplerSinc.cpp
+++ b/libs/audioflinger/AudioResamplerSinc.cpp
@@ -62,7 +62,7 @@ const int32_t AudioResamplerSinc::mFirCoefsDown[] = {
 static inline 
 int32_t mulRL(int left, int32_t in, uint32_t vRL)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     if (left) {
         asm( "smultb %[out], %[in], %[vRL] \n"
@@ -88,7 +88,7 @@ int32_t mulRL(int left, int32_t in, uint32_t vRL)
 static inline 
 int32_t mulAdd(int16_t in, int32_t v, int32_t a)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     asm( "smlawb %[out], %[v], %[in], %[a] \n"
          : [out]"=r"(out)
@@ -103,7 +103,7 @@ int32_t mulAdd(int16_t in, int32_t v, int32_t a)
 static inline 
 int32_t mulAddRL(int left, uint32_t inRL, int32_t v, int32_t a)
-#if defined(__arm__) && !defined(__thumb__)
+#if defined(__arm__) && defined(__ARCH_ARM_5TE__) && !defined(__thumb__)
     int32_t out;
     if (left) {
         asm( "smlawb %[out], %[v], %[inRL], %[a] \n"

The AndroidConfig.h header file is included on every compile. We mess with it to convince it that we don’t have an optimised memcmp16 function.

project system/core/
diff --git a/include/arch/linux-arm/AndroidConfig.h b/include/arch/linux-arm/AndroidConfig.h
index d7e182a..76f424e 100644
--- a/include/arch/linux-arm/AndroidConfig.h
+++ b/include/arch/linux-arm/AndroidConfig.h
@@ -249,8 +249,9 @@
  * Do we have __memcmp16()?
+#if defined(__ARCH_ARM_5TE__)
 #define HAVE__MEMCMP16  1
  * type for the third argument to mincore().

Next up is the pixelflinger, where things get interesting, because all of a sudden we have armv6 code. I’ve taken the rash decision of wrapping this in conditionals that are only enabled if you actually have an ARMv6 version, not a pesky ARMv5E, but I really need to better understand the intent here. It seems a little strange.

diff --git a/libpixelflinger/ b/libpixelflinger/
index a8e5ee4..077cf47 100644
--- a/libpixelflinger/
+++ b/libpixelflinger/
@@ -5,7 +5,7 @@ include $(CLEAR_VARS)
 # ARMv6 specific objects
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv6)
 LOCAL_ASFLAGS := -march=armv6
 LOCAL_SRC_FILES := rotate90CW_4x4_16v6.S
 LOCAL_MODULE := libpixelflinger_armv6
@@ -39,7 +39,7 @@ PIXELFLINGER_SRC_FILES:= \
 	raster.cpp \
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv5te)
@@ -67,7 +67,7 @@ ifneq ($(BUILD_TINY_ANDROID),true)
 LOCAL_MODULE:= libpixelflinger
-ifeq ($(TARGET_ARCH),arm)
+ifeq ($(TARGET_ARCH_VERSION),armv6)
 LOCAL_WHOLE_STATIC_LIBRARIES := libpixelflinger_armv6

Finally scanline has an optimised asm version it calls in preference to doing the same thing inline with C code. Again, I take the easy way out, and use the C code.

diff --git a/libpixelflinger/scanline.cpp b/libpixelflinger/scanline.cpp
index d24c988..685a3b7 100644
--- a/libpixelflinger/scanline.cpp
+++ b/libpixelflinger/scanline.cpp
@@ -1312,7 +1312,7 @@ void scanline_t32cb16blend(context_t* c)
     const int32_t v = (c->state.texture[0].shade.it0>>16) + y;
     uint32_t *src = reinterpret_cast(tex->data)+(u+(tex->stride*v));
-#if ((ANDROID_CODEGEN >= ANDROID_CODEGEN_ASM) && defined(__arm__))
+#if ((ANDROID_CODEGEN >= ANDROID_CODEGEN_ASM) && defined(__arm__) && defined(__ARCH_ARM_5TE__))
     scanline_t32cb16blend_arm(dst, src, ct);
     while (ct--) {

And that my friends, is that! Now to see if I can actually run this code!

Why we have to wait for Android on the Neo 1973

Wed, 21 Nov 2007 17:46:25 +0000
tech android article arm

update: It finally works!

So after a day of much fun and hacking, I sadly blog here in the face of defeat. Thwarted by binary only distribution and non-forwards compatible architectures. This tale of woe documents my attempt to get the Android stack running on the FIC Neo 1973 phone.

This post describes what I did to try and get this working, and ultimately why it isn't going to work until you get the source for the stack. And if you find the story excruciatingly just skip to the conclusion.

The day started out promisingly. I took the diff of the kernel I had earlier produced, and started to hack it down into something a little more manageable. First I got rid of the patches that enabled Qemu, since I only care about running this on the real hardware. Then I got rid of the patches that enabled the goldfish platform. The goldfish platform is the hardware platform that the Android SDK simulates. I don't need that for running on the Neo, so gone! Next there was a whole big patchset for enabling yaffs2. The Openmoko kernel already has yaffs2 patched in there, so this just causes confusion. Once all that is done, the final patch is much more manageable; 8000 lines rather than 30000 lines.

So with that in place, I pulled down the kernel, and then applied the Openmoko patchset with quilt. With that in place I applied my stripped down diff. Now this was against 2.6.23, rather than 2.6.22 so there was a bit of fuzz and a couple of failed hunks. (That sounds more, like gangster slang than hacking!). Anyway, after fixing up the patch, I now have patch that applies cleanly against the Openmoko kernel. (Not that it will do you much good as you are about to see.)

So I took the recommended default config for openmoko, and ran trusty make oldconfig. This prompted for a couple of new options:


I ended up removing the low memory killer option because it had compile errors. I was going to go back and fix, but in the end didn't really matter.

This was a lot of progress in my first hour or two of hacking. I then proceeded to waste a whole bunch of time on stupid stuff. I'll save you all the gory details, but highlights follow.

Firstly just trying to get stuff running on Neo 1973 proved a bit of a challenge. I eventually found some known binaries, and workd ut how to get them onto the phone:

$ sudo dfu-util -a 5 -R -D om.rootfs.jffs2
$ sudo dfu-util -a 3 -R -D om.uImage.bin 

(Special thanks to David who pointed me at the right places.)

The next challenge was actually just getting the kernel I had built from to work with the rootfs rather than just the binary kernel. This was particularly difficult to actually debug because there were no error messages or panics, just the kernel sitting at Freeing init memory, and no more. As far as I could tell it was the same kernel source, and I had used the default config. So after lots of messing around (different compilers, with/without modules, etc), I stumbled upon the fact that the user mode binary applications on the rootfs image are all using the new EABI, as opposed to the old ABI. It turns out that special kernel support needs to be enabled for EABI, and this isn't in the default openmoko kernel config. (I don't have a good reference for EABI vs. OABI. Linux devices has a story, but the floating point stuff is really only one small part of the differences.). Anyway, after enabling the CONFIG_AEBI things started going a lot smoother. I really wish that more people enabled the /proc/config.gz option, it would have made life a lot easier.

So at this point I had a kernel with the Android patches loading and running the standard OpenMoko distribution. Next step was to run a different rootfs. Taking the filesytems I had extracted earlier as well as the rootfs I had extracted (see this post for details), I combined these and used mkfs.jffs2 to build an android jffs filesystem. (For reference the full command is: $ sudo mkfs.jffs2 -x lzo -r android-root-image -o android.jffs2 --eraseblock=0x4000 --pad -n -squash). At this point I thought I was home free. How very wrong I was.

On booting this I was back at the dreaded Freeing init memory, with no other output. Confused with this, I compiled a very simple hello world program to see if this would work as a replacement init (just for testing). This didn't work either.

With this failure I tried another tack. I would revert to my known good, of the working openmoko rootfs, and install my hello program on this rootfs just to test it. I didn't think I would have a problem here. It turns out it failed to run. Luckily the openmoko rootfs has gdb, which is great for fixing problems like this. Firing up gdb soon let me to the real problem.

ARMv4 vs. ARMv5

So, it turns out that my hello binary (and all the android binaries) are compiled for an ARM926Ej-S chip. This is a problem because the neo1973 has an ARM920T core. Now you would think that ARM926 and ARM920 would be pretty close. But if you thought that you would, unfortunately, be wrong, wrong, wrong! The ARM926EJ-S implement the ARMv5TEJ instruction set, but the ARM920T implements the ARMv4T instruction set. So what happens in my hello program is that we hit an ARMv5 instruction, which is undefined in the earlier ARMv5 ISA, which generates an undefined instruction trap to the kernel, and the kernel responds by sending SIGILL to the running process. Assuming that the program hasn't installed any special signal handlers this will kill the process. And this is what was happening to my hello program, and what I assumed was happening to init as well. (Of course, assumptions make an ass out of u and me, or in this case, mostly me.)

Now I really wasn't going to be daunted by a pesky little thing such as the CPU not implementing the instructions stand in my way! (Note: I could of course have compiled hello for ARMv4 architecture, but that isn't an option for the rest of the stack, and I was only interested in getting hello running so I could get the rest of the stack running). So, in an act of stupid defiance, I decided, if the CPU can't implement the instruction, I'll do it myself.

Luckily the kernel provides a neat infrastructure for managing undefined instructions, and even emulating them. So the first instruction to emulate was the ARM clz instruction. This is the instruction that counts the number of leading zero bits. The code below implements this. The only other thing to do is ensure that this hook is registered at startup using: register_undef_hook(&clz_hook);

static int clz_trap(struct pt_regs *regs, unsigned int instr)
  /* Extract the source register index */
  int src = instr & 0xf;
  /* Extract the destination (result) register index */
  int dst = (instr >> 12) & 0xf;
  /* Extract the conditional code */
  int cc = (instr >> 28) & 0xf;
  /* Test if the conditional code passes */
  if (handle_cc(regs, cc)) {
      /* Implement the instruction */
      regs->uregs[dst] = 32 - fls(regs->uregs[src]);
  /* Print some stuff for debugging */
  printk("emulating clz: %x src=%d (%lx) dst=%d (%lx) @ %p\n", instr, 
	 src, regs->uregs[src], dst, regs->uregs[dst], (void*) regs->ARM_pc);
  /* Increment the PC register */
  regs->ARM_pc += 4;
  return 0;

static struct undef_hook clz_hook = {
	.instr_mask	= 0x0fff0ff0,
	.instr_val	= 0x016f0f10,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= clz_trap,

One thing that may not be clear from the comments is that ARM supports conditionally executed instructions. The top 4 bits of the instruction are its condition field. Depending on the condition field, and the value of the N, Z, C and V flags (which are stored in the CPSR register), the instruction may or may not be executed. This is used to avoid having to branch for all if statements and the associated problems... but you didn't come here for an introduction to computer architecture. To correctly implement this, some code is needed, and I clag it here for posterity.

/* Return true if conditional code should be executed */
static int handle_cc(struct pt_regs *regs, int cc)
  int doit = 0;
  int cpsr = regs->ARM_cpsr;

  int n = (cpsr >> 31) & 1;
  int z = (cpsr >> 30) & 1;
  int c = (cpsr >> 29) & 1;
  int v = (cpsr >> 28) & 1;

  switch (cc) {
  case 0:
    doit = z;
  case 1:
    doit = !z;
  case 2:
    doit = c;
  case 3:
    doit = !c;
  case 4:
    doit = n;
  case 5:
    doit = !n;
  case 6:
    doit = v;
  case 7:
    doit = !v;
  case 8:
    doit = c && !z;
  case 9:
    doit = !c || z;
  case 10:
    doit = (n == v);
  case 11:
    doit = (n != v);
  case 12:
    doit = ((z == 0) && (n == v));
  case 13:
    doit = ((z == 1) || (n != v));
  case 14:
    doit = 1;
  case 15:
    doit = 0;
    printk("Error, should get here!\n");
  return doit;

OK, one down. That wasn't so hard. The next one gets a little bit tricker. The compiler will use the BLX instruction if it is available. This is the Branch, Link and Exchange instruction. There are two versions of the instruction, and at this stage we only really care about version 2. In this version the address to branch to is stored in a register, and a flag indicates whether or not an exchange is required. (You can ignore exchange for now, more about that later.). This instruction is a little bit more effort to implement, but it is not too hard:

static int blxv2_trap(struct pt_regs *regs, unsigned int instr)
  int rm = instr & 0xf;
  int cc = (instr >> 28) & 0xf;
  printk("emulate blxv2: %x rm=%d (%lx) @ %p CC: %d cc(%lx) cpsr(%lx)\n", instr, rm,
	 regs->uregs[rm], (void*) regs->ARM_pc, handle_cc(regs, cc), cc, regs->ARM_cpsr);

  if (handle_cc(regs, cc)) {
    /* Update the link register with the return address 8/
    regs->ARM_lr = regs->ARM_pc + 4;
    /* Update the CPSR is this is an 'exchange' */
    regs->ARM_cpsr = (regs->ARM_cpsr & (~32))  | ((regs->uregs[rm] & 1) << 5);
    /* Jump to the register value */
    regs->ARM_pc = regs->uregs[rm] & 0xfffffffe;
  } else {
    /* If the condition code fail, just go to the next instruction. */
    regs->ARM_pc += 4;
  return 0;

static struct undef_hook blxv2_hook = {
	.instr_mask	= 0x0ffffff0,
	.instr_val	= 0x012fff30,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= blxv2_trap,

After this, success! Hello world ran correctly. Of course this emulation isn't going to be particularly fast, but it is still infinitely faster than not running at all. (Well, OK, not really, divide by zero is undefined, not infinite.) At this point we were feeling pretty good with ourselves. At this point I must acknowledge Carl and Matt for there assistance with this.

Thumb interworking

So now I really thought I was home free, but wrong once again. (A pattern emerging maybe?) So first a bit of a primer on ARM's Thumb mode (so punny!). ARM has two different instruction sets, the ARM instruction set, and the Thumb instruction set. The Thumb instruction set is a 16-bit instruction set, which has a higher code density than the ARM instruction set. Now the neat thing about this is that you can actually combine both ARM and Thumb instruction in the same program. So if your compiler is smart, it should be able to use both instruction sets for optimisation. The CPU knows whether code is executing in ARM or Thumb mode by a bit in the CPSR register. When the bit is set the instruction stream is assumed to be 16-bit Thumb instruction. Now if you are running in ARM mode, and want to enter Thumb mode, you need to do an exchange operation, which is part of the bx and blx instructions. Now it turns out that Android is compiled with Thumb mode, so this means it uses blx to switch from ARM to Thumb mode. So at this stage I ended up needing to implement blx (version 1) function. This is shown below:

static int blxv1_trap(struct pt_regs *regs, unsigned int instr)
  int h = (instr >> 24) & 1;
  long imm = ((instr & 0xffffff) << 8);
  /* should be signed extended */
  imm = imm >> 8;
  imm = imm << 2;
  imm = imm | (h << 1);
  printk("emulate blxv1: %x imm=%lx @ %p\n", instr, imm, (void*) regs->ARM_pc);

  regs->ARM_lr = regs->ARM_pc + 4;
  regs->ARM_cpsr = regs->ARM_cpsr | (1 << 5);
  regs->ARM_pc += imm;
  return 0;

static struct undef_hook blxv1_hook = {
	.instr_mask	= 0xfe000000,
	.instr_val	= 0xfa000000,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= blxv1_trap,

Now, we get a bit further. But still no go. It turns out that Thumb also has a new BLX instruction in V5. So, we have to go through and emulate this instruction for Thumb as well. Below is the code for that.

static int blxv1_t_trap(struct pt_regs *regs, unsigned int instr)
  u16 offset_11 = instr & 0x7ff;
  u16 h = (instr >> 11) & 3;
  printk("blx thumb %lx ofs: %lx h: %d @ %p\n", 
	 instr, offset_11, h, (void*) regs->ARM_pc);
  if (h == 2) {
    long imm = (offset_11 << 12) << 9;
    imm = imm >> 9;
    regs->ARM_lr = regs->ARM_pc + (imm << 12);
    regs->ARM_pc = regs->ARM_pc + 2;
  } else {
    long new_pc = regs->ARM_lr + (offset_11 << 1);
    /* We set the top bit for mega hack! */
    regs->ARM_lr = (regs->ARM_pc + 2) | 1;
    regs->ARM_pc = new_pc;
    if (h == 1) {
      regs->ARM_lr = (1 << 31) | (((regs->ARM_lr & 2) >> 1) << 30) | regs->ARM_lr;
      regs->ARM_pc = regs->ARM_pc & 0xfffffffc;
      regs->ARM_cpsr = regs->ARM_cpsr & (~32);

  printk(" blx thumb after: pc: %lx lr: %lx cpsr: %lx\n", regs->ARM_pc, regs->ARM_lr, regs->ARM_cpsr);
  return 0;

static struct undef_hook blxv1_t_hook = {
	.instr_mask	= 0xe000,
	.instr_val	= 0xe000,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= PSR_T_BIT,
	.fn		= blxv1_t_trap,

Now if you are still with me, and actually read the code, you might recognise some pretty interesting code. Spot it? No? OK, so the problem is the way in which ARM code returns to Thumb mode. The blx instruction updates the link register with the return address. In Thumb mode it also sets the lowest bit. This ensures that when bx is called from ARM mode it will jump back into Thumb mode. It turns out that having to use bx to return from functions is a bit of a pain, so in ARMv5, the architecture was updated so that if you popped values from the stack into the pc register, the CPU would also check the low bit and switch to Thumb mode if required. Unfortunately ARMv4 doesn't do this. Rather than checking the lower bit, it simply ignores it and masks it off, which means you jump back to the return address but remain in ARM mode, so you end up executing 16-bit instructions as though they were 32-bit instructions. It may not surprise you to learn that this generally doesn't work so well.

Which gets us to the truly evil code found above. As well as setting the low bit, we also go and set the top bit of the LR. When the ARM code returns from the function, rather than going to the correct location, it ends up at an unmapped location, which causes a pre-fetch abort. The prefetch abort handler was then updated to handle this error case.

asmlinkage void __exception
do_PrefetchAbort(unsigned long addr, struct pt_regs *regs)
  if (addr >> 31) {
    printk("Magic prefetch abort happened on: %x\n", addr);
    regs->ARM_pc = addr & 0x3ffffffe | ((addr >> 29) & 2);
    printk("  jumping to : %x\n", regs->ARM_pc);
    /* Enable thumb mode */
    regs->ARM_cpsr = (regs->ARM_cpsr | 32);
  do_translation_fault(addr, 0, regs);

Now, at this stage, we have something pretty hacked up, but all these hacks are pretty solid. Unfortunately it still doesn't work. We have successfully ensured that ARM code returns correctly when called from Thumb mode, what we have failed to do is ensure that Thumb code returns correct to ARM code. In ARMv4, this is only possible through the bx instruction, which correctly sets the Thumb bit, in the CPSR. Unfortunately on ARMv5, the pop instruction was extended to also correctly update the thumb bit. But we aren't on an ARMv5, so it is simply ignored. Which means we get stuck in Thumb mode and can't correctly return to ARM code.

The prefetch abort trick works to an extent the other way as well, e.g: for getting from Thumb, back into ARM, but it relies on the ARM code using the blx instruction. Unfortunately this isn't always the case, and it is perfectly reasonably for code to use a bl followed by a bx. As none of these trap it is not possible to put our magic fake value into the LR register.

The only other option left at this stage is some kind of code scanning technique. In this we scan the object code looking for the unsafe pop instructions, and replace them with undefined instruction so that we safely emulate them with the ARMv5 behaviour. Unfortunately ARM makes this approach basically impossible. It is not possible to tell if any block of code is Thumb or ARM instructions. More importantly, it is impossible to determine if a random word in the text segment is actually an instruction, or is in fact a literal value. Simply scanning for pop could actually modify some constants, which would lead to potentially subtle bugs. If ARM had separate execute and read permissions we could use the MMU to distinguish between code and data, but unfortunately the ARM MMU can't really do this. Which means that this approach is basically a no-go, at least not without some pretty nasty heuristics, or some really awesome static analysis. Of course we could just emulate every instruction, but this isn't exactly appealing to me. (And the performance would really suck!)


In summary, Android is compiled for ARMv5, Neo 1937 is ARMv4. These instruction sets are not compatible. Therefore Android will not run on the Neo 1937. Solutions to this problem would be either:

My guess is none of those three things is going to happen any time soon (although I'll be really happy to be disproved!), so it is better to focus on trying to get this running on an actual ARMv5 based chipset. (E.g: PXA270, i.MX21).

Finally, thanks to Jaq, Carl, David and Matt for providing inspiration and advice.

Update: Thanks to andrzej for spotting the bug in my clz() emulation. It should of course be 32 - fls(), not fls(). This is now updated.