A project that I’m working on at the moment calls for a very small footprint. This post is about how to make code really small for the ARM architecture.
As you can probably guess, I’m interested in operating system code, so as an example, I’ve taken a very simple piece of operating system code, and stripped it right back to demonstrate some of the techniques to use when optimising for space. So here is the snippet of code we are going to optimise:
01 struct thread {
02 unsigned int notify_bits;
03 };
04
05 unsigned int
06 poll(struct thread *thread, unsigned int mask)
07 {
08 unsigned int result;
09 result = thread->notify_bits & mask;
10 if (result) {
11 thread->notify_bits &= ~result;
12 }
13 return result;
14 }
In this very simple operating system we have threads (thread data
is stored in struct thread). Each thread has a set of 32
signals (encoded in a single word notify_bits). This poll
is used by a thread to determine if it has been sent any signals. The
mask parameter is the set of signals that the thread is
interested in checking. So, a thread can check if a single signal has
been thread, or if any signal has been set, or if a specific subset of
signals has been set. The function returns the signals that are
available (which is simply the bit-wise and of
notify_bits and mask). It is important that
the function clears any signals that have been returned. This makes
sure that if poll is called twice the same signals are
not returned. This is achieved in lines 10—12.
So, our goal is to try and get this code as small as possible.
And every byte counts! First off, we just try and compile
this with the standard ARM gcc compiler. I’m using version 4.2.2.
So we start with: $ arm-elf-gcc -c poll.c -o poll.o.
We can then use object dump to work out what the compiler did:
$ arm-elf-objdump -dl poll.o.
00000000 <poll&ht;:
poll():
0: e1a0c00d mov ip, sp
4: e92dd800 push {fp, ip, lr, pc}
8: e24cb004 sub fp, ip, #4 ; 0x4
c: e24dd00c sub sp, sp, #12 ; 0xc
10: e50b0014 str r0, [fp, #-20]
14: e50b1018 str r1, [fp, #-24]
18: e51b3014 ldr r3, [fp, #-20]
1c: e5932000 ldr r2, [r3]
20: e51b3018 ldr r3, [fp, #-24]
24: e0023003 and r3, r2, r3
28: e50b3010 str r3, [fp, #-16]
2c: e51b3010 ldr r3, [fp, #-16]
30: e3530000 cmp r3, #0 ; 0x0
34: 0a000006 beq 54 <poll+0x54>
38: e51b3014 ldr r3, [fp, #-20]
3c: e5932000 ldr r2, [r3]
40: e51b3010 ldr r3, [fp, #-16]
44: e1e03003 mvn r3, r3
48: e0022003 and r2, r2, r3
4c: e51b3014 ldr r3, [fp, #-20]
50: e5832000 str r2, [r3]
54: e51b3010 ldr r3, [fp, #-16]
58: e1a00003 mov r0, r3
5c: e24bd00c sub sp, fp, #12 ; 0xc
60: e89da800 ldm sp, {fp, sp, pc}
So, our first go gets us 100 bytes of code. Which is 10 bytes
(or 2.5 instructions) for each of our lines of code. We should be able
to do better. Well, the first thing is we should try is to use
optimisation: $ arm-elf-gcc -c poll.c -o poll.o -O2. This
gives us a much better code generation output:
00000000 <poll>: poll(): 0: e5902000 ldr r2, [r0] 4: e1a0c000 mov ip, r0 8: e0110002 ands r0, r1, r2 c: 11e03000 mvnne r3, r0 10: 10033002 andne r3, r3, r2 14: 158c3000 strne r3, [ip] 18: e12fff1e bx lr
So this got us down to 28 bytes. A factor 4 improvement for one
compiler flag, not bad. Now, -O2 does some standard
omptimisations, but -Os, will do optimisations
specifically for reducing the amount of code. So trying:
$ arm-elf-gcc -c poll.c -o poll.o -Os. This gives
a little bit better code-gen:
00000000 <poll>: poll(): 0: e5903000 ldr r3, [r0] 4: e1a02000 mov r2, r0 8: e0110003 ands r0, r1, r3 c: 11c33000 bicne r3, r3, r0 10: 15823000 strne r3, [r2] 14: e12fff1e bx lr
Down to 24 bytes (6 instructions), is pretty good. Now, as you can
see the generated code has 32-bits per instruction. The some of the
ARM architectures have two distinct instruction sets,
ARM and Thumb. The Thumb instruction
set uses 16-bit per instruction, instead of 32-bit. This denser
instruction set can enable much smaller code sizes. Of course there is
a trade-off here. The functionality of the 16-bit instructions is
going to be less than the 32-bit instructions. But lets give it a
try. At the same time, we will tell the compiler the exact CPU we want
to compile for (which is the ARM7TDMI-S) in our case. The compiler
line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi
-mthumb.
Which produces code like:
00000000 <poll>: poll(): 0: 6803 ldr r3, [r0, #0] 2: 1c02 adds r2, r0, #0 4: 1c08 adds r0, r1, #0 6: 4018 ands r0, r3 8: d001 beq.n e <poll+0xe> a: 4383 bics r3, r0 c: 6013 str r3, [r2, #0] e: 4770 bx lr
So, now we are down to 16 bytes, so in Thumb we need 8 instructions (2 more than ARM), but each is only 2 bytes, not 4, so we end up with a 1/3 improvement. To get any further, we need to start looking at our code again, and see if there are ways of improving the code. Looking at the code again:
00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int result;
04 result = thread->notify_bits & mask;
05 if (result) {
06 thread->notify_bits &= ~result;
07 }
08 return result;
09 }
You may notice that the branch instruction on line 5, you may
notice that this is actually redundant. If result is
zero, then ~result well be 0xffffffff. Given
this thread->notify_bits &= 0xffffffff will not change
the value of thread->notify_bits. So, we can reduce this
to:
00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int result;
04 result = thread->notify_bits & mask;
05 thread->notify_bits &= ~result;
06 return result;
07 }
When we compile this we get down to:
00000000 <poll>: poll(): 0: 6803 ldr r3, [r0, #0] 2: 4019 ands r1, r3 4: 438b bics r3, r1 6: 6003 str r3, [r0, #0] 8: 1c08 adds r0, r1, #0 a: 4770 bx lr
This gets us down to 6 instructions (12 bytes). Pretty good since
we started at 100 bytes. Now lets look at the object code in a little
bit more detail. If you look at address 0x8, the instruction simply
moves register r1 into register r0 so that
it in the right place for return. (Note: The ARM ABI has the
return value stored in r0). This seems like a bit of a waste, it
would be good if there was a way we could have the value stored
in r0 and not waste an instruction just moving values
between registers. Now, to get a better understanding of what the
instructions are doing, I’m going to slightly rewrite the code, and
then compile with debugging, so we can see how the generated code
matches up with the source code. So first, lets rewrite the code a little
bit:
00 unsigned int
01 poll(struct thread *thread, unsigned int mask)
02 {
03 unsigned int tmp_notify_bits = thread->notify_bits;
04 mask &= tmp_notify_bits;
05 tmp_notify_bits &= ~mask;
06 thread->notify_bits = tmp_notify_bits;
07 return mask;
08 }
You should convince yourself that this code is equivalent to the
existing code. (The code generated is identical). So, we can now line
up the source code with the generated code. On line 03 (unsigned
int tmp_notify_bits = thread->notify_bits), this matches up
with address 0x0 (ldr r3, [r0, #0]). So, register
r3 is used to store variable
tmp_notify_bits. The parameter thread is stored in
register r0. Now, line 04 (mask &=
tmp_notify_bits) matches directly with address 0x2 (ands
r1, r3). Register r1 matches directly with the
mask parameter. The important part of restructuring the
code as we have done, is that it becomes obvious that we can directly
using the mask parameter instead of needing an extra variable like in
the previous code. As we continue line 05 (tmp_notify_bits &=
~mask), matches directly to 0x4 (bics r3, r1). The bics
instruction is quite neat in that it can essentially do the &=
~ in one 16-bit instruction. Line 06
(thread->notify_bits = tmp_notify_bits) stores the
result back to memory, matches directly to 0x6. Now the final line of
code (return mask;), needs two instructions 0x8 and 0xa
(adds r0, r1, #0 and bx lr). Now the reason we need to instructions
is because mask is stored in register r1, and the return
value needs to be in register r0. So how can we get mask
to be stored in r0 all along? Well, if we switch the
parameters around poll(unsigned int mask, struct thread
*thread), the mask will instead be stored in
r0 instead of r1. (Note: We don’t
necessarily have to change the interface of the function. If we want
to support keep the interface we can use a simple macro to textually
swap the parameters.) If we compile this, we get the following
generated code:
00000000 <poll>: poll(): 0: 680b ldr r3, [r1, #0] 2: 4018 ands r0, r3 4: 4383 bics r3, r0 6: 600b str r3, [r1, #0] 8: 4770 bx lr
So, we have got this down to 5 instructions, 10 bytes. This is a factor 10 improvement, not bad!
So a bit of a review:
-Os to optimise for space.