Yesterday we saw how with some careful programming, and the right compiler flags we could get GCC to generate optimally small code for our particular function. In this post we take a look at one of the ways in which GCC fails to produce optimal code. Now there are actually many ways, but the I want to concentrate on in this post is the use of literal pools.
So, what is a literal pool? I’m glad you asked. The
literal pool is an area of memory (in the text segment), which is used
to store constants. These constants could be plain numerical
constants, but their main use is to store the address of variables
in the system. These addresses are needed because the ARM instruction
does not have any instructions for directly loading (or storing) an
address in memory. Instead ldr
and str
can only
store at a ±12-bit offset from a register. Now there are lots of ways
you could generate code with this restriction, for example, you could ensure
your data section is less than 8KiB in size, and reserve a register
to be used as a base for all data lookups. But such approach only works
if you have a limited data section size. The standard approach that is
taken is that when a variable is used its address is written out
into a literal pool. The compiler then generates two instructions, first
to read the address from this literal pool, and the second is the
instruction to access the variable.
So, how exactly does this literal pool work? Well, so that a
special register is not needed to point at the literal pool, the
compiler uses the program counter (PC
) register as the
base register. The generated codes looks something like: ldr r3,
[pc, #28]
. That codes loads a value at a 28-byte offset from
the current value of PC
into the register
r3
. r3
then contains the address of the
variable we want to access, and can be used like: ldr r1, [r3,
#0]
, which loads the value of the variable (rather than the
address) into r1
. Now, as the PC
is used as
the base for the literal pool access, it should be clear that the
literal pool is stored close enough to the code that needs to use it.
To ensure that the literal pool is close enough to the code using it, the compiler stores a literal pool at the end of each function. This approach works pretty well (unless you have a 4KiB+ function, which would be silly anyway), but can be a bit of a waste.
To illustrate the problem, consider this (contrived) example code:
static unsigned int value; unsigned int get_value(void) { return value; } void set_value(unsigned int x) { value = x; }
Now, while this example is contrived, the pattern involved exhibits
itself in a lot of real-world code. You have some private data in a
compilation unit (value
), and then you have a set of
accessor (get_value
) and mutator (set_value
)
functions that operate on the private data. Usually the functions would
be more complex than in our example, and usually there would be more
than two. So lets have a look at the generated code:
00000000 <get_value>: get_value(): 0: 4b01 ldr r3, [pc, #4] (8 <get_value+0x8>) 2: 6818 ldr r0, [r3, #0] 4: 4770 bx lr 6: 46c0 nop (mov r8, r8) 8: 00000000 .word 0x00000000 8: R_ARM_ABS32 .bss 0000000c <set_value>: set_value(): c: 4b01 ldr r3, [pc, #4] (14 <set_value+0x8>) e: 6018 str r0, [r3, #0] 10: 4770 bx lr 12: 46c0 nop (mov r8, r8) 14: 00000000 .word 0x00000000 14: R_ARM_ABS32 .bss
You can see that each function has a literal pool (at address 0x8
and 0x14). You can also see that there is a relocation associated with
each of these addresses (R_ARM_ABS32 .bss
). This
relocation means that at link time the address of value
will be stored at locations 0x8 and 0x14. So, what is the big deal
here? Well, there are two problems. First, we have two literal pools
containing duplicate data, by storing the address of
value
twice, we are wasting 4 bytes (remember from
yesterday, we have a very tight memory budget and we care where every
byte goes). The second problem, is that we need to insert a nop
in the code (at address 0x6 and 0x12), because the literal pool must
be aligned.
So, how could the compiler be smarter? Well, if instead of generating a literal pool for each individual function it did it for the whole compilation unit, then instead of having lots of little literal pools with duplicated data through-out, we would have a single literal pool for the whole file. As a bonus, you would only need alignment once as well! Obviously if the compilation unit ends up being larger than 4KiB then you have a problem, but in this case you could still save up producing the literal pool until after 4KiB worth of code. As it turns out the commercial compiler from ARM, RVCT, does exactly this. So lets have a look at the code it generates:
00000000 <get_value>: get_value(): 0: 4802 ldr r0, [pc, #8] (c <set_value+0x6>) 2: 6800 ldr r0, [r0, #0] 4: 4770 bx lr 00000006 <set_value>: set_value(): 6: 4901 ldr r1, [pc, #4] (c <set_value+0x6>) 8: 6008 str r0, [r1, #0] a: 4770 bx lr c: 00000000 .word 0x00000000 c: R_ARM_ABS32 .data$0
You see that the code is more or less the same, but there is just
one literal pool right at the end of the file, and no extra
nop
s are needed for alignment. Without merging literal
pools we have a .text size of 24 bytes, with the merging we slash
that down to 16 bytes.
So merging literal pools is pretty good, but the frustrating thing is that in this example, we don’t even need the literal pool!. If we examine the final compiled image for this program:
Disassembly of section ER_RO: 00008000 <get_value>: get_value(): 8000: 4802 ldr r0, [pc, #8] (800c <set_value+0x6>) 8002: 6800 ldr r0, [r0, #0] 8004: 4770 bx lr 00008006 <set_value>: set_value(): 8006: 4901 ldr r1, [pc, #4] (800c <set_value+0x6>) 8008: 6008 str r0, [r1, #0] 800a: 4770 bx lr 800c: 00008010 .word 0x00008010 Disassembly of section ER_RW: 00008010 <value>: 8010: 00000000 .word 0x00000000
You should notice that the actual location of the value
variable is
0x8010
. At address 0x800c
we have the literal pool storing
the address of a variable which is in the very next word! If we optimised this
by hand, we would end up with something like (need to verify the offset):
Disassembly of section ER_RO: 00008000 <get_value>: get_value(): 8000: 4802 ldr r0, [pc, #4] (8008 <set_value+0x8>) 8002: 4770 bx lr 00008004 <set_value>: set_value(): 8004: 6008 str r0, [pc, #0] 8006: 4770 bx lr Disassembly of section ER_RW: 00008008 <value>: 8008: 00000000 .word 0x00000000
If we get rid of the literal pool entirely, we save the memory of the literal pool itself (4 bytes), plus the two instructions need to load values out of the literal pool (4 bytes). This cuts our text size down to a total of only 8 bytes! This is a factor 3 improvement over the GCC generated code. Granted, you are not always going to be able to perform this type of optimisation, but when you care about size, it is quite important. It would be nice if gcc supported a small data section concept so that you could specify variables that essentially resised within the literal pool instead of needing an expensive (in terms of space and time) indirection.
For this project, it looks like the code will have to be hand tweaked assembler, which is frustrating, because when you use a person as your compiler iterations become really expensive, and you want to make sure you get your design right up front.