The trouble with literal pools

Fri, 02 Jan 2009 06:43:37 +0000
article tech arm

Yesterday we saw how with some careful programming, and the right compiler flags we could get GCC to generate optimally small code for our particular function. In this post we take a look at one of the ways in which GCC fails to produce optimal code. Now there are actually many ways, but the I want to concentrate on in this post is the use of literal pools.

So, what is a literal pool? I’m glad you asked. The literal pool is an area of memory (in the text segment), which is used to store constants. These constants could be plain numerical constants, but their main use is to store the address of variables in the system. These addresses are needed because the ARM instruction does not have any instructions for directly loading (or storing) an address in memory. Instead ldr and str can only store at a ±12-bit offset from a register. Now there are lots of ways you could generate code with this restriction, for example, you could ensure your data section is less than 8KiB in size, and reserve a register to be used as a base for all data lookups. But such approach only works if you have a limited data section size. The standard approach that is taken is that when a variable is used its address is written out into a literal pool. The compiler then generates two instructions, first to read the address from this literal pool, and the second is the instruction to access the variable.

So, how exactly does this literal pool work? Well, so that a special register is not needed to point at the literal pool, the compiler uses the program counter (PC) register as the base register. The generated codes looks something like: ldr r3, [pc, #28]. That codes loads a value at a 28-byte offset from the current value of PC into the register r3. r3 then contains the address of the variable we want to access, and can be used like: ldr r1, [r3, #0], which loads the value of the variable (rather than the address) into r1. Now, as the PC is used as the base for the literal pool access, it should be clear that the literal pool is stored close enough to the code that needs to use it.

To ensure that the literal pool is close enough to the code using it, the compiler stores a literal pool at the end of each function. This approach works pretty well (unless you have a 4KiB+ function, which would be silly anyway), but can be a bit of a waste.

To illustrate the problem, consider this (contrived) example code:

static unsigned int value;

unsigned int
get_value(void)
{
    return value;
}

void
set_value(unsigned int x)
{
    value = x;
}

Now, while this example is contrived, the pattern involved exhibits itself in a lot of real-world code. You have some private data in a compilation unit (value), and then you have a set of accessor (get_value) and mutator (set_value) functions that operate on the private data. Usually the functions would be more complex than in our example, and usually there would be more than two. So lets have a look at the generated code:

00000000 <get_value>:
get_value():
   0:	4b01      	ldr	r3, [pc, #4]	(8 <get_value+0x8>)
   2:	6818      	ldr	r0, [r3, #0]
   4:	4770      	bx	lr
   6:	46c0      	nop			(mov r8, r8)
   8:	00000000 	.word	0x00000000
			8: R_ARM_ABS32	.bss

0000000c <set_value>:
set_value():
   c:	4b01      	ldr	r3, [pc, #4]	(14 <set_value+0x8>)
   e:	6018      	str	r0, [r3, #0]
  10:	4770      	bx	lr
  12:	46c0      	nop			(mov r8, r8)
  14:	00000000 	.word	0x00000000
			14: R_ARM_ABS32	.bss

You can see that each function has a literal pool (at address 0x8 and 0x14). You can also see that there is a relocation associated with each of these addresses (R_ARM_ABS32 .bss). This relocation means that at link time the address of value will be stored at locations 0x8 and 0x14. So, what is the big deal here? Well, there are two problems. First, we have two literal pools containing duplicate data, by storing the address of value twice, we are wasting 4 bytes (remember from yesterday, we have a very tight memory budget and we care where every byte goes). The second problem, is that we need to insert a nop in the code (at address 0x6 and 0x12), because the literal pool must be aligned.

So, how could the compiler be smarter? Well, if instead of generating a literal pool for each individual function it did it for the whole compilation unit, then instead of having lots of little literal pools with duplicated data through-out, we would have a single literal pool for the whole file. As a bonus, you would only need alignment once as well! Obviously if the compilation unit ends up being larger than 4KiB then you have a problem, but in this case you could still save up producing the literal pool until after 4KiB worth of code. As it turns out the commercial compiler from ARM, RVCT, does exactly this. So lets have a look at the code it generates:

00000000 <get_value>:
get_value():
   0:	4802      	ldr	r0, [pc, #8]	(c <set_value+0x6>)
   2:	6800      	ldr	r0, [r0, #0]
   4:	4770      	bx	lr

00000006 <set_value>:
set_value():
   6:	4901      	ldr	r1, [pc, #4]	(c <set_value+0x6>)
   8:	6008      	str	r0, [r1, #0]
   a:	4770      	bx	lr
   c:	00000000 	.word	0x00000000
			c: R_ARM_ABS32	.data$0

You see that the code is more or less the same, but there is just one literal pool right at the end of the file, and no extra nops are needed for alignment. Without merging literal pools we have a .text size of 24 bytes, with the merging we slash that down to 16 bytes.

So merging literal pools is pretty good, but the frustrating thing is that in this example, we don’t even need the literal pool!. If we examine the final compiled image for this program:

Disassembly of section ER_RO:

00008000 <get_value>:
get_value():
    8000:	4802      	ldr	r0, [pc, #8]	(800c <set_value+0x6>)
    8002:	6800      	ldr	r0, [r0, #0]
    8004:	4770      	bx	lr

00008006 <set_value>:
set_value():
    8006:	4901      	ldr	r1, [pc, #4]	(800c <set_value+0x6>)
    8008:	6008      	str	r0, [r1, #0]
    800a:	4770      	bx	lr
    800c:	00008010 	.word	0x00008010
Disassembly of section ER_RW:

00008010 <value>:
    8010:	00000000 	.word	0x00000000

You should notice that the actual location of the value variable is 0x8010. At address 0x800c we have the literal pool storing the address of a variable which is in the very next word! If we optimised this by hand, we would end up with something like (need to verify the offset):

Disassembly of section ER_RO:

00008000 <get_value>:
get_value():
    8000:	4802      	ldr	r0, [pc, #4]	(8008 <set_value+0x8>)
    8002:	4770      	bx	lr

00008004 <set_value>:
set_value():
    8004:	6008      	str	r0, [pc, #0]
    8006:	4770      	bx	lr
Disassembly of section ER_RW:

00008008 <value>:
    8008:	00000000 	.word	0x00000000

If we get rid of the literal pool entirely, we save the memory of the literal pool itself (4 bytes), plus the two instructions need to load values out of the literal pool (4 bytes). This cuts our text size down to a total of only 8 bytes! This is a factor 3 improvement over the GCC generated code. Granted, you are not always going to be able to perform this type of optimisation, but when you care about size, it is quite important. It would be nice if gcc supported a small data section concept so that you could specify variables that essentially resised within the literal pool instead of needing an expensive (in terms of space and time) indirection.

For this project, it looks like the code will have to be hand tweaked assembler, which is frustrating, because when you use a person as your compiler iterations become really expensive, and you want to make sure you get your design right up front.

blog comments powered by Disqus