Determinism in Python

Tue, 15 Jan 2013 19:56:05 +0000

I'm working on a project where I want to ensure that two .pyc files I have are identical. Essentially I want to verify that a .py byte-compiled by Python on one platform, matches the same .py byte-compiled on another platform. This is more challenging than you might think! (As an aside, pretty much all the challenges apply equally if you are trying to verify that a .py compiled at some specific time is indentical to a .py compiled at some later time.

To understand the first problem it is import to understand the format of a .pyc file. Basically, it is a header, followed by the module's code object serialised using the marshal library. In the first instance we are interested in just the header. It has three fields:

Magic4 bytes
Timestamp4 bytes
Source file size4 bytes

These are basically verified during load to ensure that a stale .pyc file isn't inadvertently used. Anyway, if you are going to compare .pyc files you need to ensure that the .py files you use have the same mtime on both the places you are generating the .pyc files. This is not necessarily a trivial matter, if you get the source from a .tar file chances are that it will preserve mtime, but if you got it out from revision control, such as git, chances are it will just have the mtime from when you retrieved the file. The moral of the story here is you need to either ignore bytes 4 - 7, manually set the timestamp to a known value, or ensure that the .py files have the same mtime.

Now, that is just a warm, because, even after doing this, you aren't necessarily going to get the same exact bytes. Most of the time you will, but not always. I was trying this on the socket.py module. The output was almost identical except for three bytes that were transposed as seen in the diff of the hexdump below:

432,433c432,433
< 00001af0  00 00 00 75 01 00 00 00  72 75 01 00 00 00 77 75  |...u....ru....wu|
< 00001b00  01 00 00 00 62 4e 69 ff  ff ff ff 28 0c 00 00 00  |....bNi....(....|
---
> 00001af0  00 00 00 75 01 00 00 00  77 75 01 00 00 00 62 75  |...u....wu....bu|
> 00001b00  01 00 00 00 72 4e 69 ff  ff ff ff 28 0c 00 00 00  |....rNi....(....|

This is kind of weird, almost identical. So I didn't really know what the bytes 0x77, 0x72, and 0x62 were at the time, but I'll give you a clue: the are the ASCII codes for the characters w, r, and b.

Next step in the debugging is to actually decompile (disassemble) the .pyc file. Strangely there doesn't seem to be such a utility in the standard Python set of tools, however this recipe gives you something relatively usable. So after running the .pyc files through that you find the cause of the problem:

480c480
<                                  16 LOAD_CONST               9 (frozenset({'b', 'r', 'w'}))
---
>                                  16 LOAD_CONST               9 (frozenset({'w', 'r', 'b'}))

This comes from a line of code in the original Python:

            if c not in {"r", "w", "b"}:

So, basically, each .pyc has a different representation of the set {'r', 'w', 'b'}. Now of course, these are sets so they are equal and the code works the same way, no real problem there. However, I'd really like the python compile process to be deterministic. (You could say I'm determined to make it so! See what I did there!). Now if you start digging around in to the Python implementation you find that when marshalling a (frozen)set object, the library just pulls out the set elements in the order in which they are stored, so clearly, when the frozenset is created, it appears to put the elements in a different order at different times.

Next step, digging in to the set implementation. Reading through setobject.c it becomes clear that the ordering of items in a set is based on their hash, which makes sense if you think about it. So, this makes sense, but surely values same value will have the same hash across different runs? WRONG!.

Try something like:

$ python3 -c "print(hash('r'))"

You will get a different result each time! It turns out there is a good reason for this. The hash function can be used as an attack vector to DoS a webserver or something else. If the attacker can control which items are used as dictionary keys, then they have the potential to choose a set of values all with the same hash value, which can quickly lead to poor performance!

The fix to this is to randomize the hashing algorithm on each Python invocation, making it much harder for an attacker to guess values that have the same hash function.

More details on the Python bug tracker and more list thread.

The good news is that there is a simple way to disable this randomization by setting the PYTHONHASHSEED environment variable. Once that is done you get deterministic hash values, and then your sets end up being in the same order, and your compilation is deterministic. Win!

blog comments powered by Disqus