I'm working on a project where I want to ensure that two .pyc files
I have are identical. Essentially I want to verify that a .py byte-compiled
by Python on one platform, matches the same .py byte-compiled on another
platform. This is more challenging than you might think! (As an aside,
pretty much all the challenges apply equally if you are trying to verify
that a .py compiled at some specific time is indentical to a .py compiled at
some later time.
To understand the first problem it is import to understand the
format of a .pyc file. Basically, it is a header,
followed by the module's code object serialised using the marshal
library. In the first instance we are interested in just the header. It
has three fields:
| Magic | 4 bytes |
| Timestamp | 4 bytes |
| Source file size | 4 bytes |
These are basically verified during load to ensure that a stale .pyc file isn't
inadvertently used. Anyway, if you are going to compare .pyc files you need to ensure
that the .py files you use have the same mtime on both the places
you are generating the .pyc files. This is not necessarily a trivial matter, if you
get the source from a .tar file chances are that it will preserve mtime,
but if you got it out from revision control, such as git, chances are it will just have the
mtime from when you retrieved the file. The moral of the story here is you
need to either ignore bytes 4 - 7, manually set the timestamp to a known value, or ensure
that the .py files have the same mtime.
Now, that is just a warm, because, even after doing this, you
aren't necessarily going to get the same exact bytes. Most of the time
you will, but not always. I was trying this on the
socket.py module. The output was almost identical except for
three bytes that were transposed as seen in the diff of the hexdump below:
432,433c432,433 < 00001af0 00 00 00 75 01 00 00 00 72 75 01 00 00 00 77 75 |...u....ru....wu| < 00001b00 01 00 00 00 62 4e 69 ff ff ff ff 28 0c 00 00 00 |....bNi....(....| --- > 00001af0 00 00 00 75 01 00 00 00 77 75 01 00 00 00 62 75 |...u....wu....bu| > 00001b00 01 00 00 00 72 4e 69 ff ff ff ff 28 0c 00 00 00 |....rNi....(....|
This is kind of weird, almost identical. So I didn't really know what the bytes 0x77, 0x72, and 0x62 were at the time, but I'll give you a clue: the are the ASCII codes for the characters w, r, and b.
Next step in the debugging is to actually decompile (disassemble) the
.pyc file. Strangely there doesn't seem to be such a utility
in the standard Python set of tools, however
this recipe gives you something relatively usable. So after running
the .pyc files through that you find the cause of the problem:
480c480
< 16 LOAD_CONST 9 (frozenset({'b', 'r', 'w'}))
---
> 16 LOAD_CONST 9 (frozenset({'w', 'r', 'b'}))
This comes from a line of code in the original Python:
if c not in {"r", "w", "b"}:
So, basically, each .pyc has a different
representation of the set {'r', 'w', 'b'}. Now of course,
these are sets so they are equal and the code works the same way, no
real problem there. However, I'd really like the python compile
process to be deterministic. (You could say I'm determined to make it
so! See what I did there!). Now if you start digging around in to the
Python implementation you find that when marshalling a (frozen)set object,
the library just pulls out the set elements in the order in which they
are stored, so clearly, when the frozenset is created, it appears to
put the elements in a different order at different times.
Next step, digging in to the set implementation. Reading through
setobject.c it becomes clear that the ordering of items
in a set is based on their hash, which makes sense if you think about
it. So, this makes sense, but surely values same value will have the
same hash across different runs? WRONG!.
Try something like:
$ python3 -c "print(hash('r'))"
You will get a different result each time! It turns out there is a good reason for this. The hash function can be used as an attack vector to DoS a webserver or something else. If the attacker can control which items are used as dictionary keys, then they have the potential to choose a set of values all with the same hash value, which can quickly lead to poor performance!
The fix to this is to randomize the hashing algorithm on each Python invocation, making it much harder for an attacker to guess values that have the same hash function.
More details on the Python bug tracker and more list thread.
The good news is that there is a simple way to disable this randomization by
setting the PYTHONHASHSEED environment variable. Once that is
done you get deterministic hash values, and then your sets end up being in the same order,
and your compilation is deterministic. Win!