Invalid integer constants

Sun, 10 Feb 2013 19:37:29 +0000

So, another fun fact about C. You'd think that a language would make it possible to express any value for its in-built types as a simple constant expression. Of course, you can't do that in C.

In C you can express both signed and unsigned integer of any of the integer types. For example 5 is a signed integer; 5L is a signed long, 5U is an unsigned integer. But, there is no way to expression a negative number as a constant! Despite what you might easily assume, -5 is not constant, rather it is the unary - operator applied to the constant 5. Now, generally this doesn't cause any problems. Unless of course, you care about types and care about having programs that are valid according to the standard.

So, assuming two's complement 32-bit integers, if you want an expression that has the type int and the minimum value (-2147483648), then you need to be careful. If you choose the straight-forward approach you'd write this as -2147483648, and you'd be wrong. Why? Well, the integer constant 2147483648 has the type long, since it can not be represented as an integer (the maximum value of a 32-bit integer is 2147483647). So, the type of an expression of the form - <long> is, of course long. So, what can we do instead? Well, there are lots of approaches, but probably the simplist is: -2147483647 - 1. In this case, 2147483647 can be represented as an integer, -2147483647, is still of type int, and then you can safely subtract 1.

Of course, this becomes even more interesting is you are dealing with the type long long. Assuming a two's complement, 64-bit type, then the minimum value is -9223372036854775808, but of course, you can't just write -9223372036854775808LL, since 9223372036854775808LL can't be represented by the type long long (and assuming the implementation doesn't define any extended integer types), this would actually be an invalid C program.

with fail():

Sun, 20 Jan 2013 10:19:59 +0000

So, yesterday's article on with turned out the be a bit wrong. As lesson in reading the explanatory text and not just the example code. (Or possibly a lesson in providing better example code!).

So, I gave an example along the lines of:

@contextmanager
def chdir(path):
    # This code is wrong. Don't use it!
    cwd = os.getcwd()
    os.chdir(path)
    yield
    os.chdir(cwd)

Now, this appears to mostly work. The problem occurs when you have some code like:

with chdir('/'):
    raise Exception()

In this case the final cleanup code (os.chdir(cwd)) will not be executed, which is almost certainly not what you want. The correct was to write this is:

@contextmanager
def chdir(path):
    cwd = os.getcwd()
    os.chdir(path)
    try:
        yield
    finally:
        os.chdir(cwd)

Writing it this way the final os.chdir(cwd) will be executed, and the exception will still be propagated. So I think this kind of sucks, because I'd really have preferred it to have this behaviour by default. To this end I've created a simplecontextmanager function decorator. Basically you can use it like so:

@simplecontextmanager
def chdir(path):
    cwd = os.getcwd()
    os.chdir(path)
    yield
    os.chdir(cwd)

Looks pretty much like the first example, however it will have the semantics of the second example. Now clearly the simplecontextmanager doesn't provide the whole range of flexibility that contextmanager does, but for simple code it is more straightforward.

My pyutil git repo has been updated to have working version of the chdir, umask and update_env context managers, and also contains the source for the new simplecontextmanager.

Moral of the story, test your code and read the docs carefully!

with awesome:

Sat, 19 Jan 2013 19:38:43 +0000

Update: This article has some errors, that are correct in a new article.

I'm only like 8 years late to the party on this one, but I've got to say Python's with statement is pretty awesome. Although I've used the functionality off-and-on I'd mostly avoided writing my own context managers as I thought it required classes with magic __enter__ and __exit__ methods. And besides, my other main language is C, so you know, I'm used to releasing resources when I need to.

Anyway, I finally decided that I should actually learn how to make a context manager, and it turns out to be super easy thanks to Python yield statement. So, some truly simple examples that will hopefully convince you to have a go at make your own context managers.

First up, a simple chdir context manager:

@contextmanager
def chdir(path):
    cwd = os.getcwd()
    os.chdir(path)
    yield
    os.chdir(cwd)

If you are doing a lot of shell-style programming in Python this comes in handy. For example:

with chdir('/path/to/dosomething/in'):
    os.system('doit')

A lot simpler than manually having to save and restore directories in your script. Along the same lines a similar approach for umask:

@contextmanager
def umask(new_mask):
    cur_mask = os.umask(new_mask)
    yield
    os.umask(cur_mask)

Again, this avoids having to do lots of saving and restoring yourself. And, the nice thing is that you can easily use these together:

with umask(0o002), chdir('/'):
     os.system('/your/command')

Code for these examples is available on my GitHub page.

An argument against stdint.h

Fri, 18 Jan 2013 11:44:09 +0000

I've only been programming C for about 15 years, so for the vast majority of that time I've been using C99. And most of that time has been spent writing low-level code (operating systems, drivers, embedded systems). In these kind of systems you care a lot about the exact sizes of types, since memory and cache are important as so is mapping variables to memory mapped registers. For this reason having a standard way of declaring variables of a certain bitsize struck me as a pretty smart thing to do, so I've almost always used stdint.h on my projects. And in the rare cases where it wasn't available used some close equivalent. However, there is an argument to be made against using them (an extension of the general argument against typedefs). The core of the argument against is that it hides information from the programmer that is vital in constructing correct programs.

Let's start with some simple C code:

unsigned int a;
unsigned int b;

The question is, what is the type of the expression a + b. In this case, things are fairly straight-forward. The results expression has the type unsigned int.

Slightly trickier:

unsigned short a;
unsigned short b;

In this case due to integer promotion the operands are promoted to int and therefore the resulting expression is also of an type int. Integer promotion is a little obscure, but relatively well known. Most of the time this isn't really a problem because if you assigned the expression back to a variable of type unsigned short the effect will be the same. Of course if you assign it to an unsigned int, then you are going to be in trouble!

So nothing really new there. There is some obscurity, but with local knowledge (i.e: just looking at that bit of code), you can tell with complete accuracy wha The problem occurs when you have something like:

uint16_t a;
uint16_t b;

So, now what is the type of the expression a + b? The simple answer is it depends. The only way to know the answer to this question is if you know what the underlying type of uint16_t is. And this is a problem, because the correctness of your program depends on being able to answer that question correctly. The unfortunate result of this is that in an attempt to make code more portable across different platforms (i.e.: by not relying on the exact sizes of types), you end up in a situation where the correctness of your program still depends on the underlying platform, although now in a nasty obscure manner. Win!

Of course, most of the time this doesn't matter, but in the times where it does matter it is certainly a problem. Unfortunately, I don't have a good suggestion on how to avoid this problem, other than careful coding (and I really hate having to rely on careful coding just to essentially get type safety correct). If you've got a good suggestion please let me know in the comments.

Of course, I think I will still continue to use stdint.h despite these problems, however it is certainly something that C programmers should be aware of when using these types.

Determinism in Python

Tue, 15 Jan 2013 19:56:05 +0000

I'm working on a project where I want to ensure that two .pyc files I have are identical. Essentially I want to verify that a .py byte-compiled by Python on one platform, matches the same .py byte-compiled on another platform. This is more challenging than you might think! (As an aside, pretty much all the challenges apply equally if you are trying to verify that a .py compiled at some specific time is indentical to a .py compiled at some later time.

To understand the first problem it is import to understand the format of a .pyc file. Basically, it is a header, followed by the module's code object serialised using the marshal library. In the first instance we are interested in just the header. It has three fields:

Magic4 bytes
Timestamp4 bytes
Source file size4 bytes

These are basically verified during load to ensure that a stale .pyc file isn't inadvertently used. Anyway, if you are going to compare .pyc files you need to ensure that the .py files you use have the same mtime on both the places you are generating the .pyc files. This is not necessarily a trivial matter, if you get the source from a .tar file chances are that it will preserve mtime, but if you got it out from revision control, such as git, chances are it will just have the mtime from when you retrieved the file. The moral of the story here is you need to either ignore bytes 4 - 7, manually set the timestamp to a known value, or ensure that the .py files have the same mtime.

Now, that is just a warm, because, even after doing this, you aren't necessarily going to get the same exact bytes. Most of the time you will, but not always. I was trying this on the socket.py module. The output was almost identical except for three bytes that were transposed as seen in the diff of the hexdump below:

432,433c432,433
< 00001af0  00 00 00 75 01 00 00 00  72 75 01 00 00 00 77 75  |...u....ru....wu|
< 00001b00  01 00 00 00 62 4e 69 ff  ff ff ff 28 0c 00 00 00  |....bNi....(....|
---
> 00001af0  00 00 00 75 01 00 00 00  77 75 01 00 00 00 62 75  |...u....wu....bu|
> 00001b00  01 00 00 00 72 4e 69 ff  ff ff ff 28 0c 00 00 00  |....rNi....(....|

This is kind of weird, almost identical. So I didn't really know what the bytes 0x77, 0x72, and 0x62 were at the time, but I'll give you a clue: the are the ASCII codes for the characters w, r, and b.

Next step in the debugging is to actually decompile (disassemble) the .pyc file. Strangely there doesn't seem to be such a utility in the standard Python set of tools, however this recipe gives you something relatively usable. So after running the .pyc files through that you find the cause of the problem:

480c480
<                                  16 LOAD_CONST               9 (frozenset({'b', 'r', 'w'}))
---
>                                  16 LOAD_CONST               9 (frozenset({'w', 'r', 'b'}))

This comes from a line of code in the original Python:

            if c not in {"r", "w", "b"}:

So, basically, each .pyc has a different representation of the set {'r', 'w', 'b'}. Now of course, these are sets so they are equal and the code works the same way, no real problem there. However, I'd really like the python compile process to be deterministic. (You could say I'm determined to make it so! See what I did there!). Now if you start digging around in to the Python implementation you find that when marshalling a (frozen)set object, the library just pulls out the set elements in the order in which they are stored, so clearly, when the frozenset is created, it appears to put the elements in a different order at different times.

Next step, digging in to the set implementation. Reading through setobject.c it becomes clear that the ordering of items in a set is based on their hash, which makes sense if you think about it. So, this makes sense, but surely values same value will have the same hash across different runs? WRONG!.

Try something like:

$ python3 -c "print(hash('r'))"

You will get a different result each time! It turns out there is a good reason for this. The hash function can be used as an attack vector to DoS a webserver or something else. If the attacker can control which items are used as dictionary keys, then they have the potential to choose a set of values all with the same hash value, which can quickly lead to poor performance!

The fix to this is to randomize the hashing algorithm on each Python invocation, making it much harder for an attacker to guess values that have the same hash function.

More details on the Python bug tracker and more list thread.

The good news is that there is a simple way to disable this randomization by setting the PYTHONHASHSEED environment variable. Once that is done you get deterministic hash values, and then your sets end up being in the same order, and your compilation is deterministic. Win!

A tale of autotools pedantry

Wed, 09 Jan 2013 16:48:18 +0000

The backstory to this one is too tedious to describe, and besides, whenever I do that I get irrelevant comments about the backstory, also I'm going to try and make my posts more concise this year both for my sake and yours. See, it is working already!

Anyway, I'm using autotools to compile binutils, and for some reason when a run make install, the permissions of directories are different on different machines. I mean, it doesn't really matter other than the fact that I'm tar-ing up the result and checking the tarfiles are identical (for various regression tests and validating that builds are reproducible). And I could probably just fix this after make install has run by changing the file permissions as I add the file to the archive (since I already squelch the username and timestamp in a similar manner), but doing that wouldn't let me have fun appreciating how awesome the autotools are.

At first I assumed this was just a umask problem, and that is certainly part of it (why the Fedora default umask is 002 remains a mystery to me). Anyway, after changing my other machine to have a matching 002 umask the same problem was happening. (Fun note, if I'd experimented the other way, and changed my umask on Fedora to something sane, I would have also managed to avoid this fun.). So anyway, at this point on Fedora make install creates group-writable directories (as expected with a 002 umask), however, on OS X make install continues to create non group-writable directories, even with the modified umask.

The best thing about autotools is that it makes the output from make really simple to read and understand, so it didn't take long at all to work out that on Fedora it was directly using mkdir -p to create the directories, however on OS X it was using install-sh -c -d to create the directories. install-sh is a magical shell script that, as far as I can tell, is install during the auto* process.

Just as the output from running an autoconf Makefile is a pleasure to read, the generated Makefiles are themselves a work of beauty primarily underscored by great clarity and concision, much like my own prose./

In a snap, I found the offending line:

test -z "$(man1dir)" || $(MKDIR_P) "$(DESTDIR)$(man1dir)"

Sarcasm aside, this is pretty simple to grok; if we have a non empty $(man1dir) variable, create the directory using whatever MKDIR_P expands to. Given the name of the variable, you would have to assume that this is generally assumed to expand to mkdir -p in the common case, which is exactly what it does on my Fedora box. In any case, we are now just left with two interesting questions: Why isn't mkdir -p used on OS X? and Why does install -c -d have different semantics to mkdir -p? and Why does configure choose that if the semantics are different? and Why can't I count?

Well, to answer the first question, the magical configure script does an explicit version check on mkdir using the --version flag. If your mkdir doesn't give back one of the magically blessed strings, you are out of luck. As you can guess the BSD derived mkdir on OS X doesn't play nicely here, in fact it doesn't even support the --version flag at all. The interesting thing is that this check is described as checking for a thread-safe mkdir -p, but really it is just checking for a GNU mkdir. This check was added in 1.8.3.

Early versions of OS X mkdir certainly seem to have the race-condition (see mkdir.c), however more modern version appear to use mkpath_np underneath, which appears on my first reading to correctly address the issue (see: mkpath_np.c).

OK, so that is a pretty good answer for that first question; looks like autotools saves the day again by protecting us from broken mkdir -p implementations, of which there seem to be many. (Seriously, how do you fuck that up in the first place? Time-of-check vs time-of-use is a pretty simple thing to consider during design. More to the point how does it stay there for what, 20 years, without getting fixed?) Of course, now that various non-GNU mkdir implementations (well at least one), seem to have fixed the problem, it would be nice if configure had some way of knowing and didn't have to drop to the sub-optimal install-sh -c -d.

OK, so now on to trying to work out what is going on with install-sh. In theory this should have the same semantics as mkdir -p. Or at least I think it should if configure is going to choose it in preference to the mkdir -p. Now this script does a lot, but if we find the interesting bit is:

(umask $mkdir_umask &&
    eval "\$doit_exec \$mkdirprog $prefixes")

This is explicitly setting the umask to a variable mkdir_umask before executing mkdir (in a round-about way)

Well, this is pretty much explained in the source:

# Create intermediate dirs using mode 755 as modified by the umask.
# This is like FreeBSD 'install' as of 1997-10-28.

and then backed up by this commit message. Basically, using mode 755 as the base mode is safer than using mkdir's default of 777. On some levels I've got to agree (see my earlier outrage at Fedora's default umask), but shouldn't this kind of script obey whatever I specify as my umask? If I was collaborating on a group project, and had explicitly set my umask to make that easy, then this would be a pretty annoying. Of course, my main problem here is that the behaviour is different to that of mkdir -p.

So, the best work around for this is to just not use a umask of 002 in the first place. To be honest, I'm not really sure if this behaviour, is a bug or a feature, and if it is a bug whether the mkdir -p or the install -d behaviour is correct.

The real moral of this story is that if you are super pedantic about the little things, you get to find out really uninteresting details about software you'd rather not use while wasting time that could be better spent on just about anything else interesting things about the software you use every day, making you a more knowledgeable engineer.

Using pre-bound sockets in node

Sat, 08 Dec 2012 17:02:43 +0000
tech node debugging

Abstract: This article provides a short explanation of how I fixed a problem I was having with node. In doing so it provides an introduction in to some of the node’s internal architecture.

There are times when running a node webserver where, rather than opening a network socket yourself, you want to use an existing socket. For example, I’ve got a simple loader which does some minimal setup, opening a socket and binding to a port, before exec()-ing another server (in this case a node server). In pseudo-code it looks something like:

s = socket(...);
bind(s, ...);
setuid(unprivileged_uid);
exec(...);

The goal behind this is to acquire some privileged resources (such as port number less than 1024) while running as root, and then dropping privileges before executing a larger, less trusted, code base. (I’m sure there are other potential approaches for achieving a similar kind of goal, that isn’t really what this post is about).

Anyway, if you use this approach, when calling a createServer API, rather than providing an address and port when listening, you can provide a file descriptor. Effectively what happens is that the code operates something like:

if (fd is null) {
   fd = socket(...)
   bind(fd, ...)
}

fd.listen()

This pattern worked really well for me on node around the time of version 0.4. As time passed and I upgraded from 0.4 through to 0.8 this approach appeared to continue working real well. Unfortunately, not appearances can be deceiving! It wasn’t until much later when trying to log an request’s remote address that I started to run into any problems. For some reason despite everything mentioned in the docs al attempts to derefence the remoteAddress property resulted in failure. It’s at this time that open source (or at least source available) libraries really shine, as it makes debugging really possible. A few dozen printfs later (well, OK, a combination of console.log and printf), and I was able to rule out any of my code, and any of express.js and any of the node http.js and https.js libraries.

The problem was that my socket object didn’t have a getpeername method, which is weird, because you expect sockets to have a getpeername method. The reason that the socket object didn’t have a getpeername method, is that it wasn’t actually a socket. Well, unfortuantely things aren’t quite that simple. There are a number of layers of abstraction in node, that make some of this more difficult to understand!

The top level of abstraction there is a javascript object that is based on node’s Socket prototype. The Socket prototype provides a number of userful methods; most of the things that you would expect from a socket: connect, read write, etc. It also defines the property we are most interested in: remoteAddress. Now this is a really simple property:

Socket.prototype.__defineGetter__('remoteAddress', function() {
  return this._getpeername().address;
});

As you can see, this property defers all the real work to another method _getpeername:

Socket.prototype._getpeername = function() {
  if (!this._handle || !this._handle.getpeername) {
    return {};
  }
  if (!this._peername) {
    this._peername = this._handle.getpeername();
    // getpeername() returns null on error
    if (this._peername === null) {
      return {};
    }
  }
  return this._peername;
};

Now, ready the code for _getpeername it should be clear that the Socket object is primarily a wrapper for an underlying _handle. After exporing the code in net.js some more it becomes clear that _handle could be one of two different things: a Pipe() or a TCP() object. These objects are implemented in C++ rather than Javascript. Now, reading the code for these objects (in pipe_wrap.cc and tcp_wrap.cc) it becomes clear that these two objects have very similar implementation, however the TCP object provides a few more features, critically it provides the getpeername method, whereas the Pipe object does not. This is the key to the underlying problem: my Socket objects have a Pipe object as the handle, rather than an a TCP object as the handle. The next question is why!

The answer lies in the createServerHandle function residing in net.js. This is the function that eventuallly is called when the HTTP listen method is called. This has some code like this:

  if (typeof fd === 'number' && fd >= 0) {
    var tty_wrap = process.binding('tty_wrap');
    var type = tty_wrap.guessHandleType(fd);
    switch (type) {
      case 'PIPE':
        debug('listen pipe fd=' + fd);
        // create a PipeWrap
        handle = createPipe();
      default:
        // Not a fd we can listen on.  This will trigger an error.
        debug('listen invalid fd=' + fd + ' type=' + type);
        handle = null;
        break;
   ....

The upshot of this code is that if an fd is specified, node tries to guess what kind of thing the file descriptors refers to and creates th handle based on this guess. The interesting thing with the switch statement is that it only has a single valid case: PIPE. This is interesting for a couple of reasons; it clearly isn’t hitting the default error case, which means that node somehow guesses that my file descriptor is a PIPE, which is strange, since it is definitely a socket. The kind of amazing thing is that apart from the inability to retrieve the remote address, everything in the system works perfectly. We’ll come back to this point later, but for now the challenge is to try and find out why the the guessHandleType method would think my socket is a pipe!

Again we get to follow the layers of abstraction game. guessHandleType is implemented in the tty_wrap function:

Handle<Value> TTYWrap::GuessHandleType(const Arguments& args) {
  HandleScope scope;
  int fd = args[0]->Int32Value();
  assert(fd >= 0);

  uv_handle_type t = uv_guess_handle(fd);

  switch (t) {
    case UV_TTY:
      return scope.Close(String::New("TTY"));

    case UV_NAMED_PIPE:
      return scope.Close(String::New("PIPE"));

    case UV_FILE:
      return scope.Close(String::New("FILE"));

    default:
      assert(0);
      return v8::Undefined();
  }
}

So, this wrapper function is really just turning results from an underlying uv_guess_handle function in to strings. Not much interesting, although it is somewhat interesting to know that at one layer of abstraction things are called NAMED_PIPE but simply PIPE at another. Looking at uv_guess_handle gets more interesting:

uv_handle_type uv_guess_handle(uv_file file) {
  struct stat s;

  if (file < 0) {
    return UV_UNKNOWN_HANDLE;
  }

  if (isatty(file)) {
    return UV_TTY;
  }

  if (fstat(file, &s)) {
    return UV_UNKNOWN_HANDLE;
  }

  if (!S_ISSOCK(s.st_mode) && !S_ISFIFO(s.st_mode)) {
    return UV_FILE;
  }

  return UV_NAMED_PIPE;
}

Huh, so, from this code it becomes clear that if a file-descriptor is a FIFO or a SOCK the uv library wants to treat the file-descriptor as a UV_NAMED_PIPE. This would seem to be the source of the problem. Clearly named pipes and sockets are different things, and it is strange to me that this function would force a socket to be treated as a named pipe (especially when the upper layer treat the two things differently). As an aside, this is the place where comments in code are essential. At this point I have to guess why this is written in this non-obvious manner. Some potential things come to mind: compatability with the Windows version, expediency, a restriction on the API to only return one of { TTY, UNKNOWN, FILE or NAMED_PIPE }. The other odd things is that a macro S_ISREG is also provided that could test explicitly for something that is UV_FILE, but that isn’t used so things such as directories, character devices and block devices are also returned as just FILE. Without comments it is difficult to tell if this was intentional or accidental.

So the first fix on the way to solving the overall problem is to update uv_guess_handle so that it can actually return an appropriate value when we have a socket and not a pipe. There are two possible defines that this could be: UV_TCP or UV_UDP. There is likely a some way to distinguish between the two, however for my purposes I know this is always going to be TCP, so I’m going to return UV_TCP. Of course, I’m sure this will cause someone else a similar problem down the track, so if you know what the appropriate interface is, let me know!

Another approach that could be used is to leave this function alone entirely and provide some way of specifying the to listen and createServer whether the file-descriptor is a named-pipe or a socket, however deadling with all the parameter marshalling through the stack made this somewhat unattractive.

Once this is done, net.js can then be fixed up to create a TCP() object, rather than a Pipe() object in createServerHandle.

Aside: I find it amazing that a 43-line patch requires aroudn 1400 words of English to explain the rationale and back-story!

Now we’ve fixed enough of the underlying libraries that we can think of fixing net.js. What we’d like to do is something like this:

    var type = tty_wrap.guessHandleType(fd);
    switch (type) {
      case 'PIPE':
        debug('listen pipe fd=' + fd);
        // create a PipeWrap
        handle = createPipe();
        break;

      case 'TCP':
        debug('listen socket fd=' + fd);
        handle = createTCP();
        break;

      default:
        // Not a fd we can listen on.  This will trigger an error.
        debug('listen invalid fd=' + fd + ' type=' + type);
        global.errno = 'EINVAL'; // hack, callers expect that errno is set
        handle = null;
        break;
    }
    if (handle) {
        handle.open(fd);
        handle.readable = true;
        handle.writable = true;
    }
    return handle;

This looks pretty good, but unfortunately it doesn’t work. That’s because although the Pipe() object supports an open method, the TCP() object (which has a very similar set of APIs) does not support the open method! How very frustrating. Now the implementation of pipe’s open method is relatively straight forward:

Handle<Value> PipeWrap::Open(const Arguments& args) {
  HandleScope scope;

  UNWRAP(PipeWrap)

  int fd = args[0]->IntegerValue();

  uv_pipe_open(&wrap->handle_, fd);

  return scope.Close(v8::Null());
}

That looks pretty easy to implement on the TCP object, however we hit a problem when we get to the uv_pipe_open line. This function takes something of uv_pipe_t type, where as the handle in a TCP object has uv_tcp_t type.

Dropping down another level of abstraction again, we need to add a new uv_tcp_open function in that takes a uv_tcp_t parameter. It turns out that this is pretty straight foward:

int uv_tcp_open(uv_tcp_t* tcp, uv_file fd) {
  return uv__stream_open((uv_stream_t*)tcp,
                         fd,
                         UV_STREAM_READABLE | UV_STREAM_WRITABLE);
}

The only frustrating thing is that this is almost an exact duplicate of the code in uv_pipe_open. Such egregious duplication of code rub me the wrong way, but I don’t think there is a good alternative.

A this point I do find it somewhat amazing that there are four levels of abstraction:

I’m sure it is done for a good reason (most likely to provide compatability with Windows, where named pipes are sockets are probably different the kernel interface layer), however it is somewaht amusing that th eabstraction moves from the lowest kernel level, where both objects are just represetned as ints, and any function can be applied, up through two levels of abstaction that make each thing very different types, before ending up at a top-level where they are essentially recombined in to a single type.

In any case, after a lot of plumbing, we reach are able to make sure that when we listen on a socket file-descriptor, we actually get back socket objects that correclty support the remoteAddress property, and all is good with the world.

Catching thread exceptions in Python

Sat, 06 Oct 2012 12:50:18 +0000
tech python

Error handling is a total pain no matter method you choose to use; in Python we are more or less stuck with exceptions. When you have exceptions if you want any chance of debugging program failures, you want to see the stack-trace for any uncaught exceptions. Python usually obliges by spewing out a stack traces on stderr. However, it isn't too hard to get in to a situation where you end up losing those stack traces which ends up leading to a bunch of head scratcing.

When you have a server, you usually run it daemonized. When running as a deamon, it is not uncommon for any output to be redirected to /dev/null. In this case, unless you have arranged otherwise, your stack traces are going to disappear into the ether.

When you have a server style program, you definitely want to be using the Python logging system. This lets you output messages to logfiles (or syslogd). So ideally, you want any stack traces to go here as well.

Now, this is fairly straight forward, you can just make sure your top level function is wrapped in a try/except block. For example:

try:
  main()
except:
  logging.exception("Unhandled exception during main")

Another alternative is setting up a custom excepthook

This works great, unless you happen to be using the threading module. In this case, any exceptions in your run method (or the function you pass as a target) will actually be internally caught by the threading module (see the _bootstrap_inner method).

Unfortunately this code explicitly dumps the strack trace to stderr, which isn’t so useful.

Now, one approach to dealing with this is to have every run method, or target function expicilty catch any exceptions and output them to the log, however it would be nice to avoid duplicating this handling everywhere.

The solution I came up with was a simple sublcass the standard Thread class that catches the exception and places it out on the log.

class LogThread(threading.Thread):
    """LogThread should always e used in preference to threading.Thread.

    The interface provided by LogThread is identical to that of threading.Thread,
    however, if an exception occurs in the thread the error will be logged
    (using logging.exception) rather than printed to stderr.

    This is important in daemon style applications where stderr is redirected
    to /dev/null.

    """
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._real_run = self.run
        self.run = self._wrap_run

    def _wrap_run(self):
        try:
            self._real_run()
        except:
            logging.exception('Exception during LogThread.run')

Then, use the LogThread class where you would previously use the Thread class.

Another alternative approach to this would be to capture any and all stderr output and redirect it to the log. An example of this approach can be found on in electric monk’s blog post "Redirect stdout and stderr to a logger in Python".

Python packages and plugins

Sat, 06 Oct 2012 10:09:11 +0000
tech python

One thing that can be a little confusing with Python is how packages work. Packages let you group your modules together and gives you a nice namespace. You can read all about them in the Python docs.

Now one thing that can be pretty confusing is that importing a package does not mean that any modules inside that package are loaded.

Imagine a very simple package called testing, with a single foo module. E.g:

  testing/
    __init__.py
    foo.py

The foo module might look something like:

def bar():
    return 'bar'

Now, you might expect to be able to write code such as:

import testing
print(testing.foo.bar())

However, trying this won’t work, you end up with an AttributeError:

Traceback (most recent call last):
  File "t.py", line 2, in 
    testing.foo.bar()
AttributeError: 'module' object has no attribute 'foo'

So, to fix this you need to actually import the module. There are at (at least) two ways you can do this:

import testing.foo
from testing import foo

Either of these put testing.foo into sys.modules, and testing.foo.bar() will work fine.

But, what if you want to load all the modules in a package? Well, as far as I know there isn't any built-in approach to doing this, so what I’ve come up with is a pretty simple function that, given a package, will load all the modules in the package, and return them as a dictionary keyed by the module name.

def plugin_load(pkg):
    """Load all the plugin modules in a specific package.

    A dictionary of modules is returned indexed by the module name.

    Note: This assumes packages have a single path, and will only
    find modules with a .py file extension.

    """
    path = pkg.__path__[0]
    pkg_name = pkg.__name__
    module_names = [os.path.splitext(m)[0] for m in
                    os.listdir(path)
                    if os.path.splitext(m)[1] == '.py' and m != '__init__.py']
    imported = __import__(pkg_name, fromlist=module_names)
    return {m: getattr(imported, m) for m in module_names}

There are plenty of caveats to be aware of here. It only works with modules ending in .py, which may miss out on some cases. Also, at this point it doesn’t support packages that span multiple directories (although that would be relatively simple to add. Note: code testing on Python 3.2, probably needs some modification to work on 2.x (in particular I don’t think dictionary comprehensions in 2.x).

If you’ve got a better way for achieving this, please let me know in the comments.

distrusting git

Sat, 01 Oct 2011 07:06:11 +0000
git tech rant

tl;dr git destroyed my data; my team now has severe trust issues with git

We ask a lot from our source control systems. We want them to be flexible, fast, distributed, clever and even easy-to-use. But the number 1 thing we should demand from a source control system is that it doesn’t destroy our data. Probably most importantly, it shouldn’t ever lose stuff that has been committed, but just behind that it really shouldn’t destroy data in our working directory.

When you find out that your source control system has lost your data you end up in a very bad place. Once your source control system destroys your data once, you immediately have a severe break-down of trust between yourself and your tool. You revert to using cp -R to create backups before doing anything with the tool, just in case it destroys your data again.

Development was proceeding along at a cracking pace, and I was getting ready to commit a series of important changes. Before doing so, I want to merge in the recent changes from the remote master, so I do the familiar git pull. It complained about some files that would be overwritten by the merge, so I saved a backup of my changes, then reverted my changes in those specific files, and proceeded. The merge went went fine, and pulled in all the remote changes to the working directory. Getting ready for the commit, I do a git status and start to get a little concerned; one of the files I’ve been heavily editting doesn’t show up in the status list. I cat the file to see what is going on; seems none of my changes are there. Now, I’m getting concerned, maybe I’m going slightly crazy after 3 days straight hacking, but I’m sure I made some changes to this file. I scroll up the terminal backlog to the git status I did before the pull. Sure enough, the file is marked as changed there, but not after the merge. I carefully read the full details from the merge; my file isn’t listed being touched there. Now I am really worried. git has just gone and destroyed the last 5 or 6 hours worht of work. Not happy!

Luckily, I was able to reconstruct most of the work from editor buffers, which I luckily still had open.

But, now I am worried. Why the fuck did git decide to destroy data in my working directory, without even telling me!. Did I do something wrong? Is this a case I should know about? I had to investigate.

So, I took a snapshot of my repository, rolled back to the revision before the merge, mad some minor modifications to my file, the ran the merge again. And, again, git destroys the change in the working directory. Now this isn’t normal behaviour, something is really fucked. The only thing slightly interesting about the file in question is that it had been recently renamed. Somehow this rename had confused the merge, and the merge was silently overwriting files.

Now git has a few different merge strategies, so I tried out some different ones. This was a simple pretty simple merge with 2-heads so the options were really recursive or resolve. git picks recursive be default, so I tried resolve instead. This worked fine. Surprsingly this made me feel a little better, I wasn’t completely crazy, silently updating files in my working directory wasn’t intended behaviour, there had to be something wrong in recursive merge.

So, I updated to the latest version in homebrew. Same problem.

Then it was time to start debugging git for real. So I downloaded the source (using git of course). I started having a look through merge-recursive.c. It didn’t look too bad, but there was clearly going to be a lot to learn if I was going to debug this. Before I started literring the code with prints I thought I better just see if head had the same problem. Lo and behold head worked! OK, cool, they fixed the bug. But that isn’t really a satisfying answer. Just for fun I checked out some random version to try and narrow down when the bug was fixed. In doing so I found that actually it worked in some old vesions, then didn’t work, and then finally worked again in the very latest. Here are my raw notes:

1.7.1               => good
1.7.2               => good
1.7.3               => good
1.7.4               => bad
1.7.5               => bad              
1.7.6.1 (installed) => bad
1.7.6.1 (checkout)  => bad
1.7.6.4             => bad
1.7.7-rc0           => fail
1.7.7-rc1           => pass
1.7.7-rc3           => pass

OK, this is getting more interesting. So somewhere between 1.7.2 and 1.7.3 this bug was introduced. I started using git bisect to narrow things down. I quickly got bored of manually doing git bisect good and git bisect bad, luckily I stumbled upon git bisect run that automates the whole process. After about 20 minutes compiling and testing it found the bad commit.

commit 882fd11aff6f0e8add77e75924678cce875a0eaf
Author: Elijah Newren 
Date:   Mon Sep 20 02:29:03 2010 -0600

    merge-recursive: Delay content merging for renames
    
    Move the handling of content merging for renames from process_renames() to
    process_df_entry().
    
    Signed-off-by: Elijah Newren 
    Signed-off-by: Junio C Hamano 

OK, lots of talk about merge-recursive and renames. That sounds like it makes sense; at least there is a specific bit of code that I can blame for my data destruction, maybe I don’t have to distrust the whole tool.

But to really be confident, I want to think that the fix isn’t just something random, and was actually done to fix this this problem. So I switched the return code in my test script, and ran git bisect again to find when the bug was fixed. Eventually it found this commit:

commit 5b448b8530308b1f5a7a721cb1bf0ba557b5c78d
Author: Elijah Newren 
Date:   Thu Aug 11 23:20:10 2011 -0600

    merge-recursive: When we detect we can skip an update, actually skip it
    
    In 882fd11 (merge-recursive: Delay content merging for renames 2010-09-20),
    there was code that checked for whether we could skip updating a file in
    the working directory, based on whether the merged version matched the
    current working copy.  Due to the desire to handle directory/file conflicts
    that were resolvable, that commit deferred content merging by first
    updating the index with the unmerged entries and then moving the actual
    merging (along with the skip-the-content-update check) to another function
    that ran later in the merge process.  As part moving the content merging
    code, a bug was introduced such that although the message about skipping
    the update would be printed (whenever GIT_MERGE_VERBOSITY was sufficiently
    high), the file would be unconditionally updated in the working copy
    anyway.
    
    When we detect that the file does not need to be updated in the working
    copy, update the index appropriately and then return early before updating
    the working copy.
    
    Note that there was a similar change in b2c8c0a (merge-recursive: When we
    detect we can skip an update, actually skip it 2011-02-28), but it was
    reverted by 6db4105 (Revert "Merge branch 'en/merge-recursive'"
    2011-05-19) since it did not fix both of the relevant types of unnecessary
    update breakages and, worse, it made use of some band-aids that caused
    other problems.  The reason this change works is due to the changes earlier
    in this series to (a) record_df_conflict_files instead of just unlinking
    them early, (b) allowing make_room_for_path() to remove D/F entries,
    (c) the splitting of update_stages_and_entry() to have its functionality
    called at different points, and (d) making the pathnames of the files
    involved in the merge available to merge_content().
    
    Signed-off-by: Elijah Newren 
    Signed-off-by: Junio C Hamano 

OK, this is good. Looks like they fixed the bug, and it even references the bad commit that I had narrowed things down to.

So I’m a little bit dismayed that this bug existed for almost a full year before being fixed. I can’t be the only person to have been hit by this problem can I? I looked at the release notes for v1.7.7. This is what they have to say abou the issue:

 The recursive merge strategy implementation got a fairly large
 fix for many corner cases that may rarely happen in real world
 projects (it has been verified that none of the 16000+ merges in
 the Linux kernel history back to v2.6.12 is affected with the
 corner case bugs this update fixes).

OK, so the bug never trigged in 16,000+ Linux kernel merges. Strangely that doesn’t actually make me feel any better.

So, I don’t think git sucks. All software has bugs, but bugs that destroy data are pretty devastating. It is a little hard to trust git merge operations now. I’ll probably try to make sure I don’t merge on to a working directory (i.e: stash my changes first, since then they are at least backed up on the object database).

Of course convincing my colleagues, who were also affected by this bug, and didn’t really have any love for git in the first place, that git isn’t completely broken is going to be a tough sell.