So, another fun fact about C. You'd think that a language would make it possible to express any value for its in-built types as a simple constant expression. Of course, you can't do that in C.
In C you can express both signed and unsigned integer of any of the
integer types. For example 5 is a signed integer; 5L
is a signed long, 5U is an unsigned integer. But, there is no
way to expression a negative number as a constant! Despite what you might easily
assume, -5 is not constant, rather it is the unary -
operator applied to the constant 5. Now, generally this doesn't
cause any problems. Unless of course, you care about types and care about
having programs that are valid according to the standard.
So, assuming two's complement 32-bit integers, if you want an
expression that has the type int and the minimum value
(-2147483648), then you need to be careful. If you choose
the straight-forward approach you'd write this as
-2147483648, and you'd be wrong. Why? Well, the integer
constant 2147483648 has the type long, since
it can not be represented as an integer (the maximum value of a 32-bit
integer is 2147483647). So, the type of an expression of the form
- <long> is, of course long. So, what
can we do instead? Well, there are lots of approaches, but probably
the simplist is: -2147483647 - 1. In this case,
2147483647 can be represented as an integer,
-2147483647, is still of type int, and then
you can safely subtract 1.
Of course, this becomes even more interesting is you are dealing
with the type long long. Assuming a two's complement,
64-bit type, then the minimum value is
-9223372036854775808, but of course, you can't just write
-9223372036854775808LL, since
9223372036854775808LL can't be represented by the type
long long (and assuming the implementation doesn't define
any extended integer types), this would actually be an invalid C
program.
So, yesterday's article on with turned out the be a bit wrong. As lesson in reading the explanatory text and not just the example code. (Or possibly a lesson in providing better example code!).
So, I gave an example along the lines of:
@contextmanager
def chdir(path):
# This code is wrong. Don't use it!
cwd = os.getcwd()
os.chdir(path)
yield
os.chdir(cwd)
Now, this appears to mostly work. The problem occurs when you have some code like:
with chdir('/'):
raise Exception()
In this case the final cleanup code (os.chdir(cwd))
will not be executed, which is almost certainly not what you want.
The correct was to write this is:
@contextmanager
def chdir(path):
cwd = os.getcwd()
os.chdir(path)
try:
yield
finally:
os.chdir(cwd)
Writing it this way the final os.chdir(cwd) will be executed,
and the exception will still be propagated. So I think this kind of sucks, because
I'd really have preferred it to have this behaviour by default. To this end
I've created a simplecontextmanager function decorator. Basically you
can use it like so:
@simplecontextmanager
def chdir(path):
cwd = os.getcwd()
os.chdir(path)
yield
os.chdir(cwd)
Looks pretty much like the first example, however it will have the semantics of
the second example. Now clearly the simplecontextmanager doesn't provide
the whole range of flexibility that contextmanager does, but for simple
code it is more straightforward.
My pyutil git repo has been
updated to have working version of the chdir, umask and
update_env context managers, and also contains the source for the
new simplecontextmanager.
Moral of the story, test your code and read the docs carefully!
Update: This article has some errors, that are correct in a new article.
I'm only like 8 years late to the party on this one, but I've got
to say Python's with statement is pretty
awesome. Although I've used the functionality off-and-on I'd mostly
avoided writing my own context managers as I thought it required
classes with magic __enter__ and __exit__
methods. And besides, my other main language is C, so you know, I'm used
to releasing resources when I need to.
Anyway, I finally decided that I should actually learn how to make
a context manager, and it turns out to be super easy thanks to Python
yield statement. So, some truly simple examples that will
hopefully convince you to have a go at make your own context managers.
First up, a simple chdir context manager:
@contextmanager
def chdir(path):
cwd = os.getcwd()
os.chdir(path)
yield
os.chdir(cwd)
If you are doing a lot of shell-style programming in Python this comes in handy. For example:
with chdir('/path/to/dosomething/in'):
os.system('doit')
A lot simpler than manually having to save and restore directories in your script. Along the same lines a similar approach for umask:
@contextmanager
def umask(new_mask):
cur_mask = os.umask(new_mask)
yield
os.umask(cur_mask)
Again, this avoids having to do lots of saving and restoring yourself. And, the nice thing is that you can easily use these together:
with umask(0o002), chdir('/'):
os.system('/your/command')
Code for these examples is available on my GitHub page.
I've only been programming C for about 15 years, so for the vast
majority of that time I've been using C99. And most of that time has
been spent writing low-level code (operating systems, drivers,
embedded systems). In these kind of systems you care a lot about the
exact sizes of types, since memory and cache are important as so is
mapping variables to memory mapped registers. For this reason having a
standard way of declaring variables of a certain bitsize struck me as
a pretty smart thing to do, so I've almost always used
stdint.h on my projects. And in the rare cases where it
wasn't available used some close equivalent. However, there is an
argument to be made against using them (an extension of the general
argument against typedefs). The core of the argument against is that
it hides information from the programmer that is vital in constructing
correct programs.
Let's start with some simple C code:
unsigned int a; unsigned int b;
The question is, what is the type of the expression a + b.
In this case, things are fairly straight-forward. The results expression
has the type unsigned int.
Slightly trickier:
unsigned short a; unsigned short b;
In this case due to integer promotion the operands are
promoted to int and therefore the resulting expression is
also of an type int. Integer promotion is a little
obscure, but relatively well known. Most of the time this isn't really
a problem because if you assigned the expression back to a variable of
type unsigned short the effect will be the same. Of
course if you assign it to an unsigned int, then you are going to be
in trouble!
So nothing really new there. There is some obscurity, but with local knowledge (i.e: just looking at that bit of code), you can tell with complete accuracy wha The problem occurs when you have something like:
uint16_t a; uint16_t b;
So, now what is the type of the expression a + b? The
simple answer is it depends. The only way to know the
answer to this question is if you know what the underlying type of
uint16_t is. And this is a problem, because the
correctness of your program depends on being able to answer that
question correctly. The unfortunate result of this is that in an
attempt to make code more portable across different platforms (i.e.:
by not relying on the exact sizes of types), you end up in a situation
where the correctness of your program still depends on the underlying
platform, although now in a nasty obscure manner. Win!
Of course, most of the time this doesn't matter, but in the times where it does matter it is certainly a problem. Unfortunately, I don't have a good suggestion on how to avoid this problem, other than careful coding (and I really hate having to rely on careful coding just to essentially get type safety correct). If you've got a good suggestion please let me know in the comments.
Of course, I think I will still continue to use stdint.h
despite these problems, however it is certainly something that C
programmers should be aware of when using these types.
I'm working on a project where I want to ensure that two .pyc files
I have are identical. Essentially I want to verify that a .py byte-compiled
by Python on one platform, matches the same .py byte-compiled on another
platform. This is more challenging than you might think! (As an aside,
pretty much all the challenges apply equally if you are trying to verify
that a .py compiled at some specific time is indentical to a .py compiled at
some later time.
To understand the first problem it is import to understand the
format of a .pyc file. Basically, it is a header,
followed by the module's code object serialised using the marshal
library. In the first instance we are interested in just the header. It
has three fields:
| Magic | 4 bytes |
| Timestamp | 4 bytes |
| Source file size | 4 bytes |
These are basically verified during load to ensure that a stale .pyc file isn't
inadvertently used. Anyway, if you are going to compare .pyc files you need to ensure
that the .py files you use have the same mtime on both the places
you are generating the .pyc files. This is not necessarily a trivial matter, if you
get the source from a .tar file chances are that it will preserve mtime,
but if you got it out from revision control, such as git, chances are it will just have the
mtime from when you retrieved the file. The moral of the story here is you
need to either ignore bytes 4 - 7, manually set the timestamp to a known value, or ensure
that the .py files have the same mtime.
Now, that is just a warm, because, even after doing this, you
aren't necessarily going to get the same exact bytes. Most of the time
you will, but not always. I was trying this on the
socket.py module. The output was almost identical except for
three bytes that were transposed as seen in the diff of the hexdump below:
432,433c432,433 < 00001af0 00 00 00 75 01 00 00 00 72 75 01 00 00 00 77 75 |...u....ru....wu| < 00001b00 01 00 00 00 62 4e 69 ff ff ff ff 28 0c 00 00 00 |....bNi....(....| --- > 00001af0 00 00 00 75 01 00 00 00 77 75 01 00 00 00 62 75 |...u....wu....bu| > 00001b00 01 00 00 00 72 4e 69 ff ff ff ff 28 0c 00 00 00 |....rNi....(....|
This is kind of weird, almost identical. So I didn't really know what the bytes 0x77, 0x72, and 0x62 were at the time, but I'll give you a clue: the are the ASCII codes for the characters w, r, and b.
Next step in the debugging is to actually decompile (disassemble) the
.pyc file. Strangely there doesn't seem to be such a utility
in the standard Python set of tools, however
this recipe gives you something relatively usable. So after running
the .pyc files through that you find the cause of the problem:
480c480
< 16 LOAD_CONST 9 (frozenset({'b', 'r', 'w'}))
---
> 16 LOAD_CONST 9 (frozenset({'w', 'r', 'b'}))
This comes from a line of code in the original Python:
if c not in {"r", "w", "b"}:
So, basically, each .pyc has a different
representation of the set {'r', 'w', 'b'}. Now of course,
these are sets so they are equal and the code works the same way, no
real problem there. However, I'd really like the python compile
process to be deterministic. (You could say I'm determined to make it
so! See what I did there!). Now if you start digging around in to the
Python implementation you find that when marshalling a (frozen)set object,
the library just pulls out the set elements in the order in which they
are stored, so clearly, when the frozenset is created, it appears to
put the elements in a different order at different times.
Next step, digging in to the set implementation. Reading through
setobject.c it becomes clear that the ordering of items
in a set is based on their hash, which makes sense if you think about
it. So, this makes sense, but surely values same value will have the
same hash across different runs? WRONG!.
Try something like:
$ python3 -c "print(hash('r'))"
You will get a different result each time! It turns out there is a good reason for this. The hash function can be used as an attack vector to DoS a webserver or something else. If the attacker can control which items are used as dictionary keys, then they have the potential to choose a set of values all with the same hash value, which can quickly lead to poor performance!
The fix to this is to randomize the hashing algorithm on each Python invocation, making it much harder for an attacker to guess values that have the same hash function.
More details on the Python bug tracker and more list thread.
The good news is that there is a simple way to disable this randomization by
setting the PYTHONHASHSEED environment variable. Once that is
done you get deterministic hash values, and then your sets end up being in the same order,
and your compilation is deterministic. Win!
The backstory to this one is too tedious to describe, and besides, whenever I do that I get irrelevant comments about the backstory, also I'm going to try and make my posts more concise this year both for my sake and yours. See, it is working already!
Anyway, I'm using autotools
to compile binutils, and for some reason
when a run make install, the permissions of directories are different on
different machines. I mean, it doesn't really matter other than the fact that I'm tar-ing
up the result and checking the tarfiles are identical (for various regression tests
and validating that builds are reproducible). And I could probably just fix this after
make install has run by changing the file permissions as I add the file
to the archive (since I already squelch the username and timestamp in a similar manner),
but doing that wouldn't let me have fun appreciating how awesome the autotools are.
At first I assumed this was just a umask problem, and
that is certainly part of it (why the Fedora default
umask is 002 remains a mystery to me). Anyway,
after changing my other machine to have a matching 002 umask
the same problem was happening. (Fun note, if I'd experimented the other way, and changed
my umask on Fedora to something sane, I would have also managed to avoid this
fun.). So anyway, at this point on Fedora make install creates group-writable
directories (as expected with a 002 umask), however, on OS X
make install continues to create non group-writable directories, even
with the modified umask.
The best thing about autotools is that it makes the output from
make really simple to read and understand, so it didn't
take long at all to work out that on Fedora it was directly using
mkdir -p to create the directories, however on OS X it
was using install-sh -c -d to create the
directories. install-sh is a magical shell script that,
as far as I can tell, is install during the auto*
process.
Just as the output from running an autoconf Makefile
is a pleasure to read, the generated Makefiles are themselves a work
of beauty primarily underscored by great clarity and concision, much
like my own prose./
In a snap, I found the offending line:
test -z "$(man1dir)" || $(MKDIR_P) "$(DESTDIR)$(man1dir)"
Sarcasm aside, this is pretty simple to grok; if we have a
non empty $(man1dir) variable, create the directory
using whatever MKDIR_P expands to. Given the
name of the variable, you would have to assume that this is generally
assumed to expand to mkdir -p in the common case, which
is exactly what it does on my Fedora box. In any case, we are now
just left with two interesting questions: Why isn't mkdir -p
used on OS X? and Why does install -c -d have different
semantics to mkdir -p? and Why does configure choose that
if the semantics are different? and Why can't I count?
Well, to answer the first question, the magical configure
script does an explicit version check on mkdir using
the --version flag. If your mkdir doesn't
give back one of the magically blessed strings, you are out of luck.
As you can guess the BSD derived mkdir on OS X doesn't
play nicely here, in fact it doesn't even support the --version
flag at all. The interesting thing is that this check is described as
checking for a thread-safe mkdir -p, but really it is just
checking for a GNU mkdir. This check was added in 1.8.3.
Early versions of OS X mkdir certainly seem to have
the race-condition (see mkdir.c),
however more modern version appear to use mkpath_np underneath, which
appears on my first reading to correctly address the issue (see: mkpath_np.c).
OK, so that is a pretty good answer for that first question; looks
like autotools saves the day again by protecting us from broken
mkdir -p implementations, of which there seem to be many.
(Seriously, how do you fuck that up in the first place? Time-of-check
vs time-of-use is a pretty simple thing to consider during
design. More to the point how does it stay there for what, 20 years,
without getting fixed?) Of course, now that various non-GNU mkdir
implementations (well at least one), seem to have fixed the problem, it would
be nice if configure had some way of knowing and didn't have
to drop to the sub-optimal install-sh -c -d.
OK, so now on to trying to work out what is going on with
install-sh. In theory this should have the same semantics
as mkdir -p. Or at least I think it should if
configure is going to choose it in preference to the
mkdir -p. Now this script does a lot, but if we find the
interesting bit is:
(umask $mkdir_umask &&
eval "\$doit_exec \$mkdirprog $prefixes")
This is explicitly setting the umask to a variable mkdir_umask
before executing mkdir (in a round-about way)
Well, this is pretty much explained in the source:
# Create intermediate dirs using mode 755 as modified by the umask. # This is like FreeBSD 'install' as of 1997-10-28.
and then backed up by this commit message.
Basically, using mode 755 as the base mode is safer than using mkdir's default of 777. On some levels I've got to agree (see my earlier outrage at Fedora's default umask), but shouldn't this kind of script obey whatever I specify as my umask? If I was collaborating on a group project, and had explicitly set my umask to make that easy, then this would be a pretty annoying. Of course, my main problem here is that the behaviour is different to that of mkdir -p.
So, the best work around for this is to just not use a umask of 002 in the first place. To be honest, I'm not really sure if this behaviour, is a bug or a feature, and if it is a bug whether the mkdir -p or the install -d behaviour is correct.
The real moral of this story is that if you are super pedantic about the little things, you get to find out really uninteresting details about software you'd rather not use while wasting time that could be better spent on just about anything else interesting things about the software you use every day, making you a more knowledgeable engineer.
Abstract: This article provides a short explanation of how I fixed a problem I was having with node. In doing so it provides an introduction in to some of the node’s internal architecture.
There are times when running a node webserver where, rather than
opening a network socket yourself, you want to use an existing
socket. For example, I’ve got a simple loader which does some
minimal setup, opening a socket and binding to a port, before
exec()-ing another server (in this case a node
server). In pseudo-code it looks something like:
s = socket(...); bind(s, ...); setuid(unprivileged_uid); exec(...);
The goal behind this is to acquire some privileged resources (such as port number less than 1024) while running as root, and then dropping privileges before executing a larger, less trusted, code base. (I’m sure there are other potential approaches for achieving a similar kind of goal, that isn’t really what this post is about).
Anyway, if you use this approach, when calling a createServer
API, rather than providing an address and port
when listening, you can provide a file descriptor. Effectively what
happens is that the code operates something like:
if (fd is null) {
fd = socket(...)
bind(fd, ...)
}
fd.listen()
This pattern worked really well for me on node around the time of
version 0.4. As time passed and I upgraded from 0.4 through to 0.8
this approach appeared to continue working real
well. Unfortunately, not appearances can be deceiving! It wasn’t until
much later when trying to log an request’s remote address that I
started to run into any problems. For some reason despite everything
mentioned in the docs al attempts to derefence the
remoteAddress property resulted in failure. It’s at this
time that open source (or at least source available) libraries really
shine, as it makes debugging really possible. A few dozen
printfs later (well, OK, a combination of
console.log and printf), and I was able
to rule out any of my code, and any of express.js and any of the
node http.js and https.js libraries.
The problem was that my socket object didn’t have a
getpeername method, which is weird, because you expect
sockets to have a getpeername method. The reason that the
socket object didn’t have a getpeername method, is that
it wasn’t actually a socket. Well, unfortuantely things
aren’t quite that simple. There are a number of layers of abstraction
in node, that make some of this more difficult to understand!
The top level of abstraction there is a javascript object that is
based on node’s Socket prototype. The Socket
prototype provides a number of userful methods; most of the things
that you would expect from a socket: connect, read
write, etc. It also defines the property we are most
interested in: remoteAddress. Now this is a really simple
property:
Socket.prototype.__defineGetter__('remoteAddress', function() {
return this._getpeername().address;
});
As you can see, this property defers all the real work to another
method _getpeername:
Socket.prototype._getpeername = function() {
if (!this._handle || !this._handle.getpeername) {
return {};
}
if (!this._peername) {
this._peername = this._handle.getpeername();
// getpeername() returns null on error
if (this._peername === null) {
return {};
}
}
return this._peername;
};
Now, ready the code for _getpeername it should be
clear that the Socket object is primarily a wrapper for
an underlying _handle. After exporing the code in
net.js some more it becomes clear that
_handle could be one of two different things: a
Pipe() or a TCP() object. These objects are
implemented in C++ rather than Javascript. Now, reading the code for
these objects (in pipe_wrap.cc and
tcp_wrap.cc) it becomes clear that these two objects
have very similar implementation, however the TCP object provides a
few more features, critically it provides the getpeername
method, whereas the Pipe object does not. This is the key to the underlying
problem: my Socket objects have a Pipe object as the handle, rather than an
a TCP object as the handle. The next question is why!
The answer lies in the createServerHandle function residing
in net.js. This is the function that eventuallly is called when
the HTTP listen method is called. This has some code like this:
if (typeof fd === 'number' && fd >= 0) {
var tty_wrap = process.binding('tty_wrap');
var type = tty_wrap.guessHandleType(fd);
switch (type) {
case 'PIPE':
debug('listen pipe fd=' + fd);
// create a PipeWrap
handle = createPipe();
default:
// Not a fd we can listen on. This will trigger an error.
debug('listen invalid fd=' + fd + ' type=' + type);
handle = null;
break;
....
The upshot of this code is that if an fd is specified, node
tries to guess what kind of thing the file descriptors refers to and creates
th handle based on this guess. The interesting thing with the switch statement
is that it only has a single valid case: PIPE.
This is interesting for a couple of reasons; it clearly isn’t hitting the default
error case, which means that node somehow guesses that my file descriptor is a
PIPE, which is strange, since it is definitely a socket. The kind of amazing thing
is that apart from the inability to retrieve the remote address, everything in
the system works perfectly. We’ll come back to this point later, but for now
the challenge is to try and find out why the the guessHandleType
method would think my socket is a pipe!
Again we get to follow the layers of abstraction
game. guessHandleType is implemented in the
tty_wrap function:
Handle<Value> TTYWrap::GuessHandleType(const Arguments& args) {
HandleScope scope;
int fd = args[0]->Int32Value();
assert(fd >= 0);
uv_handle_type t = uv_guess_handle(fd);
switch (t) {
case UV_TTY:
return scope.Close(String::New("TTY"));
case UV_NAMED_PIPE:
return scope.Close(String::New("PIPE"));
case UV_FILE:
return scope.Close(String::New("FILE"));
default:
assert(0);
return v8::Undefined();
}
}
So, this wrapper function is really just turning results from an
underlying uv_guess_handle function in to strings. Not
much interesting, although it is somewhat interesting to know that at
one layer of abstraction things are called NAMED_PIPE
but simply PIPE at another. Looking at
uv_guess_handle gets more interesting:
uv_handle_type uv_guess_handle(uv_file file) {
struct stat s;
if (file < 0) {
return UV_UNKNOWN_HANDLE;
}
if (isatty(file)) {
return UV_TTY;
}
if (fstat(file, &s)) {
return UV_UNKNOWN_HANDLE;
}
if (!S_ISSOCK(s.st_mode) && !S_ISFIFO(s.st_mode)) {
return UV_FILE;
}
return UV_NAMED_PIPE;
}
Huh, so, from this code it becomes clear that if a file-descriptor
is a FIFO or a SOCK the uv library wants to treat the file-descriptor
as a UV_NAMED_PIPE. This would seem to be the source of
the problem. Clearly named pipes and sockets are different things, and
it is strange to me that this function would force a socket to be
treated as a named pipe (especially when the upper layer treat the
two things differently). As an aside, this is the place where comments
in code are essential. At this point I have to guess why this is written
in this non-obvious manner. Some potential things come to mind: compatability
with the Windows version, expediency, a restriction on the API
to only return one of { TTY, UNKNOWN, FILE or NAMED_PIPE }. The other
odd things is that a macro S_ISREG is also provided that could
test explicitly for something that is UV_FILE, but that isn’t
used so things such as directories, character devices and block devices are
also returned as just FILE. Without comments it is difficult to
tell if this was intentional or accidental.
So the first fix on the way to solving the overall problem is to
update uv_guess_handle so that it can actually return an
appropriate value when we have a socket and not a pipe. There are two
possible defines that this could be: UV_TCP or
UV_UDP. There is likely a some way to distinguish
between the two, however for my purposes I know this is always going
to be TCP, so I’m going to return UV_TCP. Of course, I’m
sure this will cause someone else a similar problem down the track, so
if you know what the appropriate interface is, let me know!
Another approach that could be used is to leave this function alone entirely and provide some way of specifying the to listen and createServer whether the file-descriptor is a named-pipe or a socket, however deadling with all the parameter marshalling through the stack made this somewhat unattractive.
Once this is done, net.js can then be fixed up
to create a TCP() object, rather than a Pipe()
object in createServerHandle.
Aside: I find it amazing that a 43-line patch requires aroudn 1400 words of English to explain the rationale and back-story!
Now we’ve fixed enough of the underlying libraries that we can think of fixing net.js. What we’d like to do is something like this:
var type = tty_wrap.guessHandleType(fd);
switch (type) {
case 'PIPE':
debug('listen pipe fd=' + fd);
// create a PipeWrap
handle = createPipe();
break;
case 'TCP':
debug('listen socket fd=' + fd);
handle = createTCP();
break;
default:
// Not a fd we can listen on. This will trigger an error.
debug('listen invalid fd=' + fd + ' type=' + type);
global.errno = 'EINVAL'; // hack, callers expect that errno is set
handle = null;
break;
}
if (handle) {
handle.open(fd);
handle.readable = true;
handle.writable = true;
}
return handle;
This looks pretty good, but unfortunately it doesn’t work. That’s because
although the Pipe() object supports an open method,
the TCP() object (which has a very similar set of APIs) does not
support the open method! How very frustrating. Now the implementation of pipe’s
open method is relatively straight forward:
Handle<Value> PipeWrap::Open(const Arguments& args) {
HandleScope scope;
UNWRAP(PipeWrap)
int fd = args[0]->IntegerValue();
uv_pipe_open(&wrap->handle_, fd);
return scope.Close(v8::Null());
}
That looks pretty easy to implement on the TCP object, however
we hit a problem when we get to the uv_pipe_open line.
This function takes something of uv_pipe_t type, where
as the handle in a TCP object has uv_tcp_t type.
Dropping down another level of abstraction again, we need to add
a new uv_tcp_open function in that takes a uv_tcp_t
parameter. It turns out that this is pretty straight foward:
int uv_tcp_open(uv_tcp_t* tcp, uv_file fd) {
return uv__stream_open((uv_stream_t*)tcp,
fd,
UV_STREAM_READABLE | UV_STREAM_WRITABLE);
}
The only frustrating thing is that this is almost an exact
duplicate of the code in uv_pipe_open. Such egregious
duplication of code rub me the wrong way, but I don’t think there is a
good alternative.
A this point I do find it somewhat amazing that there are four levels of abstraction:
_getpeername).uv_tcp_t
and uv_pipe_t object. These have similar shaped APIs however the specific
functions names are distinct.I’m sure it is done for a good reason (most likely to provide compatability with Windows, where named pipes are sockets are probably different the kernel interface layer), however it is somewaht amusing that th eabstraction moves from the lowest kernel level, where both objects are just represetned as ints, and any function can be applied, up through two levels of abstaction that make each thing very different types, before ending up at a top-level where they are essentially recombined in to a single type.
In any case, after a lot of plumbing, we reach are able to make sure that when we listen on a socket file-descriptor, we actually get back socket objects that correclty support the remoteAddress property, and all is good with the world.
Error handling is a total pain no matter method you choose to use;
in Python we are more or less stuck with exceptions. When you have
exceptions if you want any chance of debugging program failures, you
want to see the stack-trace for any uncaught exceptions. Python
usually obliges by spewing out a stack traces on stderr.
However, it isn't too hard to get in to a situation where you end
up losing those stack traces which ends up leading to a bunch of
head scratcing.
When you have a server, you usually run it daemonized.
When running as a deamon, it is not uncommon for any output to be
redirected to /dev/null. In this case, unless you have
arranged otherwise, your stack traces are going to disappear into
the ether.
When you have a server style program, you definitely want to be using the Python logging system. This lets you output messages to logfiles (or syslogd). So ideally, you want any stack traces to go here as well.
Now, this is fairly straight forward, you can just make
sure your top level function is wrapped in a try/except
block. For example:
try:
main()
except:
logging.exception("Unhandled exception during main")
Another alternative is setting up a custom excepthook
This works great, unless you happen to be using the threading
module. In this case, any exceptions in your run method
(or the function you pass as a target) will actually be
internally caught by the threading module (see the _bootstrap_inner
method).
Unfortunately this code explicitly dumps the strack trace to stderr, which isn’t so useful.
Now, one approach to dealing with this is to have every run method, or target
function expicilty catch any exceptions and output them to the log, however it would be nice
to avoid duplicating this handling everywhere.
The solution I came up with was a simple sublcass the standard Thread class
that catches the exception and places it out on the log.
class LogThread(threading.Thread):
"""LogThread should always e used in preference to threading.Thread.
The interface provided by LogThread is identical to that of threading.Thread,
however, if an exception occurs in the thread the error will be logged
(using logging.exception) rather than printed to stderr.
This is important in daemon style applications where stderr is redirected
to /dev/null.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._real_run = self.run
self.run = self._wrap_run
def _wrap_run(self):
try:
self._real_run()
except:
logging.exception('Exception during LogThread.run')
Then, use the LogThread class where you would previously use the Thread class.
Another alternative approach to this would be to capture any and all stderr
output and redirect it to the log. An example of this approach can be found
on in electric monk’s blog post "Redirect stdout and stderr to a logger in Python".
One thing that can be a little confusing with Python is how packages work. Packages let you group your modules together and gives you a nice namespace. You can read all about them in the Python docs.
Now one thing that can be pretty confusing is that importing a package does not mean that any modules inside that package are loaded.
Imagine a very simple package called testing, with a single
foo module. E.g:
testing/
__init__.py
foo.py
The foo module might look something like:
def bar():
return 'bar'
Now, you might expect to be able to write code such as:
import testing print(testing.foo.bar())
However, trying this won’t work, you end up with an AttributeError:
Traceback (most recent call last): File "t.py", line 2, intesting.foo.bar() AttributeError: 'module' object has no attribute 'foo'
So, to fix this you need to actually import the module. There are at (at least) two ways you can do this:
import testing.foo from testing import foo
Either of these put testing.foo into
sys.modules, and testing.foo.bar() will work
fine.
But, what if you want to load all the modules in a package? Well, as far as I know there isn't any built-in approach to doing this, so what I’ve come up with is a pretty simple function that, given a package, will load all the modules in the package, and return them as a dictionary keyed by the module name.
def plugin_load(pkg):
"""Load all the plugin modules in a specific package.
A dictionary of modules is returned indexed by the module name.
Note: This assumes packages have a single path, and will only
find modules with a .py file extension.
"""
path = pkg.__path__[0]
pkg_name = pkg.__name__
module_names = [os.path.splitext(m)[0] for m in
os.listdir(path)
if os.path.splitext(m)[1] == '.py' and m != '__init__.py']
imported = __import__(pkg_name, fromlist=module_names)
return {m: getattr(imported, m) for m in module_names}
There are plenty of caveats to be aware of here. It only works with
modules ending in .py, which may miss out on some
cases. Also, at this point it doesn’t support packages that span
multiple directories (although that would be relatively simple to
add. Note: code testing on Python 3.2, probably needs
some modification to work on 2.x (in particular I don’t think dictionary
comprehensions in 2.x).
If you’ve got a better way for achieving this, please let me know in the comments.
tl;dr git destroyed my data; my team now has severe trust issues with git
We ask a lot from our source control systems. We want them to be flexible, fast, distributed, clever and even easy-to-use. But the number 1 thing we should demand from a source control system is that it doesn’t destroy our data. Probably most importantly, it shouldn’t ever lose stuff that has been committed, but just behind that it really shouldn’t destroy data in our working directory.
When you find out that your source control system has lost your
data you end up in a very bad place. Once your source control system
destroys your data once, you immediately have a severe break-down of
trust between yourself and your tool. You revert to using cp -R
to create backups before doing anything with the tool, just in case it
destroys your data again.
Development was proceeding along at a cracking pace, and I was
getting ready to commit a series of important changes. Before doing
so, I want to merge in the recent changes from the remote master, so I
do the familiar git pull. It complained about some files
that would be overwritten by the merge, so I saved a backup of my changes,
then reverted my changes in those specific files, and proceeded.
The merge went went fine, and pulled in all the
remote changes to the working directory. Getting ready for the commit,
I do a git status and start to get a little concerned;
one of the files I’ve been heavily editting doesn’t show up in the
status list. I cat the file to see what is going on;
seems none of my changes are there. Now, I’m getting concerned, maybe
I’m going slightly crazy after 3 days straight hacking, but I’m sure I
made some changes to this file. I scroll up the terminal backlog to
the git status I did before the pull. Sure enough, the
file is marked as changed there, but not after the merge. I carefully
read the full details from the merge; my file isn’t listed being
touched there. Now I am really worried. git has just gone and destroyed
the last 5 or 6 hours worht of work. Not happy!
Luckily, I was able to reconstruct most of the work from editor buffers, which I luckily still had open.
But, now I am worried. Why the fuck did git decide to destroy data in my working directory, without even telling me!. Did I do something wrong? Is this a case I should know about? I had to investigate.
So, I took a snapshot of my repository, rolled back to the revision before the merge, mad some minor modifications to my file, the ran the merge again. And, again, git destroys the change in the working directory. Now this isn’t normal behaviour, something is really fucked. The only thing slightly interesting about the file in question is that it had been recently renamed. Somehow this rename had confused the merge, and the merge was silently overwriting files.
Now git has a few different merge strategies, so I tried out some different ones. This was a simple pretty simple merge with 2-heads so the options were really recursive or resolve. git picks recursive be default, so I tried resolve instead. This worked fine. Surprsingly this made me feel a little better, I wasn’t completely crazy, silently updating files in my working directory wasn’t intended behaviour, there had to be something wrong in recursive merge.
So, I updated to the latest version in homebrew. Same problem.
Then it was time to start debugging git for real. So I downloaded
the source (using git of course). I started having a look through
merge-recursive.c. It didn’t look too bad, but there was
clearly going to be a lot to learn if I was going to debug this. Before
I started literring the code with prints I thought I better just see if head had the
same problem. Lo and behold head worked! OK, cool, they fixed the bug. But
that isn’t really a satisfying answer. Just for fun I checked out some
random version to try and narrow down when the bug was fixed. In doing
so I found that actually it worked in some old vesions, then didn’t work,
and then finally worked again in the very latest. Here are my raw notes:
1.7.1 => good 1.7.2 => good 1.7.3 => good 1.7.4 => bad 1.7.5 => bad 1.7.6.1 (installed) => bad 1.7.6.1 (checkout) => bad 1.7.6.4 => bad 1.7.7-rc0 => fail 1.7.7-rc1 => pass 1.7.7-rc3 => pass
OK, this is getting more interesting. So somewhere between
1.7.2 and 1.7.3 this bug was introduced. I started using git bisect
to narrow things down. I quickly got bored of manually doing git bisect good
and git bisect bad, luckily I stumbled upon git bisect run that
automates the whole process. After about 20 minutes compiling and testing it found the
bad commit.
commit 882fd11aff6f0e8add77e75924678cce875a0eaf Author: Elijah NewrenDate: Mon Sep 20 02:29:03 2010 -0600 merge-recursive: Delay content merging for renames Move the handling of content merging for renames from process_renames() to process_df_entry(). Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano
OK, lots of talk about merge-recursive and renames. That sounds like it makes sense; at least there is a specific bit of code that I can blame for my data destruction, maybe I don’t have to distrust the whole tool.
But to really be confident, I want to think that the fix isn’t
just something random, and was actually done to fix this this problem. So I
switched the return code in my test script, and ran git bisect
again to find when the bug was fixed. Eventually it found this commit:
commit 5b448b8530308b1f5a7a721cb1bf0ba557b5c78d Author: Elijah NewrenDate: Thu Aug 11 23:20:10 2011 -0600 merge-recursive: When we detect we can skip an update, actually skip it In 882fd11 (merge-recursive: Delay content merging for renames 2010-09-20), there was code that checked for whether we could skip updating a file in the working directory, based on whether the merged version matched the current working copy. Due to the desire to handle directory/file conflicts that were resolvable, that commit deferred content merging by first updating the index with the unmerged entries and then moving the actual merging (along with the skip-the-content-update check) to another function that ran later in the merge process. As part moving the content merging code, a bug was introduced such that although the message about skipping the update would be printed (whenever GIT_MERGE_VERBOSITY was sufficiently high), the file would be unconditionally updated in the working copy anyway. When we detect that the file does not need to be updated in the working copy, update the index appropriately and then return early before updating the working copy. Note that there was a similar change in b2c8c0a (merge-recursive: When we detect we can skip an update, actually skip it 2011-02-28), but it was reverted by 6db4105 (Revert "Merge branch 'en/merge-recursive'" 2011-05-19) since it did not fix both of the relevant types of unnecessary update breakages and, worse, it made use of some band-aids that caused other problems. The reason this change works is due to the changes earlier in this series to (a) record_df_conflict_files instead of just unlinking them early, (b) allowing make_room_for_path() to remove D/F entries, (c) the splitting of update_stages_and_entry() to have its functionality called at different points, and (d) making the pathnames of the files involved in the merge available to merge_content(). Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano
OK, this is good. Looks like they fixed the bug, and it even references the bad commit that I had narrowed things down to.
So I’m a little bit dismayed that this bug existed for almost a full year before being fixed. I can’t be the only person to have been hit by this problem can I? I looked at the release notes for v1.7.7. This is what they have to say abou the issue:
The recursive merge strategy implementation got a fairly large fix for many corner cases that may rarely happen in real world projects (it has been verified that none of the 16000+ merges in the Linux kernel history back to v2.6.12 is affected with the corner case bugs this update fixes).
OK, so the bug never trigged in 16,000+ Linux kernel merges. Strangely that doesn’t actually make me feel any better.
So, I don’t think git sucks. All software has bugs, but bugs that destroy data are pretty devastating. It is a little hard to trust git merge operations now. I’ll probably try to make sure I don’t merge on to a working directory (i.e: stash my changes first, since then they are at least backed up on the object database).
Of course convincing my colleagues, who were also affected by this bug, and didn’t really have any love for git in the first place, that git isn’t completely broken is going to be a tough sell.