pexif is the python library for editing an image’s EXIF data. Somewhat embarrassingly, the last release I made (0.12) had a really stupid bug in it. This has now been rectified, and a new version (0.13) is now available.
For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.
I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.
The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.
The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.
So, urllib2 is billed as An extensible library for opening URLs
using a variety of protocols
, surely I can just extend it to do
what I want? Well, it turns out that I can, but it seemed to be harder
than I was expecting. Not because the final code I needed to write was
difficult or involved, but because it was quite difficult to work out
what the right code to write is. I want to explore for a little bit
why (I think) this might be the case.
urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:
ret = urllib2.urlopen("http://some_url/")
data = ret.read()
And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.
The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.
For example, the build_opener function is pretty
magical, since it does a bunch of introspection, and ends up either
adding a handler or replacing a handler depending on
the class. This is explained in the code as: if one of the
argument is a subclass of the default handler, the argument will be
installed instead of the default.
, which to me makes a lot of
sense, where-as the online documentation describes it as:
Instances of the following classes will be in front of the handlers,
unless the handlers contain them, instances of them or subclasses of
them: ....<list of default handlers>
. For me the former is
much clearer than the latter!
Anyway, here is the code I ended up with:
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
def restopen(url, method=None, headers=None, data=None):
if headers is None:
headers = {}
if method is None:
if data is None:
method = "GET"
else:
method = "POST"
return opener.open(HttpRestRequest(url, method=method,
headers=headers, data=data))
So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.
I’ve, recently started using memoize.py,
as the core of my build system for a new project I’m working on. This
simplicity involved is pretty neat. Rather than manually needing to
work out the dependencies, (or having specialised tools for
determining the dependencies), with memoize.py, you
simply write the commands you need to build your project, and
memoize.py works out all the dependencies for you.
So, what’s the catch? Well, the way memoize.py works
is by using strace to
record all the system calls that a program makes during its
execution. By analyzing this list memoize.py can work out
all the files that are touched when a command is run, and then stores
this as a list of dependencies for that command. Then, the next time
you run the same command memoize.py first checks to see
if any of the dependencies have change (using either md5sum, or
timestamp), and only runs the command if any of the dependencies have
changed. So the catch of course is that this only runs on Linux (as
far as I know, you can’t get strace anywhere else, although that
doesn’t mean the same techniques couldn’t be used with a different
underlying system call tracing tool).
This technique is quite a radical difference to other tools which
determine a large dependency graph of the entire build, and then,
recursively work through this graph to fulfil unmet dependencies. As
a result this form is a lot more imperative, rather than declarative
style. Traditional tools (SCons, make, etc), provide a language which
allows you to essentially describe a dependency graph, and then the
order in which things are executed is really hidden inside the tool.
Using memoize.py is a lot different. You go through
defining the commands you want to run (in order!), and that is
basically it.
Some of the advantages of this approach are:
There are however some disadvantages:
ptrace to perform the system call tracing.memoize.py
a little so that you could simply choose not to run strace. Obviously
you can’t determine dependencies in this case, but you can at least build the
thing.As with may good tools in your programming kit, memoize.py is
available under a very liberal BSD style license, which is nice, because I’ve
been able to fix up some problems and add some extra functionality. In particular
I’ve added options to:
The patch and full file are available. These have of course been provided upstream, so with any luck, some or most of them will be merged upstream.
So, if you have a primarily Linux project, and want to try something
different to SCons, or make, I’d recommend considering memoize.py.
I released a new version of pexif today. This release fixes some small bugs and now deals with files containing multiple application markers. This means files that have XMP metadata now work.
Now I just wish I had time to actually use it for its original purpose of attaching geo data to my photos.
Reading Jeff's humorous post
got me digging into exactly why Python would be running in a select
loop when otherwise idle.
It basically comes down to some code in the module that wraps GNU readline. The code is:
while (!has_input)
{ struct timeval timeout = {0, 100000}; /* 0.1 seconds */
FD_SET(fileno(rl_instream), &selectset);
/* select resets selectset if no input was available */
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, &timeout);
if(PyOS_InputHook) PyOS_InputHook();
}
So what this is basically doing, is trying to read data but 10 times a second
calling out to PyOS_InputHook(), which is a hook that can be used by C
extension (in particular Tk) to process something when python is otherwise idle.
Now the slightly silly thing is that it will wake up every 100 ms, even if PyOS_InputHook is not actually set. So a slight change:
while (!has_input)
{ struct timeval timeout = {0, 100000}; /* 0.1 seconds */
FD_SET(fileno(rl_instream), &selectset);
/* select resets selectset if no input was available */
if (PyOS_InputHook) {
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, &timeout);
} else {
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, NULL);
}
if(PyOS_InputHook) PyOS_InputHook();
}
With this change Python is definitely ready for the enterprise!
One of the problems with pyannodex was that you could only iterate through a list of clips once. That is in something like:
anx = annodex.Reader(sys.argv[1])
for clip in anx.clips:
print clip.start
for clip in anx.clips:
print clip.start
Only the first loop would print anything. This is basically because
clips returned an iterator, and once the iterator had run through once, it
didn't reset the iterator. I had originally (in 0.7.3.1) solved this in
a really stupid way, whereby I used class properties to recreate an
iterator object each time a clip object was returned. This was obviously
sily and I fixed it properly by reseting the file handle in the __iter__
method of my iterator object.
When reviewing this code I also found a nasty bug. The way my iterator worked relied on each read call causing it most one callback, which wasn't actually what was happening. Luckily this is also fixable by having callback functions return ANX_STOP_OK, rather than ANX_CONTINUE. Any way, there is now a new version up which fixes these problems.
So I recently stumbled across Jester, which is a mutation testing tool.Mutation testing is really a way of testing your tests. The basic idea is:
So in testing your code you should first aim for complete coverage, that is ensure that your test cases exercise each line of code. Once you get here, then mutation testing can help you find other holes in your test cases.
So the interesting thing here is what mutation operations (or mutagens), you should do on the original code. Jester provides only some very simple mutations, (at least according to this paper):
Unfortunately it seems that such an approach is, well, way too simplistic, at least according Jeff Offutt, who has published papers in the area. Basically, any rudimentary testing finds these changes, and doesn't really find the holes in your test suite.
A paper on Mothra, a mutation system for FORTRAN, describes a large number of different mutation operations. (As a coincident, my friend has a bug tracking system called Mothra.)
Anyway, so while Jester has a python port, I didn't quite like its approach (especially since I needed to install Java, which I didn't really feel like doing). So I decided to explore how this could be done in Python.
So the first set of mutation I wanted to play with is changing some constants
to different values. E.g: change 3 to (1 or 2). So this turns out to be reasonably
easy to do in python. I took functions as the unit of test I wanted to play with.
So, Python makes it easy to introspect function. You can get a list of
conants on a function like so: function.func_code.co_consts. Now,
you can't actually modify this, but what you can do is make a new
copy of the method with a different set of constants. This is conveniant because
in mutation testing we want to create mutants. So:
def f_new_consts(f, newconsts):
"""Return a copy of function f, but change its constants to newconsts."""
co = f.func_code
codeobj = type(co)(co.co_argcount, co.co_nlocals, co.co_stacksize,
co.co_flags, co.co_code, tuple(newconsts), co.co_names,
co.co_varnames, co.co_filename, co.co_name,
co.co_firstlineno, co.co_lnotab, co.co_freevars,
co.co_cellvars)
new_function = type(f)(codeobj, f.func_globals, f.func_name, f.func_defaults,
f.func_closure)
return new_function
So that is the basic mechanism I'm using for creating mutant function with
different constants. The other mutants I wanted to make where those where
I change comparison operators. E.g: changing < to <= or >. This is a bit trickier,
it means getting down and dirty with the python byte code, or at least this is my
preferred approach. I guess you could also do syntactic changes and recompile.
Anyway, you can get a string representation of a function's bytecode like so:
function.func_code.co_code. To do something useful with this you
really want convert it to a list of integers, which is easy as so:
[ord(x) for x in function.func_code.co_code].
So far, so good, what you need to be able to do next is iterate through the different bytecode, this is a little tricky as they can be either 1 or 3 bytes long. A loop like so:
from opcode import opname, HAVE_ARGUMENT
i = 0
opcodes = [ord(x) for x in co.co_code]
while i < len(opcodes):
print opname[opcode]
i += 1
if opcode >= HAVE_ARGUMENT:
i+= 2
will iterate through the opcodes and printing each one. Basically all bytecodes larger than the constant HAVE_ARGUMENT, are three bytes long.
From here it is possible to find all the COMPARE_OP
byte codes. The first argument byte specifies the type of compare operations,
these can be decoded using the opcode.cmp_op tuple.
So, at this point I have some basic proof-of-concept code for creating some very simple mutants. The next step is to try and integrate this with my unit testing and coverage testing framework, and then my code will never have bugs in it again!
When you write scripts that run as cron jobs, and send email to people, and have the potential to send a *lot* of email to people you really don't want to screw up.
Unfortunately I did screw up when writing one of these. It was a pretty simple 200 lines or so python script that would find any new revisions that had been commited since the last time it ran, and email commit messages to developers.
The idea was simple, a file kept a list of the last seen revisions, I would go through the archive find new revisions, mail them out, and then finally write out the file with the latest files.
Spot the bug, or at least the design error? When our server ran out of disk space, the stage of writing out the the file with the last seen revisions failed, and created an empty file. So next time the script ran it thought all the revisions were new, resulting in thousands of email about revisions committed years ago. I pity our poor sysadmin who not only had to deal with out of disk problems but now also with a mail queue with thousands of messages.
Solution to the problem of course is try and write out the new revision file before sending the email, and write it to a temporary file, instead of blasting the last known good out of existance by writing directly over the top of it.
I guess the moral is designing these little scripts actually requires more care than I usually give them.
Today I release the first version of nswgeo. It is a simple python script that queries the NSW Department of Lands GeoSpatialPortal to find the location of addresses in NSW.
So I was inspired (distracted) by the python functional programming module, and got to thinking couldn't things like partial application and function composition be a bit more natural in the syntax.
So it turns out that with decorators and a bit of evil introspection I came up with something that works (at least for functions that don't have keyword arguments). So you can do something like:
@Functional
def add(a, b): return a + b
add3 = add.add
assert(add3(1)(2)(3) == add3(1, 2, 3) == add3(1, 2)(3) == add3(1)(2, 3) == 6)
So what is the magic (or evilness)? Well first we (ab)use decorators to decorate things as Functional, which is you probably know, is just a shortcut for function = Functional(function).
The evil is in the Functional class:
class Functional:
def __init__(self, fn, nargs = None):
self.fn = fn
if not nargs:
_args, _, _, _ = inspect.getargspec(self.fn)
self.nargs = len(_args)
else:
self.nargs = nargs
def __call__(self, *args, **kargs):
if len(args) > self.nargs:
res = self.fn(*args[:self.nargs], **kargs)
if res.__class__ == Functional:
return res(*args[self.nargs:])
if type(res) != type((0,)):
res = (res, )
return res + args[self.nargs:]
elif len(args) == self.nargs:
return self.fn(*args, **kargs)
else:
def newfunc(*fargs, **fkeyw):
return self.fn(*(args + fargs))
newfunc.func = self.fn
newfunc.args = args
return Functional(newfunc, self.nargs - len(args))
def __getattr__(self, name):
if hasattr(self.fn, name):
return getattr(self.fn, name)
func_1 = self.fn
func_2 = globals()[name]
def composition(*args, **kwargs):
res = func_2(*args, **kwargs)
if type(res) != type((0,)):
res = (res, )
return self(*res)
return Functional(composition, func_2.nargs)
I totally abuse the __getattr__ method so that dot becomes a composition operator. This returns a new function (which is also Functional), which when called will pass on the return value from the first function to the second function. If the first function returns multiple results each of these is passed as arguments to the second function.
The real magic comes in the overload __call__, which is where
partial functions are hacked in. Basically if not enough arguments are
passed to the function is returns a new function that accumulates the
arguments already passed and once it gets enough calls the original function.
Of course the function returned from partial application is itself. The real
evil is in supporting the case add3(1, 2, 3) which means
we detect if too many arguments are passed and then call with only the first
arguments, then if the called function returns another functional we apply the remaining
arguments to it.
Oh yeah, I wouldn't use this in any real python code, as it is likely to confuse everyone!