pexif 0.13 release

Thu, 23 Apr 2009 11:19:06 +0000
tech code pexif python

pexif is the python library for editing an image’s EXIF data. Somewhat embarrassingly, the last release I made (0.12) had a really stupid bug in it. This has now been rectified, and a new version (0.13) is now available.

urllib2 and web applications

Mon, 05 Jan 2009 18:17:52 +0000
python tech article web

For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.

I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.

The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.

The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.

So, urllib2 is billed as An extensible library for opening URLs using a variety of protocols, surely I can just extend it to do what I want? Well, it turns out that I can, but it seemed to be harder than I was expecting. Not because the final code I needed to write was difficult or involved, but because it was quite difficult to work out what the right code to write is. I want to explore for a little bit why (I think) this might be the case.

urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:

ret = urllib2.urlopen("http://some_url/") data = ret.read()

And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.

The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.

For example, the build_opener function is pretty magical, since it does a bunch of introspection, and ends up either adding a handler or replacing a handler depending on the class. This is explained in the code as: if one of the argument is a subclass of the default handler, the argument will be installed instead of the default. , which to me makes a lot of sense, where-as the online documentation describes it as: Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ....<list of default handlers>. For me the former is much clearer than the latter!

Anyway, here is the code I ended up with:

opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())

def restopen(url, method=None, headers=None, data=None):
    if headers is None:
        headers = {}
    if method is None:
        if data is None:
            method = "GET"
        else:
            method = "POST"
    return opener.open(HttpRestRequest(url, method=method, 
                                       headers=headers, data=data))

So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.

memoize.py: a build tool framework

Fri, 06 Jun 2008 15:50:47 +0000
code tech article python

I’ve, recently started using memoize.py, as the core of my build system for a new project I’m working on. This simplicity involved is pretty neat. Rather than manually needing to work out the dependencies, (or having specialised tools for determining the dependencies), with memoize.py, you simply write the commands you need to build your project, and memoize.py works out all the dependencies for you.

So, what’s the catch? Well, the way memoize.py works is by using strace to record all the system calls that a program makes during its execution. By analyzing this list memoize.py can work out all the files that are touched when a command is run, and then stores this as a list of dependencies for that command. Then, the next time you run the same command memoize.py first checks to see if any of the dependencies have change (using either md5sum, or timestamp), and only runs the command if any of the dependencies have changed. So the catch of course is that this only runs on Linux (as far as I know, you can’t get strace anywhere else, although that doesn’t mean the same techniques couldn’t be used with a different underlying system call tracing tool).

This technique is quite a radical difference to other tools which determine a large dependency graph of the entire build, and then, recursively work through this graph to fulfil unmet dependencies. As a result this form is a lot more imperative, rather than declarative style. Traditional tools (SCons, make, etc), provide a language which allows you to essentially describe a dependency graph, and then the order in which things are executed is really hidden inside the tool. Using memoize.py is a lot different. You go through defining the commands you want to run (in order!), and that is basically it.

Some of the advantages of this approach are:

There are however some disadvantages:

As with may good tools in your programming kit, memoize.py is available under a very liberal BSD style license, which is nice, because I’ve been able to fix up some problems and add some extra functionality. In particular I’ve added options to:

The patch and full file are available. These have of course been provided upstream, so with any luck, some or most of them will be merged upstream.

So, if you have a primarily Linux project, and want to try something different to SCons, or make, I’d recommend considering memoize.py.

pexif 0.11 released

Thu, 27 Mar 2008 13:22:11 +0000
pexif code python tech

I released a new version of pexif today. This release fixes some small bugs and now deals with files containing multiple application markers. This means files that have XMP metadata now work.

Now I just wish I had time to actually use it for its original purpose of attaching geo data to my photos.

PyOS_InputHook: ready for the enterprise

Sat, 02 Sep 2006 09:42:22 +0000
tech python article

Reading Jeff's humorous post got me digging into exactly why Python would be running in a select loop when otherwise idle.

It basically comes down to some code in the module that wraps GNU readline. The code is:

		while (!has_input)
		{	struct timeval timeout = {0, 100000}; /* 0.1 seconds */
			FD_SET(fileno(rl_instream), &selectset);
			/* select resets selectset if no input was available */
			has_input = select(fileno(rl_instream) + 1, &selectset,
					   NULL, NULL, &timeout);
			if(PyOS_InputHook) PyOS_InputHook();
		}

So what this is basically doing, is trying to read data but 10 times a second calling out to PyOS_InputHook(), which is a hook that can be used by C extension (in particular Tk) to process something when python is otherwise idle.

Now the slightly silly thing is that it will wake up every 100 ms, even if PyOS_InputHook is not actually set. So a slight change:

		while (!has_input)
		{	struct timeval timeout = {0, 100000}; /* 0.1 seconds */
			FD_SET(fileno(rl_instream), &selectset);
			/* select resets selectset if no input was available */
			if (PyOS_InputHook) {
				has_input = select(fileno(rl_instream) + 1, &selectset,
						   NULL, NULL, &timeout);
			} else {
				has_input = select(fileno(rl_instream) + 1, &selectset,
						   NULL, NULL, NULL);
			}
			if(PyOS_InputHook) PyOS_InputHook();
		}

With this change Python is definitely ready for the enterprise!

seek support in pyannodex

Sun, 27 Aug 2006 15:48:22 +0000
annodex python code tech

One of the problems with pyannodex was that you could only iterate through a list of clips once. That is in something like:

anx = annodex.Reader(sys.argv[1])

for clip in anx.clips:
    print clip.start

for clip in anx.clips:
    print clip.start

Only the first loop would print anything. This is basically because clips returned an iterator, and once the iterator had run through once, it didn't reset the iterator. I had originally (in 0.7.3.1) solved this in a really stupid way, whereby I used class properties to recreate an iterator object each time a clip object was returned. This was obviously sily and I fixed it properly by reseting the file handle in the __iter__ method of my iterator object.

When reviewing this code I also found a nasty bug. The way my iterator worked relied on each read call causing it most one callback, which wasn't actually what was happening. Luckily this is also fixable by having callback functions return ANX_STOP_OK, rather than ANX_CONTINUE. Any way, there is now a new version up which fixes these problems.

Mutation testing, reinventing the wheel, and python bytecode

Sun, 18 Jun 2006 19:18:54 +0000
tech article python

So I recently stumbled across Jester, which is a mutation testing tool.Mutation testing is really a way of testing your tests. The basic idea is:

  1. Ensure your tests work on the code under test.
  2. Generate a set of mutant pieces of code, which are copies of the original code with some small changes, or mutations.
  3. For each mutant, run you test cases. If one of these tests doesn't fail then you are missing test case.

So in testing your code you should first aim for complete coverage, that is ensure that your test cases exercise each line of code. Once you get here, then mutation testing can help you find other holes in your test cases.

So the interesting thing here is what mutation operations (or mutagens), you should do on the original code. Jester provides only some very simple mutations, (at least according to this paper):

Unfortunately it seems that such an approach is, well, way too simplistic, at least according Jeff Offutt, who has published papers in the area. Basically, any rudimentary testing finds these changes, and doesn't really find the holes in your test suite.

A paper on Mothra, a mutation system for FORTRAN, describes a large number of different mutation operations. (As a coincident, my friend has a bug tracking system called Mothra.)

Anyway, so while Jester has a python port, I didn't quite like its approach (especially since I needed to install Java, which I didn't really feel like doing). So I decided to explore how this could be done in Python.

So the first set of mutation I wanted to play with is changing some constants to different values. E.g: change 3 to (1 or 2). So this turns out to be reasonably easy to do in python. I took functions as the unit of test I wanted to play with. So, Python makes it easy to introspect function. You can get a list of conants on a function like so: function.func_code.co_consts. Now, you can't actually modify this, but what you can do is make a new copy of the method with a different set of constants. This is conveniant because in mutation testing we want to create mutants. So:

def f_new_consts(f, newconsts):
    """Return a copy of function f, but change its constants to newconsts."""
    co = f.func_code
    codeobj = type(co)(co.co_argcount, co.co_nlocals, co.co_stacksize,
                       co.co_flags, co.co_code, tuple(newconsts), co.co_names,
                       co.co_varnames, co.co_filename, co.co_name,
                       co.co_firstlineno, co.co_lnotab, co.co_freevars,
                       co.co_cellvars)
    new_function =  type(f)(codeobj, f.func_globals, f.func_name, f.func_defaults,
                           f.func_closure)
    return new_function

So that is the basic mechanism I'm using for creating mutant function with different constants. The other mutants I wanted to make where those where I change comparison operators. E.g: changing < to <= or >. This is a bit trickier, it means getting down and dirty with the python byte code, or at least this is my preferred approach. I guess you could also do syntactic changes and recompile. Anyway, you can get a string representation of a function's bytecode like so: function.func_code.co_code. To do something useful with this you really want convert it to a list of integers, which is easy as so: [ord(x) for x in function.func_code.co_code].

So far, so good, what you need to be able to do next is iterate through the different bytecode, this is a little tricky as they can be either 1 or 3 bytes long. A loop like so:

    from opcode import opname, HAVE_ARGUMENT
    i = 0
    opcodes = [ord(x) for x in co.co_code]
    while i < len(opcodes):
	print opname[opcode]

        i += 1
        if opcode >= HAVE_ARGUMENT:
           i+= 2

will iterate through the opcodes and printing each one. Basically all bytecodes larger than the constant HAVE_ARGUMENT, are three bytes long.

From here it is possible to find all the COMPARE_OP byte codes. The first argument byte specifies the type of compare operations, these can be decoded using the opcode.cmp_op tuple.

So, at this point I have some basic proof-of-concept code for creating some very simple mutants. The next step is to try and integrate this with my unit testing and coverage testing framework, and then my code will never have bugs in it again!

On robust scripts

Thu, 08 Jun 2006 11:01:54 +0000
tech code python

When you write scripts that run as cron jobs, and send email to people, and have the potential to send a *lot* of email to people you really don't want to screw up.

Unfortunately I did screw up when writing one of these. It was a pretty simple 200 lines or so python script that would find any new revisions that had been commited since the last time it ran, and email commit messages to developers.

The idea was simple, a file kept a list of the last seen revisions, I would go through the archive find new revisions, mail them out, and then finally write out the file with the latest files.

Spot the bug, or at least the design error? When our server ran out of disk space, the stage of writing out the the file with the last seen revisions failed, and created an empty file. So next time the script ran it thought all the revisions were new, resulting in thousands of email about revisions committed years ago. I pity our poor sysadmin who not only had to deal with out of disk problems but now also with a mail queue with thousands of messages.

Solution to the problem of course is try and write out the new revision file before sending the email, and write it to a temporary file, instead of blasting the last known good out of existance by writing directly over the top of it.

I guess the moral is designing these little scripts actually requires more care than I usually give them.

nswgeo.py

Sun, 12 Mar 2006 09:50:24 +0000
tech maps python code

Today I release the first version of nswgeo. It is a simple python script that queries the NSW Department of Lands GeoSpatialPortal to find the location of addresses in NSW.

Functional code in python (or yet another abuse of python)

Thu, 09 Feb 2006 10:54:18 +0000
tech python article

So I was inspired (distracted) by the python functional programming module, and got to thinking couldn't things like partial application and function composition be a bit more natural in the syntax.

So it turns out that with decorators and a bit of evil introspection I came up with something that works (at least for functions that don't have keyword arguments). So you can do something like:


@Functional
def add(a, b): return a + b

add3 = add.add

assert(add3(1)(2)(3) == add3(1, 2, 3) == add3(1, 2)(3) == add3(1)(2, 3) == 6)

So what is the magic (or evilness)? Well first we (ab)use decorators to decorate things as Functional, which is you probably know, is just a shortcut for function = Functional(function).

The evil is in the Functional class:


class Functional:
    def __init__(self, fn, nargs = None):
        self.fn = fn
        if not nargs:
            _args, _, _, _ = inspect.getargspec(self.fn)
            self.nargs = len(_args)
        else:
            self.nargs = nargs

    def __call__(self, *args, **kargs):
        if len(args) > self.nargs:
            res = self.fn(*args[:self.nargs], **kargs)
            if res.__class__ == Functional:
                return res(*args[self.nargs:])
            if type(res) != type((0,)):
                res = (res, )
            return res + args[self.nargs:]

        elif len(args) == self.nargs:
            return self.fn(*args, **kargs)
        else:
            def newfunc(*fargs, **fkeyw):
                return self.fn(*(args + fargs))
            newfunc.func = self.fn
            newfunc.args = args
            return Functional(newfunc, self.nargs - len(args))

    def __getattr__(self, name):
        if hasattr(self.fn, name):
            return getattr(self.fn, name)
        func_1 = self.fn
        func_2 = globals()[name]
        def composition(*args, **kwargs):
            res = func_2(*args, **kwargs)
            if type(res) != type((0,)):
                res = (res, )
            return self(*res)
        return Functional(composition, func_2.nargs)

I totally abuse the __getattr__ method so that dot becomes a composition operator. This returns a new function (which is also Functional), which when called will pass on the return value from the first function to the second function. If the first function returns multiple results each of these is passed as arguments to the second function.

The real magic comes in the overload __call__, which is where partial functions are hacked in. Basically if not enough arguments are passed to the function is returns a new function that accumulates the arguments already passed and once it gets enough calls the original function. Of course the function returned from partial application is itself. The real evil is in supporting the case add3(1, 2, 3) which means we detect if too many arguments are passed and then call with only the first arguments, then if the called function returns another functional we apply the remaining arguments to it.

Oh yeah, I wouldn't use this in any real python code, as it is likely to confuse everyone!