pexif project moved

Sun, 30 Jun 2013 12:14:38 +0000
tech python pexif

My Python EXIF parsing library is joining the slow drift of projects away from Google code to Github. If you are intertested in contributing please send pull requests and I'll attempt to merge things in a timely manner.

As you can probably tell from the commit log, I haven’t really been actively working on this code base for a while, so if anyone out there feels like maintaining, please just fork on github, let me know, and I’ll point people in your direction.

Catching thread exceptions in Python

Sat, 06 Oct 2012 12:50:18 +0000
tech python

Error handling is a total pain no matter method you choose to use; in Python we are more or less stuck with exceptions. When you have exceptions if you want any chance of debugging program failures, you want to see the stack-trace for any uncaught exceptions. Python usually obliges by spewing out a stack traces on stderr. However, it isn't too hard to get in to a situation where you end up losing those stack traces which ends up leading to a bunch of head scratcing.

When you have a server, you usually run it daemonized. When running as a deamon, it is not uncommon for any output to be redirected to /dev/null. In this case, unless you have arranged otherwise, your stack traces are going to disappear into the ether.

When you have a server style program, you definitely want to be using the Python logging system. This lets you output messages to logfiles (or syslogd). So ideally, you want any stack traces to go here as well.

Now, this is fairly straight forward, you can just make sure your top level function is wrapped in a try/except block. For example:

try:
  main()
except:
  logging.exception("Unhandled exception during main")

Another alternative is setting up a custom excepthook

This works great, unless you happen to be using the threading module. In this case, any exceptions in your run method (or the function you pass as a target) will actually be internally caught by the threading module (see the _bootstrap_inner method).

Unfortunately this code explicitly dumps the strack trace to stderr, which isn’t so useful.

Now, one approach to dealing with this is to have every run method, or target function expicilty catch any exceptions and output them to the log, however it would be nice to avoid duplicating this handling everywhere.

The solution I came up with was a simple sublcass the standard Thread class that catches the exception and places it out on the log.

class LogThread(threading.Thread):
    """LogThread should always e used in preference to threading.Thread.

    The interface provided by LogThread is identical to that of threading.Thread,
    however, if an exception occurs in the thread the error will be logged
    (using logging.exception) rather than printed to stderr.

    This is important in daemon style applications where stderr is redirected
    to /dev/null.

    """
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._real_run = self.run
        self.run = self._wrap_run

    def _wrap_run(self):
        try:
            self._real_run()
        except:
            logging.exception('Exception during LogThread.run')

Then, use the LogThread class where you would previously use the Thread class.

Another alternative approach to this would be to capture any and all stderr output and redirect it to the log. An example of this approach can be found on in electric monk’s blog post "Redirect stdout and stderr to a logger in Python".

Python packages and plugins

Sat, 06 Oct 2012 10:09:11 +0000
tech python

One thing that can be a little confusing with Python is how packages work. Packages let you group your modules together and gives you a nice namespace. You can read all about them in the Python docs.

Now one thing that can be pretty confusing is that importing a package does not mean that any modules inside that package are loaded.

Imagine a very simple package called testing, with a single foo module. E.g:

  testing/
    __init__.py
    foo.py

The foo module might look something like:

def bar():
    return 'bar'

Now, you might expect to be able to write code such as:

import testing
print(testing.foo.bar())

However, trying this won’t work, you end up with an AttributeError:

Traceback (most recent call last):
  File "t.py", line 2, in 
    testing.foo.bar()
AttributeError: 'module' object has no attribute 'foo'

So, to fix this you need to actually import the module. There are at (at least) two ways you can do this:

import testing.foo
from testing import foo

Either of these put testing.foo into sys.modules, and testing.foo.bar() will work fine.

But, what if you want to load all the modules in a package? Well, as far as I know there isn't any built-in approach to doing this, so what I’ve come up with is a pretty simple function that, given a package, will load all the modules in the package, and return them as a dictionary keyed by the module name.

def plugin_load(pkg):
    """Load all the plugin modules in a specific package.

    A dictionary of modules is returned indexed by the module name.

    Note: This assumes packages have a single path, and will only
    find modules with a .py file extension.

    """
    path = pkg.__path__[0]
    pkg_name = pkg.__name__
    module_names = [os.path.splitext(m)[0] for m in
                    os.listdir(path)
                    if os.path.splitext(m)[1] == '.py' and m != '__init__.py']
    imported = __import__(pkg_name, fromlist=module_names)
    return {m: getattr(imported, m) for m in module_names}

There are plenty of caveats to be aware of here. It only works with modules ending in .py, which may miss out on some cases. Also, at this point it doesn’t support packages that span multiple directories (although that would be relatively simple to add. Note: code testing on Python 3.2, probably needs some modification to work on 2.x (in particular I don’t think dictionary comprehensions in 2.x).

If you’ve got a better way for achieving this, please let me know in the comments.

Python getCaller

Thu, 06 Jan 2011 12:55:06 +0000
python tech

I’ve been doing a bit of Python programming of late, and thought I’d share a simple trick that I’ve found quite useful. When working with a large code-base it can sometimes be quite difficult to understand the system’s call-flow, which can make life trickier than necessary when refactoring or debugging.

A handy tool for this situation is to print out where a certain function is called from, Python makes this quite simple to do. Python’s inspect module is very powerful way of determining the current state of the Python interpreter. The stack function provides a mechanism to view the stack.

import inspect

print inspect.stack()

inspect.stack gives you a list of frame records. A frame record is a 6-tuple that, among other things, contains the filename and line number of the caller location. So, in your code you can do something like:

import inspect
_, filename, linenumber, _, _, _ = inspect.stack()[1]
print "Called from: %s:%d" % (filename, linenumber)

The list-index used is 1, which refers to the caller’s frame records. Index 0 returns the current function’s frame record.

Now, while this is just a simple little bit of code, it is nice to package it into something more reusable, so we can create a function:

def getCallsite():
    """Return a string representing where the function was called from in the form 'filename:linenumber'"""
    _, filename, linenumber, _, _, _ = inspect.stack()[2]
    return "%s:%d" % (filename, linenumber)

The tricky thing here is to realise that it is necessary to use list index 2 rather than 1.

The ability to inspect the stack provides the opportunity to do some truly awful things (like making the return value dependent on the caller), but that doesn’t mean it can’t be used for good as well.

pexif 0.13 release

Thu, 23 Apr 2009 11:19:06 +0000
tech code pexif python

pexif is the python library for editing an image’s EXIF data. Somewhat embarrassingly, the last release I made (0.12) had a really stupid bug in it. This has now been rectified, and a new version (0.13) is now available.

urllib2 and web applications

Mon, 05 Jan 2009 18:17:52 +0000
python tech article web

For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.

I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.

The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.

The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.

So, urllib2 is billed as An extensible library for opening URLs using a variety of protocols, surely I can just extend it to do what I want? Well, it turns out that I can, but it seemed to be harder than I was expecting. Not because the final code I needed to write was difficult or involved, but because it was quite difficult to work out what the right code to write is. I want to explore for a little bit why (I think) this might be the case.

urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:

ret = urllib2.urlopen("http://some_url/") data = ret.read()

And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.

The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.

For example, the build_opener function is pretty magical, since it does a bunch of introspection, and ends up either adding a handler or replacing a handler depending on the class. This is explained in the code as: if one of the argument is a subclass of the default handler, the argument will be installed instead of the default. , which to me makes a lot of sense, where-as the online documentation describes it as: Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ....<list of default handlers>. For me the former is much clearer than the latter!

Anyway, here is the code I ended up with:

opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())

def restopen(url, method=None, headers=None, data=None):
    if headers is None:
        headers = {}
    if method is None:
        if data is None:
            method = "GET"
        else:
            method = "POST"
    return opener.open(HttpRestRequest(url, method=method, 
                                       headers=headers, data=data))

So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.

memoize.py: a build tool framework

Fri, 06 Jun 2008 15:50:47 +0000
code tech article python

I’ve, recently started using memoize.py, as the core of my build system for a new project I’m working on. This simplicity involved is pretty neat. Rather than manually needing to work out the dependencies, (or having specialised tools for determining the dependencies), with memoize.py, you simply write the commands you need to build your project, and memoize.py works out all the dependencies for you.

So, what’s the catch? Well, the way memoize.py works is by using strace to record all the system calls that a program makes during its execution. By analyzing this list memoize.py can work out all the files that are touched when a command is run, and then stores this as a list of dependencies for that command. Then, the next time you run the same command memoize.py first checks to see if any of the dependencies have change (using either md5sum, or timestamp), and only runs the command if any of the dependencies have changed. So the catch of course is that this only runs on Linux (as far as I know, you can’t get strace anywhere else, although that doesn’t mean the same techniques couldn’t be used with a different underlying system call tracing tool).

This technique is quite a radical difference to other tools which determine a large dependency graph of the entire build, and then, recursively work through this graph to fulfil unmet dependencies. As a result this form is a lot more imperative, rather than declarative style. Traditional tools (SCons, make, etc), provide a language which allows you to essentially describe a dependency graph, and then the order in which things are executed is really hidden inside the tool. Using memoize.py is a lot different. You go through defining the commands you want to run (in order!), and that is basically it.

Some of the advantages of this approach are:

There are however some disadvantages:

As with may good tools in your programming kit, memoize.py is available under a very liberal BSD style license, which is nice, because I’ve been able to fix up some problems and add some extra functionality. In particular I’ve added options to:

The patch and full file are available. These have of course been provided upstream, so with any luck, some or most of them will be merged upstream.

So, if you have a primarily Linux project, and want to try something different to SCons, or make, I’d recommend considering memoize.py.

pexif 0.11 released

Thu, 27 Mar 2008 13:22:11 +0000
pexif code python tech

I released a new version of pexif today. This release fixes some small bugs and now deals with files containing multiple application markers. This means files that have XMP metadata now work.

Now I just wish I had time to actually use it for its original purpose of attaching geo data to my photos.

PyOS_InputHook: ready for the enterprise

Sat, 02 Sep 2006 09:42:22 +0000
tech python article

Reading Jeff's humorous post got me digging into exactly why Python would be running in a select loop when otherwise idle.

It basically comes down to some code in the module that wraps GNU readline. The code is:

		while (!has_input)
		{	struct timeval timeout = {0, 100000}; /* 0.1 seconds */
			FD_SET(fileno(rl_instream), &selectset);
			/* select resets selectset if no input was available */
			has_input = select(fileno(rl_instream) + 1, &selectset,
					   NULL, NULL, &timeout);
			if(PyOS_InputHook) PyOS_InputHook();
		}

So what this is basically doing, is trying to read data but 10 times a second calling out to PyOS_InputHook(), which is a hook that can be used by C extension (in particular Tk) to process something when python is otherwise idle.

Now the slightly silly thing is that it will wake up every 100 ms, even if PyOS_InputHook is not actually set. So a slight change:

		while (!has_input)
		{	struct timeval timeout = {0, 100000}; /* 0.1 seconds */
			FD_SET(fileno(rl_instream), &selectset);
			/* select resets selectset if no input was available */
			if (PyOS_InputHook) {
				has_input = select(fileno(rl_instream) + 1, &selectset,
						   NULL, NULL, &timeout);
			} else {
				has_input = select(fileno(rl_instream) + 1, &selectset,
						   NULL, NULL, NULL);
			}
			if(PyOS_InputHook) PyOS_InputHook();
		}

With this change Python is definitely ready for the enterprise!

seek support in pyannodex

Sun, 27 Aug 2006 15:48:22 +0000
annodex python code tech

One of the problems with pyannodex was that you could only iterate through a list of clips once. That is in something like:

anx = annodex.Reader(sys.argv[1])

for clip in anx.clips:
    print clip.start

for clip in anx.clips:
    print clip.start

Only the first loop would print anything. This is basically because clips returned an iterator, and once the iterator had run through once, it didn't reset the iterator. I had originally (in 0.7.3.1) solved this in a really stupid way, whereby I used class properties to recreate an iterator object each time a clip object was returned. This was obviously sily and I fixed it properly by reseting the file handle in the __iter__ method of my iterator object.

When reviewing this code I also found a nasty bug. The way my iterator worked relied on each read call causing it most one callback, which wasn't actually what was happening. Luckily this is also fixable by having callback functions return ANX_STOP_OK, rather than ANX_CONTINUE. Any way, there is now a new version up which fixes these problems.