My Python EXIF parsing library is joining the slow drift of projects away from Google code to Github. If you are intertested in contributing please send pull requests and I'll attempt to merge things in a timely manner.
As you can probably tell from the commit log, I haven’t really been actively working on this code base for a while, so if anyone out there feels like maintaining, please just fork on github, let me know, and I’ll point people in your direction.
Error handling is a total pain no matter method you choose to use;
in Python we are more or less stuck with exceptions. When you have
exceptions if you want any chance of debugging program failures, you
want to see the stack-trace for any uncaught exceptions. Python
usually obliges by spewing out a stack traces on stderr
.
However, it isn't too hard to get in to a situation where you end
up losing those stack traces which ends up leading to a bunch of
head scratcing.
When you have a server, you usually run it daemonized.
When running as a deamon, it is not uncommon for any output to be
redirected to /dev/null
. In this case, unless you have
arranged otherwise, your stack traces are going to disappear into
the ether.
When you have a server style program, you definitely want to be using the Python logging system. This lets you output messages to logfiles (or syslogd). So ideally, you want any stack traces to go here as well.
Now, this is fairly straight forward, you can just make
sure your top level function is wrapped in a try/except
block. For example:
try: main() except: logging.exception("Unhandled exception during main")
Another alternative is setting up a custom excepthook
This works great, unless you happen to be using the threading
module. In this case, any exceptions in your run
method
(or the function you pass as a target
) will actually be
internally caught by the threading module (see the _bootstrap_inner
method).
Unfortunately this code explicitly dumps the strack trace to stderr, which isn’t so useful.
Now, one approach to dealing with this is to have every run
method, or target
function expicilty catch any exceptions and output them to the log, however it would be nice
to avoid duplicating this handling everywhere.
The solution I came up with was a simple sublcass the standard Thread
class
that catches the exception and places it out on the log.
class LogThread(threading.Thread): """LogThread should always e used in preference to threading.Thread. The interface provided by LogThread is identical to that of threading.Thread, however, if an exception occurs in the thread the error will be logged (using logging.exception) rather than printed to stderr. This is important in daemon style applications where stderr is redirected to /dev/null. """ def __init__(self, **kwargs): super().__init__(**kwargs) self._real_run = self.run self.run = self._wrap_run def _wrap_run(self): try: self._real_run() except: logging.exception('Exception during LogThread.run')
Then, use the LogThread
class where you would previously use the Thread
class.
Another alternative approach to this would be to capture any and all stderr
output and redirect it to the log. An example of this approach can be found
on in electric monk’s blog post "Redirect stdout and stderr to a logger in Python".
One thing that can be a little confusing with Python is how packages work. Packages let you group your modules together and gives you a nice namespace. You can read all about them in the Python docs.
Now one thing that can be pretty confusing is that importing a package does not mean that any modules inside that package are loaded.
Imagine a very simple package called testing
, with a single
foo
module. E.g:
testing/ __init__.py foo.py
The foo
module might look something like:
def bar(): return 'bar'
Now, you might expect to be able to write code such as:
import testing print(testing.foo.bar())
However, trying this won’t work, you end up with an AttributeError
:
Traceback (most recent call last): File "t.py", line 2, intesting.foo.bar() AttributeError: 'module' object has no attribute 'foo'
So, to fix this you need to actually import the module. There are at (at least) two ways you can do this:
import testing.foo from testing import foo
Either of these put testing.foo
into
sys.modules
, and testing.foo.bar()
will work
fine.
But, what if you want to load all the modules in a package? Well, as far as I know there isn't any built-in approach to doing this, so what I’ve come up with is a pretty simple function that, given a package, will load all the modules in the package, and return them as a dictionary keyed by the module name.
def plugin_load(pkg): """Load all the plugin modules in a specific package. A dictionary of modules is returned indexed by the module name. Note: This assumes packages have a single path, and will only find modules with a .py file extension. """ path = pkg.__path__[0] pkg_name = pkg.__name__ module_names = [os.path.splitext(m)[0] for m in os.listdir(path) if os.path.splitext(m)[1] == '.py' and m != '__init__.py'] imported = __import__(pkg_name, fromlist=module_names) return {m: getattr(imported, m) for m in module_names}
There are plenty of caveats to be aware of here. It only works with
modules ending in .py
, which may miss out on some
cases. Also, at this point it doesn’t support packages that span
multiple directories (although that would be relatively simple to
add. Note: code testing on Python 3.2, probably needs
some modification to work on 2.x (in particular I don’t think dictionary
comprehensions in 2.x).
If you’ve got a better way for achieving this, please let me know in the comments.
I’ve been doing a bit of Python programming of late, and thought I’d share a simple trick that I’ve found quite useful. When working with a large code-base it can sometimes be quite difficult to understand the system’s call-flow, which can make life trickier than necessary when refactoring or debugging.
A handy tool for this situation is to print out where a certain function is called from,
Python makes this quite simple to do. Python’s inspect
module
is very powerful way of determining the current state of the Python interpreter. The stack
function
provides a mechanism to view the stack.
import inspect
print inspect.stack()
inspect.stack
gives you a list of frame records. A frame record is a 6-tuple that, among other things,
contains the filename and line number of the caller location. So, in your code you can do something like:
import inspect
_, filename, linenumber, _, _, _ = inspect.stack()[1]
print "Called from: %s:%d" % (filename, linenumber)
The list-index used is 1, which refers to the caller’s frame records. Index 0 returns the current function’s frame record.
Now, while this is just a simple little bit of code, it is nice to package it into something more reusable, so we can create a function:
def getCallsite():
"""Return a string representing where the function was called from in the form 'filename:linenumber'"""
_, filename, linenumber, _, _, _ = inspect.stack()[2]
return "%s:%d" % (filename, linenumber)
The tricky thing here is to realise that it is necessary to use list index 2 rather than 1.
The ability to inspect the stack provides the opportunity to do some truly awful things (like making the return value dependent on the caller), but that doesn’t mean it can’t be used for good as well.
pexif is the python library for editing an image’s EXIF data. Somewhat embarrassingly, the last release I made (0.12) had a really stupid bug in it. This has now been rectified, and a new version (0.13) is now available.
For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.
I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.
The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.
The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.
So, urllib2 is billed as An extensible library for opening URLs
using a variety of protocols
, surely I can just extend it to do
what I want? Well, it turns out that I can, but it seemed to be harder
than I was expecting. Not because the final code I needed to write was
difficult or involved, but because it was quite difficult to work out
what the right code to write is. I want to explore for a little bit
why (I think) this might be the case.
urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:
ret = urllib2.urlopen("http://some_url/")
data = ret.read()
And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.
The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.
For example, the build_opener
function is pretty
magical, since it does a bunch of introspection, and ends up either
adding a handler or replacing a handler depending on
the class. This is explained in the code as: if one of the
argument is a subclass of the default handler, the argument will be
installed instead of the default.
, which to me makes a lot of
sense, where-as the online documentation describes it as:
Instances of the following classes will be in front of the handlers,
unless the handlers contain them, instances of them or subclasses of
them: ....<list of default handlers>
. For me the former is
much clearer than the latter!
Anyway, here is the code I ended up with:
opener = urllib2.OpenerDirector() opener.add_handler(urllib2.HTTPHandler()) def restopen(url, method=None, headers=None, data=None): if headers is None: headers = {} if method is None: if data is None: method = "GET" else: method = "POST" return opener.open(HttpRestRequest(url, method=method, headers=headers, data=data))
So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.
I’ve, recently started using memoize.py
,
as the core of my build system for a new project I’m working on. This
simplicity involved is pretty neat. Rather than manually needing to
work out the dependencies, (or having specialised tools for
determining the dependencies), with memoize.py
, you
simply write the commands you need to build your project, and
memoize.py
works out all the dependencies for you.
So, what’s the catch? Well, the way memoize.py
works
is by using strace
to
record all the system calls that a program makes during its
execution. By analyzing this list memoize.py
can work out
all the files that are touched when a command is run, and then stores
this as a list of dependencies for that command. Then, the next time
you run the same command memoize.py
first checks to see
if any of the dependencies have change (using either md5sum, or
timestamp), and only runs the command if any of the dependencies have
changed. So the catch of course is that this only runs on Linux (as
far as I know, you can’t get strace anywhere else, although that
doesn’t mean the same techniques couldn’t be used with a different
underlying system call tracing tool).
This technique is quite a radical difference to other tools which
determine a large dependency graph of the entire build, and then,
recursively work through this graph to fulfil unmet dependencies. As
a result this form is a lot more imperative, rather than declarative
style. Traditional tools (SCons, make, etc), provide a language which
allows you to essentially describe a dependency graph, and then the
order in which things are executed is really hidden inside the tool.
Using memoize.py
is a lot different. You go through
defining the commands you want to run (in order!), and that is
basically it.
Some of the advantages of this approach are:
There are however some disadvantages:
ptrace
to perform the system call tracing.memoize.py
a little so that you could simply choose not to run strace
. Obviously
you can’t determine dependencies in this case, but you can at least build the
thing.As with may good tools in your programming kit, memoize.py
is
available under a very liberal BSD style license, which is nice, because I’ve
been able to fix up some problems and add some extra functionality. In particular
I’ve added options to:
The patch and full file are available. These have of course been provided upstream, so with any luck, some or most of them will be merged upstream.
So, if you have a primarily Linux project, and want to try something
different to SCons, or make, I’d recommend considering memoize.py
.
I released a new version of pexif today. This release fixes some small bugs and now deals with files containing multiple application markers. This means files that have XMP metadata now work.
Now I just wish I had time to actually use it for its original purpose of attaching geo data to my photos.
Reading Jeff's humorous post
got me digging into exactly why Python would be running in a select
loop when otherwise idle.
It basically comes down to some code in the module that wraps GNU readline. The code is:
while (!has_input) { struct timeval timeout = {0, 100000}; /* 0.1 seconds */ FD_SET(fileno(rl_instream), &selectset); /* select resets selectset if no input was available */ has_input = select(fileno(rl_instream) + 1, &selectset, NULL, NULL, &timeout); if(PyOS_InputHook) PyOS_InputHook(); }
So what this is basically doing, is trying to read data but 10 times a second
calling out to PyOS_InputHook()
, which is a hook that can be used by C
extension (in particular Tk) to process something when python is otherwise idle.
Now the slightly silly thing is that it will wake up every 100 ms, even if PyOS_InputHook is not actually set. So a slight change:
while (!has_input) { struct timeval timeout = {0, 100000}; /* 0.1 seconds */ FD_SET(fileno(rl_instream), &selectset); /* select resets selectset if no input was available */ if (PyOS_InputHook) { has_input = select(fileno(rl_instream) + 1, &selectset, NULL, NULL, &timeout); } else { has_input = select(fileno(rl_instream) + 1, &selectset, NULL, NULL, NULL); } if(PyOS_InputHook) PyOS_InputHook(); }
With this change Python is definitely ready for the enterprise!
One of the problems with pyannodex was that you could only iterate through a list of clips once. That is in something like:
anx = annodex.Reader(sys.argv[1]) for clip in anx.clips: print clip.start for clip in anx.clips: print clip.start
Only the first loop would print anything. This is basically because
clips returned an iterator, and once the iterator had run through once, it
didn't reset the iterator. I had originally (in 0.7.3.1) solved this in
a really stupid way, whereby I used class properties to recreate an
iterator object each time a clip object was returned. This was obviously
sily and I fixed it properly by reseting the file handle in the __iter__
method of my iterator object.
When reviewing this code I also found a nasty bug. The way my iterator worked relied on each read call causing it most one callback, which wasn't actually what was happening. Luckily this is also fixable by having callback functions return ANX_STOP_OK, rather than ANX_CONTINUE. Any way, there is now a new version up which fixes these problems.