A better namedtuple for Python (?)

Sun, 30 Nov 2014 16:56:23 +0000

Python's namedtuple factory function is, perhaps, the most under utilised feature in Python. Classes created using the namedtuple are great because they lead to immutable object (as compared to normal classes which lead to mutable objects). The goal of this blog post isn’t to convince you that immutable objetcs are a good thing, but take a bit of a deep dive into the namedtuple construct and explore some of its shortcomings and some of the ways in which it can be improved.

NOTE: All the code here is available in this gist.

As a quick primer, let’s have a look at a very simple example:

from collections import namedtuple

Foo = namedtuple('Foo', 'bar baz')
foo = Foo(5, 10)
print(foo)

NOTE: All my examples are targeting Python 3.3. They will not necessarily work in earlier version, in particular at least some of the later ones are not to work in Python 2.7

For simple things like this example, the existing namedtuple works pretty well, however the repetition of Foo is a bit of a code smell. Also, even though Foo is a class, it is created significantly differently to a normal class (which I guess could be considered an advantage).

Let’s try and do something a little more complex. If we find ourselves needing to know the product of bar and baz a bunch of times we probably want to encapsulate rather than writing foo.bar * foo.baz everywhere. (OK this example isn’t the greatest in the history of technical writing but just bear with me on this). So, as a first example we can just write a function:

def foo_bar_times_baz(foo):
    return foo.bar * foo.baz

print(foo_bar_times_baz(foo))

There isn’t anything wrong with this, functions are great! But the normal pattern in Python is that when we have a function that does stuff on a class instance we make it a method, rather than a function. We could debate the aesthetic and practical tradeoffs between methods and functions, but let’s just go with the norm here and assume we’d prefer to do foo.bar_times_baz(). The canonical approach to this is:

class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz


foo = Foo(5, 10)
print(foo.bar_times_baz())

Now this works but, again, we have the repetition of Foo, which bugs me. Coupled with this we are calling a function inside the class definition which is a bit ugly. There are some minor practical concerns because there are now multiple Foo classes in existance, which can at times cause confusion. This can be avoided by appending an underscore to the name of the super-class. E.g.: class Foo(namedtuple('Foo_', 'bar baz')).

An alternative to this which I’ve used from time to time is to directly update the Foo class:

Foo = namedtuple('Foo', 'bar baz')
Foo.bar_times_baz = lambda self: self.bar * self.baz
foo = Foo(5, 10)
print(foo.bar_times_baz())

I quite like this, and think it works well for methods which can be written as an expression, but not all Python programmers are particularly fond of lambdas, and although I know that classes are mutable, I prefer to consider them as immutable, so modifying them after creation is also a bit ugly.

The way that the namedtuple function works is by creating a string that matches the class definition and then using exec on that string to create the new class. You can see the string verion of that class definition by passing verbose=True to the namedtuple function. For our Foo class it looks a bit like:

class Foo(tuple):
    'Foo(bar, baz)'

    __slots__ = ()

    _fields = ('bar', 'baz')

    def __new__(_cls, bar, baz):
        'Create new instance of Foo(bar, baz)'
        return _tuple.__new__(_cls, (bar, baz))

    ....

    def __repr__(self):
        'Return a nicely formatted representation string'
        return self.__class__.__name__ + '(bar=%r, baz=%r)' % self

    .....

    bar = _property(_itemgetter(0), doc='Alias for field number 0')

    baz = _property(_itemgetter(1), doc='Alias for field number 1')

I’ve omitted some of the method implementation for brevity, but you get the general idea. If you have a look at that you might have the same thought I did: why isn’t this just a sub-class?. It seems practical to construct:

class NamedTuple(tuple):
    __slots__ = ()

    _fields = None  # Subclass must provide this

    def __new__(_cls, *args):
        # Error checking, and **kwargs handling omitted for brevity
        return tuple.__new__(_cls, tuple(args))

    def __repr__(self):
        'Return a nicely formatted representation string'
        fmt = '(' + ', '.join('%s=%%r' % x for x in self._fields) + ')'
        return self.__class__.__name__ + fmt % self

    def __getattr__(self, field):
        try:
            idx = self._fields.index(field)
        except ValueError:
            raise AttributeError("'{}' NamedTuple has no attribute '{}'".format(self.__class__.__name__, field))

        return self[idx]

Again, some method implementations are omitted for brevity (see the gist for gory details). The main difference to the generated class is making use of _fields consistently rather than hard-coding some things, and the necessity of a __getattr__ method, rather than hardcoded properties.

This can be used something like this:

class Foo(NamedTuple):
    _fields = ('bar', 'baz')

This fits the normal pattern for creating classes much better than a constructor function. It is two lines rather than the slightly more concise one-liner, but we don’t need to duplicate the Foo name so that is a win. In addition, when we want to start adding methods, we know exactly where they go.

This isn’t without some drawbacks. The biggest problem is that we’ve just inadvertently made our subclass mutable! Which is certainly not ideal. This can be rectified by adding a __slots__ = () in our sub-class:

class Foo(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')

At this point this approach no longer looks so good. Some other minor drawbacks are that there is no checking that the field names are really valid in anyway, which the namedtuple approach does correctly. The final drawback is on the performance front. We can measure the attribute access time:

from timeit import timeit

RUNS = 1000000

direct_idx_time = timeit('foo[0]', setup='from __main__ import foo', number=RUNS)
direct_attr_time = timeit('foo.bar', setup='from __main__ import foo', number=RUNS)
sub_class_idx_time = timeit('foo[0]', setup='from __main__ import foo_sub_class as foo', number=RUNS)
sub_class_attr_time = timeit('foo.bar', setup='from __main__ import foo_sub_class as foo', number=RUNS)


print(direct_idx_time)
print(direct_attr_time)
print(sub_class_idx_time)
print(sub_class_attr_time)

The results are that accessing a value via direct indexing is the same for both approachs, however when accessing via attributes the sub-class approach is 10x slower than the original namedtuple approach (which was about factor 3x slower than direct indexing).

Just for kicks, I wanted to see how far I could take this approach. So, I created a little optimize function, which can be applied as a decorator on a namedtuple class:

def optimize(cls):
    for idx, fld in enumerate(cls._fields):
        setattr(cls, fld, property(itemgetter(idx), doc='Alias for field number {}'.format(idx)))
    return cls


@optimize
class FooSubClassOptimized(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')

    def bar_times_baz(self):
        return self.bar * self.baz

The optimize goes through and adds a property to the class for each of it’s fields. This means that when an attribute is accessed, it can short-cut through the property, rather than using the relatively slow __getattr__ method. This result in a considerable speed-up, however performance was still about 1.2x slower than the namedtuple approach.

If anyone can explain why this takes a performance hit I’d appreciate it, because I certainly can’t work it out!

So, at this point I thought my quest for a better namedtuple had come to an end, but I thought it might be worth pursuing the decorator approach some more. Rather than modifying the existing class, instead we could use that as a template and return a new class. After some iteration I came up with a class decorator called namedfields, which can be used like:

@namedfields('bar', 'baz')
class Foo(tuple):
    pass

This approach is slightly more verbose than sub-classing, or the namedtuple factory however, I think it should be fairly clear, and doesn’t include any redundancy. The decorator constructs a new class based on the input class, however it adds the properties for all the fields, ensures __slots__ is correctly added to the class, and attaches all the useful NamedTuple methods (such as _replace and __repr__) to the new class.

Classes constructed this way have identical attribute performance as namedtuple factory classes.

The namedfields function is:

def namedfields(*fields):
    def inner(cls):
        if not issubclass(cls, tuple):
            raise TypeError("namefields decorated classes must be subclass of tuple")

        attrs = {
            '__slots__': (),
        }

        methods = ['__new__', '_make', '__repr__', '_asdict',
                   '__dict__', '_replace', '__getnewargs__',
                   '__getstate__']

        attrs.update({attr: getattr(NamedTuple, attr) for attr in methods})

        attrs['_fields'] = fields

        attrs.update({fld: property(itemgetter(idx), doc='Alias for field number {}'.format(idx))
                      for idx, fld in enumerate(fields)})

        attrs.update({key: val for key, val in cls.__dict__.items()
                      if key not in ('__weakref__', '__dict__')})

        return type(cls.__name__, cls.__bases__, attrs)

    return inner

So, is the namedfields decorator better than the namedtuple factory function? Performance wise things are on-par for the common operation of accessing attributes. I think from an aesthetic point of view things are on par for simple classes:

Foo = namedtuple('Foo', 'bar baz')

# compared to

@namedfields('bar', 'baz')
class Foo(tuple): pass

namedtuple wins if you prefer a one-line (although if you are truly perverse you can always do: Foo = namedfields('bar', 'baz')(type('Foo', (tuple, ), {})). If you don’t really care about the one vs. two lines, then it is pretty line-ball call. However, when we compare what is required to add a simple method the decorate becomes a clear winner:

class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz

# compared to

@namedfields('bar', 'baz')
class Foo(tuple):
    def bar_times_baz(self):
        return self.bar * self.baz

There is an extra line for the decorator in the namedfields approach, however this is a lot clearer than it being forced in as a super class.

Finally, I wouldn’t recommend using this right now because my bring in a new dependency for something that provides only marginal gains over the built-in functionality isn’t really worth it, and this code is really only proof-of-concept at this point in time.

I will be following up with some extensions to this approach that covers off some of my other pet-hates with the namedtuple approach: custom constructors and subclasses

blog comments powered by Disqus