Interesting corners of SFTP

Wed, 11 Jan 2017 20:49:12 +0000

January 11, 2017 is an auspicious day in tech history. No, I'm not talking about the 10 year anniversary of the iPhone announcement, I'm talking about with much more impact: today marks a decade since the most recent version (13) of the SFTP internet draft expired. So, I thought it would be a good opportunity to rant a bit about some of the more interesting corners of the protocol.

The first interesting thing is that even though the there are 14 versions of the internet draft (the first being version zero) the protocol version used by most (all?) major implementations (OpenSSH in particular) is version 3 (which of course is specified in specification version 2, because while the specification number is zero-based the protocol version numbering is 1-based). Specification version 2 is expired on April's fool day 2002.

The second interesting thing is that this never reached the status of RFC. I can't claim to undersatnd why, but for a commonly used tool it is relatively surprising that the best we have in terms of written specification is an expired internet draft. Certainly a step up from now documentation at all, but still surprising.

The third interesting thing is that it isn't really a file transfer protocol. It certainly lets you do that, but its really much closer to a remote file system protocol. The protocol doesn't expose an upload / download style interface, but rather an open / close / read / write style interface. In fact most of the SFTP interfaces are really small wrappers around the POSIX APIs (with some subtle changes just to keep everyone on their toes!). One consequence of this style API is that there isn't any kind of atomic upload operation; there is no way for the srver to really know when a client has finished uploading a file. (Which is unfortunate is you want to say trigger some processing after a file is uploaded).

The fourth interesting thing is that some parts of the protocol are underspecified, which makes solid implementation a little tricky. Specifically there is no definition of a maximum packet size, but in practise popular STP client implement a maximum packet size. So, if you are a server implementator all you can do is guess! If you really want to gory details check out this bug report.

The fifth interesting thing is the way in which the readdir implementation work. Now, on a UNIX you have the POSIX readdir API, however that returns only a single file at a time. You don't want a network round trip for each file in a directory. So the SFTP readdir is able to return multiple results in a single API (although exactly how many is underspecified; see the previous interesting thing!).

Now, this implementation lets you type ls into the SFTP client and you only need a few round-trips to display things. However, people aren't happy with just a file listing, and generally use ls -l to get a long listing with additional details on each file. On a POSIX system you end up calling stat on each filename to get the details, but if you have this design over the network then you are going to end up back with a lot of round-trips. So, to help optimize for this, rather than readdir returning a list of filenames, it returns a list of (filename, attribute) tuples. This let you do the ls -l without excessive round-trips.

Now, so far that isn't interesting. The really interesting thing here is that they didn't stop here with this optimisation. To implement ls -l you need to take the filename and attributes and then format them into a string that looks something like: drwxr-xr-x 5 markus markus 1024 Jan 13 18:39 .ssh. Now, there is not really anything that stops an SFTP client doing the transformation on the client. Doing so certainly doesn't requier any additional network round trips. But, for some reason, the SFTP server actually sends this formatted string in the result of readdir! So, readdir returns a list of (filename, longname, attribute) tuples. This is kind of strange.

The only real reason for doing this transformation on the server side is that the client can't perform a transformation from gid / uid into strings (since you can't expect the client and server to share the same mapping). So, there is some justificaiton here. However, readdir isn't the only call you need to implement ls. If the user simply performs an ls on a specific file, then you need to use stat to determine if the file exists, and if so get the attribute information. The SFTP protocol provides a stat call that basically works as you'd expect, returning the attributes for the file. But, it does not return the longname! So, now you know why sometimes an ls in the SFTP client will provide you with user and group names ,while other times it just show numeric identifiers.

The sixth interesting thing is that despite there being a cd command in the OpenSSH SFTP client, there isn't any chdir equivalent in the protocol. So, cd is a pure client-side illusion. A related interesting thing is that the protocol support filenames starting with a leading slash and without slash. With a slash means absolute (as you'd expect), but without a slash doesn't mean relative, or at least not relative to any current working directory. Paths without a leading slash are resolved relative to a default directory. Normally this would be the user's home directory, but really it is up to the SSH server.

The final interesting thing worth commenting on is that even though most of the SFTP interfaces seem to be copies of POSIX interfaces for rename, they chose to specify different semantics to the POSIX rename! Specifically, in SFTP it is an error to rename a file to an existing file, while POSIX rename supports this without an issue. Now of course, people wanted their normal POSIX semantics in rename, so OpenSSH obliged with an extension: posix-rename. So, OpenSSH has two renames, one with SFTP semantics, and one with POSIX semantics. This is not unreasonable, the really fun part is that the OpenSSH SFTP client automatically tries ot use the posix-rename semantics, but will happily, silently fall back to the SFTP semantics.

Hopefully if you ever need to deal with SFTP at a low level this has given you some pointers of things to look out for!

Thoughts on PolicyHack

Fri, 09 Oct 2015 09:31:59 +0000

Yesterday a Policy Hackathon event was launched by the Assistant Minister for Innovation Wyatt Roy. This kind of outreach to the community is a great thing to see from politicians, and I look forward to seeing more examples of it. However, I'm concerned that the way in which this particular event has been framed won't lead to successful outcomes.

The first problem with Policy Hack is that it focuses on the solution space (i.e.: policy ideas) and completely fails to explore the problem space. Ironically, this is the same kind of thinking that leads to the demise of many technology companies who focus on the cool technology instead of the problem their customers need them to solve.

The Policy Hack website has, at the time of writing, received over 60 submissions (of varying quality). All of them are policy suggestions, but few, if any, provide any kind of detail on the specific problem they are trying to solve. In some cases it is implied (e.g. a policy suggestion to lower taxation implies that high taxation is in some way impeding innovation), but even then there is little to convince the reader that the implied problem is actually a problem. (e.g.: where is the evidence that the current level of taxation is actually impeding innovation?)

Before we get into policy ideas we really need to have some kind of problem definition. What is the problem that Policy Hack is trying to solve?

The closest we get to that on the website is policy ideas designed to foster the growth of innovation industries including tech startups, biotech, agtech, fintech, renewables and resources..

This isn't at all a bad starting point, but it's a very high-level aspirational kind of statement, rather than an actionable definition of the problem. A lot more work needs to be done to fully flesh out the problem before we even start thinking about the solution.

An example of a more actionable problem statement could be: The growth of the tech startups, biotech, agtech, fintech, renewables and resources industries as measured by percentage of GDP is too low for Australia's future economic prosperity. Of course this raises further questions. Maybe percentage of GDP isn't the best measure, and we should be using a different metric such as tax revenue, or employment. And the actual terms should be defined a bit more clearly as well; what even is a tech startup? I don't think that is one of the business classifications used by the Bureau of Statistics, which raises even more questions!

Of course, once we have that problem statement, some kind of data would be nice. Like, what is the current growth rate? Off what base? What does it need to be to ensure economic prosperity? If we can't get those numbers directly, what are reasonable proxies for those numbers that we can measure? How reliable are those proxies?

With a problem definition, trying some kind of root cause analysis might lead to specific problems that we can solve with policies. One example could be: there aren't enough people creating startups, because there aren't enough STEM graduates, because high-school students don't choose STEM courses, because they are worried about the HECS/HELP debt. Or it could be: we have plenty of startups but once they get initial traction they move to the US, because there isn't enough capital available in Australia, because investors are taxed too heavily on capital gains. I'm sure there are hundreds different chains of thought like this that should be explored.

Of course, those are just hypotheses! We should be applying some kind of scientific method here. Do startups really move to the US once they get success? Maybe they don't. Do they move because of funding reasons, or maybe it is access to customers? Is capital not available here because of capital gains tax, or maybe it is because other asset classes provide better returns?

None of that is really easy, and certainly can't be done collaboratively during a hackathon (despite the excellent opening up of government data sources to make the underlying data available). This needs patient analysis, by diligent people, spending time to find the right data sources.

Without access to this information how could you possibly come up with a useful policy suggestion? Of course, once this work has been done the actual policies probably become somewhat obvious!

Ideally this is work that has already been done, either within government policy units, or external consultants. If so, it should be directly available on the Policy Hack website, and it should be required reading for anyone wanting to put forward a policy idea.

The second problem that I see with the Policy Hack event, or at least some of the suggestions put forward to date is that many of them are simply reinventing the wheel. (Which, ironically, is the cause of the downfall of many a startup!)

Many of the suggestions currently put forward are basically the same as policies that already exist in some shape or form. This isn't meant to be a criticism about the person putting forward the idea; it isn't exactly simple to find what policies are currently out there. But surely, if we are going to have an event that looks at putting forward new policy ideas on innovation we should at the very least have a list of the current policies the government has in place with the aim of fostering innovation! Ideally, we have also have an analysis of each policy with some commentary on what works well about it, what doesn't work so well, and some form of cost-benefit analysis.

With this information in place we may find that a simple tweak to an existing policy could have a large impact. Or we could find ineffective policies that can be scrapped to make funds available for new policies (because any new policies will need to be funded somehow, right?). Or we could find that we already have policies addressing many of the things that we think are problems, and we need to go back to the root cause analysis to determine other possible things that need addressing.

Again, maybe this information is available somewhere, but it should be made available to participants of Policy Hack in advance. And, if you are going to propose a new policy, that is similar in some way to an existing policy, then should explain what's different about your proposal. For example, if you put forward an idea such as allow special visas to hire technical immigrants, without any reference at all to the current 457 visa scheme you're really just wasting everybody's time!

Finally, there are a lot of Country X does Y, let's do Y policy suggestions without any critical analysis of why Y works for X (or even if Y works for X!). (And to continue the theme, this kind of me too thinking seems to plague startups as well.)

Ideally an input to the Policy Hack should be a comprehensive review of other innovation policies that have been put in place by other jurisdictions. This review should explain how these have, or have not, been effective in the jurisdictions in which they have been implemented, along with some commentary as to how they compare to similar policies in Australia. For example, holding up the US low capital gains tax and long-term dividend tax rate without explaining that the US has no mechanism for relief from double taxation at the corporate and individual level is not providing a fair evaluation.

And again, if you are going to put forward policy ideas based on ideas from other jurisdictions, it better actually reference that research document and justify why that policy would work in Australia. For example, if you think that Australia should have lower corporate tax rates because Ireland has lower corporate tax rates, you need to at least give some kind of backup that this is actually working for Ireland.

If we want to have an effective discussion on policy, we first need to agree on the problem we are solving, and have a clear understanding of what has been tried before both locally and overseas.

Instead of policies, maybe the Policy Hack could instead work through the problem space. A morning session that presented a summary of all existing data to the participants, followed by an afternoon brainstorming session on identifying possible reasons why we don't see the levels of innovation that we'd like. With this information in hand the policy experts could perform the relevant research to determine which of the identified reasons actually has an evidence base to support it. The results of this could feed in to a future hackathon that starts to look at policy from an informed base.

Uniquely identifying processes in Linux

Sun, 28 Jun 2015 12:20:50 +0000

If you are doing any kind of analysis of a Linux based system you quickly come to the point of needing to collect statistics on a per-process basis. However one difficulty here is how to uniquely identify a process.

The simplest answer is that you can just use the process identifier (PID), unfortunately this is not a good answer in the face of PID reuse. The total number of PIDs in the system is finite, and once exhausted the kernel will start re-using PIDs that have been previously allocated to different processes.

Now you may expect that this reuse would be relatively rare, or occur on a relatively large period, but it can happen relatively quickly. On a system under load I've seen PID reuse in under 5 seconds.

The practical consequence of this is that if you are, for example, collecting statistics about a process via any interface that identifies processes via PID you need to be careful to ensure you are collecting the right statistics! For example, if you are a collecting statistics of a process identified via PID 37 you might read /proc/37/stat at one instance and receive valid data, but at any later time /proc/37/stat could return data about a completely different process.

Thankfully, the kernel associates another useful piece of information with a process: it’s start time. The combination of PID and start time provides a reasonably robust way of uniquely identifying processes over the life-time of a system. (For the pedantic, if a process can be created and correctly reaped all within the granularity of the clock, then it would be theoretically possible for multiple different processes to have existed in the system that have the same PID and start time, but that is unlikely to be a problem in practise.)

The start time is one of the fields that is present in the /proc/<pid>/stat information, so this can be used to ensure you are correctly matching up statistics.

Playing around with await/async in Python 3.5

Mon, 25 May 2015 07:07:03 +0000

PEP-0492 was recently approved giving Python 3.5 some special syntax for dealing with co-routines. A lot of the new functionaltiy was available pre-3.5, but the syntax certainly wasn’t ideal as the concept of generators and co-routines were kind of intermingled. PEP-0492 makes an explicit distinction between generators and co-routines through the use of the the async keyword.

This post aims to describe how these new mechanisms works from a rather low-level. If you are mostly interested in just using this functionality for high-level stuff I recommend skipping this post and reading up on the built-in asyncio module. If you are interested in how these low-level concepts can be used to build up your own version of the asyncio module, then you might find this interesting.

For this post we’re going to totally ignore any asynchronous I/O aspect and just limit things to interleaving progress from multiple co-routines. Here are two very simple functions:

def coro1():
    print("C1: Start")
    print("C1: Stop")


def coro2():
    print("C2: Start")
    print("C2: a")
    print("C2: b")
    print("C2: c")
    print("C2: Stop")

We start with two very simple function, coro1 and coro2. We could call these function one after the other:

coro1()
coro2()

and we’d get the expected output:

C1: Start
C1: Stop
C2: Start
C2: a
C2: b
C2: c
C2: Stop

But, for some reason, rather than running these one after the other, we’d like to interleave the execution. We can’t just do that with normal functions, so let’s turn these into co-routines:

async def coro1():
    print("C1: Start")
    print("C1: Stop")


async def coro2():
    print("C2: Start")
    print("C2: a")
    print("C2: b")
    print("C2: c")
    print("C2: Stop")

Through the magic of the new async these functions are no-longer functions, but now they are co-routines (or more specifically native co-routine functions). When you call a normal function, the function body is executed, however when you call a co-routine function the body isn’t executed; instead you get back a co-routine object:

c1 = coro1()
c2 = coro2()
print(c1, c2)

gives:

<coroutine object coro1 at 0x10ea60990> <coroutine object coro2 at 0x10ea60a40>

(The interpretter will also also print some runtime warnings that we’ll ignore for now).

So, what good is having a co-routine object? How do we actually execute the thing? Well, one way to execute a co-routine is through an await expression (using the new await keyword). You might think you could do something like:

await c1

but, you’d be disappointed. An await expression is only valid syntax when contained within an native co-routine function. You could do something like:

async def main():
    await c1

but of course then you are left with the problem of how to force the execution of main!

The trick to realise is that co-routines are actually pretty similar to Python generators, and have the same send method. We can kick off execution of a co-routine by calling the send method.

c1.send(None)

This gets our first co-routine executing to completion, however we also get a nasty StopIteration exception:

C1: Start
C1: Stop
Traceback (most recent call last):
  File "test3.py", line 16, in 
    c1.send(None)
StopIteration

The StopIteration exception is the mechanism used to indicate that a generator (or co-routine in this case) has completed execution. Despite being an exception it is actually quite expected! We can wrap this in an appropriate try/catch block, to avoid the error condition. At the same time let’s start the execution of our second co-routine:

try:
    c1.send(None)
except StopIteration:
    pass
try:
    c2.send(None)
except StopIteration:
    pass

Now we get complete output, but it is disappointingly similar to our original output. So we have a bunch more code, but no actual interleaving yet! Co-routines are not dissimilar to threads as they allow the interleaving of multiple distinct threads of control, however unlike threads when we have a co-routine any switching is explicit, rather than implicit (which is, in many cases, a good thing!). So we need to put in some of these explicit switches.

Normally the send method on generators will execute until the generator yields a value (using the yield keyword), so you might think we could change coro1 to something like:

async def coro1():
    print("C1: Start")
    yield
    print("C1: Stop")

but we can’t use yield inside a co-routine. Instead we use the new await expression, which suspends execution of the co-routine until the awaitable completes. So we need something like await _something_; the question is what is the something in thise case? We can’t await just await on nothing! The PEP explains which things are awaitable. One of them is another native co-routine, but that doesn’t help get to the bottom of things. One of them is an object defined with a special CPython API, but we want to avoid extension modules and stick to pure Python right now. That leaves two options; either a generator-based coroutine object or a special Future-like like object.

So, let’s go with the generator-based co-routine object to start with. Basically a Python generator (e.g.: something that has a yield in it) can be marked as a co-routine through the types.coroutine decorator. So a very simple example of this would be:

@types.coroutine
def switch():
    yield

This define a generator-based co-routine function. To get a generator-based co-routine object we just call the function. So, we can change our coro1 co-routine to:

async def coro1():
    print("C1: Start")
    await switch()
    print("C1: Stop")

With this in place, we hope that we can interleave our execution of coro1 with the execution of coro2. If we try it with our existing code we get the output:

C1: Start
C2: Start
C2: a
C2: b
C2: c
C2: Stop

We can see that as expected coro1 stopped executing after the first print statement, and then coro2 was able to execute. In fact, we can look at the co-routine object and see exactly where it is suspended with some code like this:

print("c1 suspended at: {}:{}".format(c1.gi_frame.f_code.co_filename, c1.gi_frame.f_lineno))

which print-out the line of your await expression. (Note: this gives you the outer-most await, so is mostly just for explanatory purpose here, and not particularly useful in the general case).

OK, the question now is, how can we resume coro1 so that it executes to completion. We can just use send again. So we end up with some code like:

try:
    c1.send(None)
except StopIteration:
    pass
try:
    c2.send(None)
except StopIteration:
    pass
try:
    c1.send(None)
except StopIteration:
    pass

which then gives us our expected output:

C1: Start
C2: Start
C2: a
C2: b
C2: c
C2: Stop
C1: Stop

So, at this point we’re manually pushing the co-routines through to completion by explicitly calling send on each individual co-routine object. This isn’t going to work in general. What we’d really like is a function that kept executing all our co-routines until they had all completed. In other words, we want to continually execute send on each co-routine object until that method raises the StopIteration exception.

So, let’s create a function that takes in a list of co-routines and executes them until completion. We’ll call this function run.

def run(coros):
    coros = list(coros)

    while coros:
        # Duplicate list for iteration so we can remove from original list.
        for coro in list(coros):
            try:
                coro.send(None)
            except StopIteration:
                coros.remove(coro)

This picks a co-routine from the list of co-routines, executes it, and then if a StopIteration exception is raised, the co-routine is removed from the list.

We can then remove the code manually calling the send method and instead do something like:

c1 = coro1()
c2 = coro2()
run([c1, c2])

And now we have a very simple run-time for executing co-routines using the new await and async features in Python 3.5. Code related to this post is available on github.

Android projects and gradlew

Wed, 08 Apr 2015 14:15:36 +0000

The instructions in the Android first-app tutorial (as of 20150408) are kind of confusing. I think something got lost in translation with the adding of Android Studio to the mix.

Both the creating a project and running an app have instructions for using the command line, rather than the IDE (which is great, because I'm curmudgeonly like that). Unfortunately, if you use the command line options for creating a project, you end up with an Ant based build system set up, and not a Gradle based build system. Which conveniently means that the command line instructions for running an app make no sense at all.

So, since you probably want a gradle based build, I suggest using IDE once off to create the project, and then you can follow the instructions for running an app from the command line.

PS: Make sure you are connected to the interwebs the first time you run gradlew as it needs to download a bunch of stuff before it can start.

Checking for leaks in Python

Thu, 26 Feb 2015 15:29:57 +0000

For short running scripts it doesn't really matter if your Python module is leaking memory; it all gets cleaned up on process exit anyway, however if you have a long-running application you really don't want an ever increasing memory footprint. So, it is helpful to check that your module isn’t leaking memory, and to plug the leaks if they occur. This post provides some ideas on how to do that, and describes some of the implementation of my check_leaks module.

If you suspect a given function foo of leaking memory, then the simplest approach to checking that it is not leaking memory is to get the set of objects that exist before running the function, the set of objects the exists after running the function, and then any object in after that isn’t in before is going to be a leaked object.

So in Python the garbage collector tracks (practically) all the objects that exist, and the gc module provides us with access to the internals of the garbage collector. In particular the get_objects method, will provide a list of all the objects tracked by the garbage collector. So at a really simple level we can do:

before_objects = gc.get_objects()
foo()
after_objects = gc.get_objects()

It is then a simple matter of working out which objects are in after_objects but not before_objects, and then that is our leaking objects! One catch however is that we don’t know when the garbage collector has run, so with the above code it is possible that there are objects in after_objects that could be cleaned by the garbage collector. Thankfully, the gc module provides the collect method which allows us to force a garbage collection. So we end up with something like:

before_objects = gc.get_objects()
foo()
gc.collect()
after_objects = gc.get_objects()

The other small gotcha in this case is that the before_objects list will end up as an object in after_objects. This isn’t really a leak, so we can ignore it when display the leaking objects.

So far, this can give us a list of leaked objects, but it doesn’t really help resolve the issue except in some of the simplest cases. We need more information to really fix the issue. Some things that can help with this is knowing why it hasn’t been collected yet, which comes down to knowing which other objects still have a reference to the leaked object. Additionally it can be very useful to know where the object was originally allocated. Python gives us some tools to help give us this information.

The gc module provides a get_referrers method, which provides a list of all objects that refer to a given object. This can be used to print out which objects refer to the given object. Unfortunately this is less useful than you might think, because it most cases the referring object is a dict object, as most class instances and modules store references in __dict__ rather than directly. With some more processing it would probably be possible to work out the purpose of each dict and provide more semantically useful information.

To work out where an object was originally allocated we can use the new tracemalloc module. When enabled the tracemalloc module will store a traceback for each memory allocation within Python. The get_object_traceback() method will return a traceback (or None) for a given object. This makes debugging leaks significantly easier than just guessing where an object was allocated.

Unfortunately, tracemalloc can sometimes provide misleading information when trying to detect leaks. Many of the Python built-in types optimise memory allocation by keeping an internal free list. When freeing an object, rather than calling the normal free routine, the object is placed on this free list, and a new object allocation can grab an object directly from the free list, rather than need to go through the normal memory manager routines. This means when using get_object_traceback(), the trace returned will be from the original time the memory was allocated, which may be for a long dead object. I think it would be good if the Python runtime provide a way to selectively disable this freelist approach to allow precise use of the get_object_traceback() method. I have an initial branch that does this for some object up on github.

uiautomator and webviews in Lollipop

Thu, 01 Jan 2015 11:07:25 +0000

Android’s uiautomator dump command takes a snapshot of the currently active window state and creates an XML representation of the view hierarchy. This is extremely useful when it comes to automating interactions with the device and general debugging. One handy aspect of this is that it can even introspect on Android’s WebView, so you get an idea of where text boxes and buttons are. Or at least that was the case pre-Lollipop

In Lollipop this doesn’t quite work anymore. When you run uiautomator dump in Lollipop, you just get one big WebView item in the view hierarchy, but no children! (And I’m not the only one to have this issue.). So what’s going on?

The first thing to realise is that uiautomator operates through the accessibility APIs. Essentially it is enabled as an accessibility manager in the system (which can actually be a problem if you have another accessibility manager enabled).

It’s actually kind of tricky to understand exactly what is going on, but reading through Chrome source it seems to only enable population of the virtual view hierarchy if an accessibility manager is enabled at the time that the page is rendered. The UiAutoamtor sub-system is enabled as an accessibility manager when you run a uiautomator command, but in Lollipop at least, it is then disabled when the uiautomator process exits! I’m not 100% sure why this worked pre-Lollipop. I think it is because pre-lollipop, the UiAutomator service stayed registered even after the command exited, so any WebViews rendered after the first execution of uiautomator would correctly provide the virtual view hierarchy.

A work-around is to enable an accessibility service (like TalkBack), and then load the view you are interested in, disable the accessibility service, and then run uiautomator dump. Sometimes in this case you will get the full dump (other times not, it seems not, but I think that must be due to the page being re-rendered or something similar).

If you want to get the full view hierarchy the only real way to go about it is to write a UI test that is run by uiautomator; as long as uiautomator is running you should be able to extract the full hierarchy.

Haskell Baby Steps

Sat, 27 Dec 2014 11:25:28 +0000

I haven’t really written any Haskell in a serious manner since first-year computing which doesn’t seem that long ago, but actually is.

Recently I’ve tried to pick it up again, which has been challenging to be honest. Haskell is much more widely used these days and can do a lot more. At the same time there is a lot more to learn and the language seems to be an awful lot bigger now, than it did 16 years ago. My overall goal right now it to rewrite some Python code that performs abstract interpretation in Haskell, but I’m taking baby steps to start with.

One of the biggest thing with Haskell (as compared to Python) is that it really forces you to know exactly what you are trying to do before writing code. This is a lot different to how I usually start writing code. The way I usually design code is by trying to write some code that does some small part of what I’m trying to achieve, and then iterate from that. For me at least this is easier to do in Python than in Haskell, but this is most likely to be a property of my familiarity with the langauge, than any inherent property of either language. In any case, for me, doing an initial sketch in Python and then converting that to Haskell seems to be a useful approach.

The first module I’ve written on this journey is FiniteType.hs, which provides a type-class FiniteType with a single method cardinality. To be honest I was surprised that this wasn’t some kind of built-in! At the end of the day the module is fairly simple but along the way there was a lot to learn: language extension (in general), the DefaultSignatures extension in particular and the Data.Proxy.

In the end I’m fairly happy with the solution I came up with, but there is a certain amount of code duplication in the implementation that I’m still not happy about. I think the only fix for this is template-haskell, but that seems overkill for such a simple module.

There are still some things that I’m not sure about. In particular, when writing a type-class such as `FiniteType` should the module create instances fo the type-class, or should that be up to the user of the module?

Code is available on github. I’d welcome any comments or suggestions for improvements.

namedfields create performance

Mon, 01 Dec 2014 20:35:37 +0000

This is a follow-up to yesterday’s post about namedtuples.

Yesterday I mostly focussed on the performance of accessing attributes on a named tuple object, and the namedfields decorator approach that I showed ended up with the same performance as the standard library namedtuple. One operation that I didn’t consider, but is actually reasonably common is the actual creation of a new object.

My implementation relied on a generic __new__ that used the underlying _fields to work out the actual arguments to pass to the tuple.__new__ constructor:

    def __new__(_cls, *args, **kwargs):
        if len(args) > len(_cls._fields):
            raise TypeError("__new__ takes {} positional arguments but {} were given".format(len(_cls._fields) + 1, len(args) + 1))

        missing_args = tuple(fld for fld in _cls._fields[len(args):] if fld not in kwargs)
        if len(missing_args):
            raise TypeError("__new__ missing {} required positional arguments".format(len(missing_args)))
        extra_args = tuple(kwargs.pop(fld) for fld in _cls._fields[len(args):] if fld in kwargs)
        if len(kwargs) > 0:
            raise TypeError("__new__ got an unexpected keyword argument '{}'".format(list(kwargs.keys())[0]))

        return tuple.__new__(_cls, tuple(args + extra_args))

This seems to work (in my limited testing), but the code is pretty nasty (I’m far from confident that it is correct), and it is also slow. About 10x slower than a class created with the namedtuple factory function, which is just:

    def __new__(_cls, bar, baz):
        'Create new instance of Foo2(bar, baz)'
        return _tuple.__new__(_cls, (bar, baz))

As a result of this finding, I’ve changed my constructor approach, and now generate a custom constructor for each new class using eval. It looks something like:

        str_fields = ", ".join(fields)
        new_method = eval("lambda cls, {}: tuple.__new__(cls, ({}))".format(str_fields, str_fields), {}, {})

With this change constructor performance is on-par with the namedtuple approach, and I’m much more confident that the code is actually correct!

I’ve cleaned up the namedfields code a little, and made it available as part of my pyutil repo.

A better namedtuple for Python (?)

Sun, 30 Nov 2014 16:56:23 +0000

Python's namedtuple factory function is, perhaps, the most under utilised feature in Python. Classes created using the namedtuple are great because they lead to immutable object (as compared to normal classes which lead to mutable objects). The goal of this blog post isn’t to convince you that immutable objetcs are a good thing, but take a bit of a deep dive into the namedtuple construct and explore some of its shortcomings and some of the ways in which it can be improved.

NOTE: All the code here is available in this gist.

As a quick primer, let’s have a look at a very simple example:

from collections import namedtuple

Foo = namedtuple('Foo', 'bar baz')
foo = Foo(5, 10)
print(foo)

NOTE: All my examples are targeting Python 3.3. They will not necessarily work in earlier version, in particular at least some of the later ones are not to work in Python 2.7

For simple things like this example, the existing namedtuple works pretty well, however the repetition of Foo is a bit of a code smell. Also, even though Foo is a class, it is created significantly differently to a normal class (which I guess could be considered an advantage).

Let’s try and do something a little more complex. If we find ourselves needing to know the product of bar and baz a bunch of times we probably want to encapsulate rather than writing foo.bar * foo.baz everywhere. (OK this example isn’t the greatest in the history of technical writing but just bear with me on this). So, as a first example we can just write a function:

def foo_bar_times_baz(foo):
    return foo.bar * foo.baz

print(foo_bar_times_baz(foo))

There isn’t anything wrong with this, functions are great! But the normal pattern in Python is that when we have a function that does stuff on a class instance we make it a method, rather than a function. We could debate the aesthetic and practical tradeoffs between methods and functions, but let’s just go with the norm here and assume we’d prefer to do foo.bar_times_baz(). The canonical approach to this is:

class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz


foo = Foo(5, 10)
print(foo.bar_times_baz())

Now this works but, again, we have the repetition of Foo, which bugs me. Coupled with this we are calling a function inside the class definition which is a bit ugly. There are some minor practical concerns because there are now multiple Foo classes in existance, which can at times cause confusion. This can be avoided by appending an underscore to the name of the super-class. E.g.: class Foo(namedtuple('Foo_', 'bar baz')).

An alternative to this which I’ve used from time to time is to directly update the Foo class:

Foo = namedtuple('Foo', 'bar baz')
Foo.bar_times_baz = lambda self: self.bar * self.baz
foo = Foo(5, 10)
print(foo.bar_times_baz())

I quite like this, and think it works well for methods which can be written as an expression, but not all Python programmers are particularly fond of lambdas, and although I know that classes are mutable, I prefer to consider them as immutable, so modifying them after creation is also a bit ugly.

The way that the namedtuple function works is by creating a string that matches the class definition and then using exec on that string to create the new class. You can see the string verion of that class definition by passing verbose=True to the namedtuple function. For our Foo class it looks a bit like:

class Foo(tuple):
    'Foo(bar, baz)'

    __slots__ = ()

    _fields = ('bar', 'baz')

    def __new__(_cls, bar, baz):
        'Create new instance of Foo(bar, baz)'
        return _tuple.__new__(_cls, (bar, baz))

    ....

    def __repr__(self):
        'Return a nicely formatted representation string'
        return self.__class__.__name__ + '(bar=%r, baz=%r)' % self

    .....

    bar = _property(_itemgetter(0), doc='Alias for field number 0')

    baz = _property(_itemgetter(1), doc='Alias for field number 1')

I’ve omitted some of the method implementation for brevity, but you get the general idea. If you have a look at that you might have the same thought I did: why isn’t this just a sub-class?. It seems practical to construct:

class NamedTuple(tuple):
    __slots__ = ()

    _fields = None  # Subclass must provide this

    def __new__(_cls, *args):
        # Error checking, and **kwargs handling omitted for brevity
        return tuple.__new__(_cls, tuple(args))

    def __repr__(self):
        'Return a nicely formatted representation string'
        fmt = '(' + ', '.join('%s=%%r' % x for x in self._fields) + ')'
        return self.__class__.__name__ + fmt % self

    def __getattr__(self, field):
        try:
            idx = self._fields.index(field)
        except ValueError:
            raise AttributeError("'{}' NamedTuple has no attribute '{}'".format(self.__class__.__name__, field))

        return self[idx]

Again, some method implementations are omitted for brevity (see the gist for gory details). The main difference to the generated class is making use of _fields consistently rather than hard-coding some things, and the necessity of a __getattr__ method, rather than hardcoded properties.

This can be used something like this:

class Foo(NamedTuple):
    _fields = ('bar', 'baz')

This fits the normal pattern for creating classes much better than a constructor function. It is two lines rather than the slightly more concise one-liner, but we don’t need to duplicate the Foo name so that is a win. In addition, when we want to start adding methods, we know exactly where they go.

This isn’t without some drawbacks. The biggest problem is that we’ve just inadvertently made our subclass mutable! Which is certainly not ideal. This can be rectified by adding a __slots__ = () in our sub-class:

class Foo(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')

At this point this approach no longer looks so good. Some other minor drawbacks are that there is no checking that the field names are really valid in anyway, which the namedtuple approach does correctly. The final drawback is on the performance front. We can measure the attribute access time:

from timeit import timeit

RUNS = 1000000

direct_idx_time = timeit('foo[0]', setup='from __main__ import foo', number=RUNS)
direct_attr_time = timeit('foo.bar', setup='from __main__ import foo', number=RUNS)
sub_class_idx_time = timeit('foo[0]', setup='from __main__ import foo_sub_class as foo', number=RUNS)
sub_class_attr_time = timeit('foo.bar', setup='from __main__ import foo_sub_class as foo', number=RUNS)


print(direct_idx_time)
print(direct_attr_time)
print(sub_class_idx_time)
print(sub_class_attr_time)

The results are that accessing a value via direct indexing is the same for both approachs, however when accessing via attributes the sub-class approach is 10x slower than the original namedtuple approach (which was about factor 3x slower than direct indexing).

Just for kicks, I wanted to see how far I could take this approach. So, I created a little optimize function, which can be applied as a decorator on a namedtuple class:

def optimize(cls):
    for idx, fld in enumerate(cls._fields):
        setattr(cls, fld, property(itemgetter(idx), doc='Alias for field number {}'.format(idx)))
    return cls


@optimize
class FooSubClassOptimized(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')

    def bar_times_baz(self):
        return self.bar * self.baz

The optimize goes through and adds a property to the class for each of it’s fields. This means that when an attribute is accessed, it can short-cut through the property, rather than using the relatively slow __getattr__ method. This result in a considerable speed-up, however performance was still about 1.2x slower than the namedtuple approach.

If anyone can explain why this takes a performance hit I’d appreciate it, because I certainly can’t work it out!

So, at this point I thought my quest for a better namedtuple had come to an end, but I thought it might be worth pursuing the decorator approach some more. Rather than modifying the existing class, instead we could use that as a template and return a new class. After some iteration I came up with a class decorator called namedfields, which can be used like:

@namedfields('bar', 'baz')
class Foo(tuple):
    pass

This approach is slightly more verbose than sub-classing, or the namedtuple factory however, I think it should be fairly clear, and doesn’t include any redundancy. The decorator constructs a new class based on the input class, however it adds the properties for all the fields, ensures __slots__ is correctly added to the class, and attaches all the useful NamedTuple methods (such as _replace and __repr__) to the new class.

Classes constructed this way have identical attribute performance as namedtuple factory classes.

The namedfields function is:

def namedfields(*fields):
    def inner(cls):
        if not issubclass(cls, tuple):
            raise TypeError("namefields decorated classes must be subclass of tuple")

        attrs = {
            '__slots__': (),
        }

        methods = ['__new__', '_make', '__repr__', '_asdict',
                   '__dict__', '_replace', '__getnewargs__',
                   '__getstate__']

        attrs.update({attr: getattr(NamedTuple, attr) for attr in methods})

        attrs['_fields'] = fields

        attrs.update({fld: property(itemgetter(idx), doc='Alias for field number {}'.format(idx))
                      for idx, fld in enumerate(fields)})

        attrs.update({key: val for key, val in cls.__dict__.items()
                      if key not in ('__weakref__', '__dict__')})

        return type(cls.__name__, cls.__bases__, attrs)

    return inner

So, is the namedfields decorator better than the namedtuple factory function? Performance wise things are on-par for the common operation of accessing attributes. I think from an aesthetic point of view things are on par for simple classes:

Foo = namedtuple('Foo', 'bar baz')

# compared to

@namedfields('bar', 'baz')
class Foo(tuple): pass

namedtuple wins if you prefer a one-line (although if you are truly perverse you can always do: Foo = namedfields('bar', 'baz')(type('Foo', (tuple, ), {})). If you don’t really care about the one vs. two lines, then it is pretty line-ball call. However, when we compare what is required to add a simple method the decorate becomes a clear winner:

class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz

# compared to

@namedfields('bar', 'baz')
class Foo(tuple):
    def bar_times_baz(self):
        return self.bar * self.baz

There is an extra line for the decorator in the namedfields approach, however this is a lot clearer than it being forced in as a super class.

Finally, I wouldn’t recommend using this right now because my bring in a new dependency for something that provides only marginal gains over the built-in functionality isn’t really worth it, and this code is really only proof-of-concept at this point in time.

I will be following up with some extensions to this approach that covers off some of my other pet-hates with the namedtuple approach: custom constructors and subclasses