January 11, 2017 is an auspicious day in tech history. No, I'm not talking about the 10 year anniversary of the iPhone announcement, I'm talking about with much more impact: today marks a decade since the most recent version (13) of the SFTP internet draft expired. So, I thought it would be a good opportunity to rant a bit about some of the more interesting corners of the protocol.
The first interesting thing is that even though the there are 14 versions of the internet draft (the first being version zero) the protocol version used by most (all?) major implementations (OpenSSH in particular) is version 3 (which of course is specified in specification version 2, because while the specification number is zero-based the protocol version numbering is 1-based). Specification version 2 is expired on April's fool day 2002.
The second interesting thing is that this never reached the status of RFC. I can't claim to undersatnd why, but for a commonly used tool it is relatively surprising that the best we have in terms of written specification is an expired internet draft. Certainly a step up from now documentation at all, but still surprising.
The third interesting thing is that it isn't really a file transfer protocol. It certainly lets you do that, but its really much closer to a remote file system protocol. The protocol doesn't expose an upload / download style interface, but rather an open / close / read / write style interface. In fact most of the SFTP interfaces are really small wrappers around the POSIX APIs (with some subtle changes just to keep everyone on their toes!). One consequence of this style API is that there isn't any kind of atomic upload operation; there is no way for the srver to really know when a client has finished uploading a file. (Which is unfortunate is you want to say trigger some processing after a file is uploaded).
The fourth interesting thing is that some parts of the protocol are underspecified, which makes solid implementation a little tricky. Specifically there is no definition of a maximum packet size, but in practise popular STP client implement a maximum packet size. So, if you are a server implementator all you can do is guess! If you really want to gory details check out this bug report.
The fifth interesting thing is the way in which the readdir
implementation work.
Now, on a UNIX you have the POSIX readdir API, however that returns only a single file at a time. You don't want a network round trip for each file in a directory. So the SFTP readdir
is able to return multiple results in a single API (although exactly how many is underspecified; see the previous interesting thing!).
Now, this implementation lets you type ls
into the SFTP client and you only need a few round-trips to display things. However, people aren't happy with just a file listing, and generally use ls -l
to get a long listing with additional details on each file. On a POSIX system you end up calling stat on each filename to get the details, but if you have this design over the network then you are going to end up back with a lot of round-trips. So, to help optimize for this, rather than readdir
returning a list of filenames, it returns a list of (filename, attribute)
tuples. This let you do the ls -l
without excessive round-trips.
Now, so far that isn't interesting. The really interesting thing here is that they didn't stop here with this optimisation. To implement ls -l
you need to take the filename and attributes and then format them into a string that looks something like: drwxr-xr-x 5 markus markus 1024 Jan 13 18:39 .ssh
. Now, there is not really anything that stops an SFTP client doing the transformation on the client. Doing so certainly doesn't requier any additional network round trips. But, for some reason, the SFTP server actually sends this formatted string in the result of readdir
! So, readdir returns a list of (filename, longname, attribute)
tuples. This is kind of strange.
The only real reason for doing this transformation on the server side is that the client can't perform a transformation from gid
/ uid
into strings (since you can't expect the client and server to share the same mapping). So, there is some justificaiton here. However, readdir
isn't the only call you need to implement ls
. If the user simply performs an ls on a specific file, then you need to use stat
to determine if the file exists, and if so get the attribute information. The SFTP protocol provides a stat
call that basically works as you'd expect, returning the attributes for the file. But, it does not return the longname! So, now you know why sometimes an ls
in the SFTP client will provide you with user and group names ,while other times it just show numeric identifiers.
The sixth interesting thing is that despite there being a cd
command in the OpenSSH SFTP client, there isn't any chdir
equivalent in the protocol. So, cd
is a pure client-side illusion. A related interesting thing is that the protocol support filenames starting with a leading slash and without slash. With a slash means absolute (as you'd expect), but without a slash doesn't mean relative, or at least not relative to any current working directory. Paths without a leading slash are resolved relative to a default directory. Normally this would be the user's home directory, but really it is up to the SSH server.
The final interesting thing worth commenting on is that even though most of the SFTP interfaces seem to be copies of POSIX interfaces for rename, they chose to specify different semantics to the POSIX rename! Specifically, in SFTP it is an error to rename a file to an existing file, while POSIX rename supports this without an issue. Now of course, people wanted their normal POSIX semantics in rename, so OpenSSH obliged with an extension: posix-rename
. So, OpenSSH has two renames, one with SFTP semantics, and one with POSIX semantics. This is not unreasonable, the really fun part is that the OpenSSH SFTP client automatically tries ot use the posix-rename
semantics, but will happily, silently fall back to the SFTP semantics.
Hopefully if you ever need to deal with SFTP at a low level this has given you some pointers of things to look out for!
Yesterday a Policy Hackathon event was launched by the Assistant Minister for Innovation Wyatt Roy. This kind of outreach to the community is a great thing to see from politicians, and I look forward to seeing more examples of it. However, I'm concerned that the way in which this particular event has been framed won't lead to successful outcomes.
The first problem with Policy Hack is that it focuses on the solution space (i.e.: policy ideas) and completely fails to explore the problem space. Ironically, this is the same kind of thinking that leads to the demise of many technology companies who focus on the cool technology instead of the problem their customers need them to solve.
The Policy Hack website has, at the time of writing, received over 60 submissions (of varying quality). All of them are policy suggestions, but few, if any, provide any kind of detail on the specific problem they are trying to solve. In some cases it is implied (e.g. a policy suggestion to lower taxation implies that high taxation is in some way impeding innovation), but even then there is little to convince the reader that the implied problem is actually a problem. (e.g.: where is the evidence that the current level of taxation is actually impeding innovation?)
Before we get into policy ideas we really need to have some kind of problem definition. What is the problem that Policy Hack is trying to solve?
The closest we get to that on the website is policy ideas
designed to foster the growth of innovation industries including tech
startups, biotech, agtech, fintech, renewables and resources.
.
This isn't at all a bad starting point, but it's a very high-level aspirational kind of statement, rather than an actionable definition of the problem. A lot more work needs to be done to fully flesh out the problem before we even start thinking about the solution.
An example of a more actionable problem statement
could be: The growth of the tech startups, biotech, agtech, fintech, renewables and resources
industries as measured by percentage of GDP is too low for Australia's future
economic prosperity
. Of course this raises further questions. Maybe percentage of GDP isn't the
best measure, and we should be using a different metric such as tax revenue,
or employment. And the actual terms should be defined a bit more clearly as
well; what even is a tech startup? I don't think that is one of the
business classifications used by the Bureau of Statistics, which
raises even more questions!
Of course, once we have that problem statement, some kind of data would be nice. Like, what is the current growth rate? Off what base? What does it need to be to ensure economic prosperity? If we can't get those numbers directly, what are reasonable proxies for those numbers that we can measure? How reliable are those proxies?
With a problem definition, trying some kind of root cause analysis might lead to specific problems that we can solve with policies. One example could be: there aren't enough people creating startups, because there aren't enough STEM graduates, because high-school students don't choose STEM courses, because they are worried about the HECS/HELP debt. Or it could be: we have plenty of startups but once they get initial traction they move to the US, because there isn't enough capital available in Australia, because investors are taxed too heavily on capital gains. I'm sure there are hundreds different chains of thought like this that should be explored.
Of course, those are just hypotheses! We should be applying some kind of scientific method here. Do startups really move to the US once they get success? Maybe they don't. Do they move because of funding reasons, or maybe it is access to customers? Is capital not available here because of capital gains tax, or maybe it is because other asset classes provide better returns?
None of that is really easy, and certainly can't be done collaboratively during a hackathon (despite the excellent opening up of government data sources to make the underlying data available). This needs patient analysis, by diligent people, spending time to find the right data sources.
Without access to this information how could you possibly come up with a useful policy suggestion? Of course, once this work has been done the actual policies probably become somewhat obvious!
Ideally this is work that has already been done, either within government policy units, or external consultants. If so, it should be directly available on the Policy Hack website, and it should be required reading for anyone wanting to put forward a policy idea.
The second problem that I see with the Policy Hack event, or at least some of the suggestions put forward to date is that many of them are simply reinventing the wheel. (Which, ironically, is the cause of the downfall of many a startup!)
Many of the suggestions currently put forward are basically the same as policies that already exist in some shape or form. This isn't meant to be a criticism about the person putting forward the idea; it isn't exactly simple to find what policies are currently out there. But surely, if we are going to have an event that looks at putting forward new policy ideas on innovation we should at the very least have a list of the current policies the government has in place with the aim of fostering innovation! Ideally, we have also have an analysis of each policy with some commentary on what works well about it, what doesn't work so well, and some form of cost-benefit analysis.
With this information in place we may find that a simple tweak to an existing policy could have a large impact. Or we could find ineffective policies that can be scrapped to make funds available for new policies (because any new policies will need to be funded somehow, right?). Or we could find that we already have policies addressing many of the things that we think are problems, and we need to go back to the root cause analysis to determine other possible things that need addressing.
Again, maybe this information is available somewhere, but it should be
made available to participants of Policy Hack in advance. And,
if you are going to propose a new policy, that is
similar in some way to an existing policy, then should explain what's different about your proposal.
For example, if you put forward an idea such as allow special visas to hire
technical immigrants
, without any reference at all to the current
457 visa scheme you're really just wasting everybody's time!
Finally, there are a lot of Country X does Y, let's do Y
policy
suggestions without any critical analysis of why Y works for
X (or even if Y works for
X!). (And to continue the theme, this kind of me too
thinking seems to plague startups as well.)
Ideally an input to the Policy Hack should be a comprehensive review of other innovation policies that have been put in place by other jurisdictions. This review should explain how these have, or have not, been effective in the jurisdictions in which they have been implemented, along with some commentary as to how they compare to similar policies in Australia. For example, holding up the US low capital gains tax and long-term dividend tax rate without explaining that the US has no mechanism for relief from double taxation at the corporate and individual level is not providing a fair evaluation.
And again, if you are going to put forward policy ideas based on ideas from other jurisdictions, it better actually reference that research document and justify why that policy would work in Australia. For example, if you think that Australia should have lower corporate tax rates because Ireland has lower corporate tax rates, you need to at least give some kind of backup that this is actually working for Ireland.
If we want to have an effective discussion on policy, we first need to agree on the problem we are solving, and have a clear understanding of what has been tried before both locally and overseas.
Instead of policies, maybe the Policy Hack could instead work through the problem space. A morning session that presented a summary of all existing data to the participants, followed by an afternoon brainstorming session on identifying possible reasons why we don't see the levels of innovation that we'd like. With this information in hand the policy experts could perform the relevant research to determine which of the identified reasons actually has an evidence base to support it. The results of this could feed in to a future hackathon that starts to look at policy from an informed base.
If you are doing any kind of analysis of a Linux based system you quickly come to the point of needing to collect statistics on a per-process basis. However one difficulty here is how to uniquely identify a process.
The simplest answer is that you can just use the process identifier (PID), unfortunately this is not a good answer in the face of PID reuse. The total number of PIDs in the system is finite, and once exhausted the kernel will start re-using PIDs that have been previously allocated to different processes.
Now you may expect that this reuse would be relatively rare, or occur on a relatively large period, but it can happen relatively quickly. On a system under load I've seen PID reuse in under 5 seconds.
The practical consequence of this is that if you are, for example,
collecting statistics about a process via any interface that
identifies processes via PID you need to be careful to ensure you are
collecting the right statistics! For example, if you are a collecting
statistics of a process identified via PID 37 you might read
/proc/37/stat
at one instance and receive valid data, but
at any later time /proc/37/stat
could return data about
a completely different process.
Thankfully, the kernel associates another useful piece of information with a process: it’s start time. The combination of PID and start time provides a reasonably robust way of uniquely identifying processes over the life-time of a system. (For the pedantic, if a process can be created and correctly reaped all within the granularity of the clock, then it would be theoretically possible for multiple different processes to have existed in the system that have the same PID and start time, but that is unlikely to be a problem in practise.)
The start time is one of the fields that is present in the
/proc/<pid>/stat
information, so this can be used
to ensure you are correctly matching up statistics.
PEP-0492
was recently approved giving Python 3.5 some special syntax for
dealing with co-routines. A lot of the new functionaltiy was available
pre-3.5, but the syntax certainly wasn’t ideal as the concept
of generators and co-routines were kind of intermingled. PEP-0492
makes an explicit distinction between generators and co-routines
through the use of the the async
keyword.
This post aims to describe how these new mechanisms works from a rather low-level. If you are mostly interested in just using this functionality for high-level stuff I recommend skipping this post and reading up on the built-in asyncio module. If you are interested in how these low-level concepts can be used to build up your own version of the asyncio module, then you might find this interesting.
For this post we’re going to totally ignore any asynchronous I/O aspect and just limit things to interleaving progress from multiple co-routines. Here are two very simple functions:
def coro1(): print("C1: Start") print("C1: Stop") def coro2(): print("C2: Start") print("C2: a") print("C2: b") print("C2: c") print("C2: Stop")
We start with two very simple function, coro1
and
coro2
. We could call these function one after the
other:
coro1() coro2()
and we’d get the expected output:
C1: Start C1: Stop C2: Start C2: a C2: b C2: c C2: Stop
But, for some reason, rather than running these one after the other, we’d like to interleave the execution. We can’t just do that with normal functions, so let’s turn these into co-routines:
async def coro1(): print("C1: Start") print("C1: Stop") async def coro2(): print("C2: Start") print("C2: a") print("C2: b") print("C2: c") print("C2: Stop")
Through the magic of the new async
these functions are
no-longer functions, but now they are co-routines (or more
specifically native co-routine functions). When you call a
normal function, the function body is executed, however when you call
a co-routine function the body isn’t executed; instead you get back a
co-routine object:
c1 = coro1() c2 = coro2() print(c1, c2)
gives:
<coroutine object coro1 at 0x10ea60990> <coroutine object coro2 at 0x10ea60a40>
(The interpretter will also also print some runtime warnings that we’ll ignore for now).
So, what good is having a co-routine object? How do we actually execute the thing? Well,
one way to execute a co-routine is through an await expression (using the new await
keyword). You might think you could do something like:
await c1
but, you’d be disappointed. An await expression is only valid syntax when contained within an native co-routine function. You could do something like:
async def main(): await c1
but of course then you are left with the problem of how to force the execution of main
!
The trick to realise is that co-routines are actually pretty
similar to Python generators, and have the same send
method. We can kick off execution of a co-routine by calling the
send
method.
c1.send(None)
This gets our first co-routine executing to completion, however we also get a nasty
StopIteration
exception:
C1: Start C1: Stop Traceback (most recent call last): File "test3.py", line 16, inc1.send(None) StopIteration
The StopIteration exception is the mechanism used to indicate that
a generator (or co-routine in this case) has completed execution.
Despite being an exception it is actually quite expected! We can wrap
this in an appropriate try/catch
block, to avoid the
error condition. At the same time let’s start the execution of our
second co-routine:
try: c1.send(None) except StopIteration: pass try: c2.send(None) except StopIteration: pass
Now we get complete output, but it is disappointingly similar to our original output. So we have a bunch more code, but no actual interleaving yet! Co-routines are not dissimilar to threads as they allow the interleaving of multiple distinct threads of control, however unlike threads when we have a co-routine any switching is explicit, rather than implicit (which is, in many cases, a good thing!). So we need to put in some of these explicit switches.
Normally the send
method on generators will execute
until the generator yields a value (using the yield
keyword), so you might think we could change coro1
to
something like:
async def coro1(): print("C1: Start") yield print("C1: Stop")
but we can’t use yield
inside a co-routine. Instead we
use the new await
expression, which suspends execution of
the co-routine until the awaitable completes. So we need
something like await _something_
; the question is what is
the something in thise case? We can’t await just await on
nothing! The PEP explains
which things are awaitable. One of them is another native
co-routine, but that doesn’t help get to the bottom of things. One of
them is an object defined with a special CPython API, but we want to
avoid extension modules and stick to pure Python right now. That
leaves two options; either a generator-based coroutine object
or a special Future-like like object.
So, let’s go with the generator-based co-routine object to start
with. Basically a Python generator (e.g.: something that has a
yield
in it) can be marked as a co-routine through the
types.coroutine
decorator. So a very simple example of
this would be:
@types.coroutine def switch(): yield
This define a generator-based co-routine
function. To get a generator-based co-routine
object we just call the function. So, we can change
our coro1
co-routine to:
async def coro1(): print("C1: Start") await switch() print("C1: Stop")
With this in place, we hope that we can interleave our execution of
coro1
with the execution of coro2
. If we try
it with our existing code we get the output:
C1: Start C2: Start C2: a C2: b C2: c C2: Stop
We can see that as expected coro1
stopped executing after the first print statement,
and then coro2
was able to execute. In fact, we can look at the co-routine object
and see exactly where it is suspended with some code like this:
print("c1 suspended at: {}:{}".format(c1.gi_frame.f_code.co_filename, c1.gi_frame.f_lineno))
which print-out the line of your await
expression. (Note: this gives you the outer-most await, so is mostly
just for explanatory purpose here, and not particularly useful in the
general case).
OK, the question now is, how can we resume coro1
so that it executes to completion.
We can just use send again. So we end up with some code like:
try: c1.send(None) except StopIteration: pass try: c2.send(None) except StopIteration: pass try: c1.send(None) except StopIteration: pass
which then gives us our expected output:
C1: Start C2: Start C2: a C2: b C2: c C2: Stop C1: Stop
So, at this point we’re manually pushing the co-routines through to completion by
explicitly calling send
on each individual co-routine object. This isn’t going
to work in general. What we’d really like is a function that kept executing all our co-routines
until they had all completed. In other words, we want to continually execute send
on each co-routine object until that method raises the StopIteration
exception.
So, let’s create a function that takes in a list of co-routines and executes them until
completion. We’ll call this function run
.
def run(coros): coros = list(coros) while coros: # Duplicate list for iteration so we can remove from original list. for coro in list(coros): try: coro.send(None) except StopIteration: coros.remove(coro)
This picks a co-routine from the list of co-routines, executes it, and then if
a StopIteration
exception is raised, the co-routine is removed from the
list.
We can then remove the code manually calling the send method and instead do something like:
c1 = coro1() c2 = coro2() run([c1, c2])
And now we have a very simple run-time for executing co-routines using the new await and async features in Python 3.5. Code related to this post is available on github.
The instructions in the Android first-app tutorial (as of 20150408) are kind of confusing. I think something got lost in translation with the adding of Android Studio to the mix.
Both the creating a project and running an app have instructions for using the command line, rather than the IDE (which is great, because I'm curmudgeonly like that). Unfortunately, if you use the command line options for creating a project, you end up with an Ant based build system set up, and not a Gradle based build system. Which conveniently means that the command line instructions for running an app make no sense at all.
So, since you probably want a gradle based build, I suggest using IDE once off to create the project, and then you can follow the instructions for running an app from the command line.
PS: Make sure you are connected to the interwebs the first time you run gradlew
as it needs to download a bunch of stuff before it can start.
For short running scripts it doesn't really matter if your Python module
is leaking memory; it all gets cleaned up on process exit anyway, however
if you have a long-running application you really don't want an ever
increasing memory footprint. So, it is helpful to check that your
module isn’t leaking memory, and to plug the leaks if they occur. This
post provides some ideas on how to do that, and describes some of the
implementation of my check_leaks
module.
If you suspect a given function foo
of leaking memory, then the
simplest approach to checking that it is not leaking memory is to get the set of
objects that exist before running the function, the set of objects the
exists after running the function, and then any object in after
that isn’t in before is going to be a leaked object.
So in Python the garbage collector tracks (practically) all the objects that
exist, and the gc
module provides us with access to the internals of the garbage collector.
In particular the get_objects
method, will provide a list of all
the objects tracked by the garbage collector. So at a really simple level we can do:
before_objects = gc.get_objects() foo() after_objects = gc.get_objects()
It is then a simple matter of working out which objects are in
after_objects
but not before_objects
, and
then that is our leaking objects! One catch however is that we don’t
know when the garbage collector has run, so with the above code it is
possible that there are objects in after_objects
that
could be cleaned by the garbage collector. Thankfully, the gc
module provides the collect
method which allows us to force a
garbage collection. So we end up with something like:
before_objects = gc.get_objects() foo() gc.collect() after_objects = gc.get_objects()
The other small gotcha in this case is that the before_objects
list will end up as an object in after_objects
. This isn’t really
a leak, so we can ignore it when display the leaking objects.
So far, this can give us a list of leaked objects, but it doesn’t really help resolve the issue except in some of the simplest cases. We need more information to really fix the issue. Some things that can help with this is knowing why it hasn’t been collected yet, which comes down to knowing which other objects still have a reference to the leaked object. Additionally it can be very useful to know where the object was originally allocated. Python gives us some tools to help give us this information.
The gc
module provides a get_referrers
method, which provides a list of all objects that refer to a given
object. This can be used to print out which objects refer to the given
object. Unfortunately this is less useful than you might think,
because it most cases the referring object is a dict
object, as most class instances and modules store references in
__dict__
rather than directly. With some more processing
it would probably be possible to work out the purpose of each dict
and provide more semantically useful information.
To work out where an object was originally allocated we can use
the new tracemalloc
module. When enabled the tracemalloc
module will store a traceback
for each memory allocation within Python. The
get_object_traceback()
method will return a traceback (or None) for
a given object. This makes debugging leaks significantly easier than
just guessing where an object was allocated.
Unfortunately, tracemalloc
can sometimes provide misleading
information when trying to detect leaks. Many of the Python built-in
types optimise memory allocation by keeping an internal free list. When freeing
an object, rather than calling the normal free routine, the object is placed
on this free list, and a new object allocation can grab an object directly
from the free list, rather than need to go through the normal memory manager
routines. This means when using get_object_traceback()
, the trace
returned will be from the original time the memory was allocated, which may
be for a long dead object. I think it would be good if the Python runtime
provide a way to selectively disable this freelist approach to allow precise
use of the get_object_traceback()
method. I have an initial
branch that does this for some object up on github.
Android’s uiautomator dump
command takes a snapshot of
the currently active window state and creates an XML representation of
the view hierarchy. This is extremely useful when it comes to
automating interactions with the device and general debugging. One handy aspect
of this is that it can even introspect on Android’s WebView, so you get an idea of
where text boxes and buttons are. Or at least that was the case pre-Lollipop
In Lollipop this doesn’t quite work anymore. When you run uiautomator dump
in Lollipop,
you just get one big WebView item in the view hierarchy, but no children! (And I’m not the only one to have this issue.).
So what’s going on?
The first thing to realise is that uiautomator operates through the accessibility APIs. Essentially it is enabled as an accessibility manager in the system (which can actually be a problem if you have another accessibility manager enabled).
It’s actually kind of tricky to understand exactly what is going on, but reading through Chrome source it seems to only enable population of the virtual view hierarchy if an accessibility manager is enabled at the time that the page is rendered. The UiAutoamtor sub-system is enabled as an accessibility manager when you run a uiautomator
command, but in Lollipop at least, it is then disabled when the uiautomator
process exits! I’m not 100% sure why this worked pre-Lollipop. I think it is because pre-lollipop, the UiAutomator service stayed registered even after the command exited, so any WebViews rendered after the first execution of uiautomator
would correctly provide the virtual view hierarchy.
A work-around is to enable an accessibility service (like TalkBack), and then load the view you are interested in, disable the accessibility service, and then run uiautomator dump
. Sometimes in this case you will get the full dump (other times not, it seems not, but I think that must be due to the page being re-rendered or something similar).
If you want to get the full view hierarchy the only real way to go about it is to write a UI test that is run by uiautomator
; as long as uiautomator
is running you should be able to extract the full hierarchy.
I haven’t really written any Haskell in a serious manner since first-year computing which doesn’t seem that long ago, but actually is.
Recently I’ve tried to pick it up again, which has been challenging to be honest. Haskell is much more widely used these days and can do a lot more. At the same time there is a lot more to learn and the language seems to be an awful lot bigger now, than it did 16 years ago. My overall goal right now it to rewrite some Python code that performs abstract interpretation in Haskell, but I’m taking baby steps to start with.
One of the biggest thing with Haskell (as compared to Python) is that it really forces you to know exactly what you are trying to do before writing code. This is a lot different to how I usually start writing code. The way I usually design code is by trying to write some code that does some small part of what I’m trying to achieve, and then iterate from that. For me at least this is easier to do in Python than in Haskell, but this is most likely to be a property of my familiarity with the langauge, than any inherent property of either language. In any case, for me, doing an initial sketch in Python and then converting that to Haskell seems to be a useful approach.
The first module I’ve written on this journey is FiniteType.hs
, which provides a type-class FiniteType
with a single method cardinality
. To be honest I was surprised that this wasn’t some kind of built-in! At the end of the day the module is fairly simple but along the way there was a lot to learn: language extension (in general), the DefaultSignatures
extension in particular and the Data.Proxy
.
In the end I’m fairly happy with the solution I came up with, but there is a certain amount of code duplication in the implementation that I’m still not happy about. I think the only fix for this is template-haskell, but that seems overkill for such a simple module.
There are still some things that I’m not sure about. In particular, when writing a type-class such as `FiniteType` should the module create instances fo the type-class, or should that be up to the user of the module?
Code is available on github. I’d welcome any comments or suggestions for improvements.
This is a follow-up to yesterday’s post about namedtuple
s.
Yesterday I mostly focussed on the performance of accessing
attributes on a named tuple object, and the namedfields
decorator approach
that I showed ended up with the same performance as the standard library namedtuple.
One operation that I didn’t consider, but is actually reasonably common is the actual
creation of a new object.
My implementation relied on a generic __new__
that used the underlying
_fields
to work out the actual arguments to pass to the tuple.__new__
constructor:
def __new__(_cls, *args, **kwargs): if len(args) > len(_cls._fields): raise TypeError("__new__ takes {} positional arguments but {} were given".format(len(_cls._fields) + 1, len(args) + 1)) missing_args = tuple(fld for fld in _cls._fields[len(args):] if fld not in kwargs) if len(missing_args): raise TypeError("__new__ missing {} required positional arguments".format(len(missing_args))) extra_args = tuple(kwargs.pop(fld) for fld in _cls._fields[len(args):] if fld in kwargs) if len(kwargs) > 0: raise TypeError("__new__ got an unexpected keyword argument '{}'".format(list(kwargs.keys())[0])) return tuple.__new__(_cls, tuple(args + extra_args))
This seems to work (in my limited testing), but the code is pretty
nasty (I’m far from confident that it is correct), and it is also
slow. About 10x slower than a class created with the
namedtuple
factory function, which is just:
def __new__(_cls, bar, baz): 'Create new instance of Foo2(bar, baz)' return _tuple.__new__(_cls, (bar, baz))
As a result of this finding, I’ve changed my constructor approach, and now generate a custom constructor for each new class using eval
. It looks something like:
str_fields = ", ".join(fields) new_method = eval("lambda cls, {}: tuple.__new__(cls, ({}))".format(str_fields, str_fields), {}, {})
With this change constructor performance is on-par with the namedtuple
approach, and I’m much more confident that the code is actually correct!
I’ve cleaned up the namedfields
code a little, and made it available as part of my pyutil repo.
Python's namedtuple
factory function is, perhaps, the most under utilised feature in
Python. Classes created using the namedtuple
are great
because they lead to immutable object (as compared to normal classes which lead to
mutable objects). The goal of this blog post isn’t to convince you that immutable objetcs are a good
thing, but take a bit of a deep dive into the namedtuple
construct and explore some
of its shortcomings and some of the ways in which it can be improved.
NOTE: All the code here is available in this gist.
As a quick primer, let’s have a look at a very simple example:
from collections import namedtuple Foo = namedtuple('Foo', 'bar baz') foo = Foo(5, 10) print(foo)
NOTE: All my examples are targeting Python 3.3. They will not necessarily work in earlier version, in particular at least some of the later ones are not to work in Python 2.7
For simple things like this example, the existing namedtuple
works pretty well,
however the repetition of Foo
is a bit of a code smell. Also, even though Foo
is a class, it is created significantly differently to a normal class (which I guess could be considered an advantage).
Let’s try and do something a little more complex. If we find ourselves needing to know the product
of bar
and baz
a bunch of times we probably want to encapsulate rather than
writing foo.bar * foo.baz
everywhere. (OK this example isn’t the greatest in the history
of technical writing but just bear with me on this). So, as a first example we can just write a function:
def foo_bar_times_baz(foo): return foo.bar * foo.baz print(foo_bar_times_baz(foo))
There isn’t anything wrong with this, functions are great! But the normal pattern in Python
is that when we have a function that does stuff on a class instance we make it a method, rather
than a function. We could debate the aesthetic and practical tradeoffs between methods and functions,
but let’s just go with the norm here and assume we’d prefer to do foo.bar_times_baz()
.
The canonical approach to this is:
class Foo(namedtuple('Foo', 'bar baz')): def bar_times_baz(self): return self.bar * self.baz foo = Foo(5, 10) print(foo.bar_times_baz())
Now this works but, again, we have the repetition of Foo
, which bugs me. Coupled
with this we are calling a function inside the class definition which is a bit ugly. There are
some minor practical concerns because there are now multiple Foo
classes in existance, which
can at times cause confusion. This can be avoided by appending an underscore to the name of the
super-class. E.g.: class Foo(namedtuple('Foo_', 'bar baz'))
.
An alternative to this which I’ve used from time to time is to directly update the Foo
class:
Foo = namedtuple('Foo', 'bar baz') Foo.bar_times_baz = lambda self: self.bar * self.baz foo = Foo(5, 10) print(foo.bar_times_baz())
I quite like this, and think it works well for methods which can be written as an expression, but not all Python programmers are particularly fond of lambdas, and although I know that classes are mutable, I prefer to consider them as immutable, so modifying them after creation is also a bit ugly.
The way that the namedtuple
function works is by creating a string that matches
the class definition and then using exec
on that string to create the new class. You
can see the string verion of that class definition by passing verbose=True
to the
namedtuple
function. For our Foo
class it looks a bit like:
class Foo(tuple): 'Foo(bar, baz)' __slots__ = () _fields = ('bar', 'baz') def __new__(_cls, bar, baz): 'Create new instance of Foo(bar, baz)' return _tuple.__new__(_cls, (bar, baz)) .... def __repr__(self): 'Return a nicely formatted representation string' return self.__class__.__name__ + '(bar=%r, baz=%r)' % self ..... bar = _property(_itemgetter(0), doc='Alias for field number 0') baz = _property(_itemgetter(1), doc='Alias for field number 1')
I’ve omitted some of the method implementation for brevity, but you get the general idea.
If you have a look at that you might have the same thought I did:
why isn’t this just a sub-class?
. It seems practical to construct:
class NamedTuple(tuple): __slots__ = () _fields = None # Subclass must provide this def __new__(_cls, *args): # Error checking, and **kwargs handling omitted for brevity return tuple.__new__(_cls, tuple(args)) def __repr__(self): 'Return a nicely formatted representation string' fmt = '(' + ', '.join('%s=%%r' % x for x in self._fields) + ')' return self.__class__.__name__ + fmt % self def __getattr__(self, field): try: idx = self._fields.index(field) except ValueError: raise AttributeError("'{}' NamedTuple has no attribute '{}'".format(self.__class__.__name__, field)) return self[idx]
Again, some method implementations are omitted for brevity (see the gist for gory details). The main
difference to the generated class is making use of _fields
consistently rather than hard-coding
some things, and the necessity of a __getattr__
method, rather than hardcoded properties.
This can be used something like this:
class Foo(NamedTuple): _fields = ('bar', 'baz')
This fits the normal pattern for creating classes much better than a constructor function. It is two
lines rather than the slightly more concise one-liner, but we don’t need to duplicate the Foo
name so that is a win. In addition, when we want to start adding methods, we know exactly where they go.
This isn’t without some drawbacks. The biggest problem is that
we’ve just inadvertently made our subclass mutable! Which is certainly
not ideal. This can be rectified by adding a __slots__ =
()
in our sub-class:
class Foo(NamedTuple): __slots__ = () _fields = ('bar', 'baz')
At this point this approach no longer looks so good. Some other minor drawbacks are that
there is no checking that the field names are really valid in anyway, which the namedtuple
approach does correctly. The final drawback is on the performance front. We
can measure the attribute access time:
from timeit import timeit RUNS = 1000000 direct_idx_time = timeit('foo[0]', setup='from __main__ import foo', number=RUNS) direct_attr_time = timeit('foo.bar', setup='from __main__ import foo', number=RUNS) sub_class_idx_time = timeit('foo[0]', setup='from __main__ import foo_sub_class as foo', number=RUNS) sub_class_attr_time = timeit('foo.bar', setup='from __main__ import foo_sub_class as foo', number=RUNS) print(direct_idx_time) print(direct_attr_time) print(sub_class_idx_time) print(sub_class_attr_time)
The results are that accessing a value via direct indexing is the same for both approachs, however
when accessing via attributes the sub-class approach is 10x slower than the original namedtuple
approach (which was about factor 3x slower than direct indexing).
Just for kicks, I wanted to see how far I could take this approach. So, I created a little optimize function, which can be applied as a decorator on a namedtuple class:
def optimize(cls): for idx, fld in enumerate(cls._fields): setattr(cls, fld, property(itemgetter(idx), doc='Alias for field number {}'.format(idx))) return cls @optimize class FooSubClassOptimized(NamedTuple): __slots__ = () _fields = ('bar', 'baz') def bar_times_baz(self): return self.bar * self.baz
The optimize
goes through and adds a property to the class for each of it’s fields.
This means that when an attribute is accessed, it can short-cut through the property, rather than
using the relatively slow __getattr__
method. This result in a considerable speed-up,
however performance was still about 1.2x slower than the namedtuple
approach.
If anyone can explain why this takes a performance hit I’d appreciate it, because I certainly can’t work it out!
So, at this point I thought my quest for a better
namedtuple
had come to an end, but I thought it might be worth pursuing the decorator
approach some more. Rather than modifying the existing class, instead we could use that as a template
and return a new class. After some iteration I came up with a class decorator called namedfields
,
which can be used like:
@namedfields('bar', 'baz') class Foo(tuple): pass
This approach is slightly more verbose than sub-classing, or the namedtuple
factory
however, I think it should be fairly clear, and doesn’t include any redundancy. The decorator
constructs a new class based on the input class, however it adds the properties for all the fields,
ensures __slots__
is correctly added to the class, and attaches all the useful
NamedTuple
methods (such as _replace
and __repr__
) to the
new class.
Classes constructed this way have identical attribute performance as namedtuple
factory classes.
The namedfields
function is:
def namedfields(*fields): def inner(cls): if not issubclass(cls, tuple): raise TypeError("namefields decorated classes must be subclass of tuple") attrs = { '__slots__': (), } methods = ['__new__', '_make', '__repr__', '_asdict', '__dict__', '_replace', '__getnewargs__', '__getstate__'] attrs.update({attr: getattr(NamedTuple, attr) for attr in methods}) attrs['_fields'] = fields attrs.update({fld: property(itemgetter(idx), doc='Alias for field number {}'.format(idx)) for idx, fld in enumerate(fields)}) attrs.update({key: val for key, val in cls.__dict__.items() if key not in ('__weakref__', '__dict__')}) return type(cls.__name__, cls.__bases__, attrs) return inner
So, is the namedfields
decorator better than the namedtuple
factory function? Performance wise things are on-par for the common operation of accessing attributes.
I think from an aesthetic point of view things are on par for simple classes:
Foo = namedtuple('Foo', 'bar baz') # compared to @namedfields('bar', 'baz') class Foo(tuple): pass
namedtuple
wins if you prefer a one-line (although if you are truly perverse you can always do:
Foo = namedfields('bar', 'baz')(type('Foo', (tuple, ), {}))
. If you don’t really care about
the one vs. two lines, then it is pretty line-ball call. However, when we compare what is required
to add a simple method the decorate becomes a clear winner:
class Foo(namedtuple('Foo', 'bar baz')): def bar_times_baz(self): return self.bar * self.baz # compared to @namedfields('bar', 'baz') class Foo(tuple): def bar_times_baz(self): return self.bar * self.baz
There is an extra line for the decorator in the namedfields
approach, however
this is a lot clearer than it being forced in as a super class.
Finally, I wouldn’t recommend using this right now because my bring in a new dependency for something that provides only marginal gains over the built-in functionality isn’t really worth it, and this code is really only proof-of-concept at this point in time.
I will be following up with some extensions to this approach that covers off some of my other
pet-hates with the namedtuple
approach: custom constructors and subclasses