<?xml version="1.0"?>
<rss version="2.0">
<channel>
 <title>Benno's Blog</title>
 <link>http://benno.id.au/blog/</link>
 <description>Systems design and other random tech stuff</description>
 <language>en-au</language>
 <copyright>Ben Leslie 2006-09</copyright>
 <pubDate></pubDate>
 <lastBuildDate></lastBuildDate>
 <docs>http://blogs.law.harvard.edu/tech/rss</docs>
 <generator>Benno's Magical Content Generator</generator>
 <ttl>120</ttl>
 <webMaster>benno.blog@benno.id.au</webMaster>
  <item>
    <title>Interesting corners of SFTP</title>
    <link>http://benno.id.au/blog/2017/01/11/sftp</link>
    <guid>http://benno.id.au/blog/2017/01/11/sftp</guid>
    <pubDate>Wed, 11 Jan 2017 20:49:12 +0000</pubDate>
    <description>&lt;p&gt;January 11, 2017 is an auspicious day in tech history. No, I'm not talking about
the 10 year anniversary of the iPhone announcement, I'm talking about with
much more impact: today marks a decade since the most recent version
(13) of the &lt;a href="https://filezilla-project.org/specs/draft-ietf-secsh-filexfer-13.txt"&gt;SFTP internet draft&lt;/a&gt; expired. So, I thought
it would be a good opportunity to rant a bit about some of the more
interesting corners of the protocol.&lt;/p&gt;

&lt;p&gt;The first interesting thing is that even though the there are 14
versions of the internet draft (the first being version zero) the protocol
version used by most (all?) major implementations (OpenSSH in particular)
is version 3 (which of course is specified in &lt;a href="https://filezilla-project.org/specs/draft-ietf-secsh-filexfer-02.txt"&gt;specification version 2&lt;/a&gt;,
because while the specification number is zero-based the protocol version numbering is 1-based). Specification version 2 is expired on April's fool day 2002.
&lt;/p&gt;

&lt;p&gt;The second interesting thing is that this never reached the status
of RFC.  I can't claim to undersatnd why, but for a commonly used tool
it is relatively surprising that the best we have in terms of written
specification is an expired internet draft. Certainly a step up from
now documentation at all, but still surprising.&lt;/p&gt;

&lt;p&gt;The third interesting thing is that it isn't really a &lt;em&gt;file
transfer&lt;/em&gt; protocol.  It certainly lets you do that, but its really
much closer to a remote file system protocol. The protocol doesn't
expose an &lt;em&gt;upload / download&lt;/em&gt; style interface, but rather an
&lt;em&gt;open / close / read / write&lt;/em&gt; style interface. In fact most of
the SFTP interfaces are really small wrappers around the POSIX APIs
(with some subtle changes just to keep everyone on their toes!). One
consequence of this style API is that there isn't any kind of atomic
upload operation; there is no way for the srver to really know when a
client has finished uploading a file. (Which is unfortunate is you want to say
trigger some processing after a file is uploaded).&lt;/p&gt;

&lt;p&gt;The fourth interesting thing is that some parts of the protocol are
underspecified, which makes solid implementation a little
tricky. Specifically there is no definition of a maximum packet size, but
in practise popular STP client implement a maximum packet size. So, if you
are a server implementator all you can do is guess! If you really
want to gory details check out this &lt;a href="https://github.com/ronf/asyncssh/issues/73"&gt;bug report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The fifth interesting thing is the way in which the &lt;code&gt;readdir&lt;/code&gt; implementation work.
Now, on a UNIX you have the POSIX &lt;a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html"&gt;readdir&lt;/a&gt; API, however that returns only a single file at a time. You don't want a network round trip for each file in a directory. So the SFTP &lt;code&gt;readdir&lt;/code&gt; is able to return multiple results in a single API (although exactly how many is underspecified; see the previous interesting thing!).&lt;/p&gt;

&lt;p&gt;Now, this implementation lets you type &lt;code&gt;ls&lt;/code&gt; into the SFTP client and you only need a few round-trips to display things. However, people aren't happy with just a file listing, and generally use &lt;code&gt;ls -l&lt;/code&gt; to get a long listing with additional details on each file. On a POSIX system you end up calling &lt;a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/stat.html"&gt;stat&lt;/a&gt; on each filename to get the details, but if you have this design over the network then you are going to end up back with a lot of round-trips. So, to help optimize for this, rather than &lt;code&gt;readdir&lt;/code&gt; returning a list of filenames, it returns a list of &lt;code&gt;(filename, attribute)&lt;/code&gt; tuples. This let you do the &lt;code&gt;ls -l&lt;/code&gt; without excessive round-trips.&lt;/p&gt;

&lt;p&gt;Now, so far that isn't interesting. The really &lt;em&gt;interesting&lt;/em&gt; thing here is that they didn't stop here with this optimisation. To implement &lt;code&gt;ls -l&lt;/code&gt; you need to take the filename and attributes and then format them into a string that looks something like: &lt;code&gt;drwxr-xr-x    5 markus   markus       1024 Jan 13 18:39 .ssh&lt;/code&gt;. Now, there is not really anything that stops an SFTP client doing the transformation on the client. Doing so certainly doesn't requier any additional network round trips. But, for some reason, the SFTP server actually sends this formatted string in the result of &lt;code&gt;readdir&lt;/code&gt;! So, readdir returns a list of &lt;code&gt;(filename, longname, attribute)&lt;/code&gt; tuples. This is kind of strange.&lt;/p&gt;

&lt;p&gt;The only real reason for doing this transformation on the server side is that the client can't perform a transformation from &lt;code&gt;gid&lt;/code&gt; / &lt;code&gt;uid&lt;/code&gt; into strings (since you can't expect the client and server to share the same mapping). So, there is some justificaiton here. However, &lt;code&gt;readdir&lt;/code&gt; isn't the only call you need to implement &lt;code&gt;ls&lt;/code&gt;. If the user simply performs an ls on a specific file, then you need to use &lt;code&gt;stat&lt;/code&gt; to determine if the file exists, and if so get the attribute information. The SFTP protocol provides a &lt;code&gt;stat&lt;/code&gt; call that basically works as you'd expect, returning the attributes for the file. But, it does &lt;strong&gt;not&lt;/strong&gt; return the longname! So, now you know why sometimes an &lt;code&gt;ls&lt;/code&gt; in the SFTP client will provide you with user and group names ,while other times it just show numeric identifiers.&lt;/p&gt;


&lt;p&gt;The sixth interesting thing is that despite there being a &lt;code&gt;cd&lt;/code&gt; command in the OpenSSH SFTP client, there isn't any &lt;code&gt;chdir&lt;/code&gt; equivalent in the protocol. So, &lt;code&gt;cd&lt;/code&gt; is a pure client-side illusion. A related interesting thing is that the protocol support filenames starting with a leading slash and without slash. With a slash means absolute (as you'd expect), but without a slash doesn't mean relative, or at least not relative to any current working directory. Paths without a leading slash are resolved relative to a &lt;em&gt;default&lt;/em&gt; directory. Normally this would be the user's home directory, but really it is up to the SSH server.&lt;/p&gt;

&lt;p&gt;The final interesting thing worth commenting on is that even though most of the SFTP interfaces seem to be copies of POSIX interfaces for rename, they chose to specify &lt;strong&gt;different&lt;/strong&gt; semantics to the POSIX rename! Specifically, in SFTP it is an error to rename a file to an existing file, while POSIX rename supports this without an issue. Now of course, people wanted their normal POSIX semantics in rename, so OpenSSH obliged with an extension: &lt;code&gt;posix-rename&lt;/code&gt;. So, OpenSSH has two renames, one with SFTP semantics, and one with POSIX semantics. This is not unreasonable, the really fun part is that the OpenSSH SFTP client automatically tries ot use the &lt;code&gt;posix-rename&lt;/code&gt; semantics, but will happily, silently fall back to the SFTP semantics. &lt;/p&gt;

&lt;p&gt;Hopefully if you ever need to deal with SFTP at a low level this has given you some pointers of things to look out for!&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Thoughts on PolicyHack</title>
    <link>http://benno.id.au/blog/2015/10/09/australian-innovation-policy</link>
    <guid>http://benno.id.au/blog/2015/10/09/australian-innovation-policy</guid>
    <pubDate>Fri, 09 Oct 2015 09:31:59 +0000</pubDate>
    <description>&lt;p&gt;Yesterday a &lt;a href="http://www.policyhack.com.au/"&gt;Policy
Hackathon&lt;/a&gt; event was launched by the Assistant Minister for
Innovation &lt;a href="https://twitter.com/Wyatt_MP"&gt;Wyatt Roy&lt;/a&gt;.  This
kind of outreach to the community is a great thing to see from
politicians, and I look forward to seeing more examples of it.
However, I'm concerned that the way in which this particular event
has been framed won't lead to successful outcomes.&lt;/p&gt;

&lt;p&gt;The first problem with Policy Hack is that it focuses on the
solution space (i.e.: policy ideas) and completely fails to explore
the problem space. Ironically, this is the same kind of thinking that
leads to the demise of many technology companies who focus on
the cool technology instead of the problem their customers need
them to solve.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://oursay.org/community/policyhack"&gt;Policy Hack
website&lt;/a&gt; has, at the time of writing, received over 60 submissions (of
varying quality). All of them are policy suggestions, but few, if any,
provide any kind of detail on the &lt;em&gt;specific problem&lt;/em&gt; they are
trying to solve. In some cases it is implied (e.g. a policy suggestion
to lower taxation implies that high taxation is in some
way impeding innovation), but even then there is little to convince
the reader that the implied problem is &lt;strong&gt;actually&lt;/strong&gt;
a problem. (e.g.: where is the evidence that the current level
of taxation is &lt;strong&gt;actually&lt;/strong&gt; impeding innovation?)&lt;/p&gt;

&lt;p&gt;Before we get into policy ideas we really need to have
some kind of problem definition. What is the &lt;em&gt;problem&lt;/em&gt; that
Policy Hack is trying to solve?&lt;/p&gt;

&lt;p&gt;The closest we get to that on the website is &lt;q&gt;policy ideas
designed to foster the growth of innovation industries including tech
startups, biotech, agtech, fintech, renewables and resources.&lt;/q&gt;.&lt;/p&gt;

&lt;p&gt;This isn't at all a bad &lt;em&gt;starting point&lt;/em&gt;, but it's a very
high-level aspirational kind of statement, rather than an actionable
definition of the problem. A lot more work needs to be done to fully
flesh out the problem before we even start thinking about the solution.&lt;/p&gt;

&lt;p&gt;An example of a more actionable problem statement
could be: &lt;q&gt;The growth of the tech startups, biotech, agtech, fintech, renewables and resources
industries as measured by percentage of GDP is too low for Australia's future
economic prosperity&lt;/q&gt;. Of course this raises further questions. Maybe percentage of GDP isn't the
best measure, and we should be using a different metric such as tax revenue,
or employment. And the actual terms should be defined a bit more clearly as
well; what even is a &lt;em&gt;tech startup&lt;/em&gt;? I don't think that is one of the
business classifications used by the &lt;a href="http://www.abs.gov.au/AUSSTATS/abs@.nsf/66f306f503e529a5ca25697e0017661f/7cd8aebba7225c4eca25697e0018faf3!OpenDocument"&gt;Bureau of Statistics&lt;/a&gt;, which
raises even more questions!&lt;/p&gt;

&lt;p&gt;Of course, once we have that problem statement, some kind of
data would be nice. Like, what is the current growth rate? Off what
base?  What does it need to be to ensure economic prosperity? If we
can't get those numbers directly, what are reasonable proxies for
those numbers that we can measure? How reliable are those proxies?&lt;/p&gt;

&lt;p&gt;With a problem definition, trying some kind of root cause analysis
might lead to specific problems that we can solve with policies.
One example could be: there aren't enough people creating
startups, because there aren't enough STEM graduates, because
high-school students don't choose STEM courses, because they are worried
about the HECS/HELP debt. Or it could be: we have plenty of startups
but once they get initial traction they move to the US, because
there isn't enough capital available in Australia, because investors
are taxed too heavily on capital gains. I'm sure there are hundreds
different chains of thought like this that should be explored.&lt;/p&gt;

&lt;p&gt;Of course, those are just hypotheses! We should be applying
some kind of scientific method here. Do startups really move to the
US once they get success? Maybe they don't. Do they move because
of funding reasons, or maybe it is access to customers? Is capital
not available here because of capital gains tax, or maybe it is
because other asset classes provide better returns?&lt;/p&gt;

&lt;p&gt;None of that is really easy, and certainly can't be done
collaboratively during a hackathon (despite the excellent opening up
of government data sources to make the underlying data available).
This needs patient analysis, by diligent people, spending time to find
the right data sources.&lt;/p&gt;

&lt;p&gt;Without access to this information how could you possibly come up
with a useful policy suggestion? Of course, once this work has been
done the actual policies probably become somewhat obvious!&lt;/p&gt;

&lt;p&gt;Ideally this is work that has already been done, either within
government policy units, or external consultants. If so, it should be
directly available on the Policy Hack website, and it should be
required reading for anyone wanting to put forward a policy idea.&lt;/p&gt;

&lt;p&gt;The second problem that I see with the Policy Hack event, or at
least some of the suggestions put forward to date is that many of them
are simply reinventing the wheel. (Which, ironically, is the cause of
the downfall of many a startup!)&lt;/p&gt;

&lt;p&gt;Many of the suggestions currently put forward are basically the
same as policies that &lt;em&gt;already exist&lt;/em&gt; in some shape or
form. This isn't meant to be a criticism about the person putting
forward the idea; it isn't exactly simple to find what policies are
currently out there. But surely, if we are going to have an event that
looks at putting forward new policy ideas on innovation we should &lt;em&gt;
at the very least&lt;/em&gt; have a list of the current policies the
government has in place with the aim of fostering innovation! Ideally,
we have also have an analysis of each policy with some commentary on
what works well about it, what doesn't work so well, and some form of
cost-benefit analysis.&lt;/p&gt;

&lt;p&gt;With this information in place we may find that a simple tweak to
an existing policy could have a large impact. Or we could find
ineffective policies that can be scrapped to make funds available for
new policies (because any new policies will need to be funded somehow,
right?). Or we could find that we already have policies addressing
many of the things that we &lt;em&gt;think&lt;/em&gt; are problems, and we need
to go back to the root cause analysis to determine other possible
things that need addressing.&lt;/p&gt;

&lt;p&gt;Again, maybe this information is available somewhere, but it should be
made available to participants of Policy Hack in advance. And,
if you are going to propose a &lt;strong&gt;new&lt;/strong&gt; policy, that is
similar in some way to an existing policy, then should explain what's different about your proposal.
For example, if you put forward an idea such as &lt;em&gt;&lt;q&gt;allow special visas to hire
technical immigrants&lt;/q&gt;&lt;/em&gt;, without any reference at all to the current
457 visa scheme you're really just wasting everybody's time!&lt;/p&gt;

&lt;p&gt;Finally, there are a lot of &lt;em&gt;&lt;q&gt;Country X does Y, let's do Y
policy&lt;/q&gt;&lt;/em&gt; suggestions without any critical analysis of why &lt;em&gt;Y&lt;/em&gt; works for
&lt;em&gt;X&lt;/em&gt; (or even &lt;strong&gt;if&lt;/strong&gt; &lt;em&gt;Y&lt;/em&gt; works for
&lt;em&gt;X&lt;/em&gt;!). (And to continue the theme, this kind of &lt;em&gt;me too&lt;/em&gt;
thinking seems to plague startups as well.)&lt;/p&gt;

&lt;p&gt;Ideally an input to the Policy Hack should be a comprehensive
review of other &lt;em&gt;innovation&lt;/em&gt; policies that have been put in
place by other jurisdictions. This review should explain how these
have, or have not, been effective in the jurisdictions in which they
have been implemented, along with some commentary as to how they
compare to similar policies in Australia. For example, holding up
the US low capital gains tax and long-term dividend tax rate without explaining
that the US has no mechanism for relief from double taxation at the corporate
and individual level is not providing a fair evaluation.&lt;/p&gt;

&lt;p&gt;And again, if you are going to put forward policy ideas based on
ideas from other jurisdictions, it better actually reference that
research document and justify why that policy would work in
Australia. For example, if you think that Australia should have lower
corporate tax rates because Ireland has lower corporate tax rates, you
need to at least give some kind of backup that this is actually
working for Ireland.&lt;/p&gt;

&lt;p&gt;If we want to have an effective discussion on policy, we first need
to agree on the problem we are solving, and have a clear understanding
of what has been tried before both locally and overseas.&lt;/p&gt;

&lt;p&gt;Instead of policies, maybe the Policy Hack could instead
work through the problem space. A morning session that presented
a summary of all existing data to the participants, followed by
an afternoon brainstorming session on identifying possible
reasons why we don't see the levels of innovation that we'd like.
With this information in hand the policy experts could perform
the relevant research to determine which of the identified reasons
actually has an evidence base to support it. The results of this
could feed in to a future hackathon that starts to look at policy
from an informed base.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Uniquely identifying processes in Linux</title>
    <link>http://benno.id.au/blog/2015/06/28/identifying-linux-processes</link>
    <guid>http://benno.id.au/blog/2015/06/28/identifying-linux-processes</guid>
    <pubDate>Sun, 28 Jun 2015 12:20:50 +0000</pubDate>
    <description>&lt;p&gt;If you are doing any kind of analysis of a Linux based system you
quickly come to the point of needing to collect statistics on a
per-process basis. However one difficulty here is how to uniquely
identify a process.&lt;/p&gt;

&lt;p&gt;The simplest answer is that you can just use the process identifier
(PID), unfortunately this is not a good answer in the face of PID
reuse. The total number of PIDs in the system is finite, and once
exhausted the kernel will start re-using PIDs that have been
previously allocated to different processes.&lt;/p&gt;

&lt;p&gt;Now you may expect that this reuse would be relatively rare, or
occur on a relatively large period, but it can happen relatively
quickly. On a system under load I've seen PID reuse in under 5
seconds.&lt;/p&gt;

&lt;p&gt;The practical consequence of this is that if you are, for example,
collecting statistics about a process via any interface that
identifies processes via PID you need to be careful to ensure you are
collecting the right statistics! For example, if you are a collecting
statistics of a process identified via PID 37 you might read
&lt;code&gt;/proc/37/stat&lt;/code&gt; at one instance and receive valid data, but
at any later time &lt;code&gt;/proc/37/stat&lt;/code&gt; could return data about
a completely different process.&lt;/p&gt;

&lt;p&gt;Thankfully, the kernel associates another useful piece of information
with a process: it’s start time. The combination of PID and start time
provides a reasonably robust way of uniquely identifying processes over
the life-time of a system. (For the pedantic, if a process can be created
and correctly reaped all within the granularity of the clock, then it would
be theoretically possible for multiple different processes to have existed
in the system that have the same PID and start time, but that is unlikely
to be a problem in practise.)&lt;/p&gt;

&lt;p&gt;The start time is one of the fields that is present in the
&lt;code&gt;/proc/&amp;ltpid&amp;gt/stat&lt;/code&gt; information, so this can be used
to ensure you are correctly matching up statistics.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Playing around with await/async in Python 3.5</title>
    <link>http://benno.id.au/blog/2015/05/25/await1</link>
    <guid>http://benno.id.au/blog/2015/05/25/await1</guid>
    <pubDate>Mon, 25 May 2015 07:07:03 +0000</pubDate>
    <description>
&lt;p&gt;&lt;a href="https://www.python.org/dev/peps/pep-0492/"&gt;PEP-0492&lt;/a&gt;
was recently approved giving Python 3.5 some special syntax for
dealing with co-routines. A lot of the new functionaltiy was available
pre-3.5, but the syntax certainly wasn’t ideal as the concept
of generators and co-routines were kind of intermingled. PEP-0492
makes an explicit distinction between generators and co-routines
through the use of the the &lt;code&gt;async&lt;/code&gt; keyword.&lt;/p&gt;

&lt;p&gt;This post aims to describe how these new mechanisms works from
a rather low-level. If you are mostly interested in just using
this functionality for high-level stuff I recommend skipping this
post and reading up on the built-in asyncio module. If you are
interested in how these low-level concepts can be used to build
up your own version of the asyncio module, then you might find
this interesting.&lt;/p&gt;

&lt;p&gt;For this post we’re going to totally ignore any asynchronous
&lt;strong&gt;I/O&lt;/strong&gt; aspect and just limit things to interleaving
progress from multiple co-routines. Here are two very simple
functions:&lt;/p&gt;

&lt;pre&gt;
def coro1():
    print("C1: Start")
    print("C1: Stop")


def coro2():
    print("C2: Start")
    print("C2: a")
    print("C2: b")
    print("C2: c")
    print("C2: Stop")
&lt;/pre&gt;

&lt;p&gt;We start with two very simple function, &lt;code&gt;coro1&lt;/code&gt; and
&lt;code&gt;coro2&lt;/code&gt;. We could call these function one after the
other:&lt;/p&gt;

&lt;pre&gt;
coro1()
coro2()
&lt;/pre&gt;

&lt;p&gt;and we’d get the expected output:&lt;/p&gt;

&lt;pre&gt;
C1: Start
C1: Stop
C2: Start
C2: a
C2: b
C2: c
C2: Stop
&lt;/pre&gt;

&lt;p&gt;But, for some reason, rather than running these one after the
other, we’d like to interleave the execution. We can’t just do
that with normal functions, so let’s turn these into co-routines:&lt;p&gt;

&lt;pre&gt;
async def coro1():
    print("C1: Start")
    print("C1: Stop")


async def coro2():
    print("C2: Start")
    print("C2: a")
    print("C2: b")
    print("C2: c")
    print("C2: Stop")
&lt;/pre&gt;

&lt;p&gt;Through the magic of the new &lt;code&gt;async&lt;/code&gt; these functions are
no-longer functions, but now they are co-routines (or more
specifically &lt;em&gt;native co-routine functions&lt;/em&gt;).  When you call a
normal function, the function body is executed, however when you call
a co-routine function the body isn’t executed; instead you get back a
&lt;em&gt;co-routine object&lt;/em&gt;:&lt;/p&gt;

&lt;pre&gt;
c1 = coro1()
c2 = coro2()
print(c1, c2)
&lt;/pre&gt;

&lt;p&gt;gives:&lt;/p&gt;

&lt;pre&gt;
&amp;lt;coroutine object coro1 at 0x10ea60990&amp;gt; &amp;lt;coroutine object coro2 at 0x10ea60a40&amp;gt;
&lt;/pre&gt;

&lt;p&gt;(The interpretter will also also print some runtime warnings that we’ll ignore for now).&lt;/p&gt;

&lt;p&gt;So, what good is having a co-routine object? How do we actually execute the thing? Well,
one way to execute a co-routine is through an await expression (using the new &lt;code&gt;await&lt;/code&gt;
keyword). You might think you could do something like:&lt;/p&gt;

&lt;pre&gt;
await c1
&lt;/pre&gt;

&lt;p&gt;but, you’d be disappointed. An await expression is only valid syntax when contained
within an native co-routine function. You could do something like:&lt;/p&gt;

&lt;pre&gt;
async def main():
    await c1
&lt;/pre&gt;

&lt;p&gt;but of course then you are left with the problem of how to force the execution of &lt;code&gt;main&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;The trick to realise is that co-routines are actually pretty
similar to Python generators, and have the same &lt;code&gt;send&lt;/code&gt;
method. We can kick off execution of a co-routine by calling the
&lt;code&gt;send&lt;/code&gt; method.&lt;/p&gt;

&lt;pre&gt;
c1.send(None)
&lt;/pre&gt;

&lt;p&gt;This gets our first co-routine executing to completion, however we also get a nasty
&lt;code&gt;StopIteration&lt;/code&gt; exception:
&lt;/p&gt;

&lt;pre&gt;
C1: Start
C1: Stop
Traceback (most recent call last):
  File "test3.py", line 16, in &lt;module&gt;
    c1.send(None)
StopIteration
&lt;/pre&gt;

&lt;p&gt;The StopIteration exception is the mechanism used to indicate that
a generator (or co-routine in this case) has completed execution.
Despite being an exception it is actually quite expected!  We can wrap
this in an appropriate &lt;code&gt;try/catch&lt;/code&gt; block, to avoid the
error condition. At the same time let’s start the execution of our
second co-routine:&lt;/p&gt;

&lt;pre&gt;
try:
    c1.send(None)
except StopIteration:
    pass
try:
    c2.send(None)
except StopIteration:
    pass
&lt;/pre&gt;

&lt;p&gt;Now we get complete output, but it is disappointingly similar to
our original output. So we have a bunch more code, but no actual
interleaving yet! Co-routines are not dissimilar to threads as they
allow the interleaving of multiple distinct threads of control,
however unlike threads when we have a co-routine any switching is
explicit, rather than implicit (which is, in many cases, a good
thing!). So we need to put in some of these explicit switches.&lt;/p&gt;

&lt;p&gt;Normally the &lt;code&gt;send&lt;/code&gt; method on generators will execute
until the generator yields a value (using the &lt;code&gt;yield&lt;/code&gt;
keyword), so you might think we could change &lt;code&gt;coro1&lt;/code&gt; to
something like:&lt;/p&gt;

&lt;pre&gt;
async def coro1():
    print("C1: Start")
    yield
    print("C1: Stop")
&lt;/pre&gt;

&lt;p&gt;but we can’t use &lt;code&gt;yield&lt;/code&gt; inside a co-routine. Instead we
use the new &lt;code&gt;await&lt;/code&gt; expression, which suspends execution of
the co-routine until the &lt;em&gt;awaitable&lt;/em&gt; completes. So we need
something like &lt;code&gt;await _something_&lt;/code&gt;; the question is what is
the &lt;em&gt;something&lt;/em&gt; in thise case? We can’t await just await on
nothing! The &lt;a
href="https://www.python.org/dev/peps/pep-0492/"&gt;PEP&lt;/a&gt; explains
which things are &lt;em&gt;awaitable&lt;/em&gt;. One of them is another native
co-routine, but that doesn’t help get to the bottom of things. One of
them is an object defined with a special CPython API, but we want to
avoid extension modules and stick to pure Python right now. That
leaves two options; either a &lt;em&gt;generator-based coroutine object&lt;/em&gt;
or a special &lt;em&gt;Future-like&lt;/em&gt; like object.&lt;/p&gt;

&lt;p&gt;So, let’s go with the generator-based co-routine object to start
with. Basically a Python generator (e.g.: something that has a
&lt;code&gt;yield&lt;/code&gt; in it) can be marked as a co-routine through the
&lt;code&gt;types.coroutine&lt;/code&gt; decorator. So a very simple example of
this would be:&lt;/p&gt;

&lt;pre&gt;
@types.coroutine
def switch():
    yield
&lt;/pre&gt;

&lt;p&gt;This define a generator-based co-routine
&lt;strong&gt;function&lt;/strong&gt;. To get a generator-based co-routine
&lt;strong&gt;object&lt;/strong&gt; we just call the function. So, we can change
our &lt;code&gt;coro1&lt;/code&gt; co-routine to:&lt;/p&gt;

&lt;pre&gt;
async def coro1():
    print("C1: Start")
    await switch()
    print("C1: Stop")
&lt;/pre&gt;

&lt;p&gt;With this in place, we hope that we can interleave our execution of
&lt;code&gt;coro1&lt;/code&gt; with the execution of &lt;code&gt;coro2&lt;/code&gt;. If we try
it with our existing code we get the output:&lt;/p&gt;

&lt;pre&gt;
C1: Start
C2: Start
C2: a
C2: b
C2: c
C2: Stop
&lt;/pre&gt;

&lt;p&gt;We can see that as expected &lt;code&gt;coro1&lt;/code&gt; stopped executing after the first print statement,
and then &lt;code&gt;coro2&lt;/code&gt; was able to execute. In fact, we can look at the co-routine object
and see exactly where it is suspended with some code like this:&lt;/p&gt;

&lt;pre&gt;
print("c1 suspended at: {}:{}".format(c1.gi_frame.f_code.co_filename, c1.gi_frame.f_lineno))
&lt;/pre&gt;

&lt;p&gt;which print-out the line of your &lt;code&gt;await&lt;/code&gt;
expression. (Note: this gives you the outer-most await, so is mostly
just for explanatory purpose here, and not particularly useful in the
general case).&lt;/p&gt;

&lt;p&gt;OK, the question now is, how can we resume &lt;code&gt;coro1&lt;/code&gt; so that it executes to completion.
We can just use send again. So we end up with some code like:&lt;/p&gt;

&lt;pre&gt;
try:
    c1.send(None)
except StopIteration:
    pass
try:
    c2.send(None)
except StopIteration:
    pass
try:
    c1.send(None)
except StopIteration:
    pass
&lt;/pre&gt;

&lt;p&gt;which then gives us our expected output:&lt;/p&gt;

&lt;pre&gt;
C1: Start
C2: Start
C2: a
C2: b
C2: c
C2: Stop
C1: Stop
&lt;/pre&gt;

&lt;p&gt;So, at this point we’re manually pushing the co-routines through to completion by
explicitly calling &lt;code&gt;send&lt;/code&gt; on each individual co-routine object. This isn’t going
to work in general. What we’d really like is a function that kept executing all our co-routines
until they had all completed. In other words, we want to continually execute &lt;code&gt;send&lt;/code&gt;
on each co-routine object until that method raises the &lt;code&gt;StopIteration&lt;/code&gt; exception.&lt;/p&gt;

&lt;p&gt;So, let’s create a function that takes in a list of co-routines and executes them until
completion. We’ll call this function &lt;code&gt;run&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;
def run(coros):
    coros = list(coros)

    while coros:
        # Duplicate list for iteration so we can remove from original list.
        for coro in list(coros):
            try:
                coro.send(None)
            except StopIteration:
                coros.remove(coro)
&lt;/pre&gt;

&lt;p&gt;This picks a co-routine from the list of co-routines, executes it, and then if
 a &lt;code&gt;StopIteration&lt;/code&gt; exception is raised, the co-routine is removed from the
list.&lt;/p&gt;

&lt;p&gt;We can then remove the code manually calling the send method and
instead do something like:&lt;/p&gt;

&lt;pre&gt;
c1 = coro1()
c2 = coro2()
run([c1, c2])
&lt;/pre&gt;

&lt;p&gt;And now we have a very simple run-time for executing co-routines using the new
await and async features in Python 3.5. Code related to this post is available
&lt;a href="https://github.com/bennoleslie/awaitexp"&gt;on github&lt;/a&gt;.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Android projects and gradlew</title>
    <link>http://benno.id.au/blog/2015/04/08/psa_android_command_line_tools</link>
    <guid>http://benno.id.au/blog/2015/04/08/psa_android_command_line_tools</guid>
    <pubDate>Wed, 08 Apr 2015 14:15:36 +0000</pubDate>
    <description>&lt;p&gt;The instructions in the Android &lt;a
href="http://developer.android.com/training/basics/firstapp/index.html"&gt;first-app
tutorial&lt;/a&gt; (as of 20150408) are kind of confusing. I think something
got lost in translation with the adding of Android Studio to the
mix.&lt;/p&gt;

&lt;p&gt;Both the &lt;a
href="http://developer.android.com/training/basics/firstapp/creating-project.html"&gt;creating
a project&lt;/a&gt; and &lt;a
href="http://developer.android.com/training/basics/firstapp/running-app.html"&gt;running
an app&lt;/a&gt; have instructions for using the command line, rather than
the IDE (which is great, because I'm curmudgeonly like
that). Unfortunately, if you use the command line options for
&lt;em&gt;creating a project&lt;/em&gt;, you end up with an Ant based build system
set up, and not a Gradle based build system. Which conveniently means
that the command line instructions for &lt;em&gt;running an app&lt;/em&gt; make no
sense at all.&lt;/p&gt;

&lt;p&gt;So, since you probably want a gradle based build, I suggest using IDE once off
to create the project, and then you can follow the instructions for running
an app from the command line.&lt;/p&gt;

&lt;p&gt;PS: Make sure you are connected to the interwebs the first time you run &lt;code&gt;gradlew&lt;/code&gt;
as it needs to download a bunch of stuff before it can start.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Checking for leaks in Python</title>
    <link>http://benno.id.au/blog/2015/02/26/python-leak-checking</link>
    <guid>http://benno.id.au/blog/2015/02/26/python-leak-checking</guid>
    <pubDate>Thu, 26 Feb 2015 15:29:57 +0000</pubDate>
    <description>&lt;p&gt;For short running scripts it doesn't really matter if your Python module
is leaking memory; it all gets cleaned up on process exit anyway, however
if you have a long-running application you really don't want an ever
increasing memory footprint. So, it is helpful to check that your
module isn’t leaking memory, and to plug the leaks if they occur. This
post provides some ideas on how to do that, and describes some of the
implementation of my &lt;code&gt;&lt;a href="https://github.com/bennoleslie/pyutil/blob/08d33198e76ebb4d5d40df5087e8726304f0daf1/check_leaks.py"&gt;check_leaks&lt;/a&gt;&lt;/code&gt; module.&lt;/p&gt;

&lt;p&gt;If you suspect a given function &lt;code&gt;foo&lt;/code&gt; of leaking memory, then the
simplest approach to checking that it is not leaking memory is to get the set of
objects that exist &lt;em&gt;before&lt;/em&gt; running the function, the set of objects the
exists &lt;em&gt;after&lt;/em&gt; running the function, and then any object in &lt;em&gt;after&lt;/em&gt;
that isn’t in &lt;em&gt;before&lt;/em&gt; is going to be a leaked object.&lt;/p&gt;

&lt;p&gt;So in Python the garbage collector tracks (practically) all the objects that
exist, and the &lt;code&gt;&lt;a href="https://docs.python.org/3/library/gc.html"&gt;gc&lt;/a&gt;&lt;/code&gt;
module provides us with access to the internals of the garbage collector.
In particular the &lt;code&gt;get_objects&lt;/code&gt; method, will provide a list of all
the objects tracked by the garbage collector. So at a really simple level we can do:&lt;/p&gt;

&lt;pre&gt;
before_objects = gc.get_objects()
foo()
after_objects = gc.get_objects()
&lt;/pre&gt;

&lt;p&gt;It is then a simple matter of working out which objects are in
&lt;code&gt;after_objects&lt;/code&gt; but not &lt;code&gt;before_objects&lt;/code&gt;, and
then that is our leaking objects! One catch however is that we don’t
know when the garbage collector has run, so with the above code it is
possible that there are objects in &lt;code&gt;after_objects&lt;/code&gt; that
could be cleaned by the garbage collector. Thankfully, the &lt;code&gt;gc&lt;/code&gt;
module provides the &lt;code&gt;collect&lt;/code&gt; method which allows us to force a
garbage collection. So we end up with something like:&lt;/p&gt;

&lt;pre&gt;
before_objects = gc.get_objects()
foo()
gc.collect()
after_objects = gc.get_objects()
&lt;/pre&gt;

&lt;p&gt;The other small gotcha in this case is that the &lt;code&gt;before_objects&lt;/code&gt;
list will end up as an object in &lt;code&gt;after_objects&lt;/code&gt;. This isn’t really
a leak, so we can ignore it when display the leaking objects.&lt;/p&gt;

&lt;p&gt;So far, this can give us a list of leaked objects, but it doesn’t really
help resolve the issue except in some of the simplest cases. We need more
information to really fix the issue. Some things that can help with this is
knowing why it hasn’t been collected yet, which comes down to knowing which
other objects still have a reference to the leaked object. Additionally
it can be very useful to know where the object was originally allocated.
Python gives us some tools to help give us this information.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;gc&lt;/code&gt; module provides a &lt;code&gt;get_referrers&lt;/code&gt;
method, which provides a list of all objects that refer to a given
object. This can be used to print out which objects refer to the given
object. Unfortunately this is less useful than you might think,
because it most cases the referring object is a &lt;code&gt;dict&lt;/code&gt;
object, as most class instances and modules store references in
&lt;code&gt;__dict__&lt;/code&gt; rather than directly. With some more processing
it would probably be possible to work out the purpose of each &lt;code&gt;dict&lt;/code&gt;
and provide more semantically useful information.&lt;/p&gt;

&lt;p&gt;To work out where an object was originally allocated we can use
the new &lt;code&gt;&lt;a href="https://docs.python.org/3/library/tracemalloc.html"&gt;tracemalloc&lt;/a&gt;&lt;/code&gt;
module. When enabled the &lt;code&gt;tracemalloc&lt;/code&gt; module will store a traceback
for each memory allocation within Python. The
&lt;code&gt;get_object_traceback()&lt;/code&gt; method will return a traceback (or None) for
a given object. This makes debugging leaks significantly easier than
just guessing where an object was allocated.&lt;/p&gt;

&lt;p&gt;Unfortunately, &lt;code&gt;tracemalloc&lt;/code&gt; can sometimes provide misleading
information when trying to detect leaks. Many of the Python built-in
types optimise memory allocation by keeping an internal free list. When freeing
an object, rather than calling the normal free routine, the object is placed
on this free list, and a new object allocation can grab an object directly
from the free list, rather than need to go through the normal memory manager
routines. This means when using &lt;code&gt;get_object_traceback()&lt;/code&gt;, the trace
returned will be from the original time the memory was allocated, which may
be for a long dead object. I think it would be good if the Python runtime
provide a way to selectively disable this freelist approach to allow precise
use of the &lt;code&gt;get_object_traceback()&lt;/code&gt; method. I have an initial
branch that does this for some object up on &lt;a href="https://github.com/bennoleslie/cpython/tree/3.4-disable-freelist"&gt;github&lt;/a&gt;.
&lt;/p&gt;
    </description>  </item>

  <item>
    <title>uiautomator and webviews in Lollipop</title>
    <link>http://benno.id.au/blog/2015/01/01/uiautomator-webview-lollipop</link>
    <guid>http://benno.id.au/blog/2015/01/01/uiautomator-webview-lollipop</guid>
    <pubDate>Thu, 01 Jan 2015 11:07:25 +0000</pubDate>
    <description>&lt;p&gt;Android’s &lt;code&gt;uiautomator dump&lt;/code&gt; command takes a snapshot of
the currently active window state and creates an XML representation of
the view hierarchy. This is extremely useful when it comes to
automating interactions with the device and general debugging. One handy aspect
of this is that it can even introspect on Android’s WebView, so you get an idea of
where text boxes and buttons are. Or at least that was the case pre-Lollipop&lt;/p&gt;

&lt;p&gt;In Lollipop this doesn’t quite work anymore. When you run &lt;code&gt;uiautomator dump&lt;/code&gt; in Lollipop,
you just get one big WebView item in the view hierarchy, but no children! (And I’m &lt;a href="http://stackoverflow.com/questions/27428207/uiautomator-dump-for-lollipop-lost-webview-view-hierarchy"&gt;not the only one&lt;/a&gt; to have this issue.&lt;/a&gt;).
So what’s going on?&lt;/p&gt;

&lt;p&gt;The first thing to realise is that uiautomator operates through the accessibility APIs. Essentially it is enabled as an accessibility manager in the system (which can actually be a problem if you have another accessibility manager enabled).&lt;/p&gt;

&lt;p&gt;It’s actually kind of tricky to understand exactly what is going on, but reading through Chrome source it seems to only enable population of the virtual view hierarchy if an accessibility manager is enabled &lt;em&gt;at the time that the page is rendered&lt;/em&gt;. The UiAutoamtor sub-system is enabled as an accessibility manager when you run a &lt;code&gt;uiautomator&lt;/code&gt; command, but in Lollipop at least, it is then disabled when the &lt;code&gt;uiautomator&lt;/code&gt; process exits! I’m not 100% sure why this worked pre-Lollipop. I &lt;em&gt;think&lt;/em&gt; it is because pre-lollipop, the UiAutomator service stayed registered even after the command exited, so any WebViews rendered after the first execution of &lt;code&gt;uiautomator&lt;/code&gt; would correctly provide the virtual view hierarchy.&lt;/p&gt;

&lt;p&gt;A work-around is to enable an accessibility service (like TalkBack), and then load the view you are interested in, disable the accessibility service, and then run &lt;code&gt;uiautomator dump&lt;/code&gt;. Sometimes in this case you will get the full dump (other times not, it seems not, but I think that must be due to the page being re-rendered or something similar).&lt;/p&gt;

&lt;p&gt;If you want to get the full view hierarchy the only real way to go about it is to write a UI test that is run by &lt;code&gt;uiautomator&lt;/code&gt;; as long as &lt;code&gt;uiautomator&lt;/code&gt; is running you should be able to extract the full hierarchy.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>Haskell Baby Steps</title>
    <link>http://benno.id.au/blog/2014/12/27/haskell-baby-steps</link>
    <guid>http://benno.id.au/blog/2014/12/27/haskell-baby-steps</guid>
    <pubDate>Sat, 27 Dec 2014 11:25:28 +0000</pubDate>
    <description>&lt;p&gt;I haven’t really written any Haskell in a serious manner since first-year computing which doesn’t seem that long ago, but actually is.&lt;/p&gt;

&lt;p&gt;Recently I’ve tried to pick it up again, which has been challenging to be honest. Haskell is much more widely used these days and can do a lot more. At the same time there is a lot more to learn and the language seems to be an awful lot &lt;em&gt;bigger&lt;/em&gt; now, than it did 16 years ago. My overall goal right now it to rewrite some Python code that performs  &lt;a href="http://en.wikipedia.org/wiki/Abstract_interpretation"&gt;abstract interpretation&lt;/a&gt; in Haskell, but I’m taking baby steps to start with.&lt;/p&gt;

&lt;p&gt;One of the biggest thing with Haskell (as compared to Python) is that it really forces you to know exactly what you are trying to do before writing code. This is a lot different to how I usually start writing code. The way I usually design code is by trying to write some code that does some small part of what I’m trying to achieve, and then iterate from that. For me at least this is easier to do in Python than in Haskell, but this is most likely to be a property of my familiarity with the langauge, than any inherent property of either language. In any case, for me, doing an initial sketch in Python and then converting that to Haskell seems to be a useful approach.&lt;/p&gt;

&lt;p&gt;The first module I’ve written on this journey is &lt;code&gt;FiniteType.hs&lt;/code&gt;, which provides a type-class &lt;code&gt;FiniteType&lt;/code&gt; with a single method &lt;code&gt;cardinality&lt;/code&gt;. To be honest I was surprised that this wasn’t some kind of built-in! At the end of the day the module is fairly simple but along the way there was a &lt;em&gt;lot&lt;/em&gt; to learn: language extension (in general), the &lt;code&gt;DefaultSignatures&lt;/code&gt; extension in particular and the &lt;code&gt;Data.Proxy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the end I’m fairly happy with the solution I came up with, but there is a certain amount of code duplication in the implementation that I’m still not happy about. I think the only fix for this is template-haskell, but that seems overkill for such a simple module.&lt;/p&gt;

&lt;p&gt;There are still some things that I’m not sure about. In particular, when writing a type-class such as `FiniteType` should the module create instances fo the type-class, or should that be up to the user of the module?&lt;/p&gt;

&lt;p&gt;Code is available on &lt;a href="https://github.com/bennoleslie/hsutil/blob/3e2b6c3e33bf1bd945384c83eaba03198c765524/FiniteType.hs"&gt;github&lt;/a&gt;. I’d welcome any comments or suggestions for improvements.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>namedfields create performance</title>
    <link>http://benno.id.au/blog/2014/12/01/namedfields_create_performance</link>
    <guid>http://benno.id.au/blog/2014/12/01/namedfields_create_performance</guid>
    <pubDate>Mon, 01 Dec 2014 20:35:37 +0000</pubDate>
    <description>&lt;p&gt;This is a follow-up to yesterday’s &lt;a href="http://benno.id.au/blog/2014/11/30/a-better-namedtuple"&gt;post&lt;/a&gt; about &lt;code&gt;namedtuple&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;Yesterday I mostly focussed on the performance of accessing
attributes on a named tuple object, and the &lt;code&gt;namedfields&lt;/code&gt; decorator approach
that I showed ended up with the same performance as the standard library namedtuple.
One operation that I didn’t consider, but is actually reasonably common is the actual
creation of a new object.&lt;/p&gt;

&lt;p&gt;My implementation relied on a generic &lt;code&gt;__new__&lt;/code&gt; that used the underlying
&lt;code&gt;_fields&lt;/code&gt; to work out the actual arguments to pass to the &lt;code&gt;tuple.__new__&lt;/code&gt;
constructor:&lt;/p&gt;

&lt;pre&gt;
    def __new__(_cls, *args, **kwargs):
        if len(args) &gt; len(_cls._fields):
            raise TypeError("__new__ takes {} positional arguments but {} were given".format(len(_cls._fields) + 1, len(args) + 1))

        missing_args = tuple(fld for fld in _cls._fields[len(args):] if fld not in kwargs)
        if len(missing_args):
            raise TypeError("__new__ missing {} required positional arguments".format(len(missing_args)))
        extra_args = tuple(kwargs.pop(fld) for fld in _cls._fields[len(args):] if fld in kwargs)
        if len(kwargs) &gt; 0:
            raise TypeError("__new__ got an unexpected keyword argument '{}'".format(list(kwargs.keys())[0]))

        return tuple.__new__(_cls, tuple(args + extra_args))
&lt;/pre&gt;


&lt;p&gt;This seems to work (in my limited testing), but the code is pretty
nasty (I’m far from confident that it is correct), and it is also
slow. About 10x slower than a class created with the
&lt;code&gt;namedtuple&lt;/code&gt; factory function, which is just:&lt;/p&gt;

&lt;pre&gt;
    def __new__(_cls, bar, baz):
        'Create new instance of Foo2(bar, baz)'
        return _tuple.__new__(_cls, (bar, baz))
&lt;/pre&gt;

&lt;p&gt;As a result of this finding, I’ve changed my constructor approach, and now generate a custom constructor for each new class using &lt;code&gt;eval&lt;/code&gt;. It looks something like:&lt;/p&gt;

&lt;pre&gt;
        str_fields = ", ".join(fields)
        new_method = eval("lambda cls, {}: tuple.__new__(cls, ({}))".format(str_fields, str_fields), {}, {})
&lt;/pre&gt;

&lt;p&gt;With this change constructor performance is on-par with the &lt;code&gt;namedtuple&lt;/code&gt; approach, and I’m much more confident that the code is actually correct!&lt;/p&gt;

&lt;p&gt;I’ve cleaned up the &lt;code&gt;namedfields&lt;/code&gt; code a little, and made it available as part of my &lt;a href="https://github.com/bennoleslie/pyutil"&gt;pyutil&lt;/code&gt; repo.&lt;/p&gt;
    </description>  </item>

  <item>
    <title>A better namedtuple for Python (?)</title>
    <link>http://benno.id.au/blog/2014/11/30/a-better-namedtuple</link>
    <guid>http://benno.id.au/blog/2014/11/30/a-better-namedtuple</guid>
    <pubDate>Sun, 30 Nov 2014 16:56:23 +0000</pubDate>
    <description>&lt;p&gt;Python's &lt;code&gt;&lt;a
href="https://docs.python.org/3/library/collections.html#collections.namedtuple"&gt;namedtuple&lt;/a&gt;&lt;/code&gt;
factory function is, perhaps, the most under utilised feature in
Python. Classes created using the &lt;code&gt;namedtuple&lt;/code&gt; are great
because they lead to immutable object (as compared to normal classes which lead to
mutable objects). The goal of this blog post isn’t to convince you that immutable objetcs are a good
thing, but take a bit of a deep dive into the &lt;code&gt;namedtuple&lt;/code&gt; construct and explore some
of its shortcomings and some of the ways in which it can be improved.&lt;/p&gt;

&lt;p&gt;NOTE: All the code here is available in this &lt;a href="https://gist.github.com/bennoleslie/27aeb9065e81199f8af1"&gt;gist&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As a quick primer, let’s have a look at a very simple example:&lt;/p&gt;

&lt;pre&gt;
from collections import namedtuple

Foo = namedtuple('Foo', 'bar baz')
foo = Foo(5, 10)
print(foo)
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; All my examples are targeting Python 3.3. They will not necessarily
work in earlier version, in particular at least some of the later ones are not to work in
Python 2.7&lt;/p&gt;

&lt;p&gt;For simple things like this example, the existing &lt;code&gt;namedtuple&lt;/code&gt; works pretty well,
however the repetition of &lt;code&gt;Foo&lt;/code&gt; is a bit of a &lt;a href="http://c2.com/cgi/wiki?CodeSmell"&gt;code smell&lt;/a&gt;. Also, even though &lt;code&gt;Foo&lt;/code&gt; is a class, it is created significantly differently to a normal class (which I guess could be considered an advantage).&lt;/p&gt;

&lt;p&gt;Let’s try and do something a little more complex. If we find ourselves needing to know the product
of &lt;code&gt;bar&lt;/code&gt; and &lt;code&gt;baz&lt;/code&gt; a bunch of times we probably want to encapsulate rather than
writing &lt;code&gt;foo.bar * foo.baz&lt;/code&gt; everywhere. (OK this example isn’t the greatest in the history
of technical writing but just bear with me on this). So, as a first example we can just write a function:&lt;/p&gt;

&lt;pre&gt;
def foo_bar_times_baz(foo):
    return foo.bar * foo.baz

print(foo_bar_times_baz(foo))
&lt;/pre&gt;

&lt;p&gt;There isn’t anything wrong with this, functions are great! But the normal pattern in Python
is that when we have a function that does stuff on a class instance we make it a method, rather
than a function. We could debate the aesthetic and practical tradeoffs between methods and functions,
but let’s just go with the norm here and assume we’d prefer to do &lt;code&gt;foo.bar_times_baz()&lt;/code&gt;.
The canonical approach to this is:&lt;/p&gt;

&lt;pre&gt;
class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz


foo = Foo(5, 10)
print(foo.bar_times_baz())
&lt;/pre&gt;

&lt;p&gt;Now this works but, again, we have the repetition of &lt;code&gt;Foo&lt;/code&gt;, which bugs me. Coupled
with this we are calling a function inside the class definition which is a bit ugly. There are
some minor practical concerns because there are now multiple &lt;code&gt;Foo&lt;/code&gt; classes in existance, which
can at times cause confusion. This can be avoided by appending an underscore to the name of the
super-class. E.g.: &lt;code&gt;class Foo(namedtuple('Foo_', 'bar baz'))&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An alternative to this which I’ve used from time to time is to directly update the &lt;code&gt;Foo&lt;/code&gt;
class:&lt;/p&gt;

&lt;pre&gt;
Foo = namedtuple('Foo', 'bar baz')
Foo.bar_times_baz = lambda self: self.bar * self.baz
foo = Foo(5, 10)
print(foo.bar_times_baz())
&lt;/pre&gt;

&lt;p&gt;I quite like this, and think it works well for methods which can be written as an expression, but
not all Python programmers are particularly fond of lambdas, and although I know that classes are
mutable, I prefer to consider them as immutable, so modifying them after creation is also a bit ugly.&lt;/p&gt;

&lt;p&gt;The way that the &lt;code&gt;namedtuple&lt;/code&gt; function works is by creating a string that matches
the class definition and then using &lt;code&gt;exec&lt;/code&gt; on that string to create the new class. You
can see the string verion of that class definition by passing &lt;code&gt;verbose=True&lt;/code&gt; to the
&lt;code&gt;namedtuple&lt;/code&gt; function. For our &lt;code&gt;Foo&lt;/code&gt; class it looks a bit like:&lt;/p&gt;

&lt;pre&gt;
class Foo(tuple):
    'Foo(bar, baz)'

    __slots__ = ()

    _fields = ('bar', 'baz')

    def __new__(_cls, bar, baz):
        'Create new instance of Foo(bar, baz)'
        return _tuple.__new__(_cls, (bar, baz))

    ....

    def __repr__(self):
        'Return a nicely formatted representation string'
        return self.__class__.__name__ + '(bar=%r, baz=%r)' % self

    .....

    bar = _property(_itemgetter(0), doc='Alias for field number 0')

    baz = _property(_itemgetter(1), doc='Alias for field number 1')
&lt;/pre&gt;

&lt;p&gt;I’ve omitted some of the method implementation for brevity, but you get the general idea.
If you have a look at that you might have the same thought I did:
&lt;quote&gt;why isn’t this just a sub-class?&lt;/quote&gt;. It seems practical to construct:&lt;/p&gt;

&lt;pre&gt;
class NamedTuple(tuple):
    __slots__ = ()

    _fields = None  # Subclass must provide this

    def __new__(_cls, *args):
        # Error checking, and **kwargs handling omitted for brevity
        return tuple.__new__(_cls, tuple(args))

    def __repr__(self):
        'Return a nicely formatted representation string'
        fmt = '(' + ', '.join('%s=%%r' % x for x in self._fields) + ')'
        return self.__class__.__name__ + fmt % self

    def __getattr__(self, field):
        try:
            idx = self._fields.index(field)
        except ValueError:
            raise AttributeError("'{}' NamedTuple has no attribute '{}'".format(self.__class__.__name__, field))

        return self[idx]
&lt;/pre&gt;

&lt;p&gt;Again, some method implementations are omitted for brevity (see the gist for gory details). The main
difference to the generated class is making use of &lt;code&gt;_fields&lt;/code&gt; consistently rather than hard-coding
some things, and the necessity of a &lt;code&gt;__getattr__&lt;/code&gt; method, rather than hardcoded properties.&lt;/p&gt;

&lt;p&gt;This can be used something like this:&lt;/p&gt;

&lt;pre&gt;
class Foo(NamedTuple):
    _fields = ('bar', 'baz')
&lt;/pre&gt;

&lt;p&gt;This fits the normal pattern for creating classes much better than a constructor function. It is two
lines rather than the slightly more concise one-liner, but we don’t need to duplicate the &lt;code&gt;Foo&lt;/code&gt;
name so that is a win. In addition, when we want to start adding methods, we know exactly where they go.&lt;/p&gt;

&lt;p&gt;This isn’t without some drawbacks. The biggest problem is that
we’ve just inadvertently made our subclass mutable! Which is certainly
not ideal. This can be rectified by adding a &lt;code&gt;__slots__ =
()&lt;/code&gt; in our sub-class:&lt;/p&gt;

&lt;pre&gt;
class Foo(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')
&lt;/pre&gt;

&lt;p&gt;At this point this approach no longer looks so good. Some other minor drawbacks are that
there is no checking that the field names are really valid in anyway, which the &lt;code&gt;namedtuple&lt;/code&gt;
approach does correctly. The final drawback is on the performance front. We
can measure the attribute access time:&lt;/p&gt;

&lt;pre&gt;
from timeit import timeit

RUNS = 1000000

direct_idx_time = timeit('foo[0]', setup='from __main__ import foo', number=RUNS)
direct_attr_time = timeit('foo.bar', setup='from __main__ import foo', number=RUNS)
sub_class_idx_time = timeit('foo[0]', setup='from __main__ import foo_sub_class as foo', number=RUNS)
sub_class_attr_time = timeit('foo.bar', setup='from __main__ import foo_sub_class as foo', number=RUNS)


print(direct_idx_time)
print(direct_attr_time)
print(sub_class_idx_time)
print(sub_class_attr_time)

&lt;/pre&gt;

&lt;p&gt;The results are that accessing a value via direct indexing is the same for both approachs, however
when accessing via attributes the sub-class approach is 10x slower than the original &lt;code&gt;namedtuple&lt;/code&gt;
approach (which was about factor 3x slower than direct indexing).&lt;/p&gt;

&lt;p&gt;Just for kicks, I wanted to see how far I could take this approach. So, I created a little
optimize function, which can be applied as a decorator on a namedtuple class:&lt;/p&gt;

&lt;pre&gt;
def optimize(cls):
    for idx, fld in enumerate(cls._fields):
        setattr(cls, fld, property(itemgetter(idx), doc='Alias for field number {}'.format(idx)))
    return cls


@optimize
class FooSubClassOptimized(NamedTuple):
    __slots__ = ()
    _fields = ('bar', 'baz')

    def bar_times_baz(self):
        return self.bar * self.baz
&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;optimize&lt;/code&gt; goes through and adds a property to the class for each of it’s fields.
This means that when an attribute is accessed, it can short-cut through the property, rather than
using the relatively slow &lt;code&gt;__getattr__&lt;/code&gt; method. This result in a considerable speed-up,
however performance was still about 1.2x slower than the &lt;code&gt;namedtuple&lt;/code&gt; approach.&lt;/p&gt;

&lt;p&gt;If anyone can explain why this takes a performance hit I’d appreciate it, because I certainly
can’t work it out!&lt;/p&gt;

&lt;p&gt;So, at this point I thought my quest for a better
&lt;code&gt;namedtuple&lt;/code&gt; had come to an end, but I thought it might be worth pursuing the decorator
approach some more. Rather than modifying the existing class, instead we could use that as a template
and return a new class. After some iteration I came up with a class decorator called &lt;code&gt;namedfields&lt;/code&gt;,
which can be used like:&lt;/p&gt;

&lt;pre&gt;
@namedfields('bar', 'baz')
class Foo(tuple):
    pass
&lt;/pre&gt;

&lt;p&gt;This approach is slightly more verbose than sub-classing, or the &lt;code&gt;namedtuple&lt;/code&gt; factory
however, I think it should be fairly clear, and doesn’t include any redundancy. The decorator
constructs a new class based on the input class, however it adds the properties for all the fields,
ensures &lt;code&gt;__slots__&lt;/code&gt; is correctly added to the class, and attaches all the useful
&lt;code&gt;NamedTuple&lt;/code&gt; methods (such as &lt;code&gt;_replace&lt;/code&gt; and &lt;code&gt;__repr__&lt;/code&gt;) to the
new class.&lt;/p&gt;

&lt;p&gt;Classes constructed this way have identical attribute performance as &lt;code&gt;namedtuple&lt;/code&gt;
factory classes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;namedfields&lt;/code&gt; function is:&lt;/p&gt;

&lt;pre&gt;
def namedfields(*fields):
    def inner(cls):
        if not issubclass(cls, tuple):
            raise TypeError("namefields decorated classes must be subclass of tuple")

        attrs = {
            '__slots__': (),
        }

        methods = ['__new__', '_make', '__repr__', '_asdict',
                   '__dict__', '_replace', '__getnewargs__',
                   '__getstate__']

        attrs.update({attr: getattr(NamedTuple, attr) for attr in methods})

        attrs['_fields'] = fields

        attrs.update({fld: property(itemgetter(idx), doc='Alias for field number {}'.format(idx))
                      for idx, fld in enumerate(fields)})

        attrs.update({key: val for key, val in cls.__dict__.items()
                      if key not in ('__weakref__', '__dict__')})

        return type(cls.__name__, cls.__bases__, attrs)

    return inner
&lt;/pre&gt;

&lt;p&gt;So, is the &lt;code&gt;namedfields&lt;/code&gt; decorator &lt;strong&gt;better&lt;/strong&gt; than the &lt;code&gt;namedtuple&lt;/code&gt;
factory function? Performance wise things are on-par for the common operation of accessing attributes.
I think from an aesthetic point of view things are on par for simple classes:&lt;/p&gt;

&lt;pre&gt;
Foo = namedtuple('Foo', 'bar baz')

# compared to

@namedfields('bar', 'baz')
class Foo(tuple): pass
&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;namedtuple&lt;/code&gt; wins if you prefer a one-line (although if you are truly perverse you can always do:
&lt;code&gt;Foo = namedfields('bar', 'baz')(type('Foo', (tuple, ), {}))&lt;/code&gt;. If you don’t really care about
the one vs. two lines, then it is pretty line-ball call. However, when we compare what is required
to add a simple method the decorate becomes a clear winner:&lt;/p&gt;

&lt;pre&gt;
class Foo(namedtuple('Foo', 'bar baz')):
    def bar_times_baz(self):
        return self.bar * self.baz

# compared to

@namedfields('bar', 'baz')
class Foo(tuple):
    def bar_times_baz(self):
        return self.bar * self.baz
&lt;/pre&gt;

&lt;p&gt;There is an extra line for the decorator in the &lt;code&gt;namedfields&lt;/code&gt; approach, however
this is a lot clearer than it being forced in as a super class.&lt;/p&gt;

&lt;p&gt;Finally, I wouldn’t recommend using this right now because my bring in a new dependency for
something that provides only marginal gains over the built-in functionality isn’t really worth it,
and this code is really only proof-of-concept at this point in time.&lt;/p&gt;

&lt;p&gt;I will be following up with some extensions to this approach that covers off some of my other
pet-hates with the &lt;code&gt;namedtuple&lt;/code&gt; approach: custom constructors and subclasses&lt;/p&gt;
    </description>  </item>

</channel>
</rss>
