For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.
I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.
The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.
The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.
So, urllib2 is billed as An extensible library for opening URLs
using a variety of protocols
, surely I can just extend it to do
what I want? Well, it turns out that I can, but it seemed to be harder
than I was expecting. Not because the final code I needed to write was
difficult or involved, but because it was quite difficult to work out
what the right code to write is. I want to explore for a little bit
why (I think) this might be the case.
urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:
ret = urllib2.urlopen("http://some_url/")
data = ret.read()
And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.
The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.
For example, the build_opener function is pretty
magical, since it does a bunch of introspection, and ends up either
adding a handler or replacing a handler depending on
the class. This is explained in the code as: if one of the
argument is a subclass of the default handler, the argument will be
installed instead of the default.
, which to me makes a lot of
sense, where-as the online documentation describes it as:
Instances of the following classes will be in front of the handlers,
unless the handlers contain them, instances of them or subclasses of
them: ....<list of default handlers>
. For me the former is
much clearer than the latter!
Anyway, here is the code I ended up with:
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
def restopen(url, method=None, headers=None, data=None):
if headers is None:
headers = {}
if method is None:
if data is None:
method = "GET"
else:
method = "POST"
return opener.open(HttpRestRequest(url, method=method,
headers=headers, data=data))
So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.
So, I started with something reasonably straight-forward — update my blog posts so that the <title> tag is set correctly — which quickly led me down the rabbit hole of typographically correct apostrophes, Unicode, XML, encodings, keyboards and input methods. Updating my blog software took about 15 minutes, delving down the rabbit hole took about 5 hours.
So, the apostrophe. This isn’t about the
correct usage of the apostrophe. This is entirely about correctly
typesetting of the
apostrophe. Now there are lots
of
opinions on the
subject. It basically comes down to the choice between ASCII character 0x27 and
Unicode code point U+2019. Of course it just so happens
that ASCII character 0x27 is also Unicode code point U+0027, so really, this
comes down to a discussion about which Unicode code point is most appropriate
for representing the apostrophe. After way too much searching, it actually
turns out to be a really simple decision. Unicode provides the documentation
for the code points in a series of charts. The chart
C0 Controls and Basic Latin (pdf)
documents the APOSTROPHE
. It is described as:
0027 ' APOSTROPHE = apostrophe-quote (1.0) = APL quote • neutral (vertical) glyph with mixed usage • 2019 ’ is preferred for apostrophe
So, despite the fact that it is named APOSTROPHE
, it is described
as a neutral (vertical) glyph with mixed usage
, and it notes that
U+2019 is the preferred code point for apostrophe. This looks pretty conclusive
but let’s check the General
Punctuation chart:
2019 ’ RIGHT SINGLE QUOTATION MARK = single comma quotation mark • this is the preferred character to use for apostrophe
So, my conclusion is that the most appropriate character for an apostrophe is U+2019. OK, great, now I have to decide how I can actually encode this. I’m used to writing plain ASCII text documents, and U+2019 is not something I can represent in ASCII. So, since I’m mostly concerned about document I’m publishing on the interwebs, and I figured that character entities refernences would be the way to go. So there appears to be a relevant entity:
<!ENTITY rsquo CDATA "’" -- right single quotation mark, U+2019 ISOnum -->
Of course is seems a little odd using &rsquot; to represent an
apostrophe, but so be it. Now in XML a new character entity is defined ',
which you might on first glance think is exactly what you want, but on second
glance, it isn’t, since it maps to U+0027, not U+2019. ' is mostly
used for escaping strings which are enclosed in actual ' characters.
So, ' is out. XML itself only defines character entities for
ampersand, less-than, greater-than, quotation mark, and apostrophe. XHTML
however defines the rest of the character entities that you have come to
love and expect from HTML, so &rsquot; is still in, as long as it is
used in an XHTML document, not a general XML document.
So I was set on just using &rsquot;, and I sent my page off
to the validator. This went
fine, except it pedantically pointed out that I had not defined
a character encoding, and really I should. Damn, now I need to think
about character encoding too. OK, so what options are there? Well,
IANA, has a nice list
of official names for characters sets that may be used in the internet
.
ANSI_X3.4-1968 (a.k.a US-ASCII, a.k.a ASCII) had to be a big first contender, since that is basically what I had been doing for many a year, but to be honest, this seemed a little backwards. The idea of having to use numeric character references (NCRs) everytime I wanted an apostrophe seemed a little silly. Besides the W3C recommends using
an encoding that allows you to represent the characters in their normal form, rather than using character entities or NCRs
OK, so since XML spec defines that:
A character reference refers to a specific character in the ISO/IEC 10646 character set
it seems that I really should choose an encoding that can directly
encode Unicode code points. (The Unicode standard and ISO/IEC 10646
track each other.) So, what options are there for encoding Unicode?
Well it seem that one of the Unicode transformation formats
would be a good choice. But there are so many to choose from,
UTF-8,
UTF-16,
UTF-32, even UTF-9.
While UTF-9 was definitely a contender, UTF-8 seems to sanest
thing for me. For a start it seems to just-work
™ in my
editor. So, going
with UTF-8, I still end up needing to let other people know my files
are encoded in UTF-8. There appears to be a
few options for doing this, and the article goes into a long
explanation of the various different pros and cons. In the end, I just put it into
the XML prolog.
Of course the final piece of the puzzle is actually inputing
characters. OS X seems to have fairly good support for this. If you
poke around a bit in internationalisation support in system
preferences and enable Show Keyobard Viewer
, Show Character
Viewer
and Unicode Hex Input
, you should be able to works
things out.
So, I can now have lovely typographically correct apostrophes and they work great, and all is good with the world. (Except of course that this page probably renders like crap in Internet Explorer. Oh well.)
Damn, conkeror is the most awesome thing ever. It basically gives you an emacs style interface to firefox. This is not just emacs key bindings for text boxes and stuff but a full on emacs style interface, with a status bar and entry box at the bootom, and none of this toolbar and menu stuff crufty up the page.
Instead of tabs you just have multiple buffers, which you can move between with the normal emacs key combinations. Huzzah!
I guess it isn't too suprising that this masterpiece comes from Shawn Betts, who also gave us ratpoison.
Of course it ain't perfect yet. It would be amazingly cool if it supported
C-x 2 (split windows) and C-x B (list buffers). Also
if there was a way I could still use the developer toolbar that would be good too.