urllib2 and web applications

Mon, 05 Jan 2009 18:17:52 +0000
python tech article web

For a little personal project I’ve been working on recenly, I needed to create some unit tests for a CGI script, sorry, that should be, a cloud computing application. Of course, I turned to my trusty hammer, Python, and was a little disappointed to discover it didn’t quite have the batteries I needed included.

I kind of thought that urllib2 would more or less do what I wanted out of the box, but unfortunately it didn’t and (shock, horror!) I actually needed to write some code myself! The first problem I ran into is that urllib2 only supports GET and POST out of the box. HTTP is constrained enough in the verbs it provides, so I really do want access to things like PUT, DELETE and HEAD.

The other problem I ran into, is that I did’t want things automatically redirecting (although clearly this would be the normal use-case), because I wanted to check I got a redirect in certain cases.

The final problem that I had is that only status code 200 is treated as success, and other 2xx codes raise exceptions. This is generally not what you want, since 201, is a perfectly valid return code, indicating that new resource was created.

So, urllib2 is billed as An extensible library for opening URLs using a variety of protocols, surely I can just extend it to do what I want? Well, it turns out that I can, but it seemed to be harder than I was expecting. Not because the final code I needed to write was difficult or involved, but because it was quite difficult to work out what the right code to write is. I want to explore for a little bit why (I think) this might be the case.

urllib2 is quite nice, because simple things (fetch a URL, follow redirects, POST data), are very easy to do:

ret = urllib2.urlopen("http://some_url/") data = ret.read()

And, it is definitely possible to do more complex things, but (at least for me), there is a sharp discontinuity in the API which means that learning the easy API doesn’t help you learn the more complex API, and also the documentation (at least as far as I read it), doesn’t make it apparent that there are kind of two modes of operation.

The completely frustrating thing is that the documentation in the source file is much better than the online documentation! Since it talks about some of the things that happen in the module, which are otherwise “magic”.

For example, the build_opener function is pretty magical, since it does a bunch of introspection, and ends up either adding a handler or replacing a handler depending on the class. This is explained in the code as: if one of the argument is a subclass of the default handler, the argument will be installed instead of the default. , which to me makes a lot of sense, where-as the online documentation describes it as: Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ....<list of default handlers>. For me the former is much clearer than the latter!

Anyway, here is the code I ended up with:

opener = urllib2.OpenerDirector()

def restopen(url, method=None, headers=None, data=None):
    if headers is None:
        headers = {}
    if method is None:
        if data is None:
            method = "GET"
            method = "POST"
    return opener.open(HttpRestRequest(url, method=method, 
                                       headers=headers, data=data))

So, conclusions, if the dos don’t make sense, read the code. If you are writing an API, try to make it easier to get from the easy case to the more complex case. (This is quite difficult to do, and I have definitely been guilty in the past of falling into the same trap in APIs I have provided!). If you can’t get the API to easily transition from easy to hard, make sure you document it well. Finally, Python is great language for accessing services over HTTP, even if it does require a little bit of work to get the framework in place.

On apostrophes, Unicode and XML

Thu, 28 Feb 2008 12:51:39 +0000
artcile tech unicode web

So, I started with something reasonably straight-forward — update my blog posts so that the <title> tag is set correctly — which quickly led me down the rabbit hole of typographically correct apostrophes, Unicode, XML, encodings, keyboards and input methods. Updating my blog software took about 15 minutes, delving down the rabbit hole took about 5 hours.

So, the apostrophe. This isn’t about the correct usage of the apostrophe. This is entirely about correctly typesetting of the apostrophe. Now there are lots of opinions on the subject. It basically comes down to the choice between ASCII character 0x27 and Unicode code point U+2019. Of course it just so happens that ASCII character 0x27 is also Unicode code point U+0027, so really, this comes down to a discussion about which Unicode code point is most appropriate for representing the apostrophe. After way too much searching, it actually turns out to be a really simple decision. Unicode provides the documentation for the code points in a series of charts. The chart C0 Controls and Basic Latin (pdf) documents the APOSTROPHE. It is described as:

= apostrophe-quote (1.0) 
= APL quote 
• neutral (vertical) glyph with mixed usage 
• 2019 ’  is preferred for apostrophe 

So, despite the fact that it is named APOSTROPHE, it is described as a neutral (vertical) glyph with mixed usage, and it notes that U+2019 is the preferred code point for apostrophe. This looks pretty conclusive but let’s check the General Punctuation chart:

= single comma quotation mark 
• this is the preferred character to use for apostrophe

So, my conclusion is that the most appropriate character for an apostrophe is U+2019. OK, great, now I have to decide how I can actually encode this. I’m used to writing plain ASCII text documents, and U+2019 is not something I can represent in ASCII. So, since I’m mostly concerned about document I’m publishing on the interwebs, and I figured that character entities refernences would be the way to go. So there appears to be a relevant entity:

<!ENTITY rsquo   CDATA "’" -- right single quotation mark, U+2019 ISOnum -->

Of course is seems a little odd using &rsquot; to represent an apostrophe, but so be it. Now in XML a new character entity is defined &apos;, which you might on first glance think is exactly what you want, but on second glance, it isn’t, since it maps to U+0027, not U+2019. &apos; is mostly used for escaping strings which are enclosed in actual ' characters. So, &apos; is out. XML itself only defines character entities for ampersand, less-than, greater-than, quotation mark, and apostrophe. XHTML however defines the rest of the character entities that you have come to love and expect from HTML, so &rsquot; is still in, as long as it is used in an XHTML document, not a general XML document.

So I was set on just using &rsquot;, and I sent my page off to the validator. This went fine, except it pedantically pointed out that I had not defined a character encoding, and really I should. Damn, now I need to think about character encoding too. OK, so what options are there? Well, IANA, has a nice list of official names for characters sets that may be used in the internet.

ANSI_X3.4-1968 (a.k.a US-ASCII, a.k.a ASCII) had to be a big first contender, since that is basically what I had been doing for many a year, but to be honest, this seemed a little backwards. The idea of having to use numeric character references (NCRs) everytime I wanted an apostrophe seemed a little silly. Besides the W3C recommends using

an encoding that allows you to represent the characters in their normal form, rather than using character entities or NCRs

OK, so since XML spec defines that:

A character reference refers to a specific character in the ISO/IEC 10646 character set

it seems that I really should choose an encoding that can directly encode Unicode code points. (The Unicode standard and ISO/IEC 10646 track each other.) So, what options are there for encoding Unicode? Well it seem that one of the Unicode transformation formats would be a good choice. But there are so many to choose from, UTF-8, UTF-16, UTF-32, even UTF-9. While UTF-9 was definitely a contender, UTF-8 seems to sanest thing for me. For a start it seems to just-work™ in my editor. So, going with UTF-8, I still end up needing to let other people know my files are encoded in UTF-8. There appears to be a few options for doing this, and the article goes into a long explanation of the various different pros and cons. In the end, I just put it into the XML prolog.

Of course the final piece of the puzzle is actually inputing characters. OS X seems to have fairly good support for this. If you poke around a bit in internationalisation support in system preferences and enable Show Keyobard Viewer, Show Character Viewer and Unicode Hex Input, you should be able to works things out.

So, I can now have lovely typographically correct apostrophes and they work great, and all is good with the world. (Except of course that this page probably renders like crap in Internet Explorer. Oh well.)


Fri, 09 Jun 2006 09:43:54 +0000
tech web

Damn, conkeror is the most awesome thing ever. It basically gives you an emacs style interface to firefox. This is not just emacs key bindings for text boxes and stuff but a full on emacs style interface, with a status bar and entry box at the bootom, and none of this toolbar and menu stuff crufty up the page.

Instead of tabs you just have multiple buffers, which you can move between with the normal emacs key combinations. Huzzah!

I guess it isn't too suprising that this masterpiece comes from Shawn Betts, who also gave us ratpoison.

Of course it ain't perfect yet. It would be amazingly cool if it supported C-x 2 (split windows) and C-x B (list buffers). Also if there was a way I could still use the developer toolbar that would be good too.