On apostrophes, Unicode and XML

Thu, 28 Feb 2008 12:51:39 +0000
artcile tech unicode web

So, I started with something reasonably straight-forward — update my blog posts so that the <title> tag is set correctly — which quickly led me down the rabbit hole of typographically correct apostrophes, Unicode, XML, encodings, keyboards and input methods. Updating my blog software took about 15 minutes, delving down the rabbit hole took about 5 hours.

So, the apostrophe. This isn’t about the correct usage of the apostrophe. This is entirely about correctly typesetting of the apostrophe. Now there are lots of opinions on the subject. It basically comes down to the choice between ASCII character 0x27 and Unicode code point U+2019. Of course it just so happens that ASCII character 0x27 is also Unicode code point U+0027, so really, this comes down to a discussion about which Unicode code point is most appropriate for representing the apostrophe. After way too much searching, it actually turns out to be a really simple decision. Unicode provides the documentation for the code points in a series of charts. The chart C0 Controls and Basic Latin (pdf) documents the APOSTROPHE. It is described as:

0027 ' APOSTROPHE 
= apostrophe-quote (1.0) 
= APL quote 
• neutral (vertical) glyph with mixed usage 
• 2019 ’  is preferred for apostrophe 

So, despite the fact that it is named APOSTROPHE, it is described as a neutral (vertical) glyph with mixed usage, and it notes that U+2019 is the preferred code point for apostrophe. This looks pretty conclusive but let’s check the General Punctuation chart:

2019 ’ RIGHT SINGLE QUOTATION MARK 
= single comma quotation mark 
• this is the preferred character to use for apostrophe

So, my conclusion is that the most appropriate character for an apostrophe is U+2019. OK, great, now I have to decide how I can actually encode this. I’m used to writing plain ASCII text documents, and U+2019 is not something I can represent in ASCII. So, since I’m mostly concerned about document I’m publishing on the interwebs, and I figured that character entities refernences would be the way to go. So there appears to be a relevant entity:

<!ENTITY rsquo   CDATA "’" -- right single quotation mark, U+2019 ISOnum -->

Of course is seems a little odd using &rsquot; to represent an apostrophe, but so be it. Now in XML a new character entity is defined &apos;, which you might on first glance think is exactly what you want, but on second glance, it isn’t, since it maps to U+0027, not U+2019. &apos; is mostly used for escaping strings which are enclosed in actual ' characters. So, &apos; is out. XML itself only defines character entities for ampersand, less-than, greater-than, quotation mark, and apostrophe. XHTML however defines the rest of the character entities that you have come to love and expect from HTML, so &rsquot; is still in, as long as it is used in an XHTML document, not a general XML document.

So I was set on just using &rsquot;, and I sent my page off to the validator. This went fine, except it pedantically pointed out that I had not defined a character encoding, and really I should. Damn, now I need to think about character encoding too. OK, so what options are there? Well, IANA, has a nice list of official names for characters sets that may be used in the internet.

ANSI_X3.4-1968 (a.k.a US-ASCII, a.k.a ASCII) had to be a big first contender, since that is basically what I had been doing for many a year, but to be honest, this seemed a little backwards. The idea of having to use numeric character references (NCRs) everytime I wanted an apostrophe seemed a little silly. Besides the W3C recommends using

an encoding that allows you to represent the characters in their normal form, rather than using character entities or NCRs

OK, so since XML spec defines that:

A character reference refers to a specific character in the ISO/IEC 10646 character set

it seems that I really should choose an encoding that can directly encode Unicode code points. (The Unicode standard and ISO/IEC 10646 track each other.) So, what options are there for encoding Unicode? Well it seem that one of the Unicode transformation formats would be a good choice. But there are so many to choose from, UTF-8, UTF-16, UTF-32, even UTF-9. While UTF-9 was definitely a contender, UTF-8 seems to sanest thing for me. For a start it seems to just-work™ in my editor. So, going with UTF-8, I still end up needing to let other people know my files are encoded in UTF-8. There appears to be a few options for doing this, and the article goes into a long explanation of the various different pros and cons. In the end, I just put it into the XML prolog.

Of course the final piece of the puzzle is actually inputing characters. OS X seems to have fairly good support for this. If you poke around a bit in internationalisation support in system preferences and enable Show Keyobard Viewer, Show Character Viewer and Unicode Hex Input, you should be able to works things out.

So, I can now have lovely typographically correct apostrophes and they work great, and all is good with the world. (Except of course that this page probably renders like crap in Internet Explorer. Oh well.)