So, I started with something reasonably straight-forward — update my blog posts so that the <title> tag is set correctly — which quickly led me down the rabbit hole of typographically correct apostrophes, Unicode, XML, encodings, keyboards and input methods. Updating my blog software took about 15 minutes, delving down the rabbit hole took about 5 hours.
So, the apostrophe. This isn’t about the
correct usage of the apostrophe. This is entirely about correctly
typesetting of the
apostrophe. Now there are lots
of
opinions on the
subject. It basically comes down to the choice between ASCII character 0x27 and
Unicode code point U+2019. Of course it just so happens
that ASCII character 0x27 is also Unicode code point U+0027, so really, this
comes down to a discussion about which Unicode code point is most appropriate
for representing the apostrophe. After way too much searching, it actually
turns out to be a really simple decision. Unicode provides the documentation
for the code points in a series of charts. The chart
C0 Controls and Basic Latin (pdf)
documents the APOSTROPHE
. It is described as:
0027 ' APOSTROPHE = apostrophe-quote (1.0) = APL quote • neutral (vertical) glyph with mixed usage • 2019 ’ is preferred for apostrophe
So, despite the fact that it is named APOSTROPHE
, it is described
as a neutral (vertical) glyph with mixed usage
, and it notes that
U+2019 is the preferred code point for apostrophe. This looks pretty conclusive
but let’s check the General
Punctuation chart:
2019 ’ RIGHT SINGLE QUOTATION MARK = single comma quotation mark • this is the preferred character to use for apostrophe
So, my conclusion is that the most appropriate character for an apostrophe is U+2019. OK, great, now I have to decide how I can actually encode this. I’m used to writing plain ASCII text documents, and U+2019 is not something I can represent in ASCII. So, since I’m mostly concerned about document I’m publishing on the interwebs, and I figured that character entities refernences would be the way to go. So there appears to be a relevant entity:
<!ENTITY rsquo CDATA "’" -- right single quotation mark, U+2019 ISOnum -->
Of course is seems a little odd using &rsquot; to represent an
apostrophe, but so be it. Now in XML a new character entity is defined ',
which you might on first glance think is exactly what you want, but on second
glance, it isn’t, since it maps to U+0027, not U+2019. ' is mostly
used for escaping strings which are enclosed in actual ' characters.
So, ' is out. XML itself only defines character entities for
ampersand, less-than, greater-than, quotation mark, and apostrophe. XHTML
however defines the rest of the character entities that you have come to
love and expect from HTML, so &rsquot; is still in, as long as it is
used in an XHTML document, not a general XML document.
So I was set on just using &rsquot;, and I sent my page off
to the validator. This went
fine, except it pedantically pointed out that I had not defined
a character encoding, and really I should. Damn, now I need to think
about character encoding too. OK, so what options are there? Well,
IANA, has a nice list
of official names for characters sets that may be used in the internet
.
ANSI_X3.4-1968 (a.k.a US-ASCII, a.k.a ASCII) had to be a big first contender, since that is basically what I had been doing for many a year, but to be honest, this seemed a little backwards. The idea of having to use numeric character references (NCRs) everytime I wanted an apostrophe seemed a little silly. Besides the W3C recommends using
an encoding that allows you to represent the characters in their normal form, rather than using character entities or NCRs
OK, so since XML spec defines that:
A character reference refers to a specific character in the ISO/IEC 10646 character set
it seems that I really should choose an encoding that can directly
encode Unicode code points. (The Unicode standard and ISO/IEC 10646
track each other.) So, what options are there for encoding Unicode?
Well it seem that one of the Unicode transformation formats
would be a good choice. But there are so many to choose from,
UTF-8,
UTF-16,
UTF-32, even UTF-9.
While UTF-9 was definitely a contender, UTF-8 seems to sanest
thing for me. For a start it seems to just-work
™ in my
editor. So, going
with UTF-8, I still end up needing to let other people know my files
are encoded in UTF-8. There appears to be a
few options for doing this, and the article goes into a long
explanation of the various different pros and cons. In the end, I just put it into
the XML prolog.
Of course the final piece of the puzzle is actually inputing
characters. OS X seems to have fairly good support for this. If you
poke around a bit in internationalisation support in system
preferences and enable Show Keyobard Viewer
, Show Character
Viewer
and Unicode Hex Input
, you should be able to works
things out.
So, I can now have lovely typographically correct apostrophes and they work great, and all is good with the world. (Except of course that this page probably renders like crap in Internet Explorer. Oh well.)