Un-shortening twitter URLs

Sun, 08 Nov 2009 17:29:55 +0000

Introduction

I got a little annoyed at twitter the other day for gratiutiously tinyurl-ing a link, even when my message was under the magic 140 characters. Anyway, as far as I can tell, there isn’t really anyway to avoid this.

As it happens there has been quite a kerfuffle about this in the blog-o-sphere of late, with Josh Schachter (founder of del.icio.us) discussing some of the problems with URL shorteners. While, I agree with most of what is written there, I think some of his points are a little over the top. Personally, i don’t think that URL shorteners are evil, but they can certainly be annoying. I regularly mouseover links to see where they are going, and like the visual feedback of seeing visited links, both of which are broken by URL shorteners.

So, for my problem, there is a pretty easy solution, somehow expand the URL, and show that instead of the shortened URL. There are clearly already solutions to solve this, but I was interested in seeing if this could be done purely within the browser, using JavaScript and standard (or proposed standard) DOM APIs.

Overall approach

Now, in general, these URL shortening services do the “right thing”, and provide a 301 permanent redirect response. So my basic thinking is something like:

  1. get the tweets from twitter
  2. find all the URLs in the tweets
  3. for each URL, do an HTTP HEAD request
  4. if the response code is 301, replace the link text (and href) with the response location.

.. and the plan is to try and do this on the client. Now, if you’ve done something like this before, you can probably already guess all the pitfalls!

Render the tweet data

Twitter quite conveniently provides its data in a variety of useful machine readable formats. The most useful for my purposes here is the twitter XML format. Now, I’m leaving it as an exercise for the reader how to get an (authenticated) bit of twitter XML into the client at run-time. For now, I’ve downloaded a my latest timeline in XML format and made it available at http://benno.id.au/twit/twit.xml.

Now, in these days of JSON and JavaScript, no-one much seems to like XSLT langauge, but I’m quite partial to it, and as far as I’m concerned it is the best way to take one XML document (the tweet timeline) and convert it into another XML document (the HTML output). (Technically, I’ll be converting it into an XML document fragment.)

I’ll leave out how to create an XSLT processor in Javascript as an exercise for the reader, but the XSLT script itself is of some interest. At least I think so!

Now, for the most part converting the Twitter status element into an appropriate <div> and <p> tags is straight forward:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" />
<xsl:template match="statuses">
<h1>Tweet!</h1>
    <xsl:apply-templates/>
</xsl:template>

<xsl:template match="status">
<div class="tweet">
  <div class="text">
     <p><xsl:value-of select="text" /></p>
  </div>
  <div class="meta">
    <span><xsl:value-of select="created_at" /> from <xsl:value-of select="source"/></span>
  </div>
</div>
</xsl:template>
</xsl:stylesheet>

What this script basically does is find a tag <statuses> and output the heading tag. The for each <status> tag, it generates a div with the actual tweet. Now this works pretty well, but, the tweets are stored in plain text, without the URLs marked up in anyway! E.g:

<status>
  <created_at>Thu Apr 23 08:06:49 +0000 2009</created_at>
  <id>1592566033</id>
  <text>I can't believe the successor of the Radius protocol is the diameter protocol: http://tinyurl.com/c3zb2v</text>
  <source>web</source>
  ....
</status>

This means the script currently only prints out URLs with no link markup. Unfortunately the XSLT language doesn’t have very powerful inbuilt string handling functions. Fortunately the XSLT language is pretty general purpose (and turing complete. Confusingly, functions are called templates, and the syntax to invoke a function is far from convenient, but it is relatively straight forward to parse a string to extract and markup links.

So, we write a function, err, template, called parseurls. This template takes a single parameter called text. It then outputs this text with anything that looks like a URL replace with <a href="URL">URL</>. For this proof-of-concept, anything that starts with http:// is going to count as a URL, which works relatively well in practise.

The basic algorithm is to split the string into three parts: before-url, url, after-url. before-url is simply output as-is, url is output as described previously. If there is an after-url part, then we recursively call the template with this data.

<xsl:template name="parseurls">
  <xsl:param name="text"/>
  <xsl:choose>
    <xsl:when test="contains($text, 'http://')">
      <xsl:variable name="after_scheme" select="substring-after($text, 'http://')" />
      <xsl:value-of select="substring-before($text, 'http://')"/>
      <xsl:choose>
	<xsl:when test="contains($after_scheme, ' ')">
	  <xsl:variable name="url" select="concat('http://', substring-before($after_scheme, ' '))" />
	  <xsl:call-template name="linkify"><xsl:with-param name="url" select="$url"/></xsl:call-template> 
	  <xsl:text> </xsl:text>
	  <xsl:call-template name="parseurls">
	    <xsl:with-param name="text" select="substring-after($after_scheme, ' ')"/>
	  </xsl:call-template>
	</xsl:when>
	<xsl:otherwise>
	  <xsl:variable name="url" select="concat('http://', $after_scheme)"/> 
	  <xsl:call-template name="linkify"><xsl:with-param name="url" select="$url"/></xsl:call-template>
	</xsl:otherwise>
      </xsl:choose>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$text"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

So, through the power of functional programming we can transform the raw tweet XML into a useful HTML document. Unfortunately making HTTP request is beyond the scope of XSLT, so for the next step, we need to use JavaScript.

Replacing URLs

OK, the next step is to go and change the links in the document. The goal is to find any links that have been shortened and replace them with the true destination.

Finding the links in the document is relatively striaght-forward, something like: var atags = document.getElementsByTagName("body")[0].getElementsByTagName("a");, does the trick. Now the aim is to go and see if any of these URL return a redirect, and then update the element. What I want to do is something like:

var atags = document.getElementsByTagName("body")[0].getElementsByTagName("a");
for (var i = 0; i < atags.length; i++) {
    var x = atags[i];
    var xmlhttp = new XMLHttpRequest();
    xmlhttp.update_x = x;
    xmlhttp.open("HEAD", x.attributes.getNamedItem("href").value);
    xmlhttp.onreadystatechange = function () {
       if (this.readyState == 4) {
           if (this.status == 301) {
               this.update_x.innerHTML = this.getResponseHeader('Location');
           }
       }
    }
    xmlhttp.send(null);
}

Basically, what this code does (or what I wish it did), is grab the href attribute out of each link, and do a HEAD request on it. Recall, a head request will get the resource headers, but not the contents. Since we only care about the reponse code, and the location header, we do the network a favour, and don’t download all the uncessary data.

Unfortunately, at this stage, we end up pretty hard against some fundamental limitations of the XMLHTTPRequest, specifically, the same-origin policy. In short, (simplified) terms, the same-origin policy means you can’t make HTTP requests except to the domain where the page resides. This is done for a very good security reason, but is slightly frustrating at this point.

Now, I wasn’t going to let a simple thing like this stop me, so I implemented a pretty simple server side proxy which let me avoid this problem. (This is a pretty unsatisfactory solution, and I’m definitely looking forward to some of the new cross-origin extensions coming out.) So, basically, we change the open call to something like:

xmlhttp.open("HEAD", "proxy?" + x.attributes.getNamedItem("href").value);

OK, so the server-side proxy gets around our cross-origin restriction, unfortunately we now hit a new problem! It turns out that XMLHttpRequest is specified so that any redirects are automatically followed by the implementation. Which means, we are stuck again, because we don’t even get a chance to find out about redirections! To get around this I hacked up my proxy, so that it converted 301 response codes into a 531 response code. (No particular reason for choosing that number over any other available reponse code).

Putting all this together gives us a solution for doing URL elongating (mostly) on the client-side. I’ve put an example up at http://benno.id.au/twit/.

Conclusions and further work

As you can see from the example, modifying the links at afer the page has already rendered can be somewhat distracting. An alternative user-interface would be to hook the mouse-over event, and in those cases display the real URL in the status-bar.

Obviously this has the potential to create a large number of network requests, but this would be no different to a desktop application, or browser plugin, so not really much that can be done there. It might be possible to batch a number of requests to the proxy, and also have the proxy cache requests, but I’d prefer to find solutions to needing the proxy cache in the first place!

The W3C Cross-Origin Resource Working Draft, provides a mechanism which allows holes to be punched in the same-origin restricting. If URL shortening services allowed their resources to be shared cross-origin, by implementing the Access-Control-Allow-Origin http header, the need for the proxying mechanism would go away.

Finally, I would propose that the XMLHttpRequest API be updated to enable a mechanism to avoid following redirects. The W3C working draft notes that a Property to disable following redirects; is being considered for future version of the specification. I would be in favour of this.

Unfortuantely the conclusion needs to be that with the current API limitations and security models, it is not possible to write this kind of application in a pure client-side manner with current web technologies.

blog comments powered by Disqus