node.js semi-asynchronous functions

Mon, 08 Aug 2011 19:19:28 +0000
tech node.js

tl;dr

Last time I wrote about some of the idiosyncrasies in the way in which you deal with exceptions in node.js. This time, I’m looking at a phenomenon I’m calling semi-asynchronous functions.

Let’s start with a simple asynchronous function. We have a function x which sets the value of two global variables. Of course global variables are bad, so you could imagine that x is a method and it is updating some fields on the current object if it makes you feel better. Of course some will argue that any mutable state is bad, but now we are getting side-tracked!

var a = 0
var b = 0

function x(new_a, new_b) {
    a = new_a
    b = new_b
}

So, here was have a pretty simple function, and it is pretty easy to state the post-condition that we expect, specifically that when x returns a will have the value of the first argument and b will have the value of the second argument.

So, let’s just write some code to quickly test our expectations:

x(5, 6)
console.log(a, b)

As expected this will print 5 6 to the console.

Now, if x is changed to be an asynchronous function things get a little bit more interesting. We’ll make x asynchronous by doing the work on the next tick:

function x(new_a, new_b, callback) {
    function doIt() {
	a = new_a
	b = new_b
	callback()
    }
    process.nextTick(doIt)
}

Now, we can guarantee something about the values of a and b when the callback is executed, but what about immediately after calling? Well, with this particular implementation, we can guarantee that a and b will be unchanged.

function done() {
    console.log("Done", a, b)
}

x(5, 6, done)
console.log("Called", a, b)

Running this we see that our expectations hold. a and b are 0 after x is called, but are 5 and 6 by the time the callback is executed.

Of course, another valid implementation of x could really mess up some of these assumptions. We could instead implement it like so:

function x(new_a, new_b, callback) {
    a = new_a
    function doIt() {
	b = new_b
	callback()
    }
    process.nextTick(doIt)
}

Now we get quite a different result. After x is called a has been modified, but b remains unchanged. This is what I call a semi-asynchronous asynchronous function; part of the work is done synchronously, while the remainder happens some time later.

Just in case you are thinking at this point that this is slightly academic, there are real functions in the node.js library that are implemented in this semi-asynchronous fashion.

Now as a caller, faced with this semi-asynchronous functions, how exactly should you use it? If it is clearly documented which parts happen asynchronously and which parts happen synchronously and that is part of the interface, then it is relatively simple, however most functions are not documented this way, so we can only make assumptions.

If we are conservative, then we really need to assume that anything modified by the function must be in an undefined state until the callback is executed. Hopefully the documentation makes it clear what is being mutated so we don’t have to assume the state of the entire program is undefined.

Put another way, after calling x we should not rely on the values a and b in anyway, and the implementer of x should feel free to change when in the program flow a and/or b is updated.

So can we rely on anything? Well, it might be nice to rely on the order in which some code is executed. With both the implementation of x so far, we have been able to guarantee that the code immediately following the function executes before the asynchronous callback executes. Well, that would be nice, but what if x is implemented like so:

function x(new_a, new_b, callback) {
    a = new_a
    b = new_b
    callback()
}

In this case, the callback will be executed before the code following the call to x. So, there are two questions to think about. Is the current formulation of x a valid approach? And secondly, is it valid to rely on the code ordering?

While you think about that, let me introduce another interesting issue. Let’s say we want to execute x many times in series (i.e: don’t start the next x operation until the previous one has finished, i.e: it has executed the callback.). Well, of course, you can’t just use something as simple as a for loop that would be far too easy, and it would be difficult to prove how cool you are at programming if you could just use a for loop. No instead, you need to do something like this:

var length = 100000;
function repeater(i) {
  if( i < length ) {
      x(i, i,  function(){
	  repeater(i + 1)
      })
  }
}
repeater(0)

This appears to be the most widely used approach. Well there is at least one blog post about this technique, and it has been tied up into a nice library. Now, this works great with our original implementations of x. But try it with the latest one (i.e: the one that does the callback immediately). What happens? Stack overflow happens:

node.js:134
        throw e; // process.nextTick error, or 'error' event on first tick
        ^
RangeError: Maximum call stack size exceeded

So now the question isn’t just about whether the code ordering is a reasonable assumption to make, now we need to work out whether it is a reasonable assumption to make that the callback gets a new stack each time it is called! Once again, if it is clearly documented it isn’t that much of a problem, but none of the standard library functions document whether they create a new stack or not.

The problem here is that common usage is conflicting. There is a lot of advice and existing libraries that make the assumption that a callback implies a new stack. At the same time there is existing code within the standard library that does not create a new stack each time. To make matters worse, this is not always consistent either, it can often depend on the actual arguments passed to the function as to whether a new stack is created, or the callback is executed on the existing stack!

What then can we make of this mess? Well, once again, as a caller you need to make sure you understand when the state is going to be mutated by the function, and also exactly when, and on which stack your callback will be executed.

As an API provider as always, you need to document this stuff, but lets try to stick to some common ground; callback should always be executed in a new stack, not on the existing one.

blog comments powered by Disqus