Determining Dependencies

Tue, 15 Mar 2011 18:58:48 +0000
tech build-system

One of the best ways to make a build system fast is to avoid the unnecessary rebuilding of files. Build tools have a variety ways of achieving this. To better discuss this, let’s first define some terms. One way to look at a build system is that it takes some set of inputs in to some set of outputs. Usually these inputs and outputs are files in the file-system, but could potentially be something else, like tables in a database or anything else. Usually the build system consists of a set of build rules; each build rule has some set of inputs, and produces a set of outputs, by running a given build command. You’ll have to forgive the abstruse nature of these definitions, but I’m attempting to keep the design space as open as possible!

So to improve the speed of the build system we want to avoid executing unnecessary build commands. To do this in any reasonable way requires making some assumptions about the individual build rules, specifically that the output for a build rule only depends on the build rule’s inputs. With such an assumption there is no need to rerun a build command if the inputs of a build rule have not changed.

This in itself is an interesting restriction as the inputs to a build rule may not be entirely obvious. For example, the output of a build rule may depend on the time, or the user name of the hostname. The other problem is build commands that have inputs which are also outputs (i.e: commands that modify files such as ar). And of course the given command for a build rule may also potentially change. For example, imagine a build system that support a --ndebug argument, which causes compile command to have an extra -DNDEBUG argument.

So the aim of this article is to explore the design space of how build tools handle the specification of the inputs to a given build rule.

Explicit

Now the easiest approach is that the build system explicitly lists the inputs for each build rule. This is the base line kind of approach for something like make. The difficulty with this approach is that it can be error prone. A prototypical extract from a Makefile might look something like:


foo.o: foo.c
     gcc foo.c -o foo.o

Now if foo.c includes foo.h then there is a problem. Since foo.h is not captured as one of the inputs to the build rule, if foo.h changes, then the build command will not be re-run.

Of course, it is quite simple to include foo.h as one of the inputs for this specific build rule, but that is pretty brittle. C already sucks enough having to both declare and define public functions, without making it even more annoying by requiring you to update the build system every time you add a header to a source file. (And of course it should be removed from the build system when it is removed from the C file, however forgetting to do this will just affect performance, not correctness of the build system).

Another possibility is to treat each of the include path directories as an input to the build rule. There are some other issues with determining if a directory has changed from one build to the next, but we’ll ignore that for now. This approach should be relatively easy to use, and should be correct most of the time, but has a performance drawback if the include path has many include files, and most source files only include one or two headers.

Rule Specific Scanners

Rather than having to manually define each and every input file another approach is to have some rule specific process that can determine the correct inputs for a given rule. gcc has a -MM option which can be used to determine which header files would be used to compile a given file. This can be used in conjuction with make to automatically determine the inputs for any compile rules, which is a great improvement over manually managing the dependencies for any given source file, however there are some drawbacks.

The first is that the overall compile time is affected, as each time a file is compiled it must also generate the dependencies. In practise this isn't too bad; scanning for headers isn’t particularly CPU intensive, and since the compile will touch the same files doesn’t result in any extra I/O (and if the files weren’t cached, it primes the cache for the compile anyway).

The second problem, is that gcc -M is a gcc specific thing that isn’t going to work with other compilers, or other tools more generally. Of course, it would be possible to write a generate dependencies script for each type of build rule, but this is potentially a lot of work, and can have accuracy problem as the actual build command may in fact work slightly differently to how dependency script works, which risks having incorrect builds.

The next problem is to do with generated header files. If a source file includes a generated header file, and that generated header file does not yet exist, then gcc -MM will result in an error. gcc provides a -MG option to help account for this problem, however it is far from perfect. It assumes that the include path of the generated header is the current working directory which may not actually be the case. Generated files are not necessarily a problem, depending on some other design decisions it is possible to ensure that this dependency scanning occurs at the same time as compilation, so missing includes would be an error.

Another way to avoid the generated header file problem if the scanning operation is aware of the rest of the build rules. For example, when searching for a specific include file, the scanning tool could check not just for specific files in the file system itself, but also check for known outputs from other build rules in the build system. This approach has the drawback that the accuracy of scanning might not be adequate. For example, the SCons build tool uses this approach but can generate incorrect set of inputs when include files are conditionally included, or when headers are included through a macro. E.g:


#define ARCH mips
#define ARCH_INC(x) 
#include ARCH_INC(foo.h)

Of course you can argue that such a construct is probably less than ideal, however any approach like this is going to be prone to the same class of errors.

Depending on the exact model for determining the execution order of build commands in the system (the subject of a later article) the time at which the scanning occurs can have a major impact on performance.

A final problem with this approach in practise is that it can actually miss some dependencies. Consider the a command such as gcc -Iinc1 -Iinc2 foo.c. If foo.c includes foo.h and foo.h resides in the inc2 directory then this approach will generally report that foo.c depends on inc2/foo.h. However, this misses an important bit of information; the command is dependent on the non-existence of foo.h in the inc1 directory. If foo.h in added to inc1 then the output of the command will be different but an incremental build would miss this and not cause a rebuild. In theory there should be no reason why such a tool can’t report that a build rule depends on the non-existence of files as well. And indeed why explicit noting of inputs can’t do the same thing.

Tracing command execution

The only other approach (that I know of) is to track what files the build command actually touches when it executes. There are a couple of ways in which this can be done, but all approaches conceptually track when the build command opens files. One example of this is the memoize.py build tool, which uses strace to trace which files are touched by a given build command.

This approach has the very large advantage of being pretty accurate and capturing all the files that a build command touches and of not needing any build rule specific logic to determine the input files.

This approach can also easily capture files that were attempted to be opened

There are of course a few disadvantages. Firstly there is no standard API for tracing, so this part of any build tool ends up being OS specific, which is not ideal. Also tracing can add some significant overhead to the execution of build commands. Benchmarks are required to see what the difference in performance is between simply tracing build commands and running some rule specific scanner / dependency generation.

A potential disadvantage of this approach is false dependencies. If a build command opens a file but doesn’t actually consider the contents of the file during the execution of the command it will still be marked as a dependency, although there would be no need to rebuild if that file changed. This is is not a correctness problem, but could cause excessive unnecessary rebuilds.

Probably the biggest disadvantage of this approach is that all the input files for a build rule are not known until after the build command has executed. This has some pretty significant impacts on the different approaches that can be used for choosing the build rules execution order.

Summary

There are different approaches that can used to determine the set of input files for a given build rule; each has pros and cons, there is no clear winner.

My preference is the automatic detection of inputs for and build rule using a tracing approach of some kind. It wins in terms of correctness and also ease-of-use. Performance is a little bit unknown, however it should approach the performance of using a secondary script for determining the inputs.

Disagree with my analysis? Can you suggest some alternatives? Please leave a comment.

blog comments powered by Disqus