Some UNIX tricks for programmers

Mon, 08 Dec 2008 21:27:50 +0000
tech article

Introduction

The UNIX® shell is one of the tools (along with Emacs) that I never feel like I’m using to the fullest. This morning was another opportunity to expand my shell repertoire. Most of my source code is stored in mercurial repositories, and I was sick of doing % find . | grep -v \.hg to strip out the meta-data. Surely there is a better way? Note: I’m using zsh and GNU tools on Mac OS X, although I’m pretty sure that most of the examples should work fine in Bourne-like on other UNIX-like operating systems.

find and -prune

So the thing that sucks about the find . | grep -v <foo> type pattern is that find will do all the work of descending into the foo directory and scanning it. It also means you can’t take advantage of other find niceties such as the -exec flag. So, if only there was a way to prune our directory tree search. Now, the find expression language is a mighty strange language to code in, or at least it is to me. The language consists of a set of primaries and a set of operators; an expression is made up of a primaries linked together with operators. Now, there are a few things that make this less than obvious, especially if you have been na̮ïvely using. The first is that in almost all cases, the find utility automatically adds a -print primary to your expression. The other big one (for me at least) is that expression expression is short-hand for expression -and expression. To me, grokking this made a huge difference to my understanding of how find actually works.

The other part to grokking find is that some expression are what I would call filters, that is the expression returns true if the file matches some property, e.g: -name, -newer, etc. Then there is a case of command expressions that have some type of external side-effect. Examples are: -print, -print0, -exec, -ok. Next is the -prune command, which is in a special category all on its own because it modifies the find internal search process. Finally there are what I would call global expression, such as -maxdepth and -mindepth. These are global because each applies to the entire expression even if it would not normally be evaluated. This is pretty strange behaviour if you ask me, surely they would have been better as options rather than expressions!

So with this in mind, lets build up an expression to do a find, but ignoring my .hg. So we start with an expression that will just print out the .hg directory. We need something like: find . -type d -name .hg. -type d returns true if the current path is a directory, and -name .hg returns true if that last element of the path is .hg. Since our -and is implicit, this expression only returns true for directories named .hg. And, since we do not specify and command like expression, find automatically assumes we want to print the path names, so the full expression is really somethings like: find . \( -type d -and -name .hg \) -print.

So far, this does the exact opposite of what we ant, it simply prints the .hg directory. So instead lets use -prune, which tells find to avoid descending into the directory.

Note: find usually does a pre-order traversal (i.e: directory path comes before directory content), if you use the -d option, the behaviour is changed to a post-order traversal (i.e: directory path comes after all the directory content). It should be clear that the -prune expression is only going to work with pre-order traversal. find doesn’t have anyway to do a bread-first traversal.

Since -prune doesn’t count as one of the -find commands, find will still auto-magically mess with out expression to add -print, so our expression becomes: find . \( -type d -and -name .hg -and -prune \) -print. Unsurprisingly it continues to only output the .hg directory. To get the result we wan, we need to do something in the other case, when the path is not a directory named .hg. So, in the other case lets do a -print, so we end up with a command line: find . -type d -name .hg -prune -o -print. Since there is a command in the expression, find doesn’t auto-magically add a -print. This now does pretty much what we want. Of course typing out find . -type d -name .hg -prune -o -print is a lot more work than find . | grep -v .hg, so it would be nice to make an alias for this command.

I’m going to have an alias called sfind (short for source find, not the best name, but short and easy for the muscle memory). So first attempt would be to just alias the command we had last, but I’d really like the alias to work mostly like find, so that I can add new arguments and do things like sfind -name "*.c" and have it just work™. Simply aliasing the above command would end up with the final expression being find . -type d -name .hg -prune -o -print -name "*.c", which will do all the printing before it tests if the final ends in .c. Another alternative would be to alias find . -type d -name .hg -prune -o, but this forces me to always have to type an expression; just sfind by itself won’t work. What I really need a fully formed expression, which still works when adding new constraints. What would be ideal is something like: find . -type d -name .hg -prune -o -true, so the other case always evaluates to true. In the case where no arguments are added this will end up being the equivalent of find . \( -type d -name .hg -prune -o -true \) -print, and in the case where further expression are added it will end up being: find . \( -type d -name .hg -prune -o -true <expression> \) -print. Unfortunately, there is no expression -true. So, how can we have an expression that always returns true? The best I’ve come up with is -name "*". Other possibilities might be depth +0. This works pretty well, but you might notice that now the .hg directory is being printed! Why? Because now instead of just printing files in the other case, find is printing whenever the entire expression is true, and since the last part of the left-hand-side is -prune, and -prune always returns true, the pruned directory evaluates to true. So how can we stop that? Well, something like find . -type d -name .hg -prune -false -o -name * should do the trick. But of course, just as there is no unconditional true, there isn’t an unconditional false either, thankfully, it is trivial to construct one: -not -name *. So, what we end up with is: find . -type d -name .hg -prune -not -name "*" -o -name "*". Easy!

Grepping code and shell functions

Now the next thing I often do is find . | xargs grep "foo" or grep -r "foo" .. (Which really should be find -print0 | xargs -0, but hey I’m lazy and my file names don’t contain white-space). Since I was learning more about find I figured I should work out how to do this without piping to xargs, especially since my new found power of -o means I might want to run different commands on different files, which wouldn’t work with a simple pipe to stdout. So the standard approach is using the cryptic expression: -exec grep "foo" {} \;. It is actually pretty straight forward. The braces are replaced by the current pathname and the command needs to be terminated in with a semi-colon (which needs to be escaped in most interactive shells). Now this isn’t quite as efficient as using xargs since xargs will run the command on multiple pathnames at once, whereas -exec runs the command once per pathname. Thankfully modern find seems to have an alternative representation: -exec grep "foo" {} +. The plus-sign makes the behaviour of -exec similar to that of xargs, processing multiple pathnames per invocation.

So what I really want to do is grep for a given pattern, in a certain subset of my files. And I’d like to do this without needing to type-out the verbose syntax each time. Something like srcfind <regexp&> [find-expression], would be ideal. Now a simple alias isn’t going to work for this, and I thought I was up for writing yet another tiny shell script. Luckily, I was clued on to shell functions. I feel very ignorant for not knowing about these before hand, but hey, I’m a systems programmer, not a shell-scripting sysadmin guru. Anyway, long story short, adding an appropriate shell function, such as:

srcgrep() { 
    if [ $# -gt 0 ]; then {
	    RE=$1; 
	    shift; 
	    if [ $# != 0 ]; then
		O="("; C=")"; 
	    else
		O=""; C="";
	    fi;
	    sfind -type f $O $@ $C -exec grep $RE {} \+ ;
    } else {
	echo "Usage: srcgrep  [expression]";
    } fi;
}

to my .zshrc does the trick. This shell function ended up being more complex than I initially thought it would. Briefly it checks there is the correct number of arguments, printing a usage if necessary. Next it check if there are any find expressions. When there are it is necessary to enclose them in brackets, or else the precedence rules breaks expression that might have an -or in them. Finally it runs the previously defined sfind alias. The -type f avoid running grep on directories.

Colourised grep

Now grep is nice and all, but it can sometimes be hard to pick out the actual search result from the results. This is where colour grep comes in. Grep has an option --color, which will use ANSI escape codes to markup the output. E.g:

bodyjar:/Users/benno/work/3.2/okl4-mainline% grep --color  SORT coverage.sh 
declare SORT=""
    F)  SORT="-F"
    tools/coverage.py -f fifo $SORT -s -S $SCOPE $IMAGE &
    tools/coverage.py -f fifo $SORT -s -S $SCOPE $IMAGE 2>&1 | tee $OUTFILE &

Now this is good as far as it goes, but when you start having more complex regular expressions, you get to a point where you only want to colourise part of the match, not the whole thing. For example: ^[A-Za-z_][A-Za-z0-9_]*\( matches lines with function declarations (at least when they follow the coding standard), but what I would really like hi-lighted is only the function-name, not the function-name plus opening bracket. This is where I ended up hitting the limits of shell and standard UNIX tools, and ended up back with good old familiar python. I created a simple grep replacement, colgrep, that by default will colourise in the same manner as grep, but if you use groups, it will only color the matched groups (and will give each group a different colour). Groups are part of the Python regular expression syntax (and probably Perl too). Basically if a portion of the regular expression is enclosed in brackets, that part of the regexp forms a group. E.g: In SO(R)T the 'R' will be matched as group #1. So for the previous example, if I write instead: ^([A-Za-z_][A-Za-z0-9_])*\(, only the function name will be hi-lighted. The other neat thing about python regular expressions is that you can name groups with the syntax (?P<name>expr). I take advantage of this in colgrep to color groups with the name "err*" with a background of red to hi-light errors.

Conclusion

So in conclusion, we’ve added some powerful commands to our toolbox which make it much more effective to quickly browse and search large revision controlled code bases.

blog comments powered by Disqus