The UNIX® shell is one of the tools (along with Emacs) that I never
feel like I’m using to the fullest. This morning was another opportunity
to expand my shell repertoire. Most of my source code is stored in
mercurial repositories, and I was sick of doing % find . | grep -v \.hg
to strip out the meta-data. Surely there is a better way? Note: I’m
using zsh and GNU tools on Mac OS X, although I’m
pretty sure that most of the examples should work fine in Bourne-like on other
UNIX-like operating systems.
find and -pruneSo the thing that sucks about the find . | grep -v
<foo> type pattern is that find will do all
the work of descending into the foo directory and
scanning it. It also means you can’t take advantage of other
find niceties such as the -exec flag. So, if
only there was a way to prune our directory tree search. Now,
the find expression language is a mighty strange language
to code in, or at least it is to me. The language consists of a set of
primaries and a set of operators; an expression is
made up of a primaries linked together with operators. Now, there are
a few things that make this less than obvious, especially if you have
been na̮ïvely using. The first is that in almost all cases, the
find utility automatically adds a -print
primary to your expression. The other big one (for me at least) is
that expression expression is short-hand for
expression -and expression. To me, grokking this made
a huge difference to my understanding of how find actually
works.
The other part to grokking find is that some expression are what I
would call filters, that is the expression returns true if
the file matches some property, e.g: -name,
-newer, etc. Then there is a case of command
expressions that have some type of external side-effect. Examples are:
-print, -print0, -exec,
-ok. Next is the -prune command,
which is in a special category all on its own because it modifies
the find internal search process. Finally there are what I
would call global expression, such as -maxdepth
and -mindepth. These are global because each applies
to the entire expression even if it would not normally be
evaluated.
This is pretty strange behaviour if you ask me, surely
they would have been better as options rather than expressions!
So with this in mind, lets build up an expression to do a find, but
ignoring my .hg. So we start with an expression that will
just print out the .hg directory. We need something like:
find . -type d -name .hg. -type d returns
true if the current path is a directory, and -name .hg
returns true if that last element of the path is
.hg. Since our -and is implicit, this
expression only returns true for directories named
.hg. And, since we do not specify and
command like expression, find automatically assumes we
want to print the path names, so the full expression is really
somethings like: find . \( -type d -and -name .hg \)
-print.
So far, this does the exact opposite of what we ant,
it simply prints the .hg directory. So instead lets use
-prune, which tells find to avoid descending
into the directory.
Note: find usually
does a pre-order traversal (i.e: directory path comes before directory
content), if you use the -d option, the behaviour is
changed to a post-order traversal (i.e: directory path comes after all
the directory content). It should be clear that the
-prune expression is only going to work with pre-order
traversal. find doesn’t have anyway to do a bread-first
traversal.
Since -prune doesn’t count as one of the
-find commands, find will still auto-magically
mess with out expression to add -print, so our expression becomes:
find . \( -type d -and -name .hg -and -prune \) -print. Unsurprisingly
it continues to only output the .hg directory. To get the result we wan, we need to
do something in the other case, when the path is not a directory named
.hg. So, in the other case lets do a -print, so we end
up with a command line: find . -type d -name .hg -prune -o -print.
Since there is a command in the expression, find doesn’t auto-magically add a
-print. This now does pretty much what we want. Of course
typing out find . -type d -name .hg -prune -o -print is a lot
more work than find . | grep -v .hg, so it would be nice
to make an alias for this command.
I’m going to have an alias called sfind (short for
source find, not the best name, but short and easy for the muscle
memory). So first attempt would be to just alias the command we had
last, but I’d really like the alias to work mostly like
find, so that I can add new arguments and do things like
sfind -name "*.c" and have it just work™. Simply
aliasing the above command would end up with the final expression
being find . -type d -name .hg -prune -o -print -name
"*.c", which will do all the printing before it tests if the
final ends in .c. Another alternative would be to alias
find . -type d -name .hg -prune -o, but this forces me to
always have to type an expression; just sfind by itself
won’t work. What I really need a fully formed expression, which still
works when adding new constraints. What would be ideal is something
like: find . -type d -name .hg -prune -o
-true, so the other case always
evaluates to true. In the case where no arguments are added this will
end up being the equivalent of find . \( -type d -name .hg
-prune -o -true \) -print, and in the case where further
expression are added it will end up being: find . \( -type d
-name .hg -prune -o -true <expression> \) -print.
Unfortunately, there is no expression -true. So, how can
we have an expression that always returns true? The best I’ve come up
with is -name "*". Other possibilities might be
depth +0. This works pretty well, but you might notice
that now the .hg directory is being printed! Why?
Because now instead of just printing files in the other case,
find is printing whenever the entire expression is true, and since the
last part of the left-hand-side is -prune, and
-prune always returns true, the pruned directory
evaluates to true. So how can we stop that? Well, something like
find . -type d -name .hg -prune -false -o -name
* should do the trick. But of course, just as there is no
unconditional true, there isn’t an unconditional false either,
thankfully, it is trivial to construct one: -not -name *.
So, what we end up with is: find . -type d -name .hg -prune -not
-name "*" -o -name "*". Easy!
Now the next thing I often do is find . | xargs grep
"foo" or grep -r "foo" .. (Which really should be
find -print0 | xargs -0, but hey I’m lazy and my file
names don’t contain white-space). Since I was learning more about
find I figured I should work out how to do this without
piping to xargs, especially since my new found power of
-o means I might want to run different commands on
different files, which wouldn’t work with a simple pipe to
stdout. So the standard approach is using the cryptic
expression: -exec grep "foo" {} \;. It is actually pretty
straight forward. The braces are replaced by the current pathname and
the command needs to be terminated in with a semi-colon (which needs
to be escaped in most interactive shells). Now this isn’t quite as
efficient as using xargs since xargs will run the command
on multiple pathnames at once, whereas -exec runs the
command once per pathname. Thankfully modern find seems to have an
alternative representation: -exec grep "foo" {}
+. The plus-sign makes the behaviour of
-exec similar to that of xargs, processing
multiple pathnames per invocation.
So what I really want to do is grep for a given
pattern, in a certain subset of my files. And I’d like to do this
without needing to type-out the verbose syntax each time. Something
like srcfind <regexp&> [find-expression], would be
ideal. Now a simple alias isn’t going to work for this,
and I thought I was up for writing yet another tiny shell
script. Luckily, I was clued on to shell functions. I feel
very ignorant for not knowing about these before
hand, but hey, I’m a systems programmer, not a shell-scripting sysadmin
guru. Anyway, long story short, adding an appropriate shell function,
such as:
srcgrep() {
if [ $# -gt 0 ]; then {
RE=$1;
shift;
if [ $# != 0 ]; then
O="("; C=")";
else
O=""; C="";
fi;
sfind -type f $O $@ $C -exec grep $RE {} \+ ;
} else {
echo "Usage: srcgrep [expression]";
} fi;
}
to my .zshrc does the trick. This shell function ended up being more
complex than I initially thought it would. Briefly it checks there is the
correct number of arguments, printing a usage if necessary. Next
it check if there are any find expressions. When there are it is
necessary to enclose them in brackets, or else the precedence rules
breaks expression that might have an -or in them. Finally
it runs the previously defined sfind alias. The -type f
avoid running grep on directories.
Now grep is nice and all, but it can sometimes be
hard to pick out the actual search result from the results. This is
where colour grep comes in. Grep has an option --color, which
will use ANSI escape codes to markup the output. E.g:
bodyjar:/Users/benno/work/3.2/okl4-mainline% grep --color SORT coverage.sh declare SORT="" F) SORT="-F" tools/coverage.py -f fifo $SORT -s -S $SCOPE $IMAGE & tools/coverage.py -f fifo $SORT -s -S $SCOPE $IMAGE 2>&1 | tee $OUTFILE &
Now this is good as far as it goes, but when you start having more
complex regular expressions, you get to a point where you only want to
colourise part of the match, not the whole thing. For example:
^[A-Za-z_][A-Za-z0-9_]*\( matches lines with function
declarations (at least when they follow the coding standard), but what
I would really like hi-lighted is only the function-name, not the
function-name plus opening bracket. This is where I ended up hitting
the limits of shell and standard UNIX tools, and ended up back with
good old familiar python. I created a simple grep replacement, colgrep, that by default
will colourise in the same manner as grep, but if you use
groups, it will only color the matched groups (and will give
each group a different colour). Groups are part of the Python regular
expression syntax (and probably Perl too). Basically if a portion of
the regular expression is enclosed in brackets, that part of the
regexp forms a group. E.g: In SO(R)T the 'R' will be
matched as group #1. So for the previous example, if I write instead:
^([A-Za-z_][A-Za-z0-9_])*\(, only the function name will
be hi-lighted. The other neat thing about python regular expressions
is that you can name groups with the syntax
(?P<name>expr). I take advantage of this in
colgrep to color groups with the name "err*" with a background of red
to hi-light errors.
So in conclusion, we’ve added some powerful commands to our toolbox which make it much more effective to quickly browse and search large revision controlled code bases.