Running with the explain things for Benno 5 months from now, here is how I'm currently making my backups work. The basic thing I want is incremental backups onto an external hard drive.
rsync makes this really
easy to do. The latest version of rsync has --link-dest,
which will hardlink to files in an existing directory tree, rather than
creating a new file. This makes each incremental reasonably cheap, as
most files are just hard links back to the previous backup.
So basically I do:
sudo rsync -a --link-dest=../backups.0/ /Users/ backups.1/.
Note that the trailing slash is actually important here.
A big gotcha is ensuring that the external disk is actually mounted with permissions enabled, or else the hard-links won't actual be made.
After this everything works nicely, when combined with a simple script:
#!/usr/bin/env python
"""
Use rsync to backup my /Users directory onto external hard drive
backup.
"""
import os
def get_latest():
backups = [int(x.split(".")[-1]) for x in os.listdir(".") if x.startswith("backup")]
backups.sort()
return backups[-1]
os.chdir("/Volumes/backup2007")
latest = get_latest()
next = latest + 1
command = "sudo rsync -a --link-dest=../backups.%d/ /Users/ backups.%d/" % (latest, next)
os.system(command)
Currently I don't do any encryption, which might be useful in the future.
Mail on UNIX is weird, I spent a few hours this week tracking down
some bugs in my mail setup, and in the process learnt a lot more about
how things interact. I'm documenting it here for Benno 5 months from
now and anyone else that cares. There is this pseudo-standard of mail
user agents (MUAs) not actually talking SMTP to a mail submission
agent (MSA), but instead simply hoping that
/usr/sbin/sendmail exists. Things are so bound this way
that you really can't avoid having a /usr/sbin/sendmail
on your machine, so as a result other mail transport agents (MTAs)
have to provide a binary that works in the same way as sendmail.
Unfortunately as far as I can tell there is no standard about what
command line flags a sendmail compatible program should accept, and
appears to come down to common usage.
In some ways this is quite backwards to what I would have expected, which is that an MTA would run on the localhost and programs would talk SMTP to port 25 (or the MSA port 587), then all communications is through this documented standard. On the other hand, this means every program that wants to send e-mail must have TCP and SMTP libraries compiled which is against the UNIX philosophy.
Now I am actually quite interested to find out what other programs (that I use), actually rely on sendmail. I'm guessing (hoping) that it is mostly just MUAs such as mutt and mailx, but I really wonder what else is out there relying on sendmail.
More importantly, what command lines arguments do these different
programs expect /usr/sbin/sendmail to handle correctly.
(In case I wanted to join the hundreds of others and say, write my own
sendmail replacement). So after putting in a simple python program to
log the command line arguments my great findings are:
mailx uses the -i argument. This one is
great, by default a line with a single '.' character is treated as the
end of input, with the argument standard end of file, means end of
file. mutt on the other hand uses -oi
(which is the same as -i, and
-oem. -oem is conveniently undocumented in
Postifx's sendmail man page, but on consulting the exim sendmail man
page, discovered that this basically means, the command can't fail,
and any errors should be reported through email, rather than with an
error return code.
mutt lets you override the command it uses for mail submission by
setting the sendmail variable. This is handy to know if you
want to add extra arguments to the sendmail command line. For example
a handy feature is being able to control the envelope from address
used. This is done with the -f command line argument.
Next up for my adventures in the wonderful world of UNIX email
is to setup my on sendmail compatible program, that can set the
envelope address and smarthost to use based on the e-mail From:
header.
Reading Jeff's humorous post
got me digging into exactly why Python would be running in a select
loop when otherwise idle.
It basically comes down to some code in the module that wraps GNU readline. The code is:
while (!has_input)
{ struct timeval timeout = {0, 100000}; /* 0.1 seconds */
FD_SET(fileno(rl_instream), &selectset);
/* select resets selectset if no input was available */
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, &timeout);
if(PyOS_InputHook) PyOS_InputHook();
}
So what this is basically doing, is trying to read data but 10 times a second
calling out to PyOS_InputHook(), which is a hook that can be used by C
extension (in particular Tk) to process something when python is otherwise idle.
Now the slightly silly thing is that it will wake up every 100 ms, even if PyOS_InputHook is not actually set. So a slight change:
while (!has_input)
{ struct timeval timeout = {0, 100000}; /* 0.1 seconds */
FD_SET(fileno(rl_instream), &selectset);
/* select resets selectset if no input was available */
if (PyOS_InputHook) {
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, &timeout);
} else {
has_input = select(fileno(rl_instream) + 1, &selectset,
NULL, NULL, NULL);
}
if(PyOS_InputHook) PyOS_InputHook();
}
With this change Python is definitely ready for the enterprise!
So I recently stumbled across Jester, which is a mutation testing tool.Mutation testing is really a way of testing your tests. The basic idea is:
So in testing your code you should first aim for complete coverage, that is ensure that your test cases exercise each line of code. Once you get here, then mutation testing can help you find other holes in your test cases.
So the interesting thing here is what mutation operations (or mutagens), you should do on the original code. Jester provides only some very simple mutations, (at least according to this paper):
Unfortunately it seems that such an approach is, well, way too simplistic, at least according Jeff Offutt, who has published papers in the area. Basically, any rudimentary testing finds these changes, and doesn't really find the holes in your test suite.
A paper on Mothra, a mutation system for FORTRAN, describes a large number of different mutation operations. (As a coincident, my friend has a bug tracking system called Mothra.)
Anyway, so while Jester has a python port, I didn't quite like its approach (especially since I needed to install Java, which I didn't really feel like doing). So I decided to explore how this could be done in Python.
So the first set of mutation I wanted to play with is changing some constants
to different values. E.g: change 3 to (1 or 2). So this turns out to be reasonably
easy to do in python. I took functions as the unit of test I wanted to play with.
So, Python makes it easy to introspect function. You can get a list of
conants on a function like so: function.func_code.co_consts. Now,
you can't actually modify this, but what you can do is make a new
copy of the method with a different set of constants. This is conveniant because
in mutation testing we want to create mutants. So:
def f_new_consts(f, newconsts):
"""Return a copy of function f, but change its constants to newconsts."""
co = f.func_code
codeobj = type(co)(co.co_argcount, co.co_nlocals, co.co_stacksize,
co.co_flags, co.co_code, tuple(newconsts), co.co_names,
co.co_varnames, co.co_filename, co.co_name,
co.co_firstlineno, co.co_lnotab, co.co_freevars,
co.co_cellvars)
new_function = type(f)(codeobj, f.func_globals, f.func_name, f.func_defaults,
f.func_closure)
return new_function
So that is the basic mechanism I'm using for creating mutant function with
different constants. The other mutants I wanted to make where those where
I change comparison operators. E.g: changing < to <= or >. This is a bit trickier,
it means getting down and dirty with the python byte code, or at least this is my
preferred approach. I guess you could also do syntactic changes and recompile.
Anyway, you can get a string representation of a function's bytecode like so:
function.func_code.co_code. To do something useful with this you
really want convert it to a list of integers, which is easy as so:
[ord(x) for x in function.func_code.co_code].
So far, so good, what you need to be able to do next is iterate through the different bytecode, this is a little tricky as they can be either 1 or 3 bytes long. A loop like so:
from opcode import opname, HAVE_ARGUMENT
i = 0
opcodes = [ord(x) for x in co.co_code]
while i < len(opcodes):
print opname[opcode]
i += 1
if opcode >= HAVE_ARGUMENT:
i+= 2
will iterate through the opcodes and printing each one. Basically all bytecodes larger than the constant HAVE_ARGUMENT, are three bytes long.
From here it is possible to find all the COMPARE_OP
byte codes. The first argument byte specifies the type of compare operations,
these can be decoded using the opcode.cmp_op tuple.
So, at this point I have some basic proof-of-concept code for creating some very simple mutants. The next step is to try and integrate this with my unit testing and coverage testing framework, and then my code will never have bugs in it again!
The ssh ProxyCommand option is just really insanely useful. The reason I want to use it is that it makes it easy to tunnel ssh through a firewall. So for example you have a machine on your corporate network that is is sitting behind a firewall, but you have a login to the firewall. Now of course you could just do:
laptop$ ssh gateway gateway$ ssh internal internal$
or something like:
laptop$ ssh -t gateway ssh internal internal$
But that really doesn't work very well if you use scp or sftp or some other service like revision control that runs on top of it. This is where you can use ProxyCommand option to make your life easier. Basically you want to put something like this into you .ssh/config:
Host internal
ProxyCommand ssh gw nc -w 1 internal 22
For this to work you will need netcat (nc) installed on the gw, but that generally isn't too much of a problem. Now you can transparently ssh, scp or whatever to internal, and ssh deals with it.
Now of course if you can't get into the network full stop, you need some reverse tunneling tricks and some other host on the interweb which you can use to tunnel through. So something like:
ssh -f -nNT -R 1100:localhost:22 somehost
Will setup a remote tunnel, which basically means if you connect to
port 1100, it will be forwarded to port 22 on the machine on which you
ran ssh -R. Unfortunately unless you have full control
over the host you are creating the reverse tunnel too, you will find
that port 1100, will only be bound to localhost, which means you will
probably still need to use the trick mentioned above to get seemless
access to it.
So I was inspired (distracted) by the python functional programming module, and got to thinking couldn't things like partial application and function composition be a bit more natural in the syntax.
So it turns out that with decorators and a bit of evil introspection I came up with something that works (at least for functions that don't have keyword arguments). So you can do something like:
@Functional
def add(a, b): return a + b
add3 = add.add
assert(add3(1)(2)(3) == add3(1, 2, 3) == add3(1, 2)(3) == add3(1)(2, 3) == 6)
So what is the magic (or evilness)? Well first we (ab)use decorators to decorate things as Functional, which is you probably know, is just a shortcut for function = Functional(function).
The evil is in the Functional class:
class Functional:
def __init__(self, fn, nargs = None):
self.fn = fn
if not nargs:
_args, _, _, _ = inspect.getargspec(self.fn)
self.nargs = len(_args)
else:
self.nargs = nargs
def __call__(self, *args, **kargs):
if len(args) > self.nargs:
res = self.fn(*args[:self.nargs], **kargs)
if res.__class__ == Functional:
return res(*args[self.nargs:])
if type(res) != type((0,)):
res = (res, )
return res + args[self.nargs:]
elif len(args) == self.nargs:
return self.fn(*args, **kargs)
else:
def newfunc(*fargs, **fkeyw):
return self.fn(*(args + fargs))
newfunc.func = self.fn
newfunc.args = args
return Functional(newfunc, self.nargs - len(args))
def __getattr__(self, name):
if hasattr(self.fn, name):
return getattr(self.fn, name)
func_1 = self.fn
func_2 = globals()[name]
def composition(*args, **kwargs):
res = func_2(*args, **kwargs)
if type(res) != type((0,)):
res = (res, )
return self(*res)
return Functional(composition, func_2.nargs)
I totally abuse the __getattr__ method so that dot becomes a composition operator. This returns a new function (which is also Functional), which when called will pass on the return value from the first function to the second function. If the first function returns multiple results each of these is passed as arguments to the second function.
The real magic comes in the overload __call__, which is where
partial functions are hacked in. Basically if not enough arguments are
passed to the function is returns a new function that accumulates the
arguments already passed and once it gets enough calls the original function.
Of course the function returned from partial application is itself. The real
evil is in supporting the case add3(1, 2, 3) which means
we detect if too many arguments are passed and then call with only the first
arguments, then if the called function returns another functional we apply the remaining
arguments to it.
Oh yeah, I wouldn't use this in any real python code, as it is likely to confuse everyone!