Infrastructures.Org | ISconf.Org TerraLuna Projects

Configuration File Management

Below we partially describe a method of configuration file management. Before we dive into it, some background clarifications are in order:

  • On this page, we're talking only about the subset of configuration files which are directly influenced by externals -- those containing IP addresses, user names, and other transient or unique data. If a configuration file can instead be managed as if it were part of the executable it controls, then it should -- it does not fall into the problem domain of this page. See the "About Environmental Data" section below for more detail about this distinction.
  • When I say "configuration management" I mean "configuration file management"; specifically those configuration files containing mutable data which is influenced by externals -- see the first point above. The convention among sysadmin tool makers these days is to use the term "configuration management" to mean the management of both executables and configuration files -- I believe this is carryover from the world of application build CM, but it's a bad fit; it's confusing at best, and has likely set the industry back some number of years.
  • While there will always be an area of fuzzy overlap between the two, executables and configuration files in general exhibit these differences in management characteristics:
    • Executables have dependencies, side-effects, and other ordering constraints, should be tested before going into production, and are usually part of a larger package which cannot be predictably rolled back in any automated way if the deployment fails.
    • Configuration files have few ordering constraints, can rarely be adequately tested before going into production, should be automatically post-tested during deployment, and in most cases can predictably be rolled back to their previous state if the test fails and if there have been no intervening changes to executables.
  • While they must be managed differently, executables and configuration files do need to be managed in sync with each other. It would be bad, for instance, to deploy a new version of a package which requires a new configuration file format, then leave the old configuration file there if the post-test fails. There is no good answer for this; the best any tool will ever be able to do in that case is scream for human help, while notifying its peers on other machines to halt their upgrades before that point. We take this into account below.
  • It's best to keep the set of configuration-file-like data as small as possible. This might seem counterintuitive, because on the face of it, configuration files sound easier to administer in so many ways -- they can be rolled back, ordering doesn't matter, and so on. But because of the nature of the data itself -- things like gateway addresses and server names -- these things tend to be difficult or impossible to test anywhere but in production. We cover this in more detail in the next section.

About Environmental Data

A better term for "the subset of configuration files which are directly influenced by externals" might be simply "environmental data". We use that term in the isconf man page; this extract from the man page does a good job of describing how this data should be managed:

Your goal should be to keep the set of environmental data as small as
possible, via architectural decisions in both infrastructure and
applications.

You need to be able to examine each bit of environmental data to try
to predict its behavior during deployment. Your ability to do this
will always be flawed -- you cannot possibly imagine all of the
permutations that might be encountered during future operations.
Keeping the environmental data set small reduces your workload and the
risk caused by a flawed analysis.

You need to be able to test each bit of environmental data after
deployment. Any change in environmental data, by definition, cannot be
tested anywhere except in its native environment. If this environment
is production, then we can only test these changes after deploying
them to production -- this is bad, but unless you have completely
duplicate networks, down to the details of IP addresses and hostnames,
there's not much you can do about it. Keeping the environmental
dataset small reduces the variations between environments; ideally, IP
addresses and/or hostnames might be the only differences you need to
analyze and test for.

The classic case of what not to do involves hardcoding IP addresses in
executables -- we all know this is bad, but here's why:  Embedding an
IP address in a larger executable taints the entire executable,
requiring that we manage the whole file as environmental data. It's
better to move that IP address to a separate configuration file, to
shrink the size of the environmental data set.

Executables aren't the only thing that can be tainted. Embedding an IP
address into a larger configuration file of non-environmental data
also taints the rest of the configuration file. If you have ever
generated configuration files by merging IP addresses into templates
of other data, then you have experienced this case. By using
templates, you prevent taint spread.

Putting the 'conf' back into ISconf

Since isconf version 4 development started, the tool's been optimized for managing executables. Versions 1-3 didn't provide the strong guarantees we wanted for reproducible deployment of executables; we described this in excruciating detail in the turing equivalence paper. Version 4 does a better job with executables, but the work that went into that paper also showed why configuration files are equally hard in their own way; we cover this above, as well as in EnvConf.

The isconf version 4.2.8 series doesn't have any special awareness of config files at all; we punt the whole problem to the local site, as described in the man page. This is not horrible; isconf can safely manage any local executables which pull data from external sources like LDAP or SVN; these executables can then massage the data and generate config files. While this is a theoretically valid approach, it doesn't do a whole lot to help someone who's trying to get a grip on configuration file management and doesn't want to write a lot of code.

One of the reasons we punted this to the local site was that we needed more information and feedback before committing things to code which would be hard to change later. Should isconf itself be a version control tool optimized for live config files, deprecating CVS, SVN, Mercurial, et al for that use? How far should we go in terms of guarantees that we can prevent a sysadmin from bringing down an entire site by deploying, say, a bad version of resolv.conf? To address the latter, for instance, we want to be able to include the "global transaction" barrier sync/rollback described in http://mailman.terraluna.org/pipermail/infrastructures/2005-March/001535.html -- and this in turn really needs the tcp mesh described in ticket #39.

Config File Management Roadmap

It's been a long time coming, but we think we've got a design now. The workflow might look like the following. In this example, we use CVS for configuration file version control; this is for historical reasons only, since most readers will have some familiarity wich CVS. My own favorites these days are Mercurial for distributed repositories, and Subversion for when centralized makes more sense.

Here we're logged into a target machine:

    isconf lock "adding new nameserver to resolv.conf" 

    isconf edit /etc/resolv.conf    # runs 'cvs up' in a local CVS sandbox,  
                                    # brings up the sandbox (not live) copy
                                    # of the file in $EDITOR

    isconf up                       # copies the file (and any others
                                    # edited above) into the live filesystem, 
                                    # runs any post-replication
                                    # triggers, automatically reverts
                                    # to previous version of the file
                                    # if any trigger commands fail
    
    isconf tag PRD                  # optional: tags the local copy of all 
                                    # files in the sandbox as "PRD",
                                    # releasing them to production
                                    # machines

    isconf ci                       # blows up if 'isconf up' hasn't been 
                                    # run since 'isconf edit', blows up
                                    # if any of the triggered commands 
                                    # exited with a non-zero return
                                    # code, else does a 'cvs ci' in
                                    # the sandbox

Later, on any other machine which uses the PRD release level, we just run:

    isconf up                       # does a 'cvs up' in a local sandbox, 
                                    # drops a sitewide semaphore in ISFS
                                    # letting other machines know what it
                                    # is about to do, copies changed files
                                    # into the live filesystem, runs
                                    # post-replication triggers,
                                    # clears sitewide semaphore if all
                                    # triggers exit with a zero return code

A key to this design is that we don't deprecate local change control tools -- we support them with a plugin which presents the isconf backend with a standard API. The local site continues to use CVS, SVN, mercurial, or anything else that anyone cares to write a plugin for.

What we do with that sitewide semaphore is open to debate: we can simply freeze further updates on all machines until the semaphore is cleared, or we can use statistical techniques to allow updates in parallel at a slowly increasing rate, governed by the number of machines which have successfully cleared their semaphore. This sitewide tracking of "critical code" entry and exit also gives us the ability to support sitewide barriers and limited rollback of config files in cases where a change destroys connectivity, as in http://mailman.terraluna.org/pipermail/infrastructures/2005-March/001535.html. (A good test of this "sitewide rollback" capability might be whether it can gracefully handle a failed network renumbering attempt.)

Missing from the above description is more detail on how the release tags work, how we support differing configuration files on different groups of hosts, the format and management of the triggers file, how file manifests are kept, and how we can specify that a certain version of configuration file not be deployed until a certain version of executable has been, or vice versa.

Meanwhile, feedback on the above would be greatly appreciated, before we commit too much to code. See SupportHelp -- the IRC channel is probably the best bet, since we're actively working on this right now. As always, feel free to edit this page directly -- see LoginHelp.