Thursday, 16 August 2012

Principles of Research Code

Ali Eslami has just writen a terrific page on organizing your experimental code and output. I pretty much agree with everything he says. I've thought quite a bit about this and would like to add some background.

Programming for research is very different than programming for industry. There are several reasons for this, which I will call Principles of Research Code. These principles underly all of the advice in Ali's post and in this post. These principles are:

  1. As a researcher, your product is not code. Your product is knowledge. Most of your research code you will completely forget once your paper is done.
  2. Unless you hit it big. If your paper takes off, and lots of people read it, then people will start asking you for a copy of your code. You should give it to them, and best to be prepared for this in advance.
  3. You need to be able to trust your results. You want to do enough testing that you do not, e.g., find a bug in your baselines after you publish. A small amount of paranoia comes in handy.
  4. You need a custom set of tools. Do not be afraid to write infrastructure and scripts to help you run new experiments quickly. But don't go overboard with this.
  5. Reproducability. Ideally, your system should be set up so that five years from now, when someone asks you about Figure 3, you can immediately find the command line, experimental parameters, and code that you used to generate it.

Principle 1 implies that the primary thing that you need to optimise for in research code is your own time. You want to generate as much knowledge as possible as quickly as possible. Sometimes being able to write fast code gives you a competitive advantage in research, because you can run on larger problems. But don't spend time optimising unless you're in a situation like this.

Also, I have some more practical suggestions to augment what Ali has said. These are

  1. Version control: Ali doesn't mention this, probably because it is second nature to him, but you need to keep all of your experimental code under version control. To not do this is courting disaster. Good version control systems include SVN, git, or Mercurial, etc. I now use Mercurial, but it doesn't really matter what you use. Always commit all of your code before you run an experiment. This way you can reproduce your experimental results by checking out the version of your code form the time that you ran an experiment.

  2. Random seeds: Definitely take Ali's advice to take the random seed as a parameter to your methods. Usually what I do is pick a large number of random seeds, save them to disk, and use them over and over again. Otherwise debugging is a nightmare.

  3. Parallel option sweeps: It takes some effort to get set up on a cluster like ECDF, but if you invest this, you get some nice benefits like the ability to run a parameter sweep in parallel.

  4. Directory trees: It is good to have your working directory in a different part of the directory space from your code, because then you don't get annoying messages from your version control system asking you why you haven't committed your experimental results. So I end up with a directory structure like

        ~/hg/projects/loopy_crf/code/synth_experiment.py
        ~/results/loopy_crf/synth_experiment/dimensions_20_iterations_300
    

    Notice how I match the directory names to help me remember what script generated the results.

  5. Figures list. The day after I submit a paper, I add enough information to my notebook to meet Principle 5. That is, for every figure in the paper, I make a note of which output directory and which data file contains the results that made that figure. Then for those output directories, I make sure to have a note of which script and options generated those results.

  6. Data preprocessing. Lots of times we have some complicated steps to do data cleaning, feature extraction, etc. It's good to save these intermediate results to disk. It's also good to use a text format rather than binary, so that you can do a quick visual check for problems. One tip that I use to make sure I keep track of what data cleaning I do is to use Makefiles to run the data cleaning step. I have a different Makefile target for each intermediate result, which gives me instant documentation.

If you want to read even more about this, I gave a guest lecture last year on a similar topic (slides, podcast).

5 comments:

  1. I do not agree on the first point. The product is the knowledge, but helping others to reach the same knowledge via a code is an improvement. Sharing code or knowledge is about the same. I share!

    ReplyDelete
    Replies
    1. Thanks for your comment. I completely agree with you that researchers should wherever possible share their code to encourage reproducible research and to help people understand their algorithms. I was trying to say something different. Research code can have a lot of potential uses, e.g., as a reference implementation of an algorithm, as a reusable toolkit that others can use, but for most code, most of the time, the primarily goal is to test your ideas. For this type of code, I am saying that you want to optimize for your programmer time over maintainability. This is way research code is often "quick and dirty" in ways that production code should not be---and this is a good thing. For example, say your command line tool crashes if the user enters the arguments in the wrong order. In production code, you'd need to fix that. In research code, don't bother. Your time is more important. That's a tradeoff that sometimes students aren't used to, coming out of a CS undergraduate.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I recently ran across a blog entry by Hilary Parker ( http://hilaryparker.com/2012/08/25/love-for-projecttemplate/ ) that tackles some similar infrastructure questions for data analysis using the Project Template project for R ( http://projecttemplate.net/index.html ).

    The hard issue for me is capturing all of the context at each stage of work -- from versions of files to generate specific plots, to everything I was thinking about. I rely heavily on lab notebooks, but that still gets awkward for capturing certain types of data/parameters.

    ReplyDelete