Welcome to the “Make” for Scientists tutorial! We're aiming to explain a
little bit about the useful Unix tool make
: what it is, when you might
want to use it, and how it works. We're not going to be able to teach you
all the details, but hopefully we can put something new into your bag of
computational tricks.
Why do we call this “Make for Scientists? Well, make
is most often used
by computer programmers for compiling software — so most of the tutorials
that are out there focus on that use case. We'll focus on scientific workflows
instead, mostly drawn from astronomy. A few mathematical topics will be
briefly touched on, but none of them are central to understanding what's
going on. If you're not a scientist, there's nothing scary in here; the
examples will just be tilted in a direction that might be less directly useful
to you.
Note: this tutorial is a work in progress. It's not yet complete or polished.
make
is a venerable Unix command-line tool. It makes things.
Before running make
you specify a set of recipes: “To make A, do X,
then Y, then Z.” All make
does is execute certain of those recipes for
you. The things that make
makes, such as our metasyntactic A, are called
targets.
You also tell make
about dependencies between targets: “A is derived
from B. If B changes, A needs to be remade.” make
can keep track of
“freshness,” so that it knows which targets need to be remade and which don't.
When you ask make
to make a target, it looks at all of the dependencies to
make sure that it's fully up-to-date.
On an abstract level, that's really all there is to it.
The concrete details of make
's implementation are also easy to state.
Targets are files on your hard drive. Freshness is recorded in the
modification times of those files. And the recipes are short Unix shell
scripts.
This short summary might not make it clear why make
can be so useful. We'll
explore its possibilities shortly. But first, a quick example to get you
grounded.
You tell make
about your targets, their dependencies, and your recipes by
writing an ASCII text file called Makefile
. When you run make
on the
command line, it looks for Makefile
in your current directory, loads up the
specifications, and then ensures that your targets are up-to-date. That is, if
any of them are insufficiently fresh, it reruns the appropriate recipes. If
they're all fresh enough, nothing happens.
Here's a basic example Makefile
:
paper.pdf: paper.tex table.tex
pdflatex paper
table.tex: extractor.py info.fits
./extractor.py info.fits >table.tex
Each Makefile
has a series of stanzas defining targets. The filename of each
target comes before the colon. The targets upon which it depends come after
the colon. Then come indented lines of shell code that, when run, will create
the target. In the above example, there are two targets: paper.pdf
and
table.tex
. The first target, paper.pdf
, depends on paper.tex
and
table.tex
, and is created by running pdflatex
. If paper.tex
or
table.tex
change, pdflatex
needs to be rerun.
The target table.tex
in turn depends on extractor.py
and info.fits
. It
gets created by running this bit of shell code:
./extractor.py info.fits >table.tex
Imagine that you start in a directory that contains Makefile
, paper.tex
,
extractor.py
, and info.fits
. If you run make
, it will try to ensure that
paper.pdf
is up-to-date. paper.pdf
doesn't exist at all, which is (sensibly)
interpreted as being not up-to-date. Therefore it needs to be rebuilt.
But before that can be done, we need to consider the dependencies of
paper.pdf
: paper.tex
and table.tex
. There's no entry in the Makefile
for paper.tex
. This is also treated sensibly: as long as the file exists,
it's considered up-to-date. If paper.tex
did not exist, make
would
signal an error, saying something like
No rule to make target "paper.tex”.
table.tex
, on the other hand, is listed in the Makefile
with its own
dependencies. These need to be checked before paper.pdf
can be built. make
sees that table.tex
depends on extractor.py
and info.fits
, both of which
exist, and both of which are not listed in the Makefile
. No sub-dependencies
need to be checked, so table.tex
can be built. Once this is done, all of the
dependencies of paper.pdf
have been dealt with, so it too can be built.
When you run make
, it prints out the commands that it's running. So the
overall effect would be:
$ make
./extractor.py info.fits >table.tex
pdflatex paper
$
(Well, I've cheated in one way: each command that make
runs prints its
output as normal, so you'd also get a bunch of output from pdflatex
, as well
as anything that extractor.py
happens to print out. This can get pretty
annoying, and I recommend using a wrapper script to drive latex
that hides its output by default.)
Imagine that we write up some amazing results in paper.tex
and rerun make
.
What happens? The same evaluation from before is repeated. This time, however,
table.tex
exists, and none of its dependencies have been updated: the
modification time of the file table.tex
is newer than the times of
extractor.py
and info.fits
. So we don't need to regenerate it. However,
paper.pdf
is a different story. It exists, and its dependencies exist, but
it's out-of-date: paper.tex
has a more recent modification time. Therefore,
the pdflatex
command does need to be rerun:
$ make
pdflatex paper
$
If you give make
the name of one or more targets as command-line arguments,
it will build only them. If you don't give it any arguments, as we've been
showing, it builds the first target listed in your Makefile
. Generally,
then, you want your first target to be your final product.
$ make paper.pdf
pdflatex paper
$ make paper.tex
make: 'paper.tex' is up to date.
$
Our example has been very simple, but hopefully it's starting to convey a
couple of ways in which make
is a very useful tool.
Firstly, there's make
's raison d’être: its dependency tracking, which lets
you keep a complicated data product up-to-date without completely recomputing
it from Square One. Even in our simple example, this could be very useful if,
say, the extractor.py
program takes 10 minutes to run.
Secondly, there's the fact that all of your recipes are codified in your
Makefile
, and you can access them just by typing make
. It's not too hard
to type pdflatex paper
once in a while, but some recipes are a lot more
complicated than that, and it's important to have them stored somewhere other
than your shell history. Even better, make
can be your Swiss Army knife: a
single place you go to for access to a whole suite of useful tools. If you
have a data analysis process that generates dozens of intermediate products,
make
can be the single point of access for creating or updating any of them
— once you've written your Makefile
, make <filename>
is the only thing
you need to remember.
A good thing about make
that may be less obvious is the way that it lets you
build a complicated data pipeline from small pieces that you can reason about.
Writing the Makefile
stanza for a target is usually not difficult: you
generally know how the recipe should go, and you just need to think about what
its dependencies are. But by just writing a bunch of straightforward targets
in a row, you can construct a sophisticated pipeline with complicated
dependency structure. For instance, the example that we showed above could be
approximated with a shell script decently; in a C shell, you could write
something like
#! /bin/tcsh
# run as "./script.sh extract" or "./script.sh latex",
# depending on how much reprocessing you want to do.
goto $1
extract:
./extractor.py info.fits >table.tex
latex:
pdflatex paper
But imagine a data processing pipeline that implements a complicated flowchart. It can't be easily linearized into a shell script, and it's very difficult to write something that will rerun only the recipes that truly need it.
You can think of the targets in a
Makefile
as constructing a graph, in the mathematical sense. Each target is a node, and has links to other nodes: its dependencies. AMakefile
actually defines a very well-known kind of graph, a Directed Acyclic Graph or DAG. It's directed because the links between nodes have a direction: “A depends on B” does not imply “B depends on A”. It's acyclic because you're not allowed to have cycles, that is, closed loops. Such a loop is a “circular dependency” inmake
, and it should be clear that loops like this are not compatible withmake
's processing model.There is a well-developed literature regarding DAGs and thinking about dependency trees in terms of graph theory can be helpful, if your mind has a certain mathematical bent.
Incidentally, another well-respected piece of software, Git, also has a DAG as its core data structure, even though it applies it to solve a vastly different problem.
Another way of putting it is that make
lets you build pipelines that get
arbitrarily sophisticated, but each step along that path is straightforward
and easy to check for correctness.
We’ve talked a little bit about the cases where make
is useful and given a
short example of a Makefile
, the recipe rule that make
uses to know what
to do. Now let's delve into some finer-grained details that are important for
getting the most out of make
.
make
is an old program, and as such it has some warts that are just really
dumb. The most famous of these is encapsulated in the title of this
subsection. Let's look at another example fragment of a Makefile
:
mytool: mytool.c mytool.h mytable.h
gcc -g -O2 -Wall -o mytool mytool.c
mytable.h: maketable.py mytable.dat
python maketable.py mytable.dat >mytable.h
We mentioned before that the recipes to make each target are indented. What we didn't mention is that this indent must be made from a single hit of the “tab” key, and nothing else. Not four spaces, not eight spaces, not spaces and a tab. This is dumb because the difference between these variations is essentially invisible, but that’s just how things are.
Most decent text editors are aware of this and make sure that the Makefile
s
they generate have the right contents. But some programs won’t get this right.
Another thing to look out for is copying and pasting code from terminal
windows or webpages (such as this one!). Even if the original file uses tabs,
the conversion can sometimes change them to spaces and break the Makefile
.
You can — and should — define variables in Makefile
s. They look a bit
like variables in command-line shells, but you have to wrap them in
parentheses to get their values:
CC = gcc
MYCFLAGS = -g -O2 -Wall
mytool: mytool.c mytool.h mytable.h
$(CC) $(MYCFLAGS) -o mytool mytool.c
As in many scripting-type languages, an undefined variable expands to nothing without causing a warning or an error.
You can override the values of variables by setting them as command-line
arguments to make
:
$ make mytool MYCFLAGS="-g -O0 -DDEBUG=1"
gcc -g -O0 -DDEBUG=1 -o mytool mytool.c
$
However, you shouldn't put yourself in situations where you need to do this
with any regularity. One of the cardinal virtues of make
is that it
generates its outputs repeatably, and if special variable values need to be
passed on the command line, you create many opportunities to lose that
repeatability.
Variables in make
can also get their values from environment variables
exported by the program calling make
(your shell, usually). Unlike the
command-line settings shown above, environment-variable settings function as
defaults, not overrides: they take effect if the variable doesn't get set
in the Makefile
, but not if it does. If you use a bash
shell, one way to
temporarily set an environment variable is like this:
$ VERBOSE=1 make foo
which amusingly means that the following command may be equivalent to the one
above, if the variable VERBOSE
is not assigned in the Makefile
:
$ make foo VERBOSE=1
All of the make
recipes that we've shown so far have been very simple
shell-script one-liners. You can write more complicated recipes if you want,
subject to a few limitations.
# Make sure that cross-references are fully up-to-date!
paper.pdf: paper.tex table.tex
pdflatex paper
bibtex paper
pdflatex paper >/dev/null
pdflatex paper
# Can be overridden: make install prefix=/opt
prefix = /usr/local
install: myprogram doccompile
mkdir -p $(prefix)/bin $(prefix)/share/myprogram
cp -p myprogram $(prefix)/bin
for f in datafiles/*.dat ; do \
doccompile $$f --destdir $(prefix)/share/myprogram ; \
done
As we see, you can write multiple-line recipes. These recipes are passed off
to your shell so you can do anything that your shell supports: output
redirection, backgrounding, all sorts of crazy things. Of course, you should
keep things simple; generally speaking, most recipes in a well-written
Makefile
just invoke a single command.
What's not apparent above is that each line is executed in its own separate
little shell. So, for instance, if you set a shell variable in the first line
of your recipe, its value won't carry over into the second line. The bottom
half of the example shows how to work around this: if you end a Makefile
line with a backslash, it gets treated as a single command. Backslashes have
this effect in any part of a Makefile
, not just the recipe lines.
When make
is executing recipes, it gives up with an error message if any of
the shell commands exit with an error code (that is, an exit code other than
zero). If the recipe is composed of several un-backslashed lines, as with
paper.pdf
above, it will give up on the first error.
If a given line of a recipe is prefixed with a dash (-
), make
will ignore
any error codes, and it’ll continue on merrily even if the command yields an
error. This is generally useful you're writing nontrivial recipe shell
scripts, since some plumbing-type utilities exit with nonzero codes in
non-error situations.
If a given line is prefixed with an at-sign (@
), make
won’t print the line
before executing it. This is also useful for nontrivial recipe scripts that
just clutter the screen without providing the user with any useful
information.
You often see special make
targets that are not intended to produce actual
files. Instead, they just provide a convenient way to run little shell
scripts. For example, you might write:
clean:
-rm -f *.o mytool
.PHONY: clean
Running make clean
will simply run the command rm -f *.o mytool
, which
presumably deletes some intermediate files that can be reconstructed if
needed. The .PHONY
construct is a hint to make
that tells it not to expect
a file named clean
to get created; it's not necessary but can sometimes be
helpful and prevent strange behavior in unusual circumstances.
Certain names for action targets are used nearly universally:
Makefile
can build, where “primary” is defined
subjectively. In other words, make all
is generally the “do what I want”
command. The all target is often the first one so that its functionality
can be accessed even more easily, by just typing make
.make
builds, so that make clean ;
make
is the “start from square one” command. You should be careful writing
the clean rule to make sure that it doesn't delete files that make
didn't build!Makefile
where to install files; at the least, make the installation
location a variable of some kind, so that people can easily override it!
It's also good form to provide a matching uninstall rule, or at least to
structure the install commands so that it's clear what's being put where.Contributors include:
This site is hosted on GitHub Pages. The site design is based on the Midnight theme created by Matt Graham.