Jan 29 2009
Simple Awk Tutorial

why awk?
awk is small, fast, and simple, unlike, say, perl. awk also has a
clean comprehensible C-like input language, unlike, say, perl. And
while it can’t do everything you can do in perl, it can do most things
that are actually text processing, and it’s much easier to work with.
what do you do?
In its simplest usage awk is meant for processing column-oriented text
data, such as tables, presented to it on standard input. The variables
$1, $2, and so forth are the contents of the first, second,
etc. column of the current input line. For example, to print the
second column of a file, you might use the following simple awk
script:
awk < file '{ print $2 }'
This means “on every line, print the second field”.
To print the second and third columns, you might use
awk < file '{ print $2, $3 }'
Input separator
By default awk splits input lines into fields based on whitespace,
that is, spaces and tabs. You can change this by using the -F option
to awk and supplying another character. For instance, to print the
home directories of all users on the system, you might do
awk < /etc/passwd -F: '{ print $6 }'
since the password file has fields delimited by colons and the home
directory is the 6th field.
Arithmetic
Awk is a weakly typed language; variables can be either strings or
numbers, depending on how they’re referenced. All numbers are
floating-point. So to implement the fahrenheit-to-celsius calculator,
you might write
awk '{ print ($1-32)*(5/9) }'
which will convert fahrenheit temperatures provided on standard input
to celsius until it gets an end-of-file.
The selection of operators is basically the same as in C, although
some of C’s wilder constructs do not work. String concatenation is
accomplished simply by writing two string expressions next to each
other. ‘+’ is always addition. Thus
echo 5 4 | awk '{ print $1 + $2 }'
prints 9, while
echo 5 4 | awk '{ print $1 $2 }'
prints 54. Note that
echo 5 4 | awk '{ print $1, $2 }'
prints “5 4″.
Variables
awk has some built-in variables that are automatically set; $1 and so
on are examples of these. The other builtin variables that are useful
for beginners are generally NF, which holds the number of fields in
the current input line ($NF gives the last field), and $0, which holds
the entire current input line.
You can make your own variables, with whatever names you like (except
for reserved words in the awk language) just by using them. You do not
have to declare variables. Variables that haven’t been explicitly set
to anything have the value “” as strings and 0 as numbers.
For example, the following code prints the average of all the numbers
on each line:
awk '{ tot=0; for (i=1; i<=NF; i++) tot += $i; print tot/NF; }'
Note the use of $i to retrieve the i’th variable, and the for loop,
which works like in C. The reason tot is explicitly
initialized at the beginning is that this code is run for every input
line, and when starting work on the second line, tot will have the
total value from the first line.
Blocks
It might seem silly to do that. Probably, you have only one set of
numbers to add up. Why not put each one on its own line? In order to
do this you need to be able to print the results when you’re done. The
way you do this is like this:
awk '{ tot += $1; n += 1; } END { print tot/n; }'
Note the use of two different block statements. The second one has
END in front of it; this means to run the block once after
all input has been processed. In fact, in general, you can put all
kinds of things in front of a block, and the block will only run if
they’re satisfied. That is, you can say
awk ' $1==0 { print $2 }'
which will print the second column for lines of input where the
first column is 0. You can also supply regular expressions to match
the whole line against:
awk ' /^test/ { print $2 }'
If you put no expression, the block is run on every line of input. If
multiple blocks have conditions that are true, they are
all run. There is no particularly clean way I know of
to get it to run exactly one of a bunch of possible blocks of code.
The block conditions BEGIN and END are special and
are run before processing any input, and after processing all input,
respectively.
Other language constructs
As hinted at above, awk supports loop and conditional statements like
in C, that is, for, while, do/while, if, and if/else.
printf
awk includes a printf statement that works essentially like C
printf. This can be used when you want to format output neatly or
combine things onto one line in more complex ways (print
implicitly adds a newline; printf doesn’t.)
Here’s how you strip the first column off:
awk '{ for (i=2; i<=NF; i++) printf "%s ", $i; printf "\n"; }'
Note the use of NF to iterate over all the fields and the use of
printf to place newlines explicitly.
Combining awk with other tools
The strength of shell scripting is the ability to combine lots of
tools together. Some things are most readily done with successive awk
processes pipelined together. awk is also often combined with the sed
utility, which does regular expression matching and substitution. You
can actually do most common sed operations in awk, but it’s usually
more convenient to use sed.
In combination with sed, sort, and some other useful shell utilities
like paste, awk is quite satisfactory for a lot of numerical data
processing, as well as the maintenance of simple databases kept in
column-tabular formats.
awk is also extremely useful in a certain style of makefile writing
for generating pieces of makefile to include.
Do not script in csh. Use sh, or if you must, ksh.
More stuff
This introduction intentionally covers only the absolute basics of
awk. awk can do lots of other useful and powerful things. The manual
page for GNU awk (gawk) is a reasonably good reference for looking
things up once you know the basics.
The only thing seriously lacking in awk that I’ve yet run into is that
there’s no way to tell awk not to do buffering on its input and
output, which makes it useless for assorted interactive or
network-oriented applications it would otherwise be ideally suited
for.
