Jan 24 2007
IBM tutorial for filtering text with Linux utilities
The Linux operating system is loaded with files: configuration files, text files, documentation files, log files, user files, and the list goes on and on. Quite often, those files contain information you need to access in order to find important data. Although you can easily dump the contents of most files to the screen with standard utilities such as cat, more, and others, there are utilities better suited for filtering and parsing out only those values that are relevant to you.
As you read this article, you can open your shell and try the examples of each utility.
Before you start, you should first understand what regular expressions are and how to use them.
In their simplest form, regular expressions are the search criteria used for locating text in a file. For example, to find all lines containing the word “admin”, you can search for “admin”. Thus, “admin” constitutes a regular expression. If you want not only to find “admin” but also to replace it with “root”, you can give the appropriate commands in a utility to substitute “root” for “admin”. Both thus constitute regular expressions.
These basic rules govern regular expressions:
- Any single character or series of characters can be used to match itself or themselves, as in the “admin” example above.
- The caret sign (
^) signifies the beginning of a line; the dollar sign ($) signifies the end. - To literally search for special characters such as the dollar sign, precede them with a backslash (
\). For example,\$searches for$and not the end of a line. - The period (
.) represents any single character. For example,ad..nstands for five-character entries, the first two being “ad” and the last being “n”. The middle two characters can be anything, but there can be only two of them. - Any time the regular expression is contained within slashes (for example,
/re/), the search is forward through the file. When it is enclosed in question marks (for example,?re?), the search is backward through the file. - Square brackets (
[]) signify multiple values, and a minus sign (-) indicates a range of values. For example,[0-9]is the same as[0123456789], and[a-z]is the equivalent of a search for any lowercase letter. If the first character of a list is a caret, it matches any character not in the list.
Table 1 illustrates how these matches work in practice.
| Example | Description |
|---|---|
[abc] |
Matches one of “a”, “b”, or “c” |
[a-z] |
Matches any one lowercase letter from “a” to “z” |
[A-Z] |
Matches any one uppercase letter from “A” to “Z” |
[0-9] |
Matches any one number from 0 to 9 |
[^0-9] |
Matches any character other than the numbers from 0 to 9 |
[-0-9] |
Matches any number from 0 to 9, or a dash (”-”) |
[0-9-] |
Matches any number from 0 to 9, or a dash (”-”) |
[^-0-9] |
Matches any character other than the numbers from 0 to 9, or a dash (”-”) |
[a-zA-Z0-9] |
Matches any alphabetic or numeric character |
With this information under your belt, let’s look at the utilities.
|
The grep utility works by searching through each line of a file (or files) for the first occurrence of a given string. If that string is found, the line is printed; otherwise, the line is not printed. The following file, which I’ll name “memo,” illustrates grep’s usage and results.
To: All Employees
From: Human Resources
In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC’s relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes.
To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Fleming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include:
Sachin Tendulkar, who joins us from XYZ Consumer Electronics as a national account manager covering traditional mass merchants.
Brian Lara, who comes to us via PQR Company and will be responsible for managing our West Coast territory.
Shane Warne, who will become an account administrator for our warehouse clubs business and joins us from DEF division.
Effectively, we have seven new faces on board:
1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNEPlease join me in welcoming each of our new team members.
As a simple example, to find the lines that have the word “welcoming”, the best approach would be to use the following command line:
|
If you look for the word “market”, the results are slightly different, as shown below.
|
Note that two matches are found: the requested “market”, and “marketing”. If the words “marketable” or “marketed” had occurred in the file, the utility would have displayed the lines containing those words as well.
Wildcards and meta-characters can be used with grep, and I strongly recommend that you place them inside quotation marks so that the shell doesn’t interpret them as commands.
To find all lines that contain a number, use the following:
|
To find all lines that contain “the”, use this:
|
As you might have noticed, the output contains the word “these”, along with exact matches of the word “the”.
The grep utility, like almost every other UNIX/Linux utility, is case-sensitive, which means that a completely different result comes from looking for “The” instead of “the”.
|
If you are seeking a particular word or phrase and don’t care about the case, there are two ways to proceed. The first is to look for both “The” and “the” by using square brackets, as shown below:
|
The second method is to use the -i option, which tells grep to ignore case sensitivity.
|
In addition to -i, there are several other command-line options to change grep’s output. The most relevant are the following:
-c— Suppress normal output; instead, print a count of matching lines for each input file.-l— Suppress normal output; instead, print the name of each input file from which output would have normally been printed.-n— Prefix each line of output with the line number within its input file.-v— Invert the sense of matching — that is, select lines that don’t match the search criteria.
|
fgrep searches files for a string and prints all lines that contain that string. Unlike grep, fgrep searches for a string instead of searching for a pattern that matches an expression. The fgrep utility can be thought of as grep with a few enhancements:
- You can search for more than one object at a time.
- The fgrep utility is always much faster than grep.
- You can’t use fgrep to search for regular expressions with patterns.
Suppose you want to pull uppercase names from your earlier memo file. In order to find “STEPHEN” and “BRIAN”, you would have to issue two separate grep commands, as shown below:
|
You can accomplish the same task with just one fgrep command:
|
Note that carriage return is required between entries. Without the carriage return, the search would look for “STEPHEN BRIAN” on each line. With the return, it looks for a match to “STEPHEN” and a match to “BRIAN”.
Note also that quotation marks must be used around the targeted text. This is what differentiates the text from the filename (or filenames).
Instead of specifying search items on the command line, you can place them in a file and use the contents of that file to search other files. The -f option allows you to specify a master file containing search items for which you search frequently.
For example, imagine a file named “search_items” that contains two search items for which you intend to search:
|
The following command searches for “STEPHEN” and “BRIAN” in our earlier memo file:
|
|
egrep is a more powerful version of grep that allows you to search for more than one object at a time. Objects being searched for are separated by carriage returns (as with fgrep) or by the pipe symbol (|).
|
The two commands above do the same job.
Besides the capacity to search for multiple objects, egrep offers the ability to search for repetitions and groups:
?looks for zero repetitions or one repetition of the character that precedes the question mark.+looks for one or more repetitions of the character that precedes the plus sign.( )signifies a group.
For example, imagine that you can’t remember whether Brian’s surname is “Lara” or “Laras”.
|
This search produces matches to both “LARA” and “LARAS”. The following search is a bit different:
|
It matches “STEPHEN”, STEPHENN”, STEPHENNN”, and so on.
If you are looking for a word plus one of its possible derivatives, include the distinguishing characters of the derivative in parentheses.
|
This finds a match for both “electrons” and “electronics”.
To summarize:
- A regular expression followed by
+matches one or more occurrences of the regular expression. - A regular expression followed by
?matches zero or one occurrence of the regular expression. - Regular expressions separated by
|or by a carriage return match strings that are matched by any of the expressions. - A regular expression can be enclosed in parentheses
( )for grouping. - The command-line parameters you can use include
-c,-f,-i,-l,-n, and-v.
|
The grep utilities: A real-world example
The grep family of utilities can be used with any system file in text format to find a match in a line. For example, to find the entries in the /etc/passwd file for a user named “root”, use the following:
|
Because it looks for a match anywhere in the file, grep finds entries for both “root” and “operator”. If you want to find only the entry with the username “root”, you can modify the command as follows:
|
|
With the cut utility, you can separate columns that could constitute data fields in a file. The default delimiter is the tab, and the -f option is used to specify the desired field.
For example, imagine a text file named “sample” with three columns that look like this:
|
Now, apply the following command:
|
This will return:
|
If you change your command like so:
|
It will return the opposite:
|
Several command-line options are available with this command. Besides -f, you should be familiar with these two:
-c— Allows you to specify characters instead of fields.-d— Allows you to specify a delimiter other than the tab.
|
The ls -l command shows the permissions, number of links, owner, group, size, date, and filenames of all the files in a directory — all separated by white space. If you’re not interested in most of the fields and want to see only the file owner, you can use the following command:
|
This command displays only the file owner (the fifth field), ignoring every other field.
If you know the exact position at which the first character of the file owner begins, you can use -c option to display the first character of the file owner. Assuming that it begins with the 16th character, the following command returns the 16th character, the first letter of the owner’s name.
|
If you further assume that most users will use eight characters or fewer for their name, you can use the following command:
|
It will return those entries in the name field.
Now, assume that the name of the file begins with the 55th character, but that it is impossible to determine how many characters it takes up after that because some filenames are considerably longer than others. A solution is to begin with the 55th character and not specifying an ending character (meaning that the entire rest of the line is taken) as shown below:
|
Now, consider another scenario. To obtain a list of all the users on the system, you can pull only the first field from the /etc/passwd file used in an earlier example:
|
To collect the usernames and their corresponding home directories, you can pull the first and sixth fields:
|
|
The paste utility combines fields from files. It takes one line from one source and combines it with another line from another source.
For example, imagine that the content of a file named “fileone” is:
|
In addition, you have “filetwo” with this content:
|
The following command combines the contents of these files, as shown below:
|
If there were more lines in fileone than filetwo, then the pasting would continue, with blank entries following the tab.
The tab character is the default delimiter, but you can change it to anything else with the -d option.
|
You can also use the -s option to output all of fileone on a line, followed by a carriage return and then filetwo.
|
|
join is a greatly enhanced version of paste. join works only if the files being joined share a common field.
For example, consider the two files you were using with the paste command previously. Here’s what happens when you try to combine them with join:
|
Note that there is nothing to display. The join utility must find a common field between the files in question, and by default it expects that common field to be the first.
To see how this works, try adding some new content. Assume that fileone now contains these entries:
|
And filetwo now contains the following:
|
Now, try that command again:
|
The commonality of the first field was identified, and the matching entries were combined. But paste blindly took from each file to create the output; join combines only lines that match, and the match must be exact. For example, imagine you added a line to filetwo:
|
Now, your command will produce this output:
|
As soon as the files no longer match, no further operations can be carried out. Each line in the first file is matched to the same and only the same line in the second file for a match on the default field. If matches are found, they are incorporated into the output; otherwise they are not.
By default, join looks only at the first fields for matches and outputs all columns, but you can change this behavior. The -1 option lets you specify which field to use as the matching field in fileone, and the -2 option lets you specify which field to use as the matching field in filetwo.
For example, to match the second field of fileone to the third field of filetwo, use the following syntax:
|
The -o option specifies output in the format {file.field}. Thus, to print the second field of fileone and the third field of filetwo on matching lines, the syntax is:
|
|
The most obvious way you could use join in the real world would be to pull the username and the corresponding home directory from the /etc/passwd file and the group name from the /etc/group file. Groups appear in the fourth field in numerical format in the /etc/passwd file. Similarly, they appear in the third field in the /etc/group file.
|
|
awk is one of the most powerful utilities in Linux. It is actually a programming language in and of itself and can be used with complex logic statements, as well as to simply pull out snippets of text. We’ll skip the details, but let’s quickly review the syntax and then walk through some real-world examples.
An awk command consists of a pattern and an action composed of one or more statements, as shown in the syntax below:
|
Notice that:
- awk tests every record in the specified file (or files) for a pattern match. If a match is found, the specified action is performed.
- awk can act as a filter in a pipeline or take input from the keyboard (standard input) if no file or files are specified.
One useful action is to print the data! Here is how to reference fields in a record.
$0— The entire record$1— The first field in the record$2— The second field in the record
You can also pull multiple fields in a record, separating each field by a comma.
For example, to pull the sixth field from the /etc/passwd file, the command is:
|
Note that -F is the input field separator defined by the predefined variable FS. It is a blank space, in my case.
To pull the first and sixth fields from the /etc/passwd file, the command is:
|
To print the file using a dash in place of the colon delimiter between fields, the command is:
|
To print the file using a dash between fields, and print only the first and sixth fields in reverse order, the command is:
|
|
The head utility prints the first part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.
For example, if you want to extract the first two lines from your memo file, the command is:
|
You can specify the number of bytes to display using the -c option. For example, if you want to read the first two bytes from the memo file, the command is:
|
|
The tail utility prints the last part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.
For example, if you want to extract the last two lines from your earlier memo, the command is:
|
You can specify the number of bytes to display using the -c option. For example, if you want to read the last five bytes from the memo file, the command is:
|
|
Now you know how to use various utilities to extract data from standard Linux files. Once extracted, that data can be manipulated for viewing and printing or directed into other files or databases. Knowing how to use just this handful of tools can help you spend less time on mundane tasks and become a more efficient administrator.
|
Learn
- Try checking the GNU Core Utilities Frequently Asked Questions page if something is just not working for you.
- The classic work in this field is Unix Power Tools, by Shelley Powers, Jerry Peek, Tim O’Reilly, and Mike Loukides (O’Reilly and Associates, October 2003).
- The UNIX Programming Environment, by Brian W. Kernighan and Rob Pike (Prentice Hall, Inc., 1984) is an essential part of any programmer’s bookshelf.
- Linux Bible, 2005 Edition, by Christopher Negus (John Wiley, 2005) can help you learn more about Linux and its utilities.
- Linux: The Complete Reference, Fifth Edition, by Richard Petersen (Osborne/McGraw-Hill, 2002) is the ultimate in-depth Linux resource.
- “Developing a Linux command-line utility” (developerWorks, June 2002) gives best practices and hints on getting started coding.
- The Linux Professional Institute (LPI) exam prep series teaches the basics of systems administration.
- In the developerWorks Linux zone, find more resources for Linux developers.
- Stay current with developerWorks technical events and Webcasts.
Get products and technologies
- Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.
Discuss
-
Check out developerWorks
blogs and get involved in the developerWorks community.
|
|
Harsha Adiga works in the IBM Software Group in Bangalore, India, and is heavily involved in various Linux and open source communities and working groups. His primary focus areas include Linux and UNIX internals, porting, compilers, and code optimization. He has been involved in software development and testing on Linux and UNIX platforms for more than six years. |
||
|
IBM, DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other company, product, or service names may be
trademarks or service marks of others.
<
