Mar 07 2006

Guide to Troubleshooting Linux Problems

Tag:tepezcuintle @ 18:52

This is a guide to basic, and not so basic troubleshooting and
debugging on linux systems. Goals include description
and usage of common tools, how to find information, and
what to do with that information. Emphasis will be on
software issues, but might include hardware as well.

Table of contents


Something isn’t working, what do you do?

It happens to the best of us. At some point, our
perfectly configured and optimized systems decide to
make our lives interesting. A process suddenly will
not start up, the database is returning bogus results,
or the installation of that new app is being more
troublesome than anyone would like. There is a
problem.

So whats the next step? Troubleshooting a
problem is always an interesting challenge. Generally,
it’s not obvious what the problem is, so the first
task is to start figuring out what is going on. The
more information you can figure out about what is happening,
the easier it is to figure out why the thing you
expect to happen isnt.


Tools

Efficient debugging and troubleshooting is often a matter
of knowing the right tools for the job, and how to use
them.


strace

Strace is one of the most powerful tools available
for troubleshooting. It allows you to see what an application
is doing, to some degree.

`strace` display all the system calls that an application
is making, what arguments it passes to them, and what the
return code is. A system call is generally something that
requires the kernel to do something. This generally means
I/O of all sorts, process management, shared memory and
IPC useage, memory allocation, and network useage.


examples

The simplest example of using strace is as follows:

      strace ls -al

This starts the strace process, which then starts `ls -al`
and shows every system call. For `ls -al` this is mostly
I/O related calls. You can see it calling stat() on files, opening
config files, opening the libs it is linked against, allocating
memory, and calling write() to output the contents to the screen.


What files are trying to be opened

A common troubleshooting technique is to see
what files an app is reading. You might want to make
sure it’s reading the proper config file, or looking
at the correct cache, etc. `strace` by default shows
all file I/O operations.

But to make it a bit easier, you can filter
strace output. To see just file open()’s

      strace -eopen ls -al


What is this thing doing to the network?

To see all network related system calls (name
resolution, opening sockets, writing/reading to sockets, etc)

   strace -e trace=network curl –head http://www.redhat.com


Rudimentary profiling

One thing that strace can be used for that is useful for
debugging performance problems is some simple profiling.

    strace -c  ls -la

Invoking strace with ‘-c’ will cause a cumulative report of
system call useage to be printed. This includes approximate
amount of time spent in each call, and how many times a
system call is made.

This can sometimes help pinpoint performance issues, especially
if an app is doing something like repeatedly opening/closing
the same files.

    strace -tt ls -al

the -tt option causes strace to print out the time each call
finished, in microseconds.

    strace -r ls -al

the -r option causes strace to print out the time since the
last system call. This can be used to spot where a process
is spending large amounts of time in user space, or especially
slow syscalls.


Following forks and attaching to running processes

Often it is difficult or impossible to run a command under
strace (an apache httpd for instance). In this case, it’s
possible to attach to an already running process.

    strace -i 12345

where 12345 is the PID of the process. This is very handy
for trying to determine why a process has stalled. Many
times a process might be blocking while waiting for I/O.
with strace -p, this is easy to detect.

Lots of processes start other processes. It is often desireable
to see a strace of all the processes.

   strace -f /etc/init.d/httpd start

will strace not just the bash process that runs the script, but
any helper utilities executed by the script, and httpd itself.

Since strace output is often a handy way to help a developer
solve a problem, it’s useful to be able to write it to a file.
The easiest way to do this is with the -o option.

   strace -o /tmp/strace.out program

Being somewhat familar with the common syscalls for linux
is helpful in understanding strace output. But most of the common
ones are simple enough to be able to figure out on context.

A line in strace output is essentially, the system call name,
the arguments to the call in parenthesis (sometimes truncated…), and then
the return status. A return status for error is typically -1, but varies
sometimes. For more information about the return status of a typically
system call is by `man 2 syscallname`. Usually the return status will
be documented in the “RETURN STATUS” section.

Another thing to note about strace it is often shows “errno”
status. If your not familar with unix system programming, errno is a global
variable that gets sets to specific values when some commands execute. This
variable gets set to different values based on the error mode of the command.
More info on this can be found in `man errno`. But typically, strace will
show the brief description for any any errno values it gets. ie:

    open(”/foo/bar”, O_RDONLY) = -1 ENOENT (No such file or directory)

   strace -s X

the -s option tells strace to show the first X digits of strings.
The default is 32 characters, which sometimes is not enough. This will
increase the info available to the user.


More info

Overview of linux system calls (http://www.quepublishing.com/articles/article.asp?p=23618&rl=1)

PDF version of Advanced Linux Programming (http://www.advancedlinuxprogramming.com/alp-folder)


ltrace

ltrace is very similar to strace, except ltrace focuses on
tracing library calls.

For apps that use a lot of libs, this can be a very powerful
debugging tool. However, because most modern apps use libraries
very heavily, the output from ltrace can sometimes be
painfully verbose.

There is a distinction between what makes a system call
and a call to a library functions. Sometimes the line between the
two is blurry, but the basic difference is that system calls are
communicating to the kernel, and library calls are just
running more userland code. System calls are usually required for
things like I/O, process control, memory management issues,
and other kernel things.

Library calls are by bulk, generaly calls to the standard
C library (glibc..), but can of course be calls to any library,
for example, Gtk,libjpeg, libnss, etc. Luckily most glibc functions
are well documented and have either man or info pages. Documentation
for other libraries varies greatly.

ltrace supports the -r, -tt, -p, and -c options the same
as strace. In addition it supports the -S option which
tells it to print out system calls as well as library
calls.

One of the more useful options is “-n 2″ which will
indent 2 spaces for each nested calls. This can make it
much easier to read.

Another useful option is the “-l” option, which
allows you to specify a specific library to trace, potentionaly
cutting down on the rather verbose output.


gdb

`gdb` is the GNU debugger. A debugger is typically used by developers
to debug applications in development. It allows for a very detailed
examination of exactly what a program is doing.

That said, gdb isnt as useful as strace/ltrace for troubleshooting/sysadmin
types of issues, but occasionally it comes in handy.

For troubleshooting, its useful for determining what application created a
core file. (`file core` will also typically show you this information too).
But gdb can also show you “where” the file crashed. Once you determine
the name of the app that caused the failure, you can start gdb with:

    gdb filename corefile
    then at the prompt type
    where

The unfortunate thing is that all the binaries are typically
stripped of debugging symbols to make them smaller, so this often returns
less than useful information. However, starting in Red Hat Enterprise Linux
3, and included in Fedora, there are “debuginfo” packages. These
packages include all the debugging symbols. You can install them the
same as any other rpm, so `rpm`, `up2date`, and `yum` all work.

The only difficult part about debuginfo rpms is figuring out
which ones you need. Generally, you want the debuginfo package
for the src rpm of the package thats crashing.

    rpm -qif /path/to/app

Will tell you the info for the binary package the app is part of.
Part of that info include the src.rpm. Just use the package name
of the src rpm plus “-debuginfo”

  FIXME: insert info about debug packages for other systems


top

`top` is a simple text based system monitoring tool. It packs
a lot of information unto the screen, which can be helpful
troubleshooting problems, particularly performance related
problems.

The top of the “top” output includes a basic summary of the system.
The top line is current time, uptime since the last reboot, users
logged in, and the load average. The load average values here are the
load for the last 1, 5, and 15 minutes. A load of 1.0 is considered
100% utilization, so loads over one typically means stuff
is having to wait. There is a lot of leeway and approxiation in
these load values however.

The memory line shows the total physical ram available
on the system, how much of it is used, how much is free, and how
much is shared, along with the amount of ram in buffers. These
buffers are typically file system caching, but can be other things.
On a system with a significant uptime, expect the buffer value to
take up all free physical ram not in use by a process.
The swap line is similar.

Each of the entries viewable in the system contain several
fields by default. The most interesting are RSS, %CPU, and
time. RSS shows the amount of physical ram the process is consuming.
%CPU shows the percentage of the available processor time a process
is taking, and time shows the total amount of processor time the process
has had. A processor intensive program can easily have more “time”
in just a few seconds than a long running low cpu process.


Sorting the output

  • M  : sorts the output by memory useage. Pretty handy for figuring out which version of openoffice.org to kill.
  • P  : sorts the process by the percentage of cpu time they are using.
  • T  : sorts by cumulative cpu time used
  • A  : sorts by age of the process, newest process first


Command line options

The only really useful command line options are:

  • b [batch mode] writes the standard top output to stdout. Useful for a quick “system monitoring hack”.

ie:

        top d 360 b >>  foo.output

to get a snapshot of the system appended to foo.output every six minutes.


ps

`ps` can be thought of as a one shot `top`. But it’s a bit
more flexible in it’s output than top.

As far as `ps` commandline options go, it can get pretty
hairy. The linux version of `ps` inherits ideas from both
the BSD version, and the SYSV version. So be warned.

The `ps` man page does a pretty good job of explaining
this, so look there for more examples.

One thing to be aware of is that ps behaves differently
depending on if a - is prepended to the options:

 ps ef

and

 ps -ef

are two very different things.


examples

   ps aux

shows all the process on the system in a “user” oriented
format. In this case meaning the username of the owner
of the process is shown in the first column.

   ps auxww

the “w” option, when used twice, allows the output to be
of unlimited width. For apps started with lots of commandline
options, this will allow you to see all the options.

   ps auxf

the ‘f” option, for “forest” tries to present the list
of processes in a tree format. This is a quick and easy
way to see which processes are child processes of what.

   ps -eo pid,%cpu,vsz,args,wchan

This is an interesting example of the -eo option. This
allows you to customize the output of `ps`. In this
case, the interesting bit is the “wchan” option, which
attempts to show what syscall the process is in which
`ps` checks.

For things like, apache httpds, this can be useful
to get an idea what all the processes are doing
at one time. See the info in the strace section
on understanding system call info for more info.


systat/sar

Systat works with two steps, a daemon process that
collects information, and a “monitoring” tool.

The daemon is called “systat”, and the monitoring
tool is called `sar`

To start it, start the systat daemon:

                ./systat start

To see a list of `sar` options, just try `sar –help`


examples

Things to note. There are lots of commandline options. The last one is always the “count”, meaning the time between updates.

    sar 3

Will run the default sar invocation every three seconds.

For a complete summary, try:

  sar -A

This generates a very large pile of info ;->

To get a good idea of disk i/o activity:

  sar -b 3

For something like a heavily used web server, you may want to get a good idea how many processes are being created per second:

  sar -c 2

Kind of surprising to see how many process can be created.

Theres also some degree of hardware monitoring builtin. Monitoring how many times a IRQ is triggered can also provide good hints at whats causing system performance problems.

Show the total number of system interrupts

  sar -I SUM 3

Watch the standard IDE controller IRQ every two seconds.

  sar -I 14 2

Network monitoring is in here too:
Show # of packets sent/receiced. # of bytes transfered, etc

  sar -n DEV 2

Show stats on network errors.

   sar -n EDEV 2

Memory usege can be monitored with something like:

   sar -r 2

This is similar to the output from `free`, except more easily
parsed.

For SMP machines, you can monitor per CPU stats with:

    sar -U 0

where 0 is the first processor. The keyword ALL will show all of them.

A really useful one on web servers and other configurations that use lots and lots of open files is:

   sar -v

This will show number of used file handles, %of available
filehandles available, and same for inodes.

To show the number of context switches ( a good indication
of how much time a process is wasting..)

    sar -w 2


vmstat

This util is part of the procps package, and can provide lots of useful
information when diagnosing performance problems.

Heres a sample vmstat output on a lightly used desktop:



   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
 1  0  0   5416   2200   1856  34612   0   1     2     1  140   194   2   1 97

And heres some sample output on a heavily used server:



   procs                      memory    swap          io     system  cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
16  0  0   2360 264400  96672   9400   0   0     0     1   53    24   3   1 96
24  0  0   2360 257284  96672   9400   0   0     0     6 3063 17713  64  36 0
15  0  0   2360 250024  96672   9400   0   0     0     3 3039 16811  66  34 0

The interesting numbers here are the first one, this is the number of
the process that are on the run queue. This value shows how many process are
ready to be executed, but can not be ran at the moment because other process
need to finish. For lightly loaded systems, this is almost never above 1-3,
and numbers consistently higher than 10 indicate the machine is getting
pounded.

Other interseting values include the “system” numbers for in and cs. The
in value is the number of interupts per second a system is getting. A system
doing a lot of network or disk I/O will have high values here, as interupts
are generated everytime something is read or written to the disk or network.

The cs value is the number of context switches per second. A context
switch is when the kernel has to take the executable code for a program
out of memory, and switch in another. It’s actually _way_ more complicated
than that, but thats the basic idea. Lots of context swithes are bad, since it
takes some fairly large number of cycles to perform a context swithch, so if
you are doing lots of them, you are spending all your time changing jobs and
not actually doing any work. I think we can all understand that concept.


tcpdump/ethereal

Ethereal will display all the connections it traced during the capture. There are a couple ways to look for bandwidth hogs.

The “Statitics” menu has a couple useful options. The “Protocol Hierarchy” shows what % of packets in the trace is from each type of protocol. In the case of a bandwith hog, at least what protocol is the culprit should be easy to spot here.

The “Conversations” screen is also helpful for looking for bandwidth hogs. Since you can sort the “conversations” by number of packets, the culprit is likely to hop to the top. This isn’t always the case, as it could easily be many small connections killing the bandwidth, not one big heavy connection.

As far as tcpdump goes, the best way to spot bandwidth hogs is just to start it up. Since it pretty much dumps all traffic to the screen in a text format, just keep your eyes peel for what seems to be coming up a lot.

tcpdump can also be used to see if a given service may be unresponsive because your packets are simply not reaching the remote machine. Since tcpdump is a commandline tool, you’ll very probably need to add filters - especially when you’re firing tcpdump up on a remote machine, where you’re logged in via ssh. Otherwise you’ll get lots of packet dumps of ssh packets that are telling you of packets dumped that belong to ssh telling you of packets dumped…

  tcpdump -l -i eth0 port 25

This will dump all packets aimed at, or originating from, a tcp or udp port 25. The ‘-l’ is to do line buffering, so we’ll actually see each packet as it crosses the wire.

The tcpdump filter syntax is actually surprisingly powerful - take 5 minutes and grab your nearest manpage on tcpdump if you need a better filter.


netstat

Netstat is a app for getting general information about the status of network connections to the machine.

   netstat

will just show all the current open sockets on the machine. This will include unix domain sockets, tcp sockets, udp sockets, etc.

One of the more useful options is:

    netstat -pa

The `-p` options tells it to try to determine what program has the socket open, which is often very useful info. For example, someone nmap’s their system and wants to know what is using port 666 for example. Running netstat -pa will show you its satand running on that tcp port.

One of the most twisted, but useful invocations is:

    netstat -t -n| cut -c 68-|sort|uniq -c|sort -n

This will show you a sorted list of how many sockets are in each connection state. For example:



   9  LISTEN
   21  ESTABLISHED

  • what process is doing what and to whom over the network
  • number of sockets open
  • socket status

A quick and dirty way to see what daemons are running and accepting connections on your machine is

  netstat -tlpn

for tcp services and

  netstat -ulpn

for udp services. Unix domain sockets are usually more abundant than either of these two and a lot less interesting.


lsof

/usr/sbin/lsof is a utility that checks to see what all open files are on the system. Theres a ton of options, almost none of which you ever need.

This is mostly useful for seeing what processes have what file open. Useful in cases where you need to umount a partion, or perhaps you have deleted some file, but its space wasnt reclaimed and you want to know why.

The EXAMPLES section of the lsof man page includes many useful examples.


fuser

Displays PIDs of processes that are using some filesystem object. Kind of like the small brother of lsof.

The most frequent use will be the ‘-m’ option when you’re trying to umount a filesystem and you get an error message telling you that the specified device is busy:


 turing:/home/sr# umount /usr
 umount: /usr: device is busy

 turing:/home/sr# fuser -m /usr
 /usr:  2522e  2604e  2646e  2652e  2662e  2761e  2764e  2775e  2798e  2804e  2843e  2846e  2849e
        2988m  3018m  3740e  3741e  3759m  3772m  3773e  3776e  3779e  3782e  3785e  3789e  3791e
        3793e  3828e  3832e  3833m  3869e  3893e  3907e  3908m  3915e  3999e  4124m  4125m  4127m

This list are all the PIDs that are working within the ‘/usr’ mountpoint and keeping you from umounting the filesystem. Check who’s what with ‘ps ax | grep [PID]‘ and kill them gently.


ldd

ldd prints out shared library depenencies.

For apps that are reporting missing libraries, this is a handy utility. It shows all the libraries a given app or library is linked to.

For most cases, what you will be looking for is missing libs. in the ldd output, they will show something like:

    libpng.so.3 => (file not found)

In this case, you need to figure out why libpng.so.3 isnt being found. It might not be in the standard lib paths, or perhaps not in a path in /etc/ld.so.conf. Or you need to run `ldconfig` again to update the ld cache.

ldd can also be useful when tracking down cases where a app is finding a library, but it finding the wrong library. This can happen if there are two libraries with the same name installed on a system in different paths.

Since the `ldd` output includes the full path to the lib, you can see if anything is pointing at at a wrong paths. One thing to look for when scanning for this, is one lib thats in a different lib path than the rest. If an app uses apps from /usr/lib, except for one from /usr/local/lib, theres a good chance thats your culprit.


nm

`nm` is a utility that shows all the library symbols an application expects to find. It can
be used in combination with `ldd` and `ldconfig` to try to track down library linking problems.

A common case would be a bin that is compiled against a newer version of a library that has symbols in it that the version of the library the app is dynamlically linking against does not.


file

`file` is a simple utility that tries to figure out what kind of file a given file is. It does this by magic(5).

Where this sometimes comes in handy for troubleshooting is looking for rogue files. A .jpg file that is actually a .html file. A tar.gz thats not actually compressed. Cases like those can sometimes cause apps to behave very strangely.


netcat

to see network stuff

   FIXME


md5sum

`md5sum` is a utilty that calculates a checksum of a file. For troubleshooting purposes, you can assume every unique file will have a unique checksum.


verifying files

Since a md5sum will change if any part of a file changes, it can also be used to verify that a file has not changed. Systems like `tripwire` use this to detect if a file has been compromised in a security breach.

This can be used to see if a file has been modified or corrupted if you know what the md5sum is supposed to be.

You can also use it to see if too files are exactly the same or not. A common case is to check to see if a config file has been modified, or if it’s different from whats in a config mangement system.


verifying iso’s

Linux distributions are often distributed as cd images or isos. An md5sum of these images is always provided to verify the integrity of the downloaded iso’s. A few bits missing here and there is enough to make an install a painful experience.

Check the location the isos were downloaded for a text file containing the md5sums of the isos. It will
typically look something like:



 2af10158545bc24477381e80412ff209  bar.iso
 9761d6ce118a1230bc48b0a59f7b5639  foo.iso

You can run the md5sum directly on the isos:

    bash# md5um bar.iso
    2af10158545bc24477381e80412ff209  bar.iso

Or you can often use the md5sums text file as input to `md5sum` to tell it what to check and
to verify. If the above example was in a file called “iso.md5s”:

   md5sum -c iso.md5s

That command will check both isos, and check the computed checksum against what the file lists as correct.

md5sum is also a good way to verify a burned cd. Something like:

   find /mnt/cdrom -name “*” -exec md5sum {} \;

will run a md5sum on all the files on the cd mounted at /mnt/cdrom. Since md5sum checks every bit (literally..) of a file, if the cd is bad, theres a good chance this will find it. If the above command causes any errors about the media, chances are the cd is bad. Better to find it now than later.

For recent Red Hat and Fedora based distros, the installer includes an option to perform a mediacheck. This
is essentially the same as verifying the iso md5sum by hand. If you have already done that, you can skip the
media check.


diff

diff compares two files, and shows the difference between the two.

For troubleshooting, this is most often used on config files. If one version of a config file works, but another does not, a `diff` of the two files can often be enlightening. Since it can be very easy to miss a small difference in a file, being able to see just the differences is useful.

For debugging during development, diff (especially the versions built into revision control systems like cvs) is invaluable. Seeing exactly what changed between two versions is a great help.

For example, if foo-2.2 is acting weird, where foo-2.1 worked fine, it’s not uncommon to `diff` the source code between the two versions to see if anything related to your problem changed.


find

For troubleshooting a system that seems to have suddenly stopped working, find has a few tricks up its sleeve.

When a system stops working suddenly, the first question to ask is “what changed?”.

   find / -mtime -1

That command will recursively list all the
file from / that have changed in the last
day.

To list all the files in /usr/lib that
changed in the last 30 minutes.

   find /usr/lib -mmin -30

Similar options exist for ctime and atime.
To show all the files in /tmp that have been accessed in the last 30 minutes.

    find /tmp -amin -30

The -atime/-amin options are useful when trying to determine if an app is actually reading the files it is supposed. If you run the app, then run that command where the files are, and nothing has been accessed, something is wrong.

If no “+” or “-” is given for the time value, find will match only exactly that time. This is handy in several cases. You can determine what files were modified/created at the same time.

A good example of this is cleaning up from a tar package that was unpacked into the wrong directory. Since all the files will have the same access time, you can use find and -exec to delete them all.

`find` can also find files with particular permisions set.
To find all world writeable files / down:

    find / -perm -0777

To find all files in /tmp owned by “alikins”:

    find /tmp -user alikins 


Using find in combo with grep to find markers (errors, filename, etc)

When troubleshooting, there are plenty of cases where you want to find all instances of a filename, or a hostname, etc.

To recursievely grep a large number of files, you can use find and it’s exec options.
This will greo for “foo” on all files down from the current working directory:

    find . -exec grep foo {} \;

Note that in many cases, you can also use `grep -r` to do this as well.


ls/stat

while `ls` is one of the first commands linux users learn, do not overlook it’s utility in troubleshooting. It’s the easiest way to see whats on the file system.


finding sym links and hard links

A simple `ls -al` will show the contents of a directory. But it will also indicate what files are symlinks.

Normally, having a file being a symlink is fine, but some apps, especially security sensitive apps, are picky about what can and can not be a symlink.

The other thing to look for is dangling, or broken symlinks. Some apps don’t expect to get handed a symlink that doesn’t go anywhere.


file system usage

Some simple `ls` invocations useful for troubleshooting.

Show a detailed view of all files, sorted by the last modified time. Quick, easy way to see if an app is modifying files:

    `ls -lart`

Show a detailed view of all files in the current directory, sorted by file size. Quick, easy way to see what files are consuming all of your precious disk space.

    `ls -larS`

Show some basic info about what type of file each file is. Maybe that directory the app is looking for is a file or vice versa?

     `ls -F`


df

Running out of disk spaces causes so many apps to fail in weird and bizarre ways, that a quick `df -h` is a pretty good troubleshooting starting point.

Use is easy, look for any volume that is 100% full. Or in the case of apps that might be writing lots of data at once, reasonably close to being filled.

It’s pretty common to spend more time that anyone would like to admit debugging a problem to suddenly here someone yell “Damnit! It’s out of disk space!”.

A quick check avoids that problem.

In addition to running out of space, it’s possible to run out of file system inodes. A `df -h` will not show this, but a `df -i` will show the number of inodes available on each filesystem.

Being out of inodes can cause even more obscure failures than being out of space, so something to keep in mind.


watch

`watch` is a command that executes another command, displays its output, then repeats. This can be more
used to repeatedly watch a reporting process. There is also a “-d” option that will highlight any output that
changes between each invocation of the command.

For an example, to watch diskspace useage:

     watch -d df

Another example, is to simply watch a `ls -al` output, to look for any tmp files that get created:

     watch -d "ls -al"

Note that the above example only runs `ls -al` every two seconds, so will not catch all file creations.

“watch” is often used in combo with commands like “ls”, “df”, “netstat”, “ps”.


ipcs/iprm

  • anything that uses shm/ipc
  • oracle/apache/etc

Alot of apps make fairly extensive useage of SysV shm and ipc (oracle, apache, gimp, etc). Most of the
time, on current linux systems, this works pretty well. But it’s occasionally useful to be able to
take a look at what shm is being used, and how it’s being used. `ipcs` is the tool for that.

FIXME: Need some real examples here.


Searching the web for error messages

A pretty common and often very effective approach to tracking downn the cause of errors or problems is searching the web. Using search engines like google or yahoo can find documentation, FAQ’s, web forum posts, mailing list archives, usenet posts, and other useful resources.

Start searching by quoting the entire error message exactly and searching for it. Be sure to put the message in “”’s. If it’s a common problem, theres a good chance you will get some hits. Anything that looks like a FAQ is a good start, mailing list archives can also been a good source. Just be sure to check the archive indexes for other messages in the discussion.

If you are using a commercial distribution, you could also consider looking up their knowledgebase. Both Red Hat and suse have useful documents for assisting in troubleshooting in their knowledgebase.


source code

For most linux distros, you have the source code,
so it can often be useful to search though
the code for error messages, filenames, or
other markers related to the problem. In many
cases, you don’t really need to be able to
understand the programming language to
get some useful info.

Kernel drivers are an great example for this, since they
often include very detailed info about which hardware
is supported, whats likely to break, etc.

On rpm based systems, to install the source code, you want to install
the src rpm. To see which src rpm corresponds to
a given file or utility, use the command:

    rpm -qi /path/to/file

there will be a Source field with the name
of the source rpm. If you have the src cd,
you can install it from there.

Altervatively, you can use up2date or other package tools to
get the source rpm.

   up2date --get-source packagename

will download the src rpm to /var/spool/up2date.

To install a src rpm, just issue the command:

   rpm -Uvh /path/to/package.src.rpm

The source will get installed in /usr/src/redhat/SOURCES, with a spec file in /usr/src/redhat/SPECS,
on Red Hat linux systems. Other distros will be similar.

The easiest way to extract the source is:

   rpmbuild -bp /usr/src/redhat/SPECS/package.spec

where package.spec is the spec file for the src package installed.

`find` and `grep` are good tools for searching for the markers of interest.


strings

`strings` is a utility that will search though a file and try to find text strings. For troubleshooting sometimes it is handy to be able to look for strings in an executable.

For an example, you can run strings on a binary to see if it has any hard coded paths to helper utilities. If those utils are in the wrong place, that app may fail.

Searching for error messages can help we well, especially in cases where you not sure what binary is reporting an error message.

It some ways, it’s a bit like grepping though source code for error messages, but a bit easier. Unfortunately, it also provides far less info.


syslog/log levels

Syslog is a daemon that mutated out of a sendmail debugging aid into a logfile-catchall for unix. A lot of applications send their log output to syslog, but they have to send it to syslog, otherwise syslog won’t know about the stuff that is to be logged. To keep logs apart, during the evolution of syslog, facilities (nothing more than “categories” in syslog-speak) and severeties got introduced. The actual filtering of what gets output where can be defined in syslogs /etc/syslog.conf(5) file.


Getting stuff into Syslog

Syslog genrally can receive messages in three ways:
- Through the syslog() function most languages provide (after an appropriate call to openlog())
- Through named sockets such as /dev/log which is enabled by default on most distributions
- Via UDP on port 514, if syslogd is running with the -r option (this can be a security hole since there is no authentication or authorization implemented in the standard syslog protocol! Caveat emptor!)


Defining Filters in /etc/syslog.conf

The basic syntax of this file is easy, but it contains some subtleties that can lead you into a long, slow suffering (when using synchronous writes on logfiles, more about that below).

  1. Empty lines and everything behind a hash mark (#) is ignored
  2. Rules are of the format
 <What>    <Goes Where>


What

Your basic “what” is a specification of a facility and a severity delimited by a period:

  <facility>.<severity>

This will catch all messages belonging to the given facility, that have the given severity and higher.

If you only want to catch messages belonging to exactly the given severity, prefix the priority with an equation sign (=):

 <facility>.=<severity>

You can also negate the severity selection by prepending an exclamation sign (!):

 <facility>.!<severity>

This will select all messages belonging to the given facility and that have a severity lower than the one specified. Note that this also weeds out messages belonging to the given severity - which is logical, since the opposite of >= is &lt.

Of course this can make things tedious if you have to list all combinations of the 20 facilities and 9 severities by hand. So there are shortcuts, such as specifying an asterisk (*) as a catchall:

  <facility>.*  -> All messages belonging to <facility>
  *.<severity>  -> All messages of the given <severity>
  *.*                 -> All messages

And then, you can specify lists of “whats”, where the “whats” are delimited by semicola (;):

 <facility>.<severity>;<facility>.<severity>

Or, if you want to process the same severities of different facilities, list the facilities using commas (,) first:

 <facility>,<facility>.<severity>

To make matters interesting, there is also a special severity called “none”, which implies that no message of the given facility are to be logged with this rule:

  *.*;<facility>.none  -> Log all messages except those of the given facility


Goes Where

After the “What” part with all it’s twists and turns, the “Where” is actually pretty simple:

 </path/to/logfile>

will log everything to the given logfile.


Asynchronously

This logging is done with synchronous writes, which means that after each log entry, syslog waits for the operating system kernel to acknowledge that the data has indeed been written to the disk before writing it’s next entry. This can slow down your system 10-fold for services with extensive logging (especially mail servers!). This factor has been verified in the wild, so only if you can afford to write logs asynchronously, do so.

To indicate to syslog that you want log entries to be written asynchronously, prepend a minus (-) to the logfile:

 -</path/to/logfile>

This is basically what is needed in 99% of everyday life.

Note that you can specify the same “What” multiple times pointing to different “wheres” for each. The messages will then be logged to all “wheres” given.


Goes Where Again?

Ok, the “Where” part isn’t actually all that simple. You have a couple of other choices:
- Remote machines:

 @<hostname>

- Named Pipes:

 |<path to fifo>

- Terminals by giving their device files as logfiles
- Specific users (if they’re logged on) using write:

 <user>,<user>

- All users logged on:

 *

But again, these are things you don’t need that often, and if you do, you’d better read up on them in the manpage first!


RPM

RPM is the RPM Package Manager. It’s a package tool widely used on many linux distributions include
Red Hat Enterprise Linux, Fedora, Novell, and Mandriva.

It’s commonly used to install, update, and remove software and to keep track of software dependencies.
The rpm database also includes a lot of information about the software currently installed, and can
often be a useful resource for troubleshooting.


using rpm to verify package contents

`rpm` includes support for verify a files contents, size, permisions, mtime, user and group ownership,
and selinux context.

If you are having problems with “gaim”, you might want to verify all of the files are correct:

    rpm -V gaim

That command will check the ondisk files against the expected values in the rpm database. If a file has
been modified, it will show up. See the rpm man page for info on decoding the string of chars at the
left of the output. But, if the file shows up at all, rpm thinks something has changed about the file,
which is often enough to know, without decoding the info.

Also useful is verifying all packages. Sometimes you just don’t know whats changed, and want an overview of
files that have been editied or modified from the original:

   rpm -Va

That will take a while on most system, but it will print out a list of all files rpm thinks have been modified.
Note that on most systems, there will be some files that show up and are perfectly acceptable.


using rpm to find config files

A good place to start looking when some software is having trouble is the config files. To see a list of
the config files for package “up2date”:

   rpm -q --configfiles up2date


using rpm to see what was installed recently

One of the bits of information rpm keeps track of is when a package was installed. Since most software problem
originate when software is updated or installed, this is useful information.

To get a list of all rpm packages install, in order, with the installation date:

   rpm -qa --last

The list is sorted so that the newest packages are at the top of the list. If you are troubleshooting a
problem that recently appeared, thats a good place to start looking for clues.


resetting file permissions and user/group info

If you think a file from a package has had it’s perms or ownership changed, an easy way to
resolve this is:

   rpm --setperms packagename


ksymoops

To quote from the ksymoops web page, “The Linux kernel produces error messages that contain machine specific numbers which are meaningless for debugging. ‘ksymoops’ reads machine specific files and the error log and does its best to converts the code to instructions and map addresses to kernel symbols. ”

See the man page for more info.


Kernel core dumps (netdump, diskdump and crash)

Netdump and diskdump are utilities for logging kernel crashes. netdump sends the core image of
the kernel (vmcore) across the network to a netdump server, while diskdump writes it to disk. The image
can be examined with the `crash` utility.

Netdump and diskdump create a vmcore. A vmcore is a representation of what was in the systems memory
when the crash occured. The crash utilitity is a modified version of gdb, which automates the basic steps required to analyse a vmcore.

At the time of writing, netdump does not work on Itanium or Itanium II architecture systems.


Netdump

Netdump requires another machine to capture the crash from the crashing kernel. The machine that is crashing is considered the netdump client, the machine that is going to host the core is considered the netdump server. One netdump server can capture crashes from multiple clients.


Server Side Configuration

The netdump server does not have to use any specific network card. It must be on the same subnet and the netdump client must be able to have a clear path (No Network Address Translation or packet modification) between the server and the client.

Start the service with the command

NaodW29-pre9ca058bbf031c300000001

The server saves the vmcore file in /var/crash. Ensure that there is enough space for the server to send the file.
There is a formula that can be applied to find the amount of space necessary.

(RAM on client + SWAP on client * 1.1).

The next step is to set the password for the netdump user. Do so with the command

NaodW29-pre9ca058bbf031c300000002

Be sure to set a strong password for this user.


Client Side Configuration

Only a limited set of hardware is currently able to send a vmcore to a netdump server. You can find out which
network drivers have support in your current kernel by issuing the command.

NaodW29-pre9ca058bbf031c300000003

The machine running the network client will have to be using one of these drivers ( 3c59x, e100, e1000, eepro100, pcnet32, tg3, tlan, and tulip ) configured as eth0 to be able to send the vmcore to a remote server.

The next step is to modify /etc/sysconfig/netdump and add the following line:
NETDUMPADDR=10.0.0.222

The address 10.0.0.222 should be the IP address of the machine configured as the netdump server.

Netdump client will now need to connect to the netdump server and create a set of public/private ssh keys. Enter the command:

NaodW29-pre9ca058bbf031c300000004

You should be prompted for a password. Enter the password of the netdump user on the netdump server.

The next step is to start the netdump service. Run the command

NaodW29-pre9ca058bbf031c300000005

And then.. crash your machine..

The crash should end up in /var/crash


Diskdump

- yep, im working on this

Supported cards

Cross platform:

   * aic7xxx
   * aic79xx
   * megaraid2
   * mpt fusion
   * sata_promise
   * sym53c8xx

i386, AMD64, EM64T.
ata_piix

i386 only
dpt_i2o

Additionally, ata_piix is supported on the i386, AMD64 and Intel® EM64T architectures. dtp_i2o is supported only on i386.


Crash

The crash package can be used to investigate live systems, kernel core
dumps created from the netdump or diskdump package


xev

`xev` is a small utilty that can be used to debug problems with X11. In particular, odd behaviour related to keypresses and mouseclick can be tracked down.

`xev` just shows all the X11 “events” that get passed to it. For example, if a keypress doesn’t seem to be doing what it is supposed to do, you can check to see if X11 is actually getting the keyclick, and if so, what value it is getting. For basic troubleshooting, no knowledge of X11 is needed, but `xev` can present a ton of information that only the most diehard X11 hacker cares about.

Related, but more low-level are also the files in /proc/bus/input. If you’re having trouble getting an input device to be accepted by X, you can check if you’re giving the correct device file/protocols in xorg.conf/xfree86.conf by cross referencing your config file with the information in these proc files.


pmap

`pmap` is part of the “procps” suite of tools. It can be used to display the memory map of
a process. It is essentialy a wrapper for reading from /proc/PID/maps.

It’s useful to be able to see what libraries and modules an app has loaded. `ldd` can show
the list of libraries an executable is linked against, but it doesn’t know anything about
dynamically loaded modules. A variety of large applications make significant useage of dynamic
loaded modules, as well as most scripting languages, so `pmap` can come in handy when trying
to diagnose issues that might be related to modules.


Scripting languages and shell programming

For more information, see Scripting Languages

Shell scripting and scripting languages are what make unix and linux
work. They are everywhere, so knowing how to track down problems with
scripts is a handy skill.

For more information, see Scripting Languages


Logs

The key to troubleshooting is knowing what is going on. For core system services, there is a significant amount of logging turned on by default, especially for error cases. The trick is knowing where to look.

For more info, see Log Files


Enviroment settings


Allowing Core Files

“core” files are dumps of a processes memory. When a program crashes it can leave behind a core file that can help determine what was the cause of the crash by loading the core file in a debugger.

By default, most linuxes turn off core file support by setting the maximum allowed core file size to 0.

In order to allow a segfaulting application to leave a core, you need to raise this limit. This is done via `ulimit`. To allow core files to be of an unlimitted size, issue:

       ulimit -c unlimited

See the section on GDB for more information on what to do with core files.


LD_ASSUME_KERNEL

LD_ASSUME_KERNEL is an enviroment variable used by the dynamic
linker to decide what implementation of libraries are used. For
most cases, the most important lib is the c library, or “libc” or
“glibc”.

The reason “glibc” is important is because it contains the thread implentation for a system.

The values you can set LD_ASSUME_KERNEL to equate to linux kernel versions. Since glibc and the kernel are tighly bound, it’s neccasary for glibc to change it’s behaviour based on what kernel version is installed.

For properly written apps, there sould be no reason to use
this setting. However, for some legacy apps that depend
on a particular thread implementation in glibc, LD_ASSUME_KERNEL
can be used to force the app to use an older implementation.

The primary targets fore LD_ASSUME_KERNEL=2.4.20 for use
of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 use the
implementation in /lib/i686 (newer LinuxTrheads).
LD_ASSUME_KERNEL=2.2.5 or older uses the implementation
in /lib (old LinuxThreads)

For an app that requires the old thread implentation, it can be launched as:

   LD_ASSUME_KERNEL=2.2.5 ./some-old-app

see http://people.redhat.com/drepper/assumekernel.html for more details.


glibc enviroment variables

Theres a wide variety of enviroment varibles that glibc uses to alter it’s behaviour, many of which are useful for debugging or troubleshoot purposes.

A good refence on these variables is at http://www.scratchbox.org/documentation/general/tutorials/glibcenv.html

Some interesting ones:


LANG and LANGUAGE

LANG sets what message catalog to use, while LANGUAGE sets LANG and all the LC_* variables. These are
control the locale specific parts of glibc.

Lots of programs are written expecting to be one in one local, and can break in other locales. Since
locale settings can change things like sort order (LC_COLLATE), and the time formats (LC_TIME), shells scripts are particularly prone to problems from this.

A script that assumes the sort order of something is a good example.

A common way to test this is to try running the troublesome app with the locale set to “C”, or the default locale.

     LANGUAGE=C ls -al

If the app starts behaving when ran that way, there is probably something in the code that is assuming “C” local (sorted lists and timeformats are strong candidates).


glibc malloc stuff

Recent (>5.4.23 for libc/>2.0 for glibc) libc implementations offer a small scale malloc debugger by way of the MALLOC_CHECK_ environment variable. MALLOC_CHECK_ can be set to 3 different values:

 - 0: ignores any heap corruptions
 - 1: prints diagnostics on STDERR
 - 2: calls abort(3) as soon as memory corruption is detected

This will help with the kind of memory corruption that can’t be found with the tried and proven software engineering method of “staring at the code”, but where electric fence/valgrind would be overkill.


Types Of Problems

Software is complicated, and there can be a wide variety of problems
that occur. But there are categories of problems that come up often,
and it’s useful to have tools and techniques for solving them

For more info, see Types Of Problems


App specific troubleshooting info


apache


mod_status

mod_status is an apache module that can show a html page representing various information
about the internal status of apache. This includes number of httpds, there current status,
network connections, amount of traffic, etc.

Very useful when trying to track down performance related issues.


module debugging

Some apache httpd modules include options to enabled extra debugging info. Unfortunately,
this seems to depend on the module.


log files

log files, the httpd error logs in particular (typically in /var/log/httpd/error_log) are
often the best place to look when troubleshooting. It’s also where any module debugging information
will log to.


Testing the configuration file for Syntax Errors

Apache comes with an executable called apachectl(8). This program can run a configuration check on apaches configuration files by issueing the command

  apachectl configtest

Some distros (like RedHat/Fedora) also include this command in apaches init script and invoke apachectl in the background.


-X debug mode

One of the biggest problems with trying to track down problems with apache httpd is the multiprocess
nature of it. It makes it difficult to strace or to attach gdb.

To force httpd to run in a single process mode start it with:

       httpd -X

Note that on Red Hat linux boxes you probably need to include the commandline arguments that the init
scripts start httpd with. The easiest way to do this is to start httpd normally, then run `ps auxwwww` and
cut and paste one of the httpd commandline lines.


php

The following assumes that you know php coding.

The most informative (but also most disruptive in a visual sense) thing to do is set

  error_reporting = E_ALL

in your php.ini (under debian: /etc/php/<calling entity>/php.ini). Remember to restart your webserver/calling entity after changing this setting. If you come from the C corner of things, you’ll know that good programming style dictates that you treat warnings and notices as errors. So off you go, clean up that code!

Back and still not working? Ok, now it gets ugly. PHP doesn’t come with a debugger like gdb. Such things exist, but usually they will be embedded in an IDE that also emulates a web server and cost $$$. So basically you get to do stuff just as in regular shell scripts: debug echos. Echo early, echo often. Hand in hand with echo statements comes the print_r function, which will print arrays/hashes (same thing in PHP) recursively. Drawback here: print_r formats in plain ASCII, not HTML. So you’ll either have to look at the page source to see a clean version of the output, or do something ugly like

 echo join( "<br>", print_r($myarray) );
 FIXME: can you turn on warnings about variables only used once, like in perl? One of my most
 frequent errors....


X apps

  FIXME


nosync stuff


X log


ssh

Most problems occur here when you’re trying to set up logins via RSA/DSA keys (and probably without passwords too…). It’s usally down to basics: Make sure that your ~/.ssh is owned by your user and set to mode 600. ~/.ssh/authorized_keys has to be set to 0600. If these basic conditions aren’t met, sshd will refuse to even look at your authorized_keys file and drop you back to password logins.

Another word about the format of the authorized_keys file: it’s one key per row. Make sure that your added keys are in a single row! vi is notorious for adding linebreaks if you have ‘tw’ set in your ~/.vimrc and use copy and paste to add a new key to the file. Use cat or ssh-copy-id instead.

You can run

  ssh -v fred@godot

to see what ssh is up to and where things start hickupping. You can go all the way up to

  ssh -vvv fred@godot

if you really want to know about how modulo groups are being prodded. Usually -vv suffices.


I just updated my openssh packages and now I can’t login

If the error message is something like “Upsupported Protocol - Remote host closed the connection”, it’s probably due to an incompatibility between OpenSSH 4.2 and anything pre-4.2. If you have the server under your control, the solutuion is easy: Update the server to the 4.2 version as well (recommended as there are some nasty zlib buffer overruns in pre-4.2 anyway).

 FIXME: What other solutions are there?


sshd -d -D


pam/auth/nss

  FIXME
  • logging options?
  • getent


LDAP

 FIXME


Kerberos

When something goes wrong with kerberos, it’s usually down to a few things:
- Something in the network topology changed, mandating that you re-check your /etc/krb5.conf
- Your kerberos server is unreachable
- You entered a wrong password while generating a keytab file, or the associated user/service name is not known to the server.

Unfortunately, tools like kinit(1) do have a -v option for verbose output, but this only starts outputting useful information after they aquire a TGT from the KDC. It’s more useful to watch the logs of the KDC and see what (if anything) actually happens there.


/etc/krb5.conf

This configuration file is read and used by the kerberos libraries, so any settings here affect everything on your system that uses kerberos.
The most important setting is

 [realms]
 <YOUR DEFAULT REALM> = {
   kdc = <IP of your KDC>
 }

There may be several realm definitions within the [realms] section. Be sure that you set the correct IP here. Otherwise your kerberos requests will just hang there and time out after a while.

The second most important setting is

 [libdefaults]
 default_realm = <YOUR DEFAULT REALM>

This specifies what realm kerberos tools will use if no explicit realm is given for a request.

Finally, if you’re fooling around with a KDC that resides on a Windows2003 server, be sure that you’ve enabled arcfour-hmac-md5 and des-cbc-crc as cryto algorithms for the settings default_tgs_enctypes, default_tkt_enctypes and permitted enctypes in the [libdefaults] section. Otherwise your keytab files will be unreadable.


OpenSwan/IPSEC

 FIXME


sendmail

  FIXME


Desktop Enviroments


Gnome


Links


Credits

Comments, suggestions, hints, ideas, critisicms, pointers, and other useful info from various folks were used
to create the original version of this document. Check the history for more.

  • Adrian Likins
  • Mihai Ibanescu
  • Chip Turner
  • Chris MacLeod
  • Todd Warner
  • Nicholas Hansen
  • Sven Riedel
  • Jacob Frelinger
  • James Clark
  • Brian Naylor
  • Drew Puch


License

This work is licensed under a Creative Commons Attribution 2.5 License (http://creativecommons.org/licenses/by/2.5/)

If folks are interested in also applying other licenses (GNU FDL, etc), let Adrian know.


How to Help

See How To Help for more info.