Problem determination is a very important activity for any system administrator. There are many tools available on AIX V3.2 to help the system administrator in this activity. In this chapter, an overview is provided for software problem determination and reporting. Reference is made to the published documents, wherever possible. The procedures given in this chapter are by no means complete or exhaustive. This chapter is written to help the customers who do not have access to a CD-ROM or other documents and who want to understand the basics of AIX V3.2 problem determination techniques.
The discussion in this chapter is mainly to diagnose problems on an inactive system. These procedures should allow the user to fix problem on a totally inactive system or a hung system. Procedures for TCP/IP, XStation* or SNA systems are quite different and are not covered in this chapter.
Use the following symptom index as your problem solving starting point.
Select the symptom in the following table, which closely matches with the
problem with the system. If more than one symptom matches, select the first
matching symptom.
Table: Symptom Table
Some error messages give instructions on how to resolve the error incurred. If the message describes the cause of the problem, attempt to correct it. If you are not given enough information to correct the problem, refer to the Message Index in AIX/6000 Problem Solving Guide and Reference to determine the nature and scope of the message. If you are able to correct the problem, then you have completed the problem determination procedure.
If you were unable to correct the problem with these steps then, some messages may have been recorded in the system error log. Go to Retrieving an Error Log Report.
Do the following:
Refer to the device documentation for the correct procedures on cabling and configuring your device. If the documentation contains a troubleshooting or other problem determination section, refer to it for further guidance.
Are you able to resolve the problem with these steps?
Use the command lsdev -C | pg to list all the devices on the system. Please check if the device under test is listed in the output and marked as available. Note that in this output the logical volumes will have a status of defined and make no effort to correct that status.
If the device is not listed or it has a status of defined then use the command mkdev to create the device or use the command chdev to make the device available. Alternately SMIT menu functions can be used for creating or changing the characteristics of the devices. If this step has fixed the problem, then you have completed the procedure. If these steps have not corrected the problem, then you may have a hardware problem. Please see step Hardware Problem Determination. At this time, you may contact the IBM Support Center, either for hardware support or for software support.
Before going into the details of problem determination of an inactive system, please look for obvious errors such as:
Check the three digit LED display in the front panel. Is the display blank?
Check if the system console is operational. If it is operational then go to step Running System has Become Inactive for freeing up other terminals. If the system console is not operational, then look for obvious power, cable or baud rate problems. If the system is inactive during a power on restart or a cold boot then go to step System Inactive During Restart. If the system was running till recently and has become inactive without any attempts to restart the system then go to step Running System has Become Inactive.
Check that the mode switch in the front panel is in the Normal mode. Is the switch in the normal mode?
Note: The system will not boot in the Secure mode.
You would have reached this step, probably due to a hung system without any LED display. A hung system means a hung terminal or inactive console or both. In this section, a procedure is outlined to free the console or the terminal from the hung process.
Try to reactivate the system by performing the following actions:
You would have reached this step, when the console and the terminals connected to the system are not operational. There is some value displayed in the front panel LED. Observe the LED and take the following actions:
A flashing 888 in the LED display indicates that there is a message encoded as a string of LED display values. Record the display values performing the following procedures:
For unexpected system halts, the string of LED values has the following format:
888 102 mmm ddd
An initial value of 102 indicates that an unexpected system halt occurred during normal operations. The value of the mmm variable indicates the cause of the halt, which is explained in Cause of the System Halt. The ddd value indicates whether your system completed a system dump, which is explained in Dump Status.
The following list gives the possible values of mmm, the second value following the 888, and the cause of the system halt invoking that value:
The value of ddd, the third value following the 888 , indicates the dump status. If any of the ddd value descriptions lead you to another step, return to the bottom of this step Unexpected System Halts. The possible values and meanings of ddd are:
You have reached the end of the list at this step. Run diagnostics on your system. For further information, refer to Hardware Problem Determination. If the diagnostics programs return a Service Request Number (SRN), record this number and contact the IBM Support Center.
In some cases, the dump space may not be sufficient to capture the kernel image, during the system crash. Or it may be configured incorrectly, pointing to a non-existent device. Use the SMIT fast path command smit dump to go into the SMIT options for handling the system dump. It is possible to define a larger dump space and assign it as the primary dump device, without the need for rebooting. Normally a dump space of 10MB is sufficient. You can assign a temporary logical volume or a tape drive as the primary or secondary dump device. Once a successful dump is made, this temporary dump device may be released for normal operations.
An initial value of 103, 104, or 105 indicates that diagnostic messages are displayed in the LED digit display when the console display is either not present or is unavailable because of a display or adapter failure, or when a failure that prevents a system restart is detected.
To interpret diagnostic messages, refer to Hardware Problem Determination. You will find a list of suggested document references, which will help you in hardware problem determination.
Record the Service Request Number returned from the diagnostics programs along with the location codes and contact the IBM Support Center. You have completed the procedure.
If the system has halted with a steady value in the range of 201 through 299 then refer to the document AIX/6000 Problem Solving Guide and Reference, section on "Interpreting Values in Three Digit Display" for the meaning of the values. LED values in the range of 201 to 299 are progress indicators, which are normally not displayed more than 30 seconds.
When the LED display is alternating between some numbers, normally this could happen, when you are trying to reboot the system either in the normal mode or in the service mode. The system is in a loop while attempting to restart from the devices indicated by the values in the LED display. Determine if the boot device is ready by checking the following:
The error log is a report of system errors and can be used to diagnose these errors.
# errpt -a | pg
It generates a detailed report starting with the most current errors.
Does the error log contain error information relevant to your problem?
The error report contains the following sections:
The predefined name for the event.
The numerical identifier for the event.
The date and time of the event.
The unique number for the event.
The identification number of your system processor unit.
The hostname of your system.
The general source of the error. The possible error classes are:
The severity of the error that has occurred. The possible error types are:
The name of the failing resource (for example, a device name of hdisk0).
The general class of the failing resource (for example, a device class of disk).
The type of the failing resource (for example, a device type of 355MB).
The path from the adapter in the processor to the device. There may be up to four fields, which refer to drawer, slot, connector, and port, respectively.
(Vital Product Data) The contents of this field, if any, vary. Error log entries for devices typically return information concerning the device manufacturer, serial number, Engineering Change levels, and Read Only Storage levels. Software entries usually include the version, release, and level.
A summary of the error.
A listing of some of the possible sources of the error.
A list of possible reasons for errors due to user mistakes. An improperly inserted disk, and external devices (such as modems and printers) not being turned on are examples of user caused errors.
A description of actions for correcting a user caused error.
A list of possible reasons for errors due to incorrect installation or configuration procedures. Examples of this type of error include hardware/software mismatches, incorrect installation of cables or cable connections becoming loose, and improperly configured systems.
A description of actions for correcting an installation caused error.
A list of possible defects in hardware or software.
A description of actions for correcting the failure. For hardware errors, "Perform problem Determination Procedures" is one of the recommended actions listed. This recommended action means that you should follow the procedures in this guide. For hardware errors, this will lead to running the diagnostics programs.
Failure data that is unique for each error log entry, such as device sense data.
Are there any errors reported that might be related to the problem?
Check the error log for errors that do not seem directly related to the problem. The problem you encountered may be an unexpected consequence of another problem.
Are there any errors reported that might be somewhat related to the problem?
Perform the recommended actions, or any failure related to the problem reported in the error log. Attempt to resolve your system problems by following items mentioned in the User Causes and Installation Causes of the error report, before trying those listed in Failure Causes.
If these actions resolve the problem then you have completed these procedures. If not, go to step Classifying Errors.
The procedure for resolving hardware errors differs from those for software and operator errors. Is the error class value for any seemingly related error equal to H (hardware)?
When you receive a hardware error, refer to the OPERATOR GUIDE for information about performing diagnostics on the problem device or other piece of equipment. The diagnostics programs test the device and analyze the error log entries related to it to determine the state of the device. Run diagnostics on each entry that has an error class with the value of H.
Since the diagnostics programs did not reveal that you have a defective device, one of the following is true:
Is there any software error log entries that may be related to the problem?
The error log entry may not have been important at this time. Check the error log periodically to determine if the problem is ongoing. If related errors occur, repeat the procedures in this chapter. If you suspect an undetected hardware problem, report it to your hardware service organization, and stop. You have completed these procedures.
If there is a Failure Causes section in a software error log entry, it usually indicates a software defect. Entries that list User or Installation Causes, or both, but not Failure Causes, usually indicate that a software defect is not the source of the problem.
If you suspect a software defect or are unable to correct User or Installation Causes, report the problem to your software service organization, and stop. You have completed these procedures.
All or some of the error logging facility may be turned off. Use the command:
# ps -eaf | grep errdemonThis should display the process ID of the error daemon. If this command does not produce any output, then the error daemon is inactive, which could be one of the causes for the empty error logs. Use the command /usr/lib/errdemon to start the error daemon. You need root authority to perform this command.
Return to the step in the problem solving procedures that brought you to step Retrieving an Error Log Report. If those procedures did not resolve the problem, report the problem to the IBM Support Center.
A system dump acts as a picture of working memory when a severe error occurs. Generally, reading the contents of a system dump will enable a person trained to read system dumps to determine what went wrong with the system and why. This set of problem determination procedures may require you to do the following:
Did your system stop with a flashing 888 displayed in the LED display?
If you initiated the system dump yourself, you will immediately see 0c2 in the LED display. Within one minute, this should change to indicate whether the crash was successful. See Problem Determination for Inactive System for the description of the LED codes.
To force a system dump to the primary dump device, here are the methods:
Turn the key to the Service position, and press the reset button once. This method works for all system configurations.
Turn the key to the Service position, hold down the left Ctrl and Alt keys, and press the 1 key on the numeric keypad. This method can only be used on a natively attached hft keyboard; it cannot be used on a serially attached console.
While logged in as root, enter:
# sysdumpstart -p
This method works for all system configurations, but it assumes that the system is responding to commands.
To force a system dump to the secondary dump device, here are the methods:
Turn the key to the Service position, hold down the left Ctrl and Alt keys, and press the 2 key on the numeric keypad. This method can only be used with a natively attached hft keyboard; it cannot be used with a serially attached console.
While logged in as root, enter:
# sysdumpstart -s
This method works for all system configurations, but it assumes that the system is responding to commands.
Continue with Determining the Status of a System Dump.
Find the value currently displayed in the LED display in the following list and take the prescribed action:
You would have reached this step, when your system has stopped with a flashing 888 and the system had either a successful or partial dump. Normally an unsuccessful or partial dump is useless. However, the user may still decide to copy the dump to a removable media and send it to the IBM Support Center for analysis. You could have also reached this step after forcing a dump.
The following files are needed to properly analyze a system dump:
Are you able to boot the system in normal mode?
You would have reached this step, to gather the files needed for the dump analysis.
To copy the system dump information to tape:
# /usr/sbin/snap -gfkD -o /dev/rmt0
# tar -tvf /dev/rmt0Note that the size of dump.Z in this example is 2933929 bytes and the size of unix.Z is 791604 bytes. It is important that the dump.Z and unix.Z files do not have 0 bytes. If they are, then something is wrong. In that case, you should not send this information to the IBM Support Center because there is nothing to analyze. Double check your steps, and write down any error messages that were displayed on the screen. Call the IBM Support Center if you continue to get 0 byte dump.Z or unix.Z files.
-rw-r--r-- 0 0 2933929 Dec 14 15:11:35 1992 ./dump/dump.Z
-rw-r--r-- 0 0 791604 Dec 14 15:09:49 1992 ./dump/unix.Z
-rw-r--r-- 0 0 4327 Dec 14 15:08:51 1992 ./general/errpt.out
# lsattr -E -l rmt0 | grep block_size
block_size 512 BLOCK size (0=variable length) True
To Copy the dump information to diskettes:
# /usr/sbin/snap -gfkD -o /dev/rfd0
# tar -tvf /dev/rfd0Check that none of these files is of 0 byte length.
-rw-r--r-- 0 0 2933929 Dec 14 15:11:35 1992 ./dump/dump.Z
-rw-r--r-- 0 0 791604 Dec 14 15:09:49 1992 ./dump/unix.Z
-rw-r--r-- 0 0 4327 Dec 14 15:08:51 1992 ./general/errpt.out
See Problem Determination for Inactive System on actions to take in case of a system crash. See Flashing 888 LED Display, for the steps to record the LED values in case of a flashing 888 LED display. These values are explained in the same section.
Follow the steps below to verify if the dump was successful.
Are you able to boot the system in the normal mode?
# sysdumpdev -lNote the primary dump device name, and use it where you see /dev/hd7 in the following examples. The default primary dump device is /dev/hd7. You could define another logical volume as the primary dump device.
primary /dev/hd7
secondary /dev/sysdumpnull
# crash /dev/hd7
Using /unix as the default namelist file.
Reading in symbols........................
If no warning messages are displayed, go to step 4. Check if the following message appears:
# crash /dev/hd7If this warning message appears after you enter the crash command, either the crash did not take place or the /unix file does not match the information that was in the dump device. In this case, type q at the > prompt to quit out of the crash command. If you receive messages that the name list does not match, your /unix file does not match the kernel you are running. Please follow-up with the IBM Support Center for assistance. Do not continue with the remaining steps and do not send the dump information to the IBM Support Center because there is nothing to analyze.
Using /unix as the default namelist file.
WARNING: dumpfile does not appear to match namelist
> stat
sysname: AIX
nodename: bruce
release: 2
version: 3
machine: 000001873400
time of crash: Sun Nov 14 16:09:28 1993
age of system: 18 day, 8 hr., 31 min.
If the time of the crash in the output approximately matches the time the system crashed, then the dump information may be sufficient for analysis. If you have any other lines of output from the stat command other than those listed above, please write them down. Continue with step Copying the System Dump Information to Tape or Diskette.
If the time of the crash does not approximately match the time the system crashed, type q to exit the crash command. In this case, do not continue with these steps and do not send the crash information to the IBM Support Center, because it is not sufficient for analysis.
The system dump cannot be retrieved until your system can be started again. Go to step Problem Determination: Symptom Indexto begin diagnosing why the system cannot be restarted. After you have completed other procedures, return to this step.
Were you able to correct the problem and restart your system?
To restart the system, turn the Mode Switch to the Normal position and turn on your system, then stop. You have completed these procedures.
Hardware problem determination is a large and involved topic. You would have reached this step, if you have been asked to test the hardware from any of the previous steps. If AIX V3.2 is running at least on the console, then there are many ways to test out the remaining hardware. It is beyond the scope of this book to describe such procedures. Because, depending on the symptom, the steps will vary for each machine.
However the approach for hardware problem determination would be as follows:
Related Publications:
If your machine has frozen or humg while you are using X Window (X) and you have the numbers 888 displayed on your LED, then you are experiencing a System Dump. You can gather the system dump and send it to the IBM Support Center
If your machine has frozen or hung while you are using X Window (X) and there is no display on the LED, then we need to determine if X has crashed or is in a hang state. To determine the current stat of X, please do the following:
# ps -ef | grep X
/usr/lpp/X11/bin/X -D /usr/lib/X11/En_US/rgb
Yes: The X process is still running, so X is in a hang state (see X Hang)
No: The X process is no longer running so X has crashed (see X Crash)
If you can come up with a repeatable process that will cause X to crash, you can let know these steps to the IBM Support Center so they can try to recreate the crash in their labs (you do not need to perform the steps outlined below).
The steps outlined below are attempts to gather information if the IBM Support Center cannot recreate the hang in their labs.
If X has frozen and you were able to determine that the X process is no longer running, you have experienced an X crash. When X crashes, there should be a core file in the directory from which you started xinit or startx.
If a core file does not exist in the directory from which you started xinit or startx, one of two things has occured:
You do not have write permission in the directory from which you started xinit or startx. If you do not have write permission, a core file cannot be created in the current directory. Re-start X from a directory where you have write permission and try to create the X crash (continue reading problem determination for an X crash with a core file, it may save you some time).
You could be running out of paging space and the operating system is killing X because it is the largest process or the process using the most paging space. To determine if your system is running out of paging space you can do the following:
Create a script that will display the output of the lsps -a command periodically. Remember, after X has been killed, this command will report the paging space that X was using as freed. In other words, the output of this command is not usedful after X has died, it is only useful right before X is killed.
Create a window and execute the sar command to watch the paging slots used. For example, run sar -r 1 1000, watch the slots column which represents the paging slots available. If this number gets decreased to around 200, then you will most likely run out of paging space.
If you are running out of paging space and suspect that one of your processes is using up paging space and not returning it, you can verify this by creating a small script executing the ps command, watch all processes to determine if any are growing continuously.
If a core file exists in the directory from which you started xinit or startx, then the IBM Support Center may be able to use this core file for problem determination.
With this information you can usually tell what pointer is corrupt or incorrect. Unfortunately, the information telling you what trashed the pointer is no longer available. In other words, in most cases with an X crash, you will need to come up with some repeatable process that will cause X to crash.
If you can come up with a repeatable process that will cause X to crash, let know the IBM Support Center these steps so they can try to recreate the crash in their labs. You do not need to do all of the steps below if you have a repeatable process.
As you are trying to come up with a repeatable process that will cause X to crash, you can send the IBM Support Center the core file that was created and they may be able to tell something from it. Please send the following to the IBM Support Center for problem determination:
Verify that the core file is current ( you can use the ls -al command, checking the date and compare to the time of X crash).
Also verify that it was X that core dumped. Run the command:
# od -c core +0x730 | head -1
Although the output of this command can be confusing, what we are interested in is the first item after line number 0000730, this should be a capital X (all the rest of the output looks like garbage). If you see an X right after 0000730, then the core file is for the X crash.
adapter1 = colorgda Color Graphics Display Adapter (Skyway)
adapter2 = hiprf3d High Performance 3D Color Graphics processor (Sabine)
adapter3 = hispd3d High Speed 3D Graphics Accelerator (730 or GTO)
adapter4 = POWER_Gt3 Midrange Entry Graphics Adapter (Gt3) or POWER_Gt4 Midrange Graphics Adapter (Gt4) or POWER_Gt4x Midrange Graphics Adapter (Gt4x)
adapter5 = POWER_Gt1 Frame Buffer Adapter (Gt1)
/usr/lpp/gai/adapter#.r4/loadrms
If you have a core file in the directory from which you started xinit or xstartx and you can keep your system or X down long enough for IBM to perform some additional problem determination, the IBM Support Center may be able to gather some useful information:
# ps -ef | grep X
# dbx -c /tmp/dbx.ignore -a <processid_for_X>
If X has frozen and you were able to determine that the X process is still running, you are experiencing an X hang.
# dbx -a <processid_for_X> | tee <output_file>
> where
> X
> listi
> cont > ^c (hit the control key with c)
> where
Is it close to the same place as step 4 above? In other words, are the adresses close?) If so, X is in a loop.
> detach
# crash | tee <output_filename>
> p
This will list all the current processes running, one of these will be X. Jot down the number that appears on the leftmost column for the line that contains the X process, this is the slot number for X.
> t <X's_slot_number>
> quit
To trace a system which is experiencing unpredictable X hang problems:
gtraced:2:once:trace -a -l
# trcrpt >/tmp/trcrpt.out
This section shows an example of file /tmp/dbx.ignore. This file contains dbx subcommands. When dbx command is run with the -c /tmp/dbx.ignore flag, the subcommands in this file will be run before reading from standard input.
ignore HUP
ignore ALRM
ignore URG
ignore STOP
ignore TSTP
ignore CONT
ignore CHLD
ignore TTIN
ignore TTOU
ignore IO
ignore XCPU
ignore XFSZ
ignore MSG
ignore WINCH
ignore PWR
ignore USR1
ignore USR2
ignore PROF
ignore DANGER
ignore VTALRM
ignore MIGRATE
ignore PRE
ignore GRANT
ignore RETRACT
ignore SOUND
ignore SAK