AIX 3.2 Problem Determination and Diagnostics

Problem determination is a very important activity for any system administrator. There are many tools available on AIX V3.2 to help the system administrator in this activity. In this chapter, an overview is provided for software problem determination and reporting. Reference is made to the published documents, wherever possible. The procedures given in this chapter are by no means complete or exhaustive. This chapter is written to help the customers who do not have access to a CD-ROM or other documents and who want to understand the basics of AIX V3.2 problem determination techniques.

The discussion in this chapter is mainly to diagnose problems on an inactive system. These procedures should allow the user to fix problem on a totally inactive system or a hung system. Procedures for TCP/IP, XStation* or SNA systems are quite different and are not covered in this chapter.

Problem Determination: Symptom Index

Use the following symptom index as your problem solving starting point. Select the symptom in the following table, which closely matches with the problem with the system. If more than one symptom matches, select the first matching symptom.


Table: Symptom Table

Encountering System Messages

Some error messages give instructions on how to resolve the error incurred. If the message describes the cause of the problem, attempt to correct it. If you are not given enough information to correct the problem, refer to the Message Index in AIX/6000 Problem Solving Guide and Reference to determine the nature and scope of the message. If you are able to correct the problem, then you have completed the problem determination procedure.

If you were unable to correct the problem with these steps then, some messages may have been recorded in the system error log. Go to Retrieving an Error Log Report.

Encountering a Malfunctioning Device

Do the following:

Refer to the device documentation for the correct procedures on cabling and configuring your device. If the documentation contains a troubleshooting or other problem determination section, refer to it for further guidance.

Are you able to resolve the problem with these steps?

Yes
You have completed the problem determination procedure.
No
Go to the next step Problem Determination for Devices.

Problem Determination for Devices

Try to determine whether the device is in a ready state with the following steps: Are you able to make the device ready and correct the problem?
Yes
You have completed the procedure.
No
There could be some error reporting in the error log for the device, adaptor or the application using the device. See Retrieving an Error Log Report on how to retrieve an error log. If you are still unable to correct the problem after analyzing error log then go to the next step Procedure for Software Checking of Devices.

Procedure for Software Checking of Devices

Use the command lsdev -C | pg to list all the devices on the system. Please check if the device under test is listed in the output and marked as available. Note that in this output the logical volumes will have a status of defined and make no effort to correct that status.

If the device is not listed or it has a status of defined then use the command mkdev to create the device or use the command chdev to make the device available. Alternately SMIT menu functions can be used for creating or changing the characteristics of the devices. If this step has fixed the problem, then you have completed the procedure. If these steps have not corrected the problem, then you may have a hardware problem. Please see step Hardware Problem Determination. At this time, you may contact the IBM Support Center, either for hardware support or for software support.

Problem Determination for Inactive System

Before going into the details of problem determination of an inactive system, please look for obvious errors such as:

Check the three digit LED display in the front panel. Is the display blank?

Yes
Go to step Determining the State of the Console.
No
If the display is showing some values then go to step LED Display on Inactive System.

Determining the State of the Console

Check if the system console is operational. If it is operational then go to step Running System has Become Inactive for freeing up other terminals. If the system console is not operational, then look for obvious power, cable or baud rate problems. If the system is inactive during a power on restart or a cold boot then go to step System Inactive During Restart. If the system was running till recently and has become inactive without any attempts to restart the system then go to step Running System has Become Inactive.

System Inactive During Restart

Check that the mode switch in the front panel is in the Normal mode. Is the switch in the normal mode?

Yes
Go to step Determining the Status of Your Boot Device for problems during booting. This is not the only option. See AIX Boot Process for further information on boot failures.
No
The switch is in the Service mode. In this position you can load the hardware diagnostics or load AIX V3.2 in maintenance mode. See step Hardware Problem Determinationfor loading the hardware diagnostics. See Procedure to Run getrootfs in Maintenance Mode in AIX Boot Process for maintenance mode booting.

Note: The system will not boot in the Secure mode.

Running System has Become Inactive

You would have reached this step, probably due to a hung system without any LED display. A hung system means a hung terminal or inactive console or both. In this section, a procedure is outlined to free the console or the terminal from the hung process.

Try to reactivate the system by performing the following actions:

  1. If you are using a mouse, make sure that the mouse pointer is in the desired window.
  2. Press the Ctrl-Q key sequence to make sure that scrolling is enabled. The Ctrl-S key sequence disables scrolling, and the Ctrl-Q key sequence enables scrolling. Ctrl-Q is ignored if scrolling is already enabled.
  3. Try to end the currently running process by pressing the Ctrl-C key sequence.
  4. Try to switch to another virtual terminal by pressing the Alt-Action or Shift-Action key sequence (possible only on hft). If the console changes to another virtual terminal, attempt to end the process that is causing the system to be inactive: In the case of an ASCII console, try to login as root from any other working terminal. Or try to telnet from another host which may be on the same network as this system. Then do the following:
  5. Try to log off your system by pressing the Ctrl-D key sequence. If this action is successful, you will need to log in again.
If you are able to correct the error by these steps, then you have completed the procedure. If the system is still inactive after these steps, that may mean the following. At this point of time, the system may have to be rebooted. Before that, a system dump should be taken, to locate the cause of the system hanging. Go to step Retrieving a System Dump. You may decide to contact the IBM Support Center with this information.

LED Display on Inactive System

You would have reached this step, when the console and the terminals connected to the system are not operational. There is some value displayed in the front panel LED. Observe the LED and take the following actions:

Flashing 888 LED Display

A flashing 888 in the LED display indicates that there is a message encoded as a string of LED display values. Record the display values performing the following procedures:

  1. Turn the Mode Switch to the Normal position, if it is not already in that position.
  2. Press the Reset button to display the next value in the string.
    **** NOTE: **** Every time you press the Reset button, hold it for about one second to allow the system to sense the change.
  3. Record this value.
  4. Press the Reset button and note the LED value every time, till it shows 888 again. Up to 30 values could be included in the string. To display the entire string of values again, repeat this procedure. The first value following the 888 indicates the type of information contained in the remainder of the string. Find that value in the following list and take the action given:
102
Go to step Unexpected System Halts.
103
Go to step Encountering Diagnostic Messages.
104
Go to step Encountering Diagnostic Messages.
105
Go to step Encountering Diagnostic Messages.

Unexpected System Halts

For unexpected system halts, the string of LED values has the following format:

888 102 mmm ddd

An initial value of 102 indicates that an unexpected system halt occurred during normal operations. The value of the mmm variable indicates the cause of the halt, which is explained in Cause of the System Halt. The ddd value indicates whether your system completed a system dump, which is explained in Dump Status.

Cause of the System Halt

The following list gives the possible values of mmm, the second value following the 888, and the cause of the system halt invoking that value:

000
Unexpected system interrupt.
200
Machine check due to memory bus error (RAS/CAS Parity).
201
Machine check due to memory time out.
202
Machine check due to memory card failure.
203
Machine check due to address exception: address out of range.
204
Machine check due to attempted store into ROM.
205
Machine check due to uncorrectable Error Correction Code due to address parity.
206
Machine check due to uncorrectable Error Correction Code.
207
Machine check due to undefined error.
300
Data storage interrupt: processor type. This seems to be the most common code in most machines for an 888.
32x
Data storage interrupt: input/output exception-input/output channel controller. The number represented by x is the bus unit identification.
38x
Data storage interrupt: input/output exception-serial link adapter. The number represented by x is the bus unit identification.
400
Instruction storage interrupt.
500
External interrupt: scrub-memory bus error (RAS/CAS Parity).
External interrupt: direct memory access (DMA)-memory bus error (RAS/CAS Parity).
External interrupt: undefined error.
51x
External interrupt because of a DMA memory bus error.
52x
External interrupt: input/output channel controller type-channel check.
External interrupt: input/output channel controller type-bus time out.
External interrupt: input/output channel controller type-keyboard external.
The number represented by x is the input/output channel controller number.
53x
External interrupt because of an IOCC bus timeout; x represents the IOCC number.
54x
External interrupt because of an IOCC bus timeout; x represents the IOCC number.
700
Program interrupt. This is another common failure code.
800
Floating point unavailable.

Dump Status

The value of ddd, the third value following the 888 , indicates the dump status. If any of the ddd value descriptions lead you to another step, return to the bottom of this step Unexpected System Halts. The possible values and meanings of ddd are:

0c0
The dump completed successfully. Go to Step Retrieving a System Dump.
0c3
The dump is inhibited.
0c4
The dump did not complete successfully. A partial dump was written to the dump device. There is not enough space on the dump device to contain the entire dump. Normally this dump would be useless. You have to increase the size of the dump device. See Changing the Dump Device Characteristics. Go to step Retrieving a System Dump.
0c5
The dump failed to start. An unexpected error occurred while the system was attempting to write to the dump media. Complete the procedures at the end of this step.
0c8
The dump device has been disabled. The current system configuration does not designate a device for the requested dump. See Changing the Dump Device Characteristics. Complete the procedures at the end of this step.
0c9
The dump did not complete, but a partial dump may be present. Go to Step Retrieving a System Dump.
c20
The kernel debugger exited without a request for a system dump. Complete the procedures at the end of this step.

You have reached the end of the list at this step. Run diagnostics on your system. For further information, refer to Hardware Problem Determination. If the diagnostics programs return a Service Request Number (SRN), record this number and contact the IBM Support Center.

Changing the Dump Device Characteristics

In some cases, the dump space may not be sufficient to capture the kernel image, during the system crash. Or it may be configured incorrectly, pointing to a non-existent device. Use the SMIT fast path command smit dump to go into the SMIT options for handling the system dump. It is possible to define a larger dump space and assign it as the primary dump device, without the need for rebooting. Normally a dump space of 10MB is sufficient. You can assign a temporary logical volume or a tape drive as the primary or secondary dump device. Once a successful dump is made, this temporary dump device may be released for normal operations.

Encountering Diagnostic Messages

An initial value of 103, 104, or 105 indicates that diagnostic messages are displayed in the LED digit display when the console display is either not present or is unavailable because of a display or adapter failure, or when a failure that prevents a system restart is detected.

To interpret diagnostic messages, refer to Hardware Problem Determination. You will find a list of suggested document references, which will help you in hardware problem determination.

Record the Service Request Number returned from the diagnostics programs along with the location codes and contact the IBM Support Center. You have completed the procedure.

Steady Value in LED Display

If the system has halted with a steady value in the range of 201 through 299 then refer to the document AIX/6000 Problem Solving Guide and Reference, section on "Interpreting Values in Three Digit Display" for the meaning of the values. LED values in the range of 201 to 299 are progress indicators, which are normally not displayed more than 30 seconds.

Determining the Status of Your Boot Device

When the LED display is alternating between some numbers, normally this could happen, when you are trying to reboot the system either in the normal mode or in the service mode. The system is in a loop while attempting to restart from the devices indicated by the values in the LED display. Determine if the boot device is ready by checking the following:

Have these actions solved the problem?
Yes
You have completed the procedure.
No
Go to step Hardware Problem Determination.

Retrieving an Error Log Report

The error log is a report of system errors and can be used to diagnose these errors.

Important Error Logging Files and Commands

Here is the explanation of the most files and commands related to the error log:

Does the error log contain error information relevant to your problem?

Yes
Go to step Understanding the Error Log Report.
No
Go to step Encountering an Empty Error Log.

Understanding the Error Log Report

The error report contains the following sections:

Are there any errors reported that might be related to the problem?

Yes
Go to Step Using the Error Report to Solve Errors.
No
Go to Step Rechecking the Error Log.

Rechecking the Error Log

Check the error log for errors that do not seem directly related to the problem. The problem you encountered may be an unexpected consequence of another problem.

Are there any errors reported that might be somewhat related to the problem?

Yes
Go to step Using the Error Report to Solve Errors.
No
Return to the step in the problem solving procedure that brought you to step Retrieving an Error Log Report.
Check the error log periodically to see if any new entries are created that are related to your problem. At this point, you may contact the IBM Support Center to discuss the nature of your problem.

Using the Error Report to Solve Errors

Perform the recommended actions, or any failure related to the problem reported in the error log. Attempt to resolve your system problems by following items mentioned in the User Causes and Installation Causes of the error report, before trying those listed in Failure Causes.

If these actions resolve the problem then you have completed these procedures. If not, go to step Classifying Errors.

Classifying Errors

The procedure for resolving hardware errors differs from those for software and operator errors. Is the error class value for any seemingly related error equal to H (hardware)?

Yes
Go to step Performing Diagnostics on a Device.
No
Go to step Encountering Software Errors.

Performing Diagnostics on a Device

When you receive a hardware error, refer to the OPERATOR GUIDE for information about performing diagnostics on the problem device or other piece of equipment. The diagnostics programs test the device and analyze the error log entries related to it to determine the state of the device. Run diagnostics on each entry that has an error class with the value of H.


**** NOTE: **** Error types with the value PERM are usually the most severe errors and are more likely to mean that you have a defective device. Error types other than PERM usually do not indicate a defective device but are recorded so that they can be analyzed by the diagnostics programs.
Has running diagnostics revealed that you have a defective device?
Yes
Report the problem to the IBM Support Center. Then stop. You have completed these procedures.
No
Go to step Looking for Other Possible Error Causes.

Looking for Other Possible Error Causes

Since the diagnostics programs did not reveal that you have a defective device, one of the following is true:

Is there any software error log entries that may be related to the problem?

Yes
Go to step Encountering Software Errors.
No
Go to step Encountering Unresolved Hardware Errors.

Encountering Unresolved Hardware Errors

The error log entry may not have been important at this time. Check the error log periodically to determine if the problem is ongoing. If related errors occur, repeat the procedures in this chapter. If you suspect an undetected hardware problem, report it to your hardware service organization, and stop. You have completed these procedures.

Encountering Software Errors

If there is a Failure Causes section in a software error log entry, it usually indicates a software defect. Entries that list User or Installation Causes, or both, but not Failure Causes, usually indicate that a software defect is not the source of the problem.

If you suspect a software defect or are unable to correct User or Installation Causes, report the problem to your software service organization, and stop. You have completed these procedures.

Encountering an Empty Error Log

All or some of the error logging facility may be turned off. Use the command:

# ps -eaf | grep errdemon
This should display the process ID of the error daemon. If this command does not produce any output, then the error daemon is inactive, which could be one of the causes for the empty error logs. Use the command /usr/lib/errdemon to start the error daemon. You need root authority to perform this command.

Return to the step in the problem solving procedures that brought you to step Retrieving an Error Log Report. If those procedures did not resolve the problem, report the problem to the IBM Support Center.

Retrieving a System Dump

A system dump acts as a picture of working memory when a severe error occurs. Generally, reading the contents of a system dump will enable a person trained to read system dumps to determine what went wrong with the system and why. This set of problem determination procedures may require you to do the following:


**** NOTE: **** Before you begin with the procedures of copying the system dump you must have six formatted 1.44MB diskettes available. Your system dump will be copied onto these diskettes enabling your service organization to determine and correct your system problems. Alternatively, you may copy the system dump on tapes as well.

Did your system stop with a flashing 888 displayed in the LED display?

Yes
Your system already has a dump saved on a dump device. Go to step Files Needed for Dump Analysis.
No
Go to step Forcing a System Dump.

Forcing a System Dump

Use the steps in this section to force a dump on a hung system or when you are requested to do so.

If you initiated the system dump yourself, you will immediately see 0c2 in the LED display. Within one minute, this should change to indicate whether the crash was successful. See Problem Determination for Inactive System for the description of the LED codes.


**** Caution **** Do not initiate a system dump if 888 is shown on the LED display. This means the system has automatically initiated a system dump. If you initiate another system dump, you will overwrite the information in the primary dump device. Initiating dump on a working system will crash it and the system will need rebooting. You may consider this, before following the steps in this section.

To force a system dump to the primary dump device, here are the methods:

Continue with Determining the Status of a System Dump.

Determining the Status of a System Dump

Find the value currently displayed in the LED display in the following list and take the prescribed action:

0c0
The dump completed successfully. Go to step Files Needed for Dump Analysis.
0c2
The dump is in progress. Wait for the dump to complete and for the LED display value to change. If the LED display value changes, find the new value on this list. If the value does not change, then the dump did not complete. Go to step Files Needed for Dump Analysis.
0c4
The dump did not complete successfully. A partial dump was written to the dump device, but there is not enough space on the dump device to contain the entire dump. In some cases, the dump may be readable. In some cases, though the dump device was large enough, the kernel crashed completely, without any chance of dumping. To prevent this problem from occurring again, you must increase the size of your dump media. See Changing the Dump Device Characteristics for changing dump space parameters. Go to step Files Needed for Dump Analysis for analyzing partial dumps.
0c5
The dump failed to start. An unexpected error occurred while the system was attempting to write to the dump media. Record this detail and report the problem to the IBM Support Center, and stop. You have completed these procedures.
0c6
This prompts you to make the secondary dump device available.
0c7
A network dump is in progress, and the host is waiting for the server to respond. The value in the LED display should alternate between 0c2 and 0c7. If the value does not change, then the dump did not complete. Go to step Files Needed for Dump Analysis.
0c8
The dump device has been disabled. The current system configuration does not designate a device for the requested dump. See Changing the Dump Device Characteristics for changing dump device parameters. Record this detail and report the problem to the IBM Support Center if you want further assistance, and stop. You have completed these procedures.
0c9
A system initiated dump started but did not complete. Wait one minute for the dump to complete and for the three-digit display value to change. If the LED display value changes, find the new value on this list. If the value does not change, then the dump did not complete. Go to step Files Needed for Dump Analysis.
000
The Kernel debugger is invoked. If there is an ASCII terminal attached to one of the native ports, you will find the Kernel debugger running on that. Type q dump at the debugger prompt (>) and follow the steps as in Problem Determination for Inactive System.

Files Needed for Dump Analysis

You would have reached this step, when your system has stopped with a flashing 888 and the system had either a successful or partial dump. Normally an unsuccessful or partial dump is useless. However, the user may still decide to copy the dump to a removable media and send it to the IBM Support Center for analysis. You could have also reached this step after forcing a dump.

The following files are needed to properly analyze a system dump:

A copy of the system dump, without the associated /unix file, is useless. In order to get this information, the system has to be working either in Service mode or in Normal mode. At this point, record carefully the LED code sequence after the 888. For example it could be 102-300-0c0 or 102-700-0c0. This code sequence is essential for proper analysis of the system dump. Try to reboot the system in the normal mode.

Are you able to boot the system in normal mode?

Yes
Go to Step Copying the System Dump Information to Tape or Diskette.
No
Go to Step Encountering a System that will not Restart.

Copying the System Dump Information to Tape or Diskette

You would have reached this step, to gather the files needed for the dump analysis.


**** NOTE: **** In the following instructions, replace /dev/rmt0 with the device name for your tape drive.

To copy the system dump information to tape:

  1. Login as root.
  2. Place a blank tape into the tape drive.
  3. Copy the appropriate information to the tape with:
    # /usr/sbin/snap -gfkD -o /dev/rmt0
    
  4. To verify the tape, use the following command, which should display an output similar to the following screen:
    # tar -tvf /dev/rmt0
    -rw-r--r-- 0 0 2933929 Dec 14 15:11:35 1992 ./dump/dump.Z
    -rw-r--r-- 0 0 791604 Dec 14 15:09:49 1992 ./dump/unix.Z
    -rw-r--r-- 0 0 4327 Dec 14 15:08:51 1992 ./general/errpt.out

    Note that the size of dump.Z in this example is 2933929 bytes and the size of unix.Z is 791604 bytes. It is important that the dump.Z and unix.Z files do not have 0 bytes. If they are, then something is wrong. In that case, you should not send this information to the IBM Support Center because there is nothing to analyze. Double check your steps, and write down any error messages that were displayed on the screen. Call the IBM Support Center if you continue to get 0 byte dump.Z or unix.Z files.
  5. With the following command, determine the block size used to back up the tape. The command should display an output similar to the following screen. In this example 512 is the block size.
    # lsattr -E -l rmt0 | grep block_size
    block_size 512 BLOCK size (0=variable length) True

  6. Label the tape, with the following information.
  7. Send the labelled tape to IBM Support Center for analysis.

To Copy the dump information to diskettes:

  1. You need about six formatted 1.44MB high density diskettes.
  2. Use the following command to copy the dump information:
    # /usr/sbin/snap -gfkD -o /dev/rfd0
    
  3. To verify the diskettes, use the following command, which should display an output similar to the following screen:
    # tar -tvf /dev/rfd0
    -rw-r--r-- 0 0 2933929 Dec 14 15:11:35 1992 ./dump/dump.Z
    -rw-r--r-- 0 0 791604 Dec 14 15:09:49 1992 ./dump/unix.Z
    -rw-r--r-- 0 0 4327 Dec 14 15:08:51 1992 ./general/errpt.out

    Check that none of these files is of 0 byte length.
  4. Label the diskettes properly (don't forget the command used) and send them to the IBM Support Center for analysis.

Verifying if the Dump was Successfull

See Problem Determination for Inactive System on actions to take in case of a system crash. See Flashing 888 LED Display, for the steps to record the LED values in case of a flashing 888 LED display. These values are explained in the same section.

Follow the steps below to verify if the dump was successful.

  1. Record the LED values after the system crash.
  2. Now, power the system off and back on, to boot the system in the normal mode.

    Are you able to boot the system in the normal mode?

    Yes
    Continue with the steps in this section.
    No
    Go to step Encountering a System that will not Restart.
  3. Log in as root.
  4. Verify the dump device with the following command, which should display an output similar to the following screen:
    # sysdumpdev -l
    primary /dev/hd7
    secondary /dev/sysdumpnull

    Note the primary dump device name, and use it where you see /dev/hd7 in the following examples. The default primary dump device is /dev/hd7. You could define another logical volume as the primary dump device.
  5. Verify the usability of the information in the dump device with the crash command, which should display an output similar to the following screen:
    # crash /dev/hd7
    Using /unix as the default namelist file.
    Reading in symbols........................

    If no warning messages are displayed, go to step 4. Check if the following message appears:


    # crash /dev/hd7
    Using /unix as the default namelist file.
    WARNING: dumpfile does not appear to match namelist

    If this warning message appears after you enter the crash command, either the crash did not take place or the /unix file does not match the information that was in the dump device. In this case, type q at the > prompt to quit out of the crash command. If you receive messages that the name list does not match, your /unix file does not match the kernel you are running. Please follow-up with the IBM Support Center for assistance. Do not continue with the remaining steps and do not send the dump information to the IBM Support Center because there is nothing to analyze.
  6. When you see the > prompt, use the command stat, which should display an output similar to the following screen:
    > stat
    sysname: AIX
    nodename: bruce
    release: 2
    version: 3
    machine: 000001873400
    time of crash: Sun Nov 14 16:09:28 1993
    age of system: 18 day, 8 hr., 31 min.

    If the time of the crash in the output approximately matches the time the system crashed, then the dump information may be sufficient for analysis. If you have any other lines of output from the stat command other than those listed above, please write them down. Continue with step Copying the System Dump Information to Tape or Diskette.

    If the time of the crash does not approximately match the time the system crashed, type q to exit the crash command. In this case, do not continue with these steps and do not send the crash information to the IBM Support Center, because it is not sufficient for analysis.

Encountering a System that will not Restart

The system dump cannot be retrieved until your system can be started again. Go to step Problem Determination: Symptom Indexto begin diagnosing why the system cannot be restarted. After you have completed other procedures, return to this step.

Were you able to correct the problem and restart your system?

Yes
Return to the step from where you came to this step.
No
If a hardware problem was found by the problem determination procedures, report the problem to the IBM Support Center, and stop. You have completed these procedures.

Returning to Normal Operations

To restart the system, turn the Mode Switch to the Normal position and turn on your system, then stop. You have completed these procedures.

Hardware Problem Determination

Hardware problem determination is a large and involved topic. You would have reached this step, if you have been asked to test the hardware from any of the previous steps. If AIX V3.2 is running at least on the console, then there are many ways to test out the remaining hardware. It is beyond the scope of this book to describe such procedures. Because, depending on the symptom, the steps will vary for each machine.

However the approach for hardware problem determination would be as follows:

If the answer was yes to any of the above questions, then you need to run the hardware diagnostics on the RISC System/6000. You may use the following publications to assist you in hardware problem determination.

Related Publications:

  • InfoExplorer
  • Common Diagnostics & Service Guide
  • RISC System/6000 Diagnostic Programs Version 1.2
  • SRN Cross Reference
  • AIX/6000 Problem Solving Guide and Reference
  • Problem Determination for X Window Hang/Crash

    If your machine has frozen or humg while you are using X Window (X) and you have the numbers 888 displayed on your LED, then you are experiencing a System Dump. You can gather the system dump and send it to the IBM Support Center

    If your machine has frozen or hung while you are using X Window (X) and there is no display on the LED, then we need to determine if X has crashed or is in a hang state. To determine the current stat of X, please do the following:

    1. Do not hot-key, hot-keying to another shell will change the state of X if it was in a hang state. If you have already performed a hot-key, you will have to recreate the situation to continue problem determination
    2. rlogin or telnet into the system from another machine (remotely)
    3. Execute the command:
      # ps -ef | grep X
      
    4. Is X running? Do you have a process running that looks something like:
      /usr/lpp/X11/bin/X -D /usr/lib/X11/En_US/rgb
      

      Yes: The X process is still running, so X is in a hang state (see X Hang)

      No: The X process is no longer running so X has crashed (see X Crash)

    X Crash


    **** Important **** The best possible way to solve an X crash problem is to be able to reproduce or recreate the problem. In other words, in most cases with an X crash, you will need to come up with some repeatable process that will cause X to crash.

    If you can come up with a repeatable process that will cause X to crash, you can let know these steps to the IBM Support Center so they can try to recreate the crash in their labs (you do not need to perform the steps outlined below).

    The steps outlined below are attempts to gather information if the IBM Support Center cannot recreate the hang in their labs.

    If X has frozen and you were able to determine that the X process is no longer running, you have experienced an X crash. When X crashes, there should be a core file in the directory from which you started xinit or startx.

    Core File does not Exist in the Current Directory

    If a core file does not exist in the directory from which you started xinit or startx, one of two things has occured:

    1. Write permission in current directory

      You do not have write permission in the directory from which you started xinit or startx. If you do not have write permission, a core file cannot be created in the current directory. Re-start X from a directory where you have write permission and try to create the X crash (continue reading problem determination for an X crash with a core file, it may save you some time).

    2. Running out of paging space

      You could be running out of paging space and the operating system is killing X because it is the largest process or the process using the most paging space. To determine if your system is running out of paging space you can do the following:

      1. lsps -a command

        Create a script that will display the output of the lsps -a command periodically. Remember, after X has been killed, this command will report the paging space that X was using as freed. In other words, the output of this command is not usedful after X has died, it is only useful right before X is killed.

      2. sar command

        Create a window and execute the sar command to watch the paging slots used. For example, run sar -r 1 1000, watch the slots column which represents the paging slots available. If this number gets decreased to around 200, then you will most likely run out of paging space.

      If you are running out of paging space and suspect that one of your processes is using up paging space and not returning it, you can verify this by creating a small script executing the ps command, watch all processes to determine if any are growing continuously.

    Core File Exists in the Current Directory

    If a core file exists in the directory from which you started xinit or startx, then the IBM Support Center may be able to use this core file for problem determination.


    **** Important **** Please keep in mind that a core file alone is usually not enough information to solve most of the X crash problems. As we mentioned above, the best possible way to solve X crash problems is to be able to reproduce or recreate the problem in the IBM Support Center's labs. The core file will tell them the last set of function calls that were executed before the crash.

    With this information you can usually tell what pointer is corrupt or incorrect. Unfortunately, the information telling you what trashed the pointer is no longer available. In other words, in most cases with an X crash, you will need to come up with some repeatable process that will cause X to crash.

    If you can come up with a repeatable process that will cause X to crash, let know the IBM Support Center these steps so they can try to recreate the crash in their labs. You do not need to do all of the steps below if you have a repeatable process.

    As you are trying to come up with a repeatable process that will cause X to crash, you can send the IBM Support Center the core file that was created and they may be able to tell something from it. Please send the following to the IBM Support Center for problem determination:

    1. The core file

      Verify that the core file is current ( you can use the ls -al command, checking the date and compare to the time of X crash).

      Also verify that it was X that core dumped. Run the command:

      # od -c core +0x730 | head -1
      

      Although the output of this command can be confusing, what we are interested in is the first item after line number 0000730, this should be a capital X (all the rest of the output looks like garbage). If you see an X right after 0000730, then the core file is for the X crash.

    2. The X executable (/usr/lpp/X11/bin/X)
    3. The load2d file for the graphics adapter you are using: /usr/lpp/gai/adapter#.r4/load2d
      or
      (for GT4) /usr/lpp/gai/afapter4.r4/loadxx
      or
      (for GT1) /usr/lpp/gai/adapter5.r5/loadxx
      where:
      adapter# corresponds to the type of graphics adapter you are using.
      To determine the type of adapter you are using, do the following:
      1. # lsdisp < /dev/hft
      2. adapter# corresponds to the output you see from the command run in step a.

        adapter1 = colorgda Color Graphics Display Adapter (Skyway)

        adapter2 = hiprf3d High Performance 3D Color Graphics processor (Sabine)

        adapter3 = hispd3d High Speed 3D Graphics Accelerator (730 or GTO)

        adapter4 = POWER_Gt3 Midrange Entry Graphics Adapter (Gt3) or POWER_Gt4 Midrange Graphics Adapter (Gt4) or POWER_Gt4x Midrange Graphics Adapter (Gt4x)

        adapter5 = POWER_Gt1 Frame Buffer Adapter (Gt1)

    4. The loadrms file for the graphics adapter you are using:

      /usr/lpp/gai/adapter#.r4/loadrms


      **** NOTE: **** See step 3b to determine adapter#.

    Core File Exists in the Current Directory and You Can Keep X Down

    If you have a core file in the directory from which you started xinit or xstartx and you can keep your system or X down long enough for IBM to perform some additional problem determination, the IBM Support Center may be able to gather some useful information:

    1. Restart X
    2. Use rlogin or telnet command from another machine to login to the machine running X
    3. Determine the process is of X:
      # ps -ef | grep X
      
    4. Create a file called /tmp/dbx.ignore. This file should look like the file described in dbx Ignore File.
    5. From the remote machine, attach dbx to the X process id:
      # dbx -c /tmp/dbx.ignore -a <processid_for_X>
      
    6. Type cont from dbx, this will allow you to use the X on the system that saw the crash
    7. Once X crashes, call the IBM Support Center to set up a tome for more for problem determination.

    X Hang

    If X has frozen and you were able to determine that the X process is still running, you are experiencing an X hang.

    Is the Process Still Using CPU?

    1. rlogin or telnet into the system that has the X hang
    2. Run ps av | grep X
    3. Watch the CPU utilization of X. Is it growing?

    Attach dbx to the Running X Process for Problem Determination

    1. rlogin or telnet into the system that has the X hang
    2. Determine the process id of X: ps -ef | grep X
    3. Attach dbx to this process and pipe the output into the tee command. The tee command will build a file with all the information gathered. You may call this output file anything you want.
      # dbx -a <processid_for_X> | tee <output_file>
      
    4. Within dbx obtain the stack trace:
      > where
      
    5. Make a note of the last function executed, the command at the top of the where output. Is this a system call?
    6. Within dbx obtain the contents of the registers:
      > X
      
    7. Within dbx obtain the last 10 assembler instructions executed:
      > listi
      
    8. Determine if X is in a loop:
      1. Try to have X continue for 50 steps:
        > cont
        > ^c   (hit the control key with c)
        
      2. When X hangs again, get another stack trace:
        > where
        

        Is it close to the same place as step 4 above? In other words, are the adresses close?) If so, X is in a loop.

    9. Detach from dbx:
      > detach
      
    10. If the last command in the stack trace is a system command (step 5 above), then do the following:
      1. Run crash and gather the stack trace, pipe the output to the tee command. Do not overwrite the file created in steps 1-9 above.
        # crash | tee <output_filename>
        
      2. Within crash obtain the slot number for X:
        > p
        

        This will list all the current processes running, one of these will be X. Jot down the number that appears on the leftmost column for the line that contains the X process, this is the slot number for X.

      3. Within crash get the traceback for X:
        > t <X's_slot_number>
        
      4. Quit crash
        > quit
        
    11. Gather the tee output files from steps 1-10 and send them to the IBM Support Center for anlysis.

    Tracing

    To trace a system which is experiencing unpredictable X hang problems:

    dbx Ignore File

    This section shows an example of file /tmp/dbx.ignore. This file contains dbx subcommands. When dbx command is run with the -c /tmp/dbx.ignore flag, the subcommands in this file will be run before reading from standard input.


    ignore HUP
    ignore ALRM
    ignore URG
    ignore STOP
    ignore TSTP
    ignore CONT
    ignore CHLD
    ignore TTIN
    ignore TTOU
    ignore IO
    ignore XCPU
    ignore XFSZ
    ignore MSG
    ignore WINCH
    ignore PWR
    ignore USR1
    ignore USR2
    ignore PROF
    ignore DANGER
    ignore VTALRM
    ignore MIGRATE
    ignore PRE
    ignore GRANT
    ignore RETRACT
    ignore SOUND
    ignore SAK