DECsystem 5900 ------------------------------------------------------------ Service Guide Order Number: EK-D590A-PS. C01 Digital Equipment Corporation Maynard, Massachusettes ------------------------------------------------------------ First Printing, December 1991 Revised June 1992, April 1993 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. Restricted Rights: Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. © Digital Equipment Corporation 1993. All Rights Reserved. Printed in U.S.A. The following are trademarks of Digital Equipment Corporation: CI, CompacTape, DEC, DECconnect, DECnet, DECserver, DECsystem 5900, DECwindows, RRD40, RRD50, RX, ThinWire, TK, TS, TU, TURBOchannel, ULTRIX, VAX, VAX DOCUMENT, VMS, VT, and the Digital logo. Prestoserve is a trademark of Legato Systems Inc. All other trademarks and registered trademarks are the property of their respective holders. FCC Notice: The equipment described in this manual generates, uses, and may emit radio frequency energy. The equipment has been type tested and found to comply with the limits for a Class A computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed to provide reasonable protection against such radio frequency interference when operated in a commercial environment. Operation of this equipment in a residential area may cause interference, in which case the user at his own expense may be required to take measures to correct the interference. S2106 This document was prepared using VAX DOCUMENT, Version 2.1. ------------------------------------------------------------ Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . vii 1 DECsystem 5900 Hardware . . . . . . 1 1.1 System Power Controller . . . . . 4 1.2 Power Control Sequence . . . . . 4 1.3 CPU Drawer Switches . . . . . . . 4 1.4 CPU Drawer Main Components . . . . . . . . . . . . . . 5 1.5 CPU Drawer Rear Panel . . . . . 6 2 DECsystem 5900 Logic . . . . . . . . . 7 3 System Module Main Components . . . . . . . . . . . . . . . . . 9 3.1 CPU Daughter Card Physical Fit . . . . . . . . . . . . . . . . . . . . 9 3.2 Memory Modules . . . . . . . . . . 10 3.3 NVRAM Module . . . . . . . . . . . 10 3.4 TURBOchannel and TURBOchannel Extender . . . . . 10 4 Mass Storage Drawers . . . . . . . . . . 11 4.1 ULTRIX Requirements for Mass Storage . . . . . . . . . . . . . . . . . 12 4.2 SCSI Interface . . . . . . . . . . . . 12 4.3 SCSI Bus Cables . . . . . . . . . . . 12 4.4 SCSI Bus IDs . . . . . . . . . . . . . 13 4.5 Ethernet Interface . . . . . . . . . 14 4.6 Communications Interface . . . . 14 5 Console Commands . . . . . . . . . . . . 14 6 Console Error Messages . . . . . . . . . 18 7 Environment Variables . . . . . . . . . 19 8 Console Password Security . . . . . . . 21 8.1 Entering a Password . . . . . . . . 21 8.2 Clearing a Password . . . . . . . . 22 iii 8.3 Clearing a Forgotten Password . . . . . . . . . . . . . . . . 22 9 Powering Up the System . . . . . . . . 24 9.1 Powering Up with Two CPU Drawers . . . . . . . . . . . . . . . . . 25 10 Troubleshooting and Diagnostics . . . 25 10.1 Checking for Power Problems . . . . . . . . . . . . . . . . 25 10.2 Power-Up Self-Tests . . . . . . . . 26 10.3 Configuration Utility (cnfg) . . . 26 10.4 CPU Daughter Card Diagnostic LEDs . . . . . . . . . . . . . . . . . . . 27 11 Extended Testing . . . . . . . . . . . . . 27 11.1 Individual Tests . . . . . . . . . . . 28 11.2 Diagnosing the NVRAM . . . . . . 32 11.3 Arming and Disarming the Battery . . . . . . . . . . . . . . . . . 32 11.4 Clearing the NVRAM . . . . . . . 32 11.5 System Module ROM Diagnostics . . . . . . . . . . . . . . 33 11.6 ULTRIX System Exercisers . . . 33 12 System Software Management . . . . 35 12.1 Booting System Software . . . . . 35 12.1.1 Booting from Disk or Tape . . . . . . . . . . . . . . . . 35 12.1.2 Booting Over the Network . . . . . . . . . . . . . 35 12.2 Shutting Down System Software . . . . . . . . . . . . . . . . 37 13 Using Error Logs . . . . . . . . . . . . . 37 13.1 erl . . . . . . . . . . . . . . . . . . . . . 38 13.2 Examining Error Logs . . . . . . . 38 13.3 Distinguishing Event Types . . . 39 13.4 Troubleshooting Mass Storage Devices Using uerf Error Logs . . . . . . . . . . . . . . . . . . . 40 13.5 Error and Status Register Error Logs . . . . . . . . . . . . . . . . . . . 45 13.6 System Overheat Error Messages . . . . . . . . . . . . . . . . 46 14 Major FRU Replacement . . . . . . . . 46 iv 14.1 Pulling Out a Drawer . . . . . . . 47 14.2 Mass Storage Drawer Power Supply . . . . . . . . . . . . . . . . . . 49 14.3 Removing the CPU Power Supply . . . . . . . . . . . . . . . . . . 50 14.4 Removing the System Skirts . . . 51 14.5 Mass Storage Drawer SCSI Cables . . . . . . . . . . . . . . . . . . 51 14.6 CPU Drawer . . . . . . . . . . . . . 53 14.7 Replacing the System Module . . . . . . . . . . . . . . . . . 54 14.8 Replacing the CPU Daughter Card . . . . . . . . . . . . . . . . . . . 55 15 Adding or Replacing Mass Storage Devices . . . . . . . . . . . . . . . . . . . . 56 15.1 Installing a Drive . . . . . . . . . . 56 15.2 Configuring Disks in the Mass Storage Drawer . . . . . . . . . . . 57 15.3 Removing the Faulty Device . . . 58 16 Field-Replaceable Units . . . . . . . . . 58 Figures 1 DECsystem 5900 Cabinet, Front View . . . . . . . . . . . . . . . . . . . 3 2 Cabling from CPU . . . . . . . . . . 6 3 Clear NVR and Other System Module Jumpers . . . . . . . . . . . 23 4 Power Controller Switches . . . . 24 5 Rear Bracket Screws . . . . . . . . 48 6 Front Hex Nuts . . . . . . . . . . . 49 7 Media in the Mass Storage Drawer . . . . . . . . . . . . . . . . . 52 8 CPU Drawer Subsystems, Side View . . . . . . . . . . . . . . . . . . . 53 v Tables 1 SCSI ID Factory Configuration . . . . . . . . . . . . . 13 2 Console Commands . . . . . . . . . 14 3 Error Messages and Their Meanings . . . . . . . . . . . . . . . . 18 4 Console Commands for Environment Variables . . . . . . 19 5 Environment Variables Set by the User . . . . . . . . . . . . . . . . 20 6 Console Privileges . . . . . . . . . . 21 7 CPU Daughter Card LED Diagnostics . . . . . . . . . . . . . . 27 8 Individual Tests . . . . . . . . . . . 28 9 Error Log to Physical Drive Conversion . . . . . . . . . . . . . . . 42 10 Field-Replaceable Units . . . . . . 58 vi ------------------------------------------------------------ Preface This guide is for use by Digital services personnel or by self-maintenance customers, and contains information to diagnose the DECsystem 5900 product and correct on-site problems. Your system uses one of two CPU daughter cards, either the R3000A or the R4400. The R3000A is referred to in the screen displays as KN03-AA. The R4400 is referred to in the screen displays as KN05. See Section 16 for their respective part numbers. Related Documents The following are documents related to servicing the DECsystem 5900 product: ULTRIX Pocket Service Guide, EK-ULT32-PG DECsystem 5900 CPU System Manual, EK-D590A-SM DECsystem 5900/260 CPU Card Installation, EK-D5960-IN DECsystem 5900 Enclosure Maintenance Manual, EK-D590A-EN DECsystem 5900 Owner 's Manual, EK-D590A-OG DECsystem 5900 Installation Guide, EK-D590A-IN DECsystem 5900 Site Preparation Guide, EK-D590A-SP DECsystem 5900 Illustrated Parts Breakdown, EK-D590A-IP RZxx Disk Drive Subsystem Pocket Service Guide, EK-RZxxD-PS vii Conventions The following conventions are used in this manual: ------------------------------------------------------------ Convention Meaning ------------------------------------------------------------ bold Boldface type indicates user input. Note Notes provide general information about the current topic. Caution Cautions provide information to prevent damage to equipment or software. Read these carefully. Warning Warnings provide information to prevent personal injury. Read these carefully. ------------------------------------------------------------ viii 1 DECsystem 5900 Hardware The DECsystem 5900 ULTRIX RISC system is intended for file- or network-server applications. The DECsystem 5900 is a data center type of system that is enclosed in a 67-inch high, H9A00 cabinet, with six standard 19-inch rackmounted modular drawers, four of which are for mass storage. ------------------------------------------------------------ Warning ------------------------------------------------------------ Stabilizer bar must be extended from beneath the front of the cabinet whenever any drawer is pulled out. Pull out only one drawer at a time. ------------------------------------------------------------ 1 The cabinet drawers are shown in Figure 1. ! Mass storage drawer (slot 6) " Mass storage drawer (slot 5) # Mass storage drawer (slot 4) $ CPU drawer (slot 3) % Second CPU drawer, optional (slot 2) & Mass storage drawer or StorageServer 100 (slot 1) ' Power controller in bottom rear ( Stabilizer bar ------------------------------------------------------------ Note ------------------------------------------------------------ A second CPU drawer may be installed in the cabinet to create a dual DECsystem 5900 configuration. ------------------------------------------------------------ 2 Figure 1 DECsystem 5900 Cabinet, Front View 3 1.1 System Power Controller AC power is routed through a power controller into the power supply in each drawer. The power controller is located at the very bottom of the enclosure. It brings the ac power into the system and distributes it to the power supply in each drawer. The system power controller is normally configured as a single, switch-controlled point of system power. See Section 1.3. 1.2 Power Control Sequence A power sequence cable with three-pin connectors connects the power controller to the CPU drawer and provides a single point of system power control using the upper front switch of the CPU drawer. This assumes that the Remote/Local switch on the power controller is in the remote (up) position. 1.3 CPU Drawer Switches The CPU drawer has two switches on the front. The upper switch is connected to the power controller. If the power controller is in the remote position, the upper front switch powers the entire system and all drawers, as long as the drawers themselves are on. In the case of dual CPU drawers, only one needs to be cabled to the power controller and only that one needs to be switched on. The lower CPU front switch powers only the CPU drawer. The lower switch controls the CPU drawer whether the power controller is in the remote or the local position. The CPU front lower switch does not affect other drawers. When the system is a dual DECsystem 5900 cabinet with two CPU drawers, either CPU upper switch controls the power for all drawers in the enclosure. 4 1.4 CPU Drawer Main Components In the CPU drawer the main components are as follows: · Power supply and cabling; a 244 W power supply is in the front of the drawer. · System module · CPU daughter card · Single-inline memory modules (SIMMs) · NVRAM (nonvolatile RAM) module · TURBOchannel Extender · TURBOchannel option slots 5 Figure 2 Cabling from CPU 1.5 CPU Drawer Rear Panel The CPU drawer rear panel has controls and indicators shown in Figure 2. ! Not used " TURBOchannel Extender option slots # TURBOchannel Extender I/O (connected to ') $ Power cable inlet % Remote power sequence connector & System module SCSI port ' TURBOchannel Extender adapter (connected to #) ( Standard Ethernet ) Diagnostic LEDs 6 +> Halt switch +? System console port +@ TURBOchannel slot 1 +A Communications port +B TURBOchannel option slot 2 +C Alternate console 2 DECsystem 5900 Logic The DECsystem 5900 logic is made up of five main subsystems: memory, SCSI, Ethernet, TURBOchannel, and the serial communication lines. In addition, the clocking and interrupt handling logic coordinate the subsystems. The main Application Specific Integrated Circuits (ASIC) are the Memory Buffer (MB), Memory TURBOchannel (MT), Memory Subsystem (MS), and I/O Control (IOCTL). (In the R4400, MB and MT are not separate.) The CPU daughter card is either the R3000A or the R4400. They each mount identically on the system module, and have the following functionality: · MIPS processor, CPU/FPU · Processor interface · Memory and TURBOchannel interface · Clock logic MB Logic The Memory Buffer ASIC interfaces the CPU to the rest of the system. 7 MT Logic The Memory TURBOchannel ASIC interfaces the CPU to memory and the TURBOchannel I/O interface. For example, data can come from the SCSI bus and DMA into memory via the MT ASIC without CPU intervention. MS ASIC Located on the system module is the Memory Subsystem (MS) ASIC, which provides interface to the memory SIMMs in the memory slots. Address and data pass through this ASIC. The MS ASIC also provides memory timing signals. Also on the system module is the SCSI, Ethernet, Serial Communication Chip, Run-Time Clock and supporting hardware for these devices. IOCTL ASIC I/O devices are controlled by the I/O Control (IOCTL) ASIC. The IOCTL ASIC is the interface to memory and other parts of the system for TURBOchannel slots 0, 1, 2. Since all I/O data passes through the IOCTL and control signals are provided by this ASIC, you must specify the TURBOchannel path when booting or testing a device on that slot. For example, the command cnfg 1 shows the devices on TURBOchannel slot 1. For the system to perform its power-up tests, it must first read the ROM by means of the IOCTL, MT and MB ASICS. 8 3 System Module Main Components The system module is the largest printed circuit board in the CPU drawer, and is screwed directly to the floor of the CPU drawer. The system module supports the CPU daughter card, Ethernet, SCSI bus, communication ports, memory modules, and up to three TURBOchannel options. Major elements on the system board include: · 256-Kbyte power-up self-test and bootstrap ROM · System control and status registers and LED's · RTC-based system clock and 50-byte (five-year) battery backed-up RAM · SCC-based serial lines · Two RS232 asynchronous serial communications ports · Error address status register · ECC error check/syndrome status register · LANCE-based network interface for Ethernet · Disk/tape interface for SCSI peripherals · 3 TURBOchannel I/O option connectors · DMA for SCSI Ethernet, and two communications ports · Halt switch 3.1 CPU Daughter Card Physical Fit The CPU card is a daughter card held to the system module by four standoffs and a dual card-edge connector. Both the R3000A CPU daughter card and the R4400 CPU daughter card fit into the system module identically. 9 3.2 Memory Modules Up to fourteen 32-MB MS02-CA single in-line memory modules (SIMMs) may be connected to the system module, for a maximum total of 448 MB. The standard DECsystem 5900 has a minimum of two 32-MB MS02-CAs. All SIMMs are to be installed in memory slots 0-13 only. 3.3 NVRAM Module A 1-MB NVRAM (nonvolatile RAM) module can be installed in the last memory slot, slot 14. This memory slot is dedicated to the NVRAM, and is the only one in which the NVRAM can be used. The NVRAM module is the hardware that supports the Prestoserve NFS accelerator. The NVRAM has two LEDs to indicate the status/condition of the on-board battery. See Section 11.2. 3.4 TURBOchannel and TURBOchannel Extender TURBOchannel is an I/O interface. The system module contains three TURBOchannel option slots. One of these slots (0) is preconfigured with an adapter module that is used to connect a TURBOchannel Extender (TCE). The remaining two slots (1 and 2) may be used for one dual or two single TURBOchannel option modules. The TURBOchannel Extender is a standard feature of the DECsystem 5900 product. The TCE is mounted in the hinged metal cover of the CPU drawer. The TURBOchannel Extender allows a two- or three-slot TURBOchannel option module to be connected and to physically take up only one TURBOchannel slot. This leaves two slots available for other TURBOchannel options. Without the TURBOchannel Extender, certain TURBOchannel options could use up all three system TURBOchannel slots, preventing the installation of other options. 10 TURBOchannel enables 100-MB per second connection to TURBOchannel options such as: PMAZ (single-ended SCSI) PMAD (Ethernet) DEFZA (FDDI) As shown in Figure 2 there are three openings in the rear panel of the CPU drawer to allow connection to TURBOchannel I/O devices. These are the three TURBOchannel controller ports. Slot 0, on the left as viewed from the rear of the drawer, contains a TURBOchannel adapter module that is connnected to the TURBOchannel Extender module (TCE). TCE provides a place to mount a one-, two-, or three-slot option without using all of the system TURBOchannel slots. ------------------------------------------------------------ Note ------------------------------------------------------------ Only one TURBOchannel option can be placed on the TCE module. This module does not provide three additional slots but rather extends one slot outside of the system board. ------------------------------------------------------------ 4 Mass Storage Drawers The mass storage drawers have the following characteristics: · 8.75-inch high mass storage drawers · Up to 7 full- and/or half-height SCSI devices per mass storage drawer · Up to four mass storage drawers per cabinet, for a total of up to 28 SCSI mass storage devices · SCSI bus in each mass storage drawer that can be split into two separate buses (see Section 4.3) · Separate power supplies and cooling fans allowing mass storage drawers to be turned off independently of the whole system 11 · Six 5/16-inch hex nuts secure the drawer to the front of the cabinet frame (see Figure 6) · Rear power switch · Rear bracket screws external to the drawer 4.1 ULTRIX Requirements for Mass Storage ULTRIX does not support powering down while ULTRIX is running. Run the shutdown utility before powering down the system or individual mass storage or CPU drawers. 4.2 SCSI Interface The system module (see Figure 8, +?), contains a 53C94-based DMA SCSI interface similar to previous Digital RISC systems. Each SCSI controller controls up to 7 SCSI devices. There can be three optional TURBOchannel (PMAZ) controllers on the CPU board. A TURBOchannel Extender is connected to one of these TURBOchannel slots. The maximum number of mass storage devices is 28 (four TURBOchannel controllers with seven devices each). The SCSI controller on the system module or the TURBOchannel (PMAZ) options are connected by means of SCSI cabling to the mass storage drawers and/or remote SCSI options. A terminator must be located at the end of each SCSI bus. 4.3 SCSI Bus Cables In each mass storage drawer are two separate bus cables. Each bus cable has five SCSI connectors. These two SCSI buses can be joined to connect up to seven SCSI devices to the controller. 12 The SCSI bus cables can be kept separate and terminated (not joined together). In this configuration, performance is increased by having two SCSI buses with fewer drives on each controller. However, this configuration will not give 28 SCSI devices total throughout the whole system because the available controllers will have fewer than the maximum number of drives. 4.4 SCSI Bus IDs On each SCSI bus, the individual disks are tagged with SCSI address numbers. No two SCSI disks on the same bus may have the same number. If SCSI devices are added or replaced, they must have the proper address stickers attached. See Table 1. Table 1 SCSI ID Factory Configuration ------------------------------------------------------------ Device ID ------------------------------------------------------------ CPU SCSI adapter 7 First removable device 5 Boot disk or first disk 0 Remaining disks, in order 1, 2, 3, 4, 6, 5 Remaining removable devices 6, 4 ------------------------------------------------------------ 13 4.5 Ethernet Interface The Ethernet hardware design is similar to that used in previous Digital products. 4.6 Communications Interface Two RS-232 ports are provided on the system module. One of these ports (see Figure 2, +?), is normally used for the system console. These ports provide full modem support in the hardware. 5 Console Commands The console commands in Table 2 are used for system and module diagnostics. Table 2 Console Commands ------------------------------------------------------------ Command Arguments ------------------------------------------------------------ ? -- Displays console commands. ------------------------------------------------------------ boot [[-z #] [-n] #/path [arg...]] Example: R>boot uses the current boot environment variable. ------------------------------------------------------------ cat SCRPT Displays the contents of an individual script. Example: cat 3/test-rtc displays the individual self-tests contained in the test-rtc script in the base system. (continued on next page) 14 Table 2 (Cont.) Console Commands ------------------------------------------------------------ cnfg [#] Displays system or specific module configuration informa- tion. Example: cnfg 3 displays the base system configuration. When [#] is 0, 1, or 2, the listing shows what is in that TURBOchannel slot. ------------------------------------------------------------ d [-bhw] [-S #] RNG val Deposits data to a specified location in memory. Example: d -b -2 0x1A0000000 deposits bytes twice into a system memory location. ------------------------------------------------------------ e [-bhwcdoux] [-S #] RNG Examines memory. Example: e -b 0x1E000000-0x1E7FFFFF examines as hexadecimal bytes the contents of the TURBOchannel slot 0 memory location. ------------------------------------------------------------ erl [-c] Displays or clears the console error log. Example: erl -c clears the console error log. ------------------------------------------------------------ go [ADR] Transfers control to a specified address in memory. Example: go 0x1FC00000 transfers control to the first byte in the Boot ROM memory. ------------------------------------------------------------ (continued on next page) 15 Table 2 (Cont.) Console Commands ------------------------------------------------------------ Command Arguments ------------------------------------------------------------ init [#] [-m] [ARG...] Resets system(s). Example: init 3 resets SCSI devices attached to a base system SCSI controller. ------------------------------------------------------------ ls [#] Displays list of all scripts available for TC slot #. Example: ls 3 gives the scripts available to test the system module and memory. ------------------------------------------------------------ passwd [-c] [-s] Sets or clears a console password. Example: passwd -c clears an existing password. ------------------------------------------------------------ printenv [ENV] Displays the environment variable table. Example: printenv displays the entire table. ------------------------------------------------------------ restart Restarts software. Example: restart. ------------------------------------------------------------ script SCRPT Creates a custom test script. Example: script fun creates a script called fun. (continued on next page) 16 Table 2 (Cont.) Console Commands ------------------------------------------------------------ Command Arguments ------------------------------------------------------------ ------------------------------------------------------------ setenv EVN STR Sets an environment variable. Example: setenv more 0 specifies scrolling to the end. ------------------------------------------------------------ sh [-belvS] [SCRPT] [ARG...] Invokes a specified script. Example: sh 2/pst-t runs through the power-up self test script for the SCSI controller and drives in slot 2. ------------------------------------------------------------ t [-l] [#/STR] [ARG...] Invokes specific module diagnostics. Example: t 3/scsi/target 5 tests unit 5 on the system (3) SCSI controller. ------------------------------------------------------------ test Runs diagnostics for the whole configuration. ------------------------------------------------------------ unsetenv EVN Deletes an environment variable until power cycle or reset. Example: unsetenv testaction disables previous setting for the testaction environment variable. ------------------------------------------------------------ 17 6 Console Error Messages A typical console error message has the following syntax: ?TFL:#/test_name (n:description) [module] where # is the slot number of the failing module. test_name is the name of the failed test. n is the part of the test that failed. description describes the failure. [module] is the module identification number. Table 3 lists the error message prefixes and their meanings. Table 3 Error Messages and Their Meanings ------------------------------------------------------------ Message Meaning ------------------------------------------------------------ ?EV:name Specified environment variable does not exist. ?EVV:value Specified environment variable value is invalid. ?IO:slot/device IO device reports error. ?PDE3:slot Option in specified slot contains old firmware. ?SNF:script Specified script not found. ?STF:KBD Keyboard self-test failed (disregard). ?STF:Pntr Pointing device self-test failed (disregard). ?STX Syntax error. ?TFL Test failure. ?TNF Test not found. ------------------------------------------------------------ 18 7 Environment Variables Environment variables store system parameters and scripts and pass information to the operating system. All valid environment variables are stored in a nonvolatile RAM (NVR) until you change them or reset the NVR. Table 4 summarizes the console commands for setting environment variables. Table 5 lists the standard environment variables that you can set. Table 4 Console Commands for Environment Variables ------------------------------------------------------------ Command Description ------------------------------------------------------------ printenv Displays the value of environment variables. setenv Sets the value of an environment variable. unsetenv Temporarily removes an environment variable. Unsetenv will remove an environment variable temporarily. If you cycle power or reset the system, the environment variable will go back to its previous state. ------------------------------------------------------------ 19 Table 5 Environment Variables Set by the User ------------------------------------------------------------ boot 1 Specifies arguments for the boot command. console 1 Chooses the system console. You normally do not set this variable. Any setting except s, including the default ``blank'' setting, selects autoconfiguration and makes your terminal the system console. Looking at the rear of the CPU drawer, the terminal connected to the leftmost serial port will be the console. haltaction 1 Specifies what happens when you press the halt button or turn on the power. -b Boots the system software, as specified by the boot environment variable. -h Halts the system software and displays the console prompt. -r Restarts the system software. If the restart fails, it boots the software. more Sets the screen pagination. In console mode the text will scroll the number of lines specified. If you set the number to zero, the text scrolls continuously. testaction 1 Sets the default power-up test. -t Specifies a thorough power-up test, which takes up to one hour depending on system memory. -q Specifies a quick power-up test; default. # The number of the module that contains the current script. If no script is active, the system module is assumed. ------------------------------------------------------------ 1 These environment variables stay in nonvolatile RAM until you change or reset them. ------------------------------------------------------------ 20 8 Console Password Security The DECsystem 5900 console security features include password protection. See Table 6 for console privileges. Table 6 Console Privileges ------------------------------------------------------------ Prompt Privilege ------------------------------------------------------------ >> Unrestricted use of all console commands R> Restricted mode; the passwd and boot commands can be used, but with no qualifiers. ------------------------------------------------------------ 8.1 Entering a Password To enter a password, proceed as follows: 1. Enter passwd -s and press Return. The system prompts you to enter a password: pwd:. 2. Enter a password that is between 6 and 32 characters. The system will be sensitive to upper- and lowercase characters. Once the password is entered the system will prompt you to enter it again. 3. Enter the password again. If the password matches the one previously entered, then that becomes the new password. The system displays the console prompt. 4. At this prompt, enter passwd and press Return twice. The system will then display the restricted prompt: R>. 5. At this prompt, you can only boot the system software or enter the console password. 6. To enter the password at the secure prompt, enter passwd and press Return. The system will prompt you for the password. If the password is correct the system will display the unrestricted console prompt: >>. At this prompt, you can enter all the console commands described in Section 5. 21 8.2 Clearing a Password To clear a password use the -c option to the passwd command. At the unrestricted console prompt enter: >> passwd -c 8.3 Clearing a Forgotten Password If the system is in restricted mode and you forget the password, the contents of the nonvolatile RAM (NVR), which contains the password, must be cleared. To clear the NVR of the password, use the Clear NVR jumper pins 1 as follows: 1 1. Power down the CPU drawer by turning off the CPU drawer switch. 2. Pull out the CPU drawer following the procedure in Section 14.1. 3. Open the middle cover over the CPU drawer and lock in place. 4. Locate the two Clear NVR pins to the left of the CPU daughter card. See Figure 3. 5. Short the two pins of the Clear NVR jumper. (If you do not have an NVR jumper, use a screwdriver blade.) 6. While keeping the pins shorted, turn on the system and wait for the console prompt to appear. 7. When the console prompt appears, turn off the system power. 8. Remove the Clear NVR jumper. ------------------------------------------------------------ 1 By using the Clear NVR jumper you also clear all environ- ment variables. After using the Clear NVR jumper, reset the customer 's environment variables. 22 Figure 3 Clear NVR and Other System Module Jumpers 9. Close the CPU drawer and push it back into the cabinet. 10. Turn on the system power. 11. Use the passwd and setenv commands to set the password and environment variables according to customer specifications. 23 9 Powering Up the System The power controller is configured as follows for a single point of power control at the upper CPU drawer switch. See Figure 4 and check the following: ! Remote/Local switch up (remote) " Top power sequence cable connected # Circuit breaker switch up (on) $ Power cable plugged into the ac source % Power cables connected to each drawer See Section 10.2 for a discussion of power-up self- tests. Figure 4 Power Controller Switches 24 9.1 Powering Up with Two CPU Drawers In the dual DECsystem 5900 configuration, with two CPU drawers, the top two power sequence connectors (Figure 4) are used to connect to the two CPU drawers. If either CPU drawer has its upper switch on, then the entire system receives power. If both CPU drawer upper switches are on, then both must be powered off to turn off the entire system. 10 Troubleshooting and Diagnostics The following subsections are a chronological path to detect faults. See Section 14 to remove and replace any faulty part. 10.1 Checking for Power Problems To power up all drawers, the following conditions must be met: 1. Both CPU drawer front switches are on. 2. Rear switches of each mass storage drawer are on. If the green LED on the front panel, lower left, is on, then the drawer has power. 3. All power cords are firmly plugged into the drawers and the power controller. 4. A 1/8-A slow-blow fuse is in place in the power controller. 5. Power controller circuit breaker is up. 6. Power controller Remote/Local switch is up (remote). At this point, if the system and drawers do not power up, toggle the Remote/Local switch to local (down). If the system powers up, then the power sequence cable is disconnected or faulty, the upper CPU drawer switch is faulty, or the power controller is faulty. If shorting the end pins of the three-pin remote sequence plug causes system power-up in the Remote mode of the Remote/Local switch, then the power controller is not faulty. 25 If an individual drawer fails to power up, and you have checked the list above, then the drawer power supply is faulty. Before replacing a drawer power supply, recheck the power connections. Make sure the drawer is receiving ac current. Drawer power supplies are one-piece modules and are easily removed and replaced. See Section 14.2. 10.2 Power-Up Self-Tests On each power up, the system automatically runs the ROM-based self-tests. By default the environment variable testaction is set to q (quick), which limits the testing to about one minute. Extended tests, using the test command, provide more thorough testing. Test names are displayed on the system console during the self-tests. The tests overwrite each other as they finish. The process of testing is reflected in binary values in the pair of 4 LEDs at the rear of the CPU drawer. When the tests are completed, the console prompt appears (>>). Successful completion of the self-test indicates that the kernel system is ready for customer use. Extended self-tests will provide more thorough testing of options and devices. 10.3 Configuration Utility (cnfg) The cnfg utility indicates whether all devices are recognized by the system, and also can be used to determine addresses for devices that are to be added to the system. The command to check the system is shown as follows, along with the system response: >>cnfg 3: KN05 DEC V1.0a TCF0 (32 MB) (enet: o8-00-2b-2d-84-c7) (SCSI = 7) 26 10.4 CPU Daughter Card Diagnostic LEDs There are two diagnostic LEDs on the CPU daughter card that help determine if a system failure on power-up is due to the CPU or to the system module. Two diagnostic LEDs report the progress of power- up self-tests. After power-up, the CPU will request data from the base system firmware ROM. When this request is made one LED turns on. When the base system recognizes this request and sends data back to the CPU the other LED turns on. If only one LED is on then we know the CPU is alert and requesting data, and is probably not at fault. If both LEDs are on, then we know that the CPU and base system module are communicating, and we should look elsewhere for our problem. Use Table 7 to determine which of the two modules to change. Table 7 CPU Daughter Card LED Diagnostics ------------------------------------------------------------ LEDs Lit Failed Module ------------------------------------------------------------ Neither CPU daughter card One System module Both No problem found ------------------------------------------------------------ 11 Extended Testing Extended tests are activated from the base system ROM. These tests should be run to further examine the DECsystem 5900 subsystems: CPU daughter card, system module, Ethernet, memory, SCSI bus, and SCSI devices. Table 8 lists the names of the tests, the modules tested, and the commands you must enter to activate the tests. Footnotes explain the optional command switches that you can set to alter the tests. The -l option for the test command can be used to loop tests. A script is a collection of tests. The ls # command shows you the scripts available for slot number (#). You can also build scripts of tests that you want to run. These scripts are lost on power down. 27 As an example of how to use Table 8, if one or more of the cache tests fail, it would indicate a bad CPU daughter card. 11.1 Individual Tests Table 8 includes all tests for each module or device. Table 8 Individual Tests ------------------------------------------------------------ Individual Test Test Command ------------------------------------------------------------ System Module Halt button t 3/misc/halt n[0] 1 Nonvolatile RAM t 3/rtc/nvr [pattern] 2 Overheat detect t 3/overtemp RAM refresh t 3/misc/rfrsh Real-time clock period t 3/rtc/period Real-time clock register t 3/rtc/regs Realtime t 3/rtc/time SCC 3 access t 3/scc/access SCC DMA t 3/scc/dma line[2] int/ext[I] bd[38400] pa[none] bits[8] 4 SCC interrupt t 3/scc/int line[0] 4 SCC I/O t 3/scc/io line[0] int/ext[I] 4 SCC pins t 3/scc/pins line[2] loopback[29- 24795-00] 4 ------------------------------------------------------------ 1 [0] = [1-9] = press halt same number of times (1-9) 2 [pattern] 55 is default pattern 3 Serial communications chip 4 Conventions used in SCC tests. line, serial line to test, 2 is rightmost from back 3 is leftmost. int/ext is internal or external loopback. bd is baud rate. pa is parity. bits is data bits. loopback specifies the type of loopback used in the pins test. The value in [ ] specifies the default. (continued on next page) 28 Table 8 (Cont.) Individual Tests ------------------------------------------------------------ Individual Test Test Command ------------------------------------------------------------ SCC xmit and receive t 3/scc/tx-rx line[2] int/ext[I] bd[9600] pa[none] bits[8] 4 Translation lookaside buffer probe t 3/tlb/prb NVRAM t 3/prcache NVRAM clear t 3/prcache/clear NVRAM battery enable t 3/prcache/unarm NVRAM battery disable t 3/prcache/arm System Module Ethernet Controller Collision t 3/ni/cllsn Cyclic redundancy code t 3/ni/crc Display MOP counter t 3/ni/ctrs DMA registers t 3/ni/dma1 DMA transfer t 3/ni/dma2 ESAR 5 t 3/ni/esar External loopback t 3/ni/ext-lb Internal loopback t 3/ni/int-lb Interrupt request (IRQ) t 3/ni/int ------------------------------------------------------------ 4 Conventions used in SCC tests. line, serial line to test, 2 is rightmost from back 3 is leftmost. int/ext is internal or external loopback. bd is baud rate. pa is parity. bits is data bits. loopback specifies the type of loopback used in the pins test. The value in [ ] specifies the default. 5 Ethernet station address ROM (continued on next page) 29 Table 8 (Cont.) Individual Tests ------------------------------------------------------------ Individual Test Test Command ------------------------------------------------------------ Multicast t 3/ni/m-cst Promiscuous mode t 3/ni/promisc Registers t 3/ni/regs SCSI Controller and Drives SCSI controller t 3/scsi/cntl SCSI send diagnostic t 3/scsi/sdiag [scsi_id] [d] [u] [s] 6 SCSI target t 3/scsi/target [scsi_id] [w] [l #] 6 CPU Card Cache data test t 3/cache/data [cache] [address] 7 Cache fill t 3/cache/fill [cache] [offset] 7 Cache isolate t 3/cache/isol [cache] 7 Cache reload t 3/cache/reload [cache] [offset] 7 Cache segment t 3/cache/seg [cache] [address] 7 ------------------------------------------------------------ 6 Replace scsi_id with device id # that you want to test. 0 is the default. [d] and [u] are device-specific parameters. Reference the device manuals for more details. Leave to default if unsure. [s] suppresses error messages (not normally set). [w] if specified will perform a write test to the device called out in SCSI target test. Caution: This can cause data loss. Run this command only on hard disks that have no data, or on tapes with scratch media installed. 7 Replace [cache] with I (instruction) or D (data) to specify which cache to test. Data cache is default. Default [offset] is 80500000. You can replace with the address you wish the test to start at. [address] is not normally entered. (continued on next page) 30 Table 8 (Cont.) Individual Tests ------------------------------------------------------------ Individual Test Test Command ------------------------------------------------------------ Secondary cache (R4400 only) t 3/scache/data 8 CPU-type t 3/misc/cpu-type Floating-point unit t 3/fpu Translation lookaside buffer probe t 3/tlb/prb TLB reg t 3/tlb/reg [pattern] [pattern] 9 Memory Modules Floating I/O t 3/mem/float10 [address] 10 Memory module t 3/mem [module] [threshold] [pattern] 10 RAM board t 3/mem [board] [threshold] [pattern] RAM address select lines t 3/mem/select Partial write t 3/misc/wbpart Initialize memory t 3/mem/init ------------------------------------------------------------ 8 The scache command has the following parameters (and defaults): [pattern] (80500000), [pattern_increment] (08104225), [address] (80500000), [length] (00100000), [f | r] (flush or replace), and [c | u] (run cached or uncached). Errors reported from this diagnostic indicate a fault in the secondary cache RAMs or in the interconnect between the R4400 and the secondary cache, or due to erros in reading in writing the memory; to eliminate this possibility, first the run the memory diagnotics. 9 [pattern] default is 55555555. Pattern can be entered if needed. 10 You can enter starting [address]. A0100000 is default. Module # default is 0. You can specify module [module]. A data pattern can be specified [pattern]. ------------------------------------------------------------ 31 11.2 Diagnosing the NVRAM The >>t 3/prcache command is used to test the Prestoserve NVRAM module. If the NVRAM cache is clean, you are testing the entire data area; if dirty (meaning it contains data) you are testing the scratch area. Using the cnfg command when NVRAM is installed, you will see the capacity and the term "prcache". The two LEDs at the top of the NVRAM module indicate the condition and the status of the battery. Looking from the front of the drawer, the LED on the left of the module shows the operating condition of the battery. When this LED is lit, the battery is fine. The LED on the right (viewing from the front of the drawer) indicates the status of the battery enable/disable circuit. When the right LED is lit, the battery is enabled. 11.3 Arming and Disarming the Battery The battery can be disabled (that is, the disable circuit is armed) with the following command: >>t 3/prcache/arm The battery can be enabled (that is, the disable circuit is unarmed) with the following command: >>t 3/prcache/unarm 11.4 Clearing the NVRAM To clear the NVRAM, enter the following command: >>t 3/prcache -c ------------------------------------------------------------ Caution ------------------------------------------------------------ If the system was not brought down using the shutdown command, NVRAM may still contain data. Do not use the -c command until this data is flushed by means of the boot and shutdown commands. ------------------------------------------------------------ 32 11.5 System Module ROM Diagnostics The diagnostics of the KN03-AA or the KN05 reside in the base system module. The test commands are invoked by name and not by number. A list of tests may be obtained by entering t n/?, where n is the number of the module. The syntax for running a ROM diagnostic test is as follows; t #/test-name where: # is the slot number and test-name is the full name of the test. TURBOchannel slots are 0, 1, and 2. The CPU/system module is slot 3. The diagnostic tests can be run one at a time, or run serially by using the SCRPT command to generate a script. 11.6 ULTRIX System Exercisers The ULTRIX operating system contains a set of commands called exercisers. The exercisers reside in the /usr/field directory and allow you to test all or part of your system by exercising specified parts. See the DECsystem 5900 CPU System Technical Manual and the ULTRIX Pocket Service Guide for details on the ULTRIX system exercisers. ------------------------------------------------------------ Note ------------------------------------------------------------ The ULTRIX exercisers are not a mandatory subset and may not be installed on your system. Subset UDTEXER must be installed for the exercisers to be present. ------------------------------------------------------------ The following ULTRIX-based exercisers are currently available and can be used to exercise and test the DECsystem 5900: · fsx = file system exerciser · memx = memory exerciser 33 · shmx = shared memory exerciser · dskx = disk exerciser · mtx = magnetic tape exerciser · tapex = tape exerciser program · netx = tcp/ip network exerciser · cmx = communications exerciser · lpx = line printer exerciser To run these exercisers, the operator must log in as superuser (root) and then change directory to /usr/field. All of the exercisers can be run in either the foreground or the background and can be canceled at any time by pressing Ctrl/C in the foreground. More than one exerciser can be run at the same time. To run more than one exerciser simultaneously, a shell script called syscript is used. The syscript command asks which exercisers are to be run, how long the exercisers will be run, and how many exercisers are to be run at one time. The syscript command can be used to exercise a device, a subsystem, or the entire system. Each time an exerciser is invoked, a new logfile is generated in the /usr/field directory. The logfile is a record of the exerciser 's results and consists of the starting and stopping times, and of error and statistical information. 34 12 System Software Management The following system software (ULTRIX) operations are significant to the hardware maintenance process: · Starting up (booting) system software · Shutting down system software · Accessing ULTRIX error logs 12.1 Booting System Software There are two ways to boot system software: from disk or tape, or over the network. Each method is detailed in the following procedures. 12.1.1 Booting from Disk or Tape The syntax for this command is: >>boot #/rz-tz scsi_id/file_name [-a] # is the slot number of the device. rz-tz scsi_id is the type and SCSI address of the boot device. file_name is the actual image you are booting, usually vmunix. [-a] is optional, specifying a multiuser boot. 12.1.2 Booting Over the Network The syntax for this command is: >>boot #/protocol[/file] [-a] # is the slot number of the NI over which you are booting. protocol is the network protocol, either mop or tftp. [/file] is optional, and represents a specific file used to boot. [-a] is optional, specifying a multiuser boot. Examples of boot commands are: >>boot 3/rz0/vmunix -a >>boot 3/tz5 >>boot 3/mop -a 35 If autoboot is not selected, and if the ROM diagnostics pass, enter the console mode by turning on the system, or after using the shutdown command to stop running ULTRIX. Select Autoboot by entering the following console command: >>setenv haltaction -b If the system displays the ULTRIX prompt # before the login: prompt appears, the system has stopped at single-user mode instead of multiuser mode. To move to multiuser mode, press Ctrl/D to continue the boot operation. When the system displays the login: prompt, the system software has started successfully. The system probably stopped at single- user mode because the bootpath is set for single-user mode or because of disk corruption. If the problem persists, the disks should be cleaned using the fsck function. If the system displays a console prompt (>> or R>), the bootup failed. Proceed as follows: 1. If the restricted prompt R> is displayed, refer to Section 8.1, on entering a password. passwd and press Return. 2. At the pwd: prompt, enter the password and press Return. The system displays the console prompt >>. 3. At the console prompt, enter printenv and press Return to display the environment variables table. 4. If the bootpath is set incorrectly, use the setenv command to set the boot environment variable to a device or to the network that contains the system software that you want to boot. 5. Reenter the boot command to boot the system. 36 12.2 Shutting Down System Software If the system is running ULTRIX software, shut down the software before you perform hardware maintenance. At the ULTRIX prompt #, enter: /etc/shutdown -h [now / hhmm / +n] In this case the values in brackets are not optional. A time to shut down must be entered. If not, the system will answer, "I don't know when that is, can't you wait until tomorrow". · now shuts down the software immediately. · hhmm shuts down the software at a specific hour and minute. 1. Replace hh with the hour to begin the shutdown. 2. Replace mm with the minute to begin the shutdown. · +n shuts down the software in a specified number of minutes. The system displays a console prompt >> or R> when shutdown is complete. 13 Using Error Logs The system records events and errors in the ULTRIX error logs. Use the memory error, error and status register, and system overheat error logs to troubleshoot intermittent problems. ------------------------------------------------------------ Note ------------------------------------------------------------ The ULTRIX error logs are not the same as the test error logs that appear when you use the erl console command. The console error log is a record of errors reported by tests run in console mode. ------------------------------------------------------------ 37 13.1 erl >>erl [-c] The system records console error messages in a special error log buffer, and the erl command displays the contents of this buffer. If you specify -c, the buffer is cleared. The system stops recording error messages when the buffer is full and resumes when the buffer is cleared. The following paragraphs describe ULTRIX error log formats and error log parts useful in troubleshooting. 13.2 Examining Error Logs You must be logged in to ULTRIX to examine ULTRIX error logs. At the ULTRIX prompt, enter /etc/uerf and press Return. A full display of error log entries appears on the console. There are many ways to examine the error log. See the ULTRIX Pocket Service Guide. The first part of each error log describes the type of error and system conditions in effect when the error occurred. The last part of each log provides specific information about the error and its location. In the error log displays: · EVENT CLASS lists the error log's general category. Possible EVENT CLASS categories are: ------------------------------------------------------------ Operational events, which are changes in system operation that are not errors. ------------------------------------------------------------ Error events, which are actual errors in system operation. · OS EVENT TYPE describes the type of error or event recorded in the log. For information about memory, error and status register, and overheat error logs, refer to the following section ``Distinguishing Event Types'' and to the discussion of the particular log in which you are interested. 38 · SEQUENCE NUMBER lists the order in which the system logged the event. · OPERATING SYSTEM lists the system's version of ULTRIX. · OCCURRED/LOGGED ON shows the time the error occurred. · OCCURRED ON SYSTEM lists the individual system that reported the error. · SYSTEM ID includes several listings: ------------------------------------------------------------ The first number to the right of SYSTEM ID is the system ID. ------------------------------------------------------------ HW REV lists the system hardware revision number. ------------------------------------------------------------ FW REV lists the system firmware revision number. ------------------------------------------------------------ CPU TYPE shows the type of CPU used in the system. · PROCESSOR TYPE lists the type of processor chip used in the system. The remaining error log entry is different for each error log event type. For an explanation of entries contained in memory, error and status register, and overheat error logs, refer to the next section. 13.3 Distinguishing Event Types The second line of each error log lists the code number and name of the error log event type. The following sections describe memory, error and status, and system overheat error logs. For a detailed explanation of other error logs that involve the system unit, refer to the ULTRIX documentation. 39 13.4 Troubleshooting Mass Storage Devices Using uerf Error Logs You need three pieces of information from the error log to locate a physical device: CONTROLLER NO., UNIT NO., and SCSI ID. These pieces of information are shown in the uerf error log. Note that the ULTRIX operating system calls out the system TURBOchannel controllers and devices differently than the system firmware does. ULTRIX sees the system module as Controller 0 and then numbers the TURBOchannel slots up from there. Therefore some conversion is necessary when determining which physical device the error log is calling out. Using Table 9, determine which TURBOchannel controller slot and unit number is being called out, and then locate the failing drive. If you are unfamiliar with the system mass storage configuration, either follow the cable or use tests and the config utility to locate the drives connected to the controller that is being called out. It will be very helpful to mark the SCSI ID numbers on the drives and know which drives are connected to which SCSI controller(s). As you are facing the rear of the system, the TURBOchannel slots are numbered 0, 1, 2 from left to right. In the following error log example, drive 1, which is connected to the PMAZ SCSI controller on TURBOchannel slot 2, is showing errors. Use the drive conversion chart (Table 9) to calculate the physical drive from the UERF error log shown in the following error log example: 40 ----- EVENT INFORMATION ----- EVENT CLASS ERROR EVENT OS EVENT TYPE 102. DISK ERROR SEQUENCE NUMBER 16. OPERATING SYSTEM ULTRIX 32 OCCURRED/LOGGED ON Tue Dec 10 03:23:55 1991 MET OCCURRED ON SYSTEM SYSTEM ID x82040230 HW REV: x30 FW REV: x2 CPU TYPE: R2000A/R3000 PROCESSOR TYPE KN05 ----- UNIT INFORMATION ----- UNIT CLASS SCSI DISK UNIT TYPE RZ57 CONTROLLER NO. 3. UNIT NO. 25. ----- SCSI INFORMATION ----- REVISION 2. ERROR TYPE x0000 DEVICE ERROR SUB-ERROR TYPE 0. SCSI ID 1. In Table 9, the subheadings in the table show the CONTROLLER NO. as listed in the error log and its related TURBOchannel slot. Within the subheadings, the UNIT NO. from the error log is listed with its corresponding SCSI ID on that module. 41 Table 9 Error Log to Physical Drive Conversion ------------------------------------------------------------ UNIT NO. (from error log) SCSI ID ------------------------------------------------------------ CONTROLLER NO. 0 (from error log) = TURBOchannel slot 3 (System module) ------------------------------------------------------------ 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 1 Controller ------------------------------------------------------------ CONTROLLER NO. 1 (from error log) = TURBOchannel slot 0 (PMAZ) ------------------------------------------------------------ 8 0 9 1 10 2 11 3 12 4 13 5 14 6 15 1 Controller ------------------------------------------------------------ 1 The highest UNIT NO. on each SCSI controller (7, 15, 23, 31) is the controller for the respective SCSI bus. As an example, SCSI ID 3 on TURBOchannel slot 0 would appear in the error log as UNIT NO. 11 on CONTROLLER NO. 1. (continued on next page) 42 Table 9 (Cont.) Error Log to Physical Drive Conversion CONTROLLER NO. 2 (from error log) = TURBOchannel slot 1 (PMAZ) ------------------------------------------------------------ 16 0 17 1 18 2 19 3 20 4 21 5 22 6 23 1 Controller ------------------------------------------------------------ CONTROLLER NO. 3 (from error log) = TURBOchannel slot 2 (PMAZ) ------------------------------------------------------------ 24 0 25 1 26 2 27 3 28 4 29 5 30 6 31 1 Controller ------------------------------------------------------------ 1 The highest UNIT NO. on each SCSI controller (7, 15, 23, 31) is the controller for the respective SCSI bus. As an example, SCSI ID 3 on TURBOchannel slot 0 would appear in the error log as UNIT NO. 11 on CONTROLLER NO. 1. ------------------------------------------------------------ The UNIT INFORMATION section describes the type of module that reported the error. · UNIT CLASS shows that the error occurred in a memory module. · UNIT TYPE lists the particular type of memory module in which the error occurred. 43 · ERROR SYNDROME describes the nature of the error. The ERROR & STATUS REGS section lists the error and status register contents followed by phrases that describe the register contents. · EPC indicates that this is an exception program counter. · KN03 or KN05 STAT REG lists the contents of the CPU status register (CSR). · ERROR ADDR REG describes the specific error type. · PHYSICAL ERROR ADDR is the address in the hardware where the error occurred. · CHECK SYNDROME entries describe the actual error and the module where it occurred. ------------------------------------------------------------ SYND BITS lists the bits in the check syndrome register. ------------------------------------------------------------ The second line states whether this is a single bit or a mulitbit error. ------------------------------------------------------------ The third line shows the check bits. ------------------------------------------------------------ MODULE NUM shows the slot number of the module that reported the error. ------------------------------------------------------------ ERROR COUNT shows the total number of errors that have occurred in this module since the last time the software was booted. ------------------------------------------------------------ The last line shows whether this is a bus error or memory interrupt error. · The ADDITIONAL INFO section lists the controller number and total number of errors that have occurred at this address. 44 13.5 Error and Status Register Error Logs Error and status register error logs record nonmemory errors. This is a sample of the error log sections that are unique to error and status register error logs. ----- ERROR & STATUS REGS ----- CAUSE x80002000 EXCEPTION CODE EXTERNAL INTERRUPT HW INTERRUPT 3 PENDING BRANCH DELAY SET STATUS x0000FE04 CURRENT INTERRUPT STATE DISABLED CURRENT MODE KERNEL PREVIOUS INTERRUPT STATE ENABLED PREVIOUS MODE KERNEL OLD INTERRUPT STATE DISABLED OLD MODE KERNEL SW INTERRUPT 1 ENABLED HW INTERRUPT 0 ENABLED HW INTERRUPT 1 ENABLED HW INTERRUPT 2 ENABLED HW INTERRUPT 3 ENABLED HW INTERRUPT 4 ENABLED HW INTERRUPT 5 ENABLED CACHE STATE NORMAL SP xFFFFDC58 KN05 STAT REG x05C20001 IO INT 0 PENDING 19,200 BAUD 8 MB MEM MODULE ECC CMD x0 IO INT 1 ENABLED IO INT 6 ENABLED IO INT 7 ENABLED NORMAL MODE REFRESH ODD MEM MODULES 45 UNSECURE ERROR ADDR REG xE7B00000 CPU I/O WRITE TIMEOUT PHYSICAL ERROR ADDR x1EC00000 The ERROR & STATUS REGS section lists the error and status register contents followed by phrases explaining these values. · CAUSE lists the event that caused the error. · STATUS lists various system settings in effect when the error occurred. · SP is a stack pointer that identifies where the CPU contents were sent when the error occurred. · KN03 or KN05 STAT REG lists the contents of the CPU status register. · ERROR ADDR REG describes the specific error. · PHYSICAL ERROR ADDR indicates where in the hardware the error occurred. 13.6 System Overheat Error Messages If the system overheats, ULTRIX records the error and displays the following message on the console: "System overheating - suggest shutdown and power-off" 14 Major FRU Replacement DECsystem 5900 field-replaceable units (FRUs) are: · Power supplies · Mass storage devices and cabling · CPU drawer modules and cabling 46 · Power controller The following sections cover the major FRUs, and explain how to access and remove the various FRUs. ------------------------------------------------------------ Caution ------------------------------------------------------------ The DECsystem 5900 hardware includes electrostatic-sensitive components. Before touching any internal components, take precautions to protect against electrostatic discharge. ------------------------------------------------------------ 14.1 Pulling Out a Drawer ------------------------------------------------------------ Warning ------------------------------------------------------------ A fully-populated mass storage drawer weighs 110 lbs; a CPU drawer weighs 65 lbs. Because of the weight of these drawers, pull out the stabilizer bar between the front two feet of the system before pulling out a drawer. In addition, OSHA rules state that one person may lift only 35 lbs. ------------------------------------------------------------ Drawers are secured to the cabinet frame with screws in the front and in the rear. To free a drawer to be pulled out, proceed as follows: 1. Pull out the stabilizer bar between the two front feet of the system. Check that the stabilizer bar is touching the floor. 2. To pull out a mass storage drawer, remove the retainer brackets in the rear (if still in place). a. Loosen the eight slotted captive screws ! holding the lower rear plate of the drawer, and remove the plate. 47 b. Loosen and remove the two hex slotted screws " on each inside wall as shown in Figure 5. Leave the brackets in place on the drawer slides. Bracket screws do not have to be replaced. Figure 5 Rear Bracket Screws 3. Remove the six hex nuts (5/16-inch) on the front of the drawer. See Figure 6, !. These hex nuts do not have to be replaced. 4. Pull the drawer out until the locking tabs secure the drawer. 48 Figure 6 Front Hex Nuts 14.2 Mass Storage Drawer Power Supply The mass storage power supply is in the rear of the drawer. Remove a mass storage drawer power supply using the following procedure: 1. Turn off the rear or front power switch to the mass storage drawer, and pull out the power cord from the power supply. 2. Follow the procedure in Section 14.1 and pull the drawer out from the front. 3. Open the drawer lid and unplug the SCSI bus power cable and the switch cable at the rear of the drawer. 4. Push the drawer in and move to the rear of the system. 5. At the rear of the drawer, unlatch the rear panel and remove the two top screws holding the power supply assembly. 6. Lift the power supply assembly up and out. 49 14.3 Removing the CPU Power Supply The CPU power supply is in the front of the drawer. Remove the power supply as follows: 1. Power off the system by turning either the upper (system) CPU switch or the lower (CPU drawer) switch off. 2. Remove the six hex nuts that hold the drawer to the cabinet frame and pull out the CPU drawer. 3. Remove the four Phillips screws holding the front plate of the drawer. 4. Release the Phillips captive screws on the front lid, and unplug the blower from the power supply. 5. Remove the lid/blower. 6. Disconnect the three cables from the power supply to the power distribution module mounted to the inside wall of the front section of the drawer. (Remember that the red cable is plugged into the top socket in the power distribution module.) 7. Disconnect the ac power cable from the power supply. 8. Loosen and disconnect the screws holding the plenum over the power supply and remove the plenum. 9. Remove the power supply assembly. 10. Reverse the previous steps to replace the power supply. 50 14.4 Removing the System Skirts Skirts have to be removed if you want to move the cabinet on its wheels. On each side of the system, a skirt assembly is held to the frame by means of two quarter-turn Phillips captive screws. The side skirts must be removed to reach and loosen the lock nuts holding the feet in place. Move the feet up above the wheels. If the system is to be rolled up or down a ramp, the front and rear skirts must be removed also. 14.5 Mass Storage Drawer SCSI Cables Signals are distributed to the mass storage devices by means of an internal SCSI cable that connects these signals to the two SCSI bus ports at the rear of each mass storage drawer. The two SCSI buses can be connected or used as separate buses. Each bus cable has five 50-pin keyed connectors. If not connected, each harness must be terminated at any point in the harness. These separate, or split-SCSI buses, allow more through- put. 51 In Figure 7 the disks and the power supply for the mass storage drawer are shown. Figure 7 Media in the Mass Storage Drawer ! Hard disk drives " Removable medium, full-height # Removable media, half-height $ Power supply module 52 14.6 CPU Drawer The DECsystem 5900 CPU is in an 8.75-inch CPU drawer, which is the third drawer in the cabinet. (Drawer 2 is reserved for the second CPU drawer in a dual DECsystem 5900 configuration.) The CPU drawer contains the following components, as shown in the side view of Figure 8: Figure 8 CPU Drawer Subsystems, Side View ! Power supply module fan " 1-MB NVRAM (SIMM in slot 14 of memory array) # CPU drawer cover $ MS02-CA 32-MB memories (two shown) % TURBOchannel extender module & 244 W power supply ' Power cable 53 ( Cable to front panel LED ) Power distribution module +> CPU daughter card +? System module (base printed-circuit board) +@ TURBOchannel option module 14.7 Replacing the System Module The system module is the main printed circuit board on the floor of the CPU drawer. It is held in place by fifteen 10/32-inch screws and six 3/16-inch metal standoffs. To access and remove the system module, use the following procedure: 1. Shut down the ULTRIX operating system. 2. Power off the system and/or the CPU drawer. 3. Label and disconnect the cables at the rear of the CPU drawer. 4. Pull out the CPU drawer. 5. Open the middle compartment lid and hold it in place with the latch. 6. Remove the CPU daughter card, the memory and NVRAM SIMM modules, TURBOchannel options, and the power cables. 7. Remove the fifteen 10/32-inch screws and six 3/16-inch standoffs. 8. Gently pull the system module up and out of the CPU drawer. Please do not bend the metal gasket at the rear of the system module. 9. Remove the new module from the stiffener before installing it in the drawer. Also remove the system jumper, located in TURBOchannel option slot 0 area (see Figure 3). Install the replaced module, with all the hardware, in the stiffener before returning it. This stiffener must be used in order to receive credit for the returned system 54 module. Please do not bend the metal gasket at the rear of the system module. 10. Transfer the ESAR chip to the new system module. When you replace the system module, the ESAR chip from the replaced system module must be transferred to the new system module to maintain the Ethernet address. The ESAR chip is the socketed DIP chip on the system module beneath the option board in option slot 1 (the middle option slot). 14.8 Replacing the CPU Daughter Card The CPU daughter card is held in place by four standoffs and two card-edge connectors. To access and remove the CPU daughter card, follow this procedure: 1. Shut down the ULTRIX operating system. 2. Power off the system and/or the CPU drawer. 3. Pull out the CPU drawer. 4. Open the middle compartment lid and hold it in place with the latch. 5. Using small pliers, release the CPU daughter card from its standoff locks. 6. Gently pull the CPU daughter card up from the system module card-edge connectors. 7. Carefully align the new card with the four standoffs. Push the module down, seating it in the connector and locking the module onto the standoffs. 55 15 Adding or Replacing Mass Storage Devices Use the following procedures to install, configure, or replace a mass storage device. 15.1 Installing a Drive Use the following procedure to install a disk drive at the customer site: 1. Shut down the ULTRIX operating system. 2. Get the current system and drive configuration. At the console prompt enter the cnfg command. a. Enter a cnfg x of the system module (3) and each PMAZ (x = the slot number 0-3). b. Write down the drive numbers on each PMAZ. Remember that you can have a maximum of 7 SCSI devices on each SCSI bus (up to 4 SCSI buses). c. Decide which SCSI bus(es) you will add the device(s) to and what addresses they will be. ------------------------------------------------------------ Note ------------------------------------------------------------ On any one SCSI bus you cannot have duplicate addresses. ------------------------------------------------------------ 3. Power down the system and/or the mass storage drawer(s). To power down one mass storage drawer, turn off either the rear or the front power switch of the mass storage drawer. 4. Set the address of device(s) 0-6 without duplicating other addresses on the SCSI bus. Do not conflict with other addresses on the SCSI bus(es). Put the appropriate address label on the drive bracket. 56 5. Remove the empty bracket(s) in the location that you will install the device(s) and install the device(s) with the screws supplied in the accessories kit. 6. Power up the system/mass storage drawers and make sure (>>cnfg x) shows the drive(s) you installed as well as the other drives. 7. Run the SCSI send diagnostics and target tests on the newly installed device(s) to verify that they work. ------------------------------------------------------------ Caution ------------------------------------------------------------ Be careful not to write on other drives if you use the w option with the target test. Do not use the [w] option on any device that has data on it unless that data is backed up. To be safe, use scratch media. ------------------------------------------------------------ 8. Boot the system and allow the system manager to set up the system to recognize the device and edit the configuration file and doconfig. 9. Report your time via the procedure in the Field Test Support Plan. 15.2 Configuring Disks in the Mass Storage Drawer An ID/address label should be placed on the top surface of each device. Refer to the inside cover of the storage trays for illustrations of the jumper configurations of various SCSI devices. 57 15.3 Removing the Faulty Device Once you locate the faulty device, use the following procedure to remove the device: 1. Have the ground strap in place. 2. Open the drawer and locate the device. 3. Loosen the captive screw to that device. 4. Disconect the power cable to that device. 5. Pull the device out. 6. Disconnect the SCSI cable and remove the device. 7. Replace the device or subassembly as required. 8. Run SCSI tests to verify operation of the replacement. 16 Field-Replaceable Units Table 10 shows the DECsystem 5900 field-replaceable parts. . Table 10 Field-Replaceable Units ------------------------------------------------------------ Part No. Description ------------------------------------------------------------ Modules ------------------------------------------------------------ 70-28348-01 System module w/ stiffener 54-20627-01 R3000A CPU daughter card 54-21872-02 R4400 CPU daughter card 54-20623-01 TCE option module 54-20625-01 TCE interface 54-21333-01 Power distribution module (continued on next page) 58 Table 10 (Cont.) Field-Replaceable Units ------------------------------------------------------------ Part No. Description ------------------------------------------------------------ Modules ------------------------------------------------------------ MS02-CA 32-MB memory module 54-20948-01 NVRAM module SZ29xC, -xD Mass storage drawer ------------------------------------------------------------ Cables, CPU ------------------------------------------------------------ 17-03363-01 Power supply logic 17-03364-01 +5 V CPU power harness 17-03335-02 TURBOchannel Extender interconnect 17-03365-01 TURBOchannel Extender power harness 17-03362-01 AC input harness 17-03379-01 Remote switch cable 17-00931-05 Remote sense cable to power controller ------------------------------------------------------------ Cables, Mass Storage Drawer ------------------------------------------------------------ 17-03529-01 Disk power harness 17-03528-01 Internal SCSI cables 17-03360-01 Power switch cable 17-03380-01 SCSI jumper cable ------------------------------------------------------------ Cables, Miscellaneous ------------------------------------------------------------ 17-03361-01 SCSI cable, dwr-to-dwr 17-00442-19 Drawer power cord 17-02641-02 PMAZ SCSI cable (continued on next page) 59 Table 10 (Cont.) Field-Replaceable Units ------------------------------------------------------------ Part No. Description ------------------------------------------------------------ Cables, Miscellaneous ------------------------------------------------------------ 12-37004-01 SCSI terminator, single-end external 12-37004-02 SCSI terminator, differential external 12-36929-01 SCSI terminator, single-end internal 12-33929-02 SCSI terminator, differential internal ------------------------------------------------------------ Power Supplies ------------------------------------------------------------ 30-32506-03 CPU drawer H7886-AA Mass storage drawer 30-35415-01 Power controller, 120 V 30-35415-02 Power controller, 240 V ------------------------------------------------------------ Mass Storage Devices ------------------------------------------------------------ 29-28158-01 RZ57 1.0-GB (HDA) 29-28159-01 RZ57 1.0-GB (module) RZ58-E RZ58 1.3-GB RX26-LF RX26 RRD42-AA RRD42 600-MB CD-ROM TLZ04-AA TLZ04 1.2 GB (embedded) TLZ06-AA TLZ06 4.0 GB (embedded) TZ30-AX TZ30 95-Mbyte tape TZK10-AA TZK10 QIC tape (continued on next page) 60 Table 10 (Cont.) Field-Replaceable Units ------------------------------------------------------------ Part No. Description ------------------------------------------------------------ Mass Storage Devices ------------------------------------------------------------ TZ85-BX TZ85 2.6-GB tape TKZ08-AA TKZ08 2-GB 8 mm tape TKZ09-AA TKZ09 5-GB 8 mm tape ------------------------------------------------------------ TURBOchannel Options ------------------------------------------------------------ 70-19874-01 PMAD-AB 70-19876-01 PMAZ-AB 70-26944-01 DEFZA-AA 70-22710-01 DEFZA-CA ------------------------------------------------------------ Loopbacks and Connectors ------------------------------------------------------------ 12-25083-01 MMJ loopback 12-22196-02 Standard Ethernet loopback 29-24795-00 Communication modem loopback 12-33190-01 Communication line to MMJ adapter ------------------------------------------------------------ Miscellaneous ------------------------------------------------------------ 12-37483-01 Blower assembly 12-24160-02 CPU switch 12-14314-00 NVR jumper ------------------------------------------------------------ 61