MS630 Memory Problem Determination/Resolution Guide Order Number EK-MS630-FI-001 ABSTRACT The objective of this guide is to clearly define the recommended memory maintenance strategy for all MS630 memory arrays. There are no new procedures defined here. These are the original maintenance procedures explained in detail with an emphasis on problem determination (that is, determine what the underlying cause of the problem is and when to replace the FRU). Digital Equipment Corporation June, 1991 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. Possession, use, duplication, or dissemination of the software described in this documentation is authorized only pursuant to a valid written license from Digital or the third-party owner of the software copyright. No responsibility is assumed for the use or reliability of software on equipment that is not supplied by Digital Equipment Corporation. Copyright © Digital Equipment Corporation 1991 All Rights Reserved. Printed in U.S.A. The following are trademarks of Digital Equipment Corporation: MicroVAX . . . MicroVAX II . . . VMS . . . the Digital logo This document was prepared and published by Educational Services Development and Publishing, Digital Equipment Corporation. Contents ------------------------------------------------------------ About This Manual v 1 START HERE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Symptom Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 FRU Replacement Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Non-Conforming Material Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 HARD FAULT -- THEORY NUMBER 1 . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 TRANSIENT FAULT -- THEORY NUMBER 2 . . . . . . . . . . . . . . . . . . 7 3.1 Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 MULTIPLE SYMPTOM FAULT -- THEORY NUMBER 3 . . . . . . . . . . 8 4.1 Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 INTERMITTENT FAULT -- THEORY NUMBER 4 . . . . . . . . . . . . . . . 9 5.1 Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Glossary Figures 1 Memory Parity Error Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Bad Pages Indicated Under SHOW MEMORY VMS/DCL . . . . . . . . . . . 3 3 Memory Error Detected While Running MDM or POST . . . . . . . . . . . . 4 Tables 1 Symptom Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 iii About This Manual ------------------------------------------------------------ This document provides guidance in the event of a MicroVAX memory problem. To use the guide, start at Section 1, identify the problem symptom that you are experiencing, and then follow the procedures. The procedures in this document apply to all MicroVAX II based systems and all supported memory products (1 Mbyte on-CPU memory, 1-Mbyte MS630-AA through 8- Mbyte MS630-CA). These procedures also take into account the recommended guidelines of FCO MS630-I001. This guide assumes that the user has service knowledge of the MicroVAX system and appropriate service tools and procedures. v MS630 Memory Problem Determination/Resolution Guide 1 1 START HERE 1.1 Problem Symptom Determination Use Table 1 to determine the current symptom that your system is experiencing. Once the symptom has been determined, refer to the figure indicated in column 2. NOTE The symptoms are listed in order of priority. In the event that more than one of the following symptoms exist, follow the directions for the first symptom found (in order). Table 1 Symptom Determination ------------------------------------------------------------ If this Symptom Exists . . . Refer to this Figure ------------------------------------------------------------ Fatal memory error in the ERRLOG Figure 1 Bad pages shown under the VMS/DCL SHOW MEMORY command Figure 2 Power-on self-test of the MDM diagnostic fails Figure 3 ------------------------------------------------------------ 2 MS630 Memory Problem Determination/Resolution Guide Figure 1 Memory Parity Error Entry MS630 Memory Problem Determination/Resolution Guide 3 Figure 2 Bad Pages Indicated Under SHOW MEMORY VMS/DCL 4 MS630 Memory Problem Determination/Resolution Guide Figure 3 Memory Error Detected While Running MDM or POST MS630 Memory Problem Determination/Resolution Guide 5 1.2 FRU Replacement Procedures The following recommended procedures should be followed when FRU replacement of the memory module is necessary. 1. Identify/verify the FRU type (module number) and location (slot number). 2. Physically remove the FRU and install a spare FRU (same type) in its place. If the FRU is an M7609 (8-Mbyte MS630-C), verify that the spare is either revision A2 or C1. If not, then find one that is. 3. If a second memory array module is present in the system and it is an M7609 (8- Mbyte MS630-C), verify that it is either revision A2 or C1. If not, then acquire and install FCO MS630-I001, which involves the replacement of this module as well. 4. Power up the system. Verify that POST passes. Run one pass of the MDM diagnostics. Reboot the operating system and verify that the problem symptom does not recur. 1.3 Non-Conforming Material Tag After replacing an FRU, the module must be tagged prior to returning it to Logistics. The following information should be included on the repair tag to aid in module repair and tracking: · Indicate whether or not the FRU problem was: - Hard (easily reproducible) - Intermittent (comes and goes) · Indicate the method used to diagnose the failed FRU: - POST failure - MDM failure - VMS bad pages - Parity error (in the ERRLOG) 6 MS630 Memory Problem Determination/Resolution Guide 2 HARD FAULT -- THEORY NUMBER 1 2.1 Theory Description This theory is valid if the fault is hard (reproducible). The underlying cause of such a fault is typically a physical component failure. If this class of fault is present, then it is quite likely that the memory array exhibits one or more of the following symptoms: · MDM diagnostics fail · VMS maps out bad pages when booted · POST fails · The system cannot boot successfully 2.2 Recommended Action Replace the failed FRU. In the comments field on the repair tag, indicate ``HARD FAULT, XXX FAILURE'', where XXX is the primary symptom (first item you encounter from the following list): 1. POST -- if POST failed 2. MDM -- if MDM failed (also fill out the diagnostic section on the repair tag) 3. VMS bad pages -- if bad pages mapped out 4. Parity error -- if the console/ERRLOG indicated this MS630 Memory Problem Determination/Resolution Guide 7 3 TRANSIENT FAULT -- THEORY NUMBER 2 3.1 Theory Description This theory is valid when the parity error has been categorized as a transient event. In other words, the parity error happened only once and appears to be an isolated incident. The most probable source of this failure is an alpha particle. An alpha particle is a minute, one-shot disturbance which inverts the contents of a single DRAM cell (in other words, changes a ``1'' to a ``0'' or vice versa). Once a cell is impacted by an alpha particle, it remains in the ``inverted'' state until the cell is re-written. Note that once the cell is re-written, all is OK (in other words, the fault is no longer present). The alpha particle phenomenon is well known and documented and is experienced by all DRAM systems of all vendors. This failure mode is the most prominent for all MicroVAX memory parity errors as the rate of occurrence of this phenomenon is 100 times that of hard/reproducible DRAM faults. From past experience and field data, it is possible for a MicroVAX II system (with a fully populated memory subsystem) to experience a transient memory fault once every 3 to 6 months (worse case). The actual rate is dependant upon system load (usage), memory access rates, and application. 3.2 Recommended Action As alpha particles inflict no permanent damage to a DRAM, repair is not necessary. Do not replace the FRU. Simply record the symptoms in the appropriate place (for example, the site management guide or the customer site log). The pertinent information recorded should include: · Date/time of error · Error description (for example, fatal memory error) · FRU isolation information (slot 2 or 3) · Diagnosis/theory (for example, transient as only one error) 8 MS630 Memory Problem Determination/Resolution Guide 4 MULTIPLE SYMPTOM FAULT -- THEORY NUMBER 3 4.1 Theory Description This theory is valid if either one or both of the following conditions exist: · There is more than one problem symptom evident · Multiple FRUs have failed Due to the underlying complexity, it is not possible (and would be inaccurate) to find the failed FRU(s). However, the following are some guidelines for further diagnosis: · If multiple FRUs fail, all exhibiting memory parity errors, then suspect a common component (for example, a cable or CPU module). · If other problem symptoms are exhibited, focus on the earliest and/or common symptom. · If something has recently been changed/installed in the system, consider that component. 4.2 Recommended Action Perform the additional manual diagnosis of all problem symptoms and/or contact the next level of support/service. MS630 Memory Problem Determination/Resolution Guide 9 5 INTERMITTENT FAULT -- THEORY NUMBER 4 5.1 Theory Description This theory is valid if the fault is recurring but not easily reproducible. The underlying cause of such a fault is typically a marginal physical component failure. If this class of fault is present, then it is quite likely that the memory array exhibits one or both of the following symptoms: · System crashes periodically due to a memory parity error · MDM diagnostics do NOT (probably) fail 5.2 Recommended Action Replace the failed FRU. In the comments field on the repair tag, indicate ``INTERMITTENT FAULT, XXX FAILURE'', where XXX is the primary symptom (first item you encounter from the following list): 1. POST -- if POST failed 2. MDM -- if MDM failed (also fill out the diagnostic section on the repair tag) 3. VMS bad pages -- if bad pages mapped out 4. Parity error -- if the console/ERRLOG indicated this Glossary ------------------------------------------------------------ The following terms used within this document are described below as they pertain to memory systems, faults, and errors. MEMORY SYSTEM TERMS Alpha particle An alpha particle is a minute, one-shot disturbance which inverts the contents of a single DRAM cell (in other words, changes a ``1'' to a ``2'' or vice versa). The physical source of alpha particles is the DRAM packaging material. Cell The basic unit of a DRAM. This element corresponds to one bit of storage. For example, a 1-Megabit DRAM contains 1,000,000 cells. DRAM Dynamic Random Access Memory. This is the basic physical component (IC) of a memory array module. For example, there are 288 DRAMs on the M7609 MS630-CA 8-Mbyte memory array module. Error An error occurs when the expected state deviates from the actual state. For example, if a parity check is made on a byte of information fetched from memory, and even parity is computed (we expect odd parity), then a parity error is the result. Fault The term fault is used to describe the underlying cause (or source) of an error. For example, if a parity error occurs, the underlying cause may be a physical component fault. Parity Refers to a technique used to protect data storage. As implemented in the MicroVAX system, a spare (parity) bit is stored with every eight bits of data to aid in the detection of errors. Glossary-1 Glossary-2 FAULT TERMS The following definitions are all considered attributes of errors or faults. As such, the definitions of these adjectives are given as they apply to the terms ``error'' and ``fault''. Faults can be categorized into three distinct groups. The nature of the group relates to the ``period'' of the fault (that is, how long the fault is present). Hard fault Permanent. The fault is always present. Any access/use of the fault results in an error. An example of a fault is a permanently damaged DRAM (which results in a parity error upon every access to the DRAM). Transient fault This class of fault refers to a fault the occurs for only a brief period of time, then disappears forever. In other words, the fault occurs only once. Examples of transient faults include power line disturbances or alpha particle faults. Intermittent fault This class of fault refers to a fault which occurs periodically. The fault is not easily reproducible but does recur over some period of time. Sources of this class of fault may include marginal components or infrequently accessed logic. ERROR TERMS Once a fault occurs and the faulted ``component'' is accessed, an error results. In one sense, the error ``inherits'' the same attributes of the fault (for example, a permanent fault results in a permanent error). However, errors are more appropriately defined in terms of how they impact the system. To this degree there are two main classes of errors. Recoverable This attribute means that the error condition can be corrected. An example of error recovery is ECC single-bit correction. This class of error has little or no impact on system operation. (Note that this class of error is sometimes referred to as a soft error.) Unrecoverable This attribute means that the error condition cannot be corrected and the operation fails to complete. An example of an unrecoverable error is a memory parity error while in kernal mode.