Hardware Error Report And Decode Tool
Contents |
errors. This chapter has the following sections: Downloading HERD About HERD Installing HERD Starting the HERD Daemon Using HERD Known Problems and Limitations Identifying CPU and DIMMs With mcelog MCEs Software Error Report and Decode (SERD) Downloading HERD You can download HERD from
[hardware Error]: Machine Check Events Logged
the Tools and Drivers CD, if available, or from the Tools and Drivers CD image, downloadable from the product web page. The utility resides in the /tools/linux/herd directory. About HERD HERD is a tool for monitoring, decoding, and reporting correctable hardware errors. These correctable hardware errors are also known as Machine Check Exceptions (MCE). Versions of Linux x86_64 kernels since 2.6.4 do not print recoverable MCEs to the kernel log. Instead they are saved into a special kernel buffer which is accessible using /dev/mcelog. HERD monitors and collects data from /dev/mcelog and reports the corresponding errors to the system log and, if the resource is available, to the system Service Processor (SP) Event Log through the local IPMI interface. During error decoding, HERD attempts to provide as much information as possible from the data supplied by the AMD CPU. In particular, physical addresses obtained from correctable ECC memory errors are matched to the corresponding CPU slot and DIMM number. HERD is supported on Sun servers with AMD processors. Installing HERD RPMs are provided for the following Linux distributions: TABLE 7-1RPM Linux Distributions Release RPM Designation Red Hat RHEL4 (64-bit) herd-1.x-x.rh4.x86_64.rpm Red Hat RHEL5 (64-bit) herd-1.x-x.rh5.x86_64.rpm Novell SLES9 (64-bit) herd-1.x-x.sl9.x86_64.rpm Novell SLES10 (64-bit) herd-1.x-x.sl10.x86_64.rpm To install the RPM, run the following command: rpm -Uhv herd-1.x-1.rh4.x86_64.rpm Each RPM has a set of run-time dependencies that are enforced by RPM. These dependencies include the openssl libraries or the OpenIPMI scripts. If one of these dependencies is missing, RPM reports an error and you must install them manually. With SLES, use the yast utility. For example, type: yast2 -i OpenIPMI With RHEL, use up2date or system-config-packages. For example, type: up2date -i openssl HERD is designed to be backwardly compatible with the mcelog utility. It supports the same command-line options and uses the same format to report errors to the system log. As such, HERD acts as a replacement to mcelog (both cannot be used at the same time). Note that this conflict information is encoded into
(these are hardware faults registered by the CPU). The utility discussed in the post (mcelog) is pretty sweet, and provides a portion of the capabilities that are currently available in the Solaris FMA architecture. The mcelog utility ships with several distributions, and can also be installed from various network repositories: $ yum install mcelog $ rpm -q -a | grep mcelog mcelog-0.7-1.22.fc6 The mcelog package will add an hourly cron job to /etc/cron.hourly to check for new MCEs. If mcelog locates a MCE, an entry similar to the following will be written to /var/log/mcelog: $ less /var/log/mcelog MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact https://docs.oracle.com/cd/E21916_01/html/820-1120-22/chapter7.html your hardware vendor CPU 2 4 northbridge TSC 1157b0af355f7d MISC c008064f00000000 ADDR 40db12ae0 Northbridge Chipkill ECC error Chipkill ECC syndrome = 7273 bit46 = corrected ecc error bit59 = misc error valid bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS 9c39c00072080a13 MCGSTATUS 0 If you would prefer to route fault messages to a central location for processing, you can add the "-syslog" option http://prefetch.net/blog/index.php/2009/06/11/locating-hardware-faults-on-linux-servers/ to the mcelog cron job. This is an awesome utility, and should simplify locating hardware errors (especially if this gets combined with memtest86+) on my various Linux hosts. matty on June 11, 2009 | Filed Under Linux Utilities 4 Comments daybringer on June 11th, 2009 Thanks, this rocks, I have one server that is acting bad but I can't find any problems with it, it is saying that a fan is bad, tried swapping some out and nothing, this should help a lot. Scott Davenport on June 19th, 2009 Sun also puts out a Hardware Error Report & Decode (HERD) tool that does some additional processing on the mcelog. I'm not overly familiar with the tool, and I expect any thresholds it uses are tuned to Sun systems, but figured it's worth noting on this thread. https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=HERD-2.0-M-G-F@CDS-CDS_SMI dumbilom on July 28th, 2009 /clapping well done Sun.. Doesn't support Intel .. why even bother. Their latest release herd-3.0-1.blahblah.x86_64.rpm doesnt even handle it.. pfff dumbilom on July 29th, 2009 Actually… research would show its developed purely for AMD so I guess my bad.. not Suns fault.. Although they should atleast indicate what architecture it works on.. still clowns I guess. Leave a Comment Username (required) : Email (required) : Web Site : Comment : Search Search for: Categories AIX Debugging (1) AIX
Flow References For Developers: Testing Logfile format Client protocol BIOS support Code README mcelog logs and http://www.mcelog.org/ accounts machine checks (in particular memory, IO, and CPU hardware errors) on modern x86 Linux systems. mcelog is required by both 32bit x86 Linux kernels (since 2.6.30) and 64bit Linux kernels (since early 2.6 kernel releases) to log machine checks and should run on all Linux hardware error systems that need error handling. The mcelog daemon accounts memory and some other errors errors in various ways. mcelog --client can be used to query a running daemon. The daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range hardware error report of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured. All errors are logged to /var/log/mcelog or syslog or the journal. For memory errors it supports modern x86 systems with integrated memory controllers; for CPU errors all modern x86 systems are supported. Traditionally mcelog was run as a cronjob, but this usage is deprecated now. The modern way to run it is to start it at boot up time and run it always as a daemon. In addition it can be used to decode fatal machine checks on the command line (but this is also usually not needed anymore on modern kernels which log those after reboot automatically) For installation information and how to set up a mcelog package (if you're a distributor) please see the README.