Edac Error Linux
Contents |
DIMM has some corrected errors, how to identify it? I have another article listed memory testing tools on linux, this time, I use EDAC error report utility Here is an example show you how to identify edac-util defective DIMM on an AMD_x64 archtecture machine, syslog reorted kernel error from EDAC (Error
Edac Vs Mcelog
Detection and Correction kernel module). Here is a piece of typical error message from EDAC kernel: [Hardware Error]: MC4
Edac-util: Error: No Memory Controller Data Found.
Error (node 1): DRAM ECC error detected on the NB.kernel: EDAC amd64 MC1: CE ERROR_ADDRESS= 0xf075b2410kernel: EDAC MC1: CE page 0xf075b2, offset 0x410, grain 0, syndrome 0xa082, row 6, channel 0, label "":
Edac Sbridge Lost Memory Errors
amd64_edackernel: [Hardware Error]: Error Status: Corrected error, no action required.kernel: [Hardware Error]: CPU:6 (10:8:0) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c414000a0080813kernel: [Hardware Error]: MC4_ADDR: 0x0000000f075b2410kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) You may get confused by the message above, here is a quick way to show you what are they: The structure of the message is: the memory controller (MC1) Error type (CE) memory page edac mc0 (0xf075b2) offset in the page (0x410) The byte granularity (grain 0) The error syndrome (0xb741) memory row (row 6) memory channel (channel 0) DIMM label Not given Module name amd64_edac More explain about info given EDAC is composed of a "core" module (edac_core.ko) and several Memory Controller (MC) driver modules. On a given system, the CORE is loaded and one MC driver will be loaded. Both the CORE and the MC driver (or edac_device driver) have individual versions that reflect current release level of their respective modules. Thus, to "report" on what version a system is running, one must report both the CORE's and the MC driver's versions.The example server I used in this article has these two edac module loaded: # lsmod | grep -i edacamd64_edac_mod 21913 0 edac_core 46645 4 amd64_edac_modedac_mce_amd 15615 1 amd64_edac_mod Memory Controller (mc) Model, the memory controller's model abstracted in EDAC. Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value. Channel, each channel represents a DIMM mo
unsigned int edac_mc_count (edac_handle *edac); int edac_handle_reset (edac_handle *edac); int edac_error_totals (edac_handle *edac, struct edac_totals *totals); edac_mc * edac_next_mc (edac_handle *edac); edac wiki int edac_mc_get_info (edac_mc *mc, struct edac_mc_info *info); edac_mc *edac_next_mc_info (edac_handle handling mce memory error *edac, struct edac_mc_info *info); int edac_mc_reset (struct edac_mc *mc); edac_csrow * edac_next_csrow (struct edac_mc *mc); int ce memory read error edac_csrow_get_info (edac_csrow *csrow, struct edac_csrow_info *info); edac_csrow * edac_next_csrow_info (edac_mc *mc, struct edac_csrow_info *info); const char * edac_strerror (edac_handle *edac); edac_for_each_mc_info (edac_handle *edac, edac_mc *mc, http://fibrevillage.com/sysadmin/240-how-to-identify-defective-dimm-from-edac-error-on-linux-2 struct edac_csrow_info *info) { ... } edac_for_each_csrow_info (edac_mc *mc, edac_csrow *csrow, struct edac_csrow_info *info) { ... } Description The libedac library offers a very simple programming interface to the information exported from in-kernel EDAC (Error Detection and Correction) drivers in sysfs. The edac-util(8) utility uses libedac to report errors in https://linux.die.net/man/3/edac a user-friendly manner from the command line. EDAC errors for most systems are recorded in sysfs on a per memory controller (MC) basis. Memory controllers are further subdivided by csrow and channel. The libedac library provides a method to loop through multiple MCs, and their corresponding csrows, obtaining information about each component from sysfs along the way. There is also a simple single call to retrieve the total error counts for a given machine. In order to use libedac an edac_handle must first be opened via the call edac_handle_create(). Once the handle is created, sysfs data can be loaded into the handle with edac_handle_init(). A final call to edac_handle_destroy() will free all memory and open files associated with the edac handle. edac_handle_create() will return NULL on failure to allocate memory. The edac_strerror function will return a descriptive string representation of the last error for the libedac handle edac. The edac_error_totals(
» Articles » Monitoring Memo... Login Error Detection and Correction Jeff Layton Data protection and checking takes place various places throughout a system. Some of it http://www.admin-magazine.com/Articles/Monitoring-Memory-Errors is in hardware and some of it is in software. The goal is to ensure that data is not corrupted (changed), either coming from or going to the hardware or in the software stack. One key technology is ECC memory (error-correcting code memory).The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit memory error errors, it cannot correct them. A simple flip of one bit in a byte can make a drastic difference in the value of the byte. For example a byte (8 bits)with a value of 156 (10011100)that is read from a file on disk suddenly acquires a value of 220 if the second bit from the left is flipped from a 0 to a 1 edac error linux (11011100) for some reason.ECC memory can detect the problem and correct it so with the user unaware. Notice, however, that only one bit in the byte has been changed and then corrected. If two bits change – perhaps by both the second and seventh from the left – the byte is now 11011110 (i.e., 222); typical ECC memory can detect that the “double-bit” error occurred, but it cannot correct it. In fact, when a double-bit error happens, memory should cause what is called a “machine check exception” (mce), which should cause the system to crash. After all, you are using ECC memory, so ensuring the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop.The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system. This interference can cause a bit to flip at seemingly random times, depending on the circumstances. According to the Wikipedia article and a paper on single-event upsets in RAM, most single-bit flips are the result of background radiation – primarily neutrons from cosmic rays.The same Wikipedia article reports that the error rate