Mcelog Fallback Socket Memory Error Count
Contents |
Fault Management ...»Using the Oracle Linux Fault Management ...»Notification of Faults and Defects Updated:October 2015 Oracle Linux Fault Management Architecture Software User's Guide Document
Mcelog Corrected Memory Errors On Page
Information Using This DocumentationDocumentation and FeedbackSupport and TrainingContributing AuthorsChange History Installing mcelog centos the Oracle Linux Fault Management Architecture SoftwareRequirementsHow to Install the Oracle Linux FMA Software Using the
Transaction Memory Scrubbing Error
Oracle Linux Fault Management Architecture SoftwareFault Management Architecture OverviewFault Management Architecture TermsNotification of Faults and DefectsPaths to Oracle Linux FMA Commands and Man PagesDisplaying Information About Faults kernel hardware error machine check events logged or DefectsDisplay Information About Faulty ComponentsRepairing Faults or Defectsfmadm replaced Commandfmadm repaired Commandfmadm acquit CommandFault Management Log Files Troubleshooting Oracle Linux Fault Management ArchitectureCheck Services and ModulesRestart fmd if mcelog FailsEdit mcelog File if Faults Are Not Present in the Fault Management Databasefmd Daemon Might Not Start if SELinux is RunningOracle Linux FMA Installation Can Fail When Using Either Anaconda or Oracle System Assistant to Install the OS IndexIndexAIndexDIndexEIndexFIndexIIndexLIndexRIndexSIndexT Language: English Notification of Faults and Defects When the mcelog daemon encounters an error, it triggers a configurable response and logs information to the mcelog file. For example, assume that physical address location 0x45a3b50c0 generates a correctable memory read error. When this happens, the mcelog daemon adds an entry to /var/log/mcelog . For example: CPU 8 BANK 3 TSC 0 RIP 00:0 MISC 0x85 ADDR 0x45a3b50c0 <------ address that had the correctable read error STATUS 0x9c000000f00c009f MCGSTATUS 0x7 PROCESSOR 0:0x306f1 TIME 1389814624 SOCKETID 0 APICID 18 MCGCAP 0x7000c16 A message is also sent to the system log (/var/log/messages) describing the problem (error count exceeded threshold) and what was done (offlining the page), such as: 1 Jan 15 14:37:04 testserver16 kernel: Machine check poll done on CPU 8 2 Jan 15 14:37:04 testserver16 mcelog: Family 6 Model 3f CPU: only decoding architectural errors 3 Jan 15 14:37:04 testserver16 mcelog:
Sign in Pricing Blog Support Search GitHub This repository Watch 12 Star 29 Fork 18 andikleen/mcelog Code Issues 10 Pull requests 0 Projects 0 Pulse Graphs New issue leaky bucket #11 Open ejones71 opened this Issue Oct 21, 2013 · 2 comments Projects None yet Labels None yet Milestone No milestone Assignees No one assigned 3 participants ejones71 commented Oct 21, 2013 The recent leaky bucket update looks wrong to me. Testing with mce-inject https://docs.oracle.com/cd/E52095_01/html/E39070/gliqr.html shows that the threshold is exceeded on every event up to the bucket capacity. This is because __bucket_account() changed from >= to < in its comparison. I have a simple fix, but I'm not clear on the correct repeated threshold behavior if the bucket fills faster than it ages. What is the purpose of "excess"? Owner https://github.com/andikleen/mcelog/issues/11 andikleen commented Nov 14, 2013 Sorry for the late answer. I would like the fix please. excess is just too show how many errors exceeded the bucket. dimaslv commented Jan 27, 2014 Encountered same bug. I have threshold of 100 in 24h, but get messages from every error: corrected Socket memory error count exceeded threshold: 1 in 24h Fallback Socket memory error count 1 exceeded threshold: 2 in 24h corrected Socket memory error count exceeded threshold: 3 in 24h Errors are printed untill the threshold (100) is really exceeded and then stops. Any news on fix? Thanks in advance! Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment Contact GitHub API Training Shop Blog About © 2016 GitHub, Inc. Terms Privacy Security Status Help You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the http://serverfault.com/questions/5672/ecc-chipkill-errors-which-dimm company Business Learn more about hiring developers or posting ads with us Server Fault Questions Tags Users Badges Unanswered Ask Question _ Server Fault is a question and answer site for system and network administrators. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top ECC chipkill errors: which DIMM? up memory error vote 8 down vote favorite 8 We often get DIMMs in our servers going bad with the following errors in syslog: May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) May 7 09:15:31 nolcgi303 kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac May 7 09:15:31 nolcgi303 mcelog fallback socket kernel: MC0: CE - no information available: k8_edac Error Overflow set May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error We can use the HP SmartStart CD to determine which DIMM has the error but that requires taking the server out of production. Is there a cunning way to work out which DIMM's bust while the server is up? All our servers are HP hardware running RHEL 5. linux hardware memory ecc share|improve this question asked May 7 '09 at 8:20 markdrayton 2,09911422 memtest86+ but I suppose you can't run it while RHEL is running –Alex Bolotov May 7 '09 at 9:32 Are you running the HP SIM homepage (or full SIM for that matter actually) on the box? if so that'll offer a lot more info. Otherwise I'd need to know a bit more information about the memory offset from a more detailed error. –Chopper3 May 7 '09 at 10:08 We're not running any of the HP SIM stuff on the box as we generally find it more trouble than it's worth. If we can't work out which DIMM is dead while online it's not a showstopper -- I'm just on the lookout for ways to save time :~) –markdrayton May 7 '09 at 10:14 add