Memory Single-bit Failure Error Rate Exceeded
Contents |
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us
Clear Memory Error Dell Openmanage
Learn more about Stack Overflow the company Business Learn more about hiring developers or correctable memory error rate exceeded for dimm posting ads with us Server Fault Questions Tags Users Badges Unanswered Ask Question _ Server Fault is a question and answer persistent correctable memory error rate has increased for a memory device at location site for system and network administrators. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and
Correctable Memory Error Log Limit Reached
rise to the top Multibit error encountered on Dell Server Memory up vote 3 down vote favorite Dell OpenManage reported the following: Memory device status is critical Memory device location: DIMM_B2 Possible memory module event cause:Multi bit error encountered What does this mean? How bad is it? memory dell dell-openmanage share|improve this question asked Sep 5 '13 at 14:02 AXE-Labs 6511716 Call Dell Support, send it
Multi Bit Ecc Error On Raid Controller
back as faulty. –Tom O'Connor Sep 5 '13 at 14:06 add a comment| 2 Answers 2 active oldest votes up vote 1 down vote The event message reference for this was 1404. It indicates a faulty DIMM that should be replaced but from what I read on blogs, the alert often clears and does not come back after reboots. Since it only tripped once for me, I cleared the memory errors using OMSA (dcicfg32.exe) and so far so good. share|improve this answer answered Sep 5 '13 at 14:02 AXE-Labs 6511716 This was a good move - replacement typically isn't warranted after a single occurrence, though I'd seriously consider it if the problem ever returns on that particular DIMM. –JimNim Sep 6 '13 at 14:52 Similarly, I was seeing "Single bit warning error rate exceeded" and "Single bit failure error rate exceeded" on a Linux host. These can be cleared as well but with omconfig: 'omconfig system alertlog action=clear' and 'omconfig system esmlog action=clear'. Lets hope they don't come back or its trash for the dimms. –AXE-Labs Mar 6 '14 at 20:18 Make sure you've got the latest firmware/BIOS too -- I have seen cases where these sorts
saw that one of the memory chips (B3) had a parity error - "single-bit failure error rate exceeded". Since the server wasn't in production yet, I was able to run all the available updates. After a reboot, the error vanished - so, problem solved, right? Wrong! multi-bit memory errors detected on a memory device A week later, the B3 memory chip was showing a parity error again. Having no updates to
Poweredge Diagnostics
run, I tried a reboot. This cleared the error again, but I was very uncomfortable putting a server into production with a hardware problem. Dell suggested correctable memory error rate exceeded for dimm a1 swapping memory chips to verify whether the error would follow the chip. The technician was suspecting either a bad chip or a bad slot on the motherboard. After a trek to the data center and swapping the suspect B3 chip with nearby http://serverfault.com/questions/536636/multibit-error-encountered-on-dell-server-memory chip B5, the problem resurfaced. On the B5 chip. It looked like the problem was a faulty chip. Not a problem - Dell shipped all new RAM. After swapping out all the chips however, the same problem showed up a day later. On chip B4. The previous error following the chip now seemed nothing more than a coincidence. This was getting odd. The technician asked me what memory chips I swapped & which was now showing the error. Since the issue stayed with the chip, this http://hardwarmotherbord.blogspot.com/2011/03/memory-error-isnt-ram.html technician believed the CPU to be at fault & asked me to swap CPU's. After doing this, sure enough the error happened again - on the same bank of memory chips, but a different slot this time - B2. Dell replaced the motherboard, suspecting the DIMM slots were bad, and the CPU's for good measure. No go. The problem happened yet again a few days later. My server's problem was escalated. This was getting very odd. The next technician, looking at the DSET that I sent earlier, asked me to change a BIOS setting involving power saving features for the CPU. The idea was that since the server was not under a production load, when it was idle overnight one of the CPU's was put in a low power state. If a network request or anything else came in while the CPU was in this state, and since the memory controller was built in to the CPU's on this server, any requests involving the CPU would be held in the memory controller until the CPU could 'wake up'. It could be that the request couldn't be handled in enough time for the CPU to 'wake up', throwing an error to the memory controller. That was it! After making that one change, the error never resurfaced. I went on to install the server in the production cluster. Keep this in mind the next time you're chasing a memory issue. The problem may not be the hardware after all. Posted by HARDWARE at 21:48 Email ThisBlogThis!Share to TwitterS
Mentions949 Products Ivan (Dell) Dell Technologies Brand Manager GROUP SPONSORED BY DELL TECHNOLOGY IN THIS DISCUSSION Dell 117936 Followers Follow VMware 396220 Followers Follow Dell OpenManag... https://community.spiceworks.com/topic/966395-vmware-esxi-says-ram-healthy-dell-osa-says-ram-critical Administrator VMware vSphere Join the Community! Creating your account only takes a https://it.slashdot.org/story/10/06/24/2210214/tracking-down-a-single-bit-ram-error few minutes. Join Now Hey guys & gals, I've been running VMWare ESXi on a Dell PowerEdge 2950 as my personal server for years without a hitch, other than throwing in the occasional hard drive after VSphere shows one failed. Well I recently picked up an MD1000/Perc 5e combo for dirt cheap so memory error I figured I'd play around with it. This forced me to finally install Dell OpenManage Server Administrator onto the ESXi host. As soon as I logged in for the first time, it's saying my system is in CRITICAL status, and that 3 of my 8 RAM sticks are critical with failures (Single bit warning error rate exceeded, Single-bit failure error rate exceeded). All this while ESXi is correctable memory error saying everything is hunky dory and RAM is healthy. I guess I have two questions. 1) Is VMware usually more perceptive on detecting possible hardware issues like this? 2) Even though Dell OSA says it's critical, I've never had a crash or lockup on any of the VMs, is it being a little over conservative with single-bit failures? Or should I drop the $300+ and replace the sticks? Tags: Dell117,936 FollowersFollow VMware vSphereReview it: (113) Reply Subscribe View Best Answer RELATED TOPICS: Very Strange VMware ESXI Issue VMWare ESXI Version Released For Free Basic VMWare ESXi Setup - Need Advice   5 Replies Cayenne OP Helpful Post Luc23 May 22, 2015 at 3:57 UTC Matthew4416 wrote: 1) Is VMware usually more perceptive on detecting possible hardware issues like this? 2) Even though Dell OSA says it's critical, I've never had a crash or lockup on any of the VMs, is it being a little over conservative with single-bit failures? Or should I drop the $300+ and replace the sticks? 1) I'd say that OMSA is more perceptive than VMware. 2) I wouldn't drop that kind of money at this point. Especially since its your personal server. I'd try swapping t
the past week (and beyond) at the Slashdot story archive Nickname: Password: Public Terminal Forgot your password? Close binspamdupenotthebestofftopicslownewsdaystalestupid freshfunnyinsightfulinterestingmaybe offtopicflamebaittrollredundantoverrated insightfulinterestinginformativefunnyunderrated descriptive typodupeerror Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. × 13688944 story Tracking Down a Single-Bit RAM Error 277 Posted by timothy on Thursday June 24, 2010 @07:10PM from the you'll-need-a-nice-microscope dept. Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause. bug hardware ram ← Related Links → Rats Breathe Air From Lungs Grown In the Lab Experimental Magnetic Shield Against Cosmic RaysDo Car Safety Problems Come From Outer Space?Submission: Blogger shows that cosmic rays are a real problem Nokia Trades Symbian For MeeGo In N-Series Smartphones This discussion has been archived. No new comments can be posted. Tracking Down a Single-Bit RAM Error More Login Tracking Down a Single-Bit RAM Error Archived Discussion Load All Comments Full Abbreviated Hidden /Sea Score: 5 4 3 2 1