Error No Local Heartbeat. Forcing Shutdown
Site news for details. FAQ From Linux-HA Jump to: navigation, search Contents 1 No Local Heartbeat 2 Heavy Load 3 TTY timeout 4 How to use Heartbeat with Ipchains firewall? 5 How to run multiple clusters on the same network segment? No Local Heartbeat I got this message "ERROR: No local heartbeat. Forcing shutdown" and then Heartbeat shut itself down for no reason at all! First of all, Heartbeat never shuts itself down for no reason at all. This kind of occurrence indicates that Heartbeat is not working properly, which in our experience can be caused by one of two things: System under heavy I/O load, or Kernel bug. For how to deal with the first occurrence (heavy load), please read the answer to the next FAQ item. If your system was not under moderate to heavy load when it got this message, you probably have the kernel bug. The 2.4.18-2.4.20 Linux kernels had a bug in it which would cause it to not schedule Heartbeat for very long periods of time when the system was idle, or nearly so. If this is the case, you need to get a kernel that isn't broken. Heavy Load How to tune Heartbeat on heavily loaded system to avoid split-brain? "No local heartbeat" or "Cluster node returning after partition" under heavy load is typically caused by too small a Ha.cf/deadtime_directive deadtime interval, or an older version of Heartbeat. Make sure you're running at least version 3.0.2. Here is a suggestion for how to tune deadtime: Set Ha.cf/deadtime_directive deadtime to 60 seconds or higher Set Ha.cf/warntime_directive warntime to 1/4 to 1/2 of whatever you want your deadtime to be. Run your system under heavy load for a few weeks. Look at your logs for the longest time either system went without hearing a heartbeat. If your never saw a "late heartbeat" message, then your chosen deadtime is fine - use it. Otherwise, Set your deadtime to 1.5-2 times that amount. Set warntime to Ha.cf/keepalive_directive keepalive*2. Continue to monitor logs for warnings about long heartbeat times. If you don't do this, you may get "Cluster node ... returning after partition" which will cause Heartbeat to restart on all machines in the cluster. This will almost certainly annoy you at a minimum. Adding memory to the machine generally helps. Limiting workload on the machine generally helps. Newer versions of Heartbeat are a better about this than pre 3.0.x versions. Some customers report being able to set sub-second deadtimes in their applications. YMMV (!) TTY timeout I got this message "TTY write timeout on [/dev/ttyxxx]" but both nodes are up and I tested my serial cable If both
both 1.1.3 and 1.0.4 versions of heartbeat. Reading thearchives, it is suggested this is either caused by a kernel bug, orheavy system load. I am running stock 2.4.24 kernel on debian, and thesystem is not doing anything real yet, so there is no load. But just tobe sure, I set keepalive to 2 seconds and dead time to 60 seconds, andwarntime to 20 seconds, but yet, this still happens repeatably about 5minutes after I start heartbeat. This http://linux-ha.org/wiki/FAQ is a very simple cluster that justhas a floating IP and a process that is started from /etc/init.d.Something I found curious about the logs:...heartbeat: 2004/01/21_09:10:08 /usr/lib/heartbeat/send_arp eth066.81.1.198 003048425922 66.81.1.198 ffffffffffffheartbeat: 2004/01/21_09:13:07 WARN: node smf-syslogcore1: is deadheartbeat: 2004/01/21_09:13:07 ERROR: No local heartbeat. Forcingshutdown.heartbeat: 2004/01/21_09:13:07 info: hb_signal_giveup_resources():current status: activeheartbeat: 2004/01/21_09:13:07 info: Heartbeat shutdown in progress.(3101)heartbeat: 2004/01/21_09:13:07 http://linux-ha.linux-ha.narkive.com/lmIFMAr3/no-local-heartbeat-forcing-shutdown WARN: node smf-syslogcore1: is deadheartbeat: 2004/01/21_09:13:07 ERROR: No local heartbeat. Forcingshutdown.heartbeat: 2004/01/21_09:13:07 info: Giving up all HA resources....The first message is when its done acquiring its own resources, then 3minutes later, it gives a warning and then pronounces itself dead. Ifthe warn time is 20 and dead time is 60, shouldn't I see several WARNmessages before its pronounced dead? Does anyone know if 2.4.24 suffersfrom the bug that causes erroneously? Any other thoughts? Alan Robertson 2004-01-22 04:01:37 UTC PermalinkRaw Message Post by Mike MachadoI am getting the "No local heartbeat. Forcing shutdown." messages on mycluster, using both 1.1.3 and 1.0.4 versions of heartbeat. Reading thearchives, it is suggested this is either caused by a kernel bug, orheavy system load. I am running stock 2.4.24 kernel on debian, and thesystem is not doing anything real yet, so there is no load. But just tobe sure, I set keepalive to 2 seconds and dead time to 60 seconds, andwarntime to 20 se
authentication from [hotcola] [PermanentLink] To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
mail ! gmail ! com [Download message RAW] Hi, I'm getting the 'no local heartbeat' error and i'm kind of stuck there. In the howto/faq it's also mentioned that it's due to high load or a kernel bug. The problem only occurs if the primary node comes back online, about 2 minutes later. There's no high load, i'm using kernel 2.6.17.13 in combination with ldirectord for LVS. I don't know what to do anymore. I've adjusted the *time values to: keepalive 500ms warntime 1 deadtime 3 initdead 60 Which is already too high for my liking, but okay. There's not a single 'late heartbeat' once both nodes are up. Again, this only occurs once the primary node comes back online, and only some time after that. I have attached a piece of logfile to explain. Thanks, Sebastian ["log.txt" (text/plain)] Oct 26 09:57:18 rpzlvs05 PCI: Bridge: 0000:00:1e.0 Oct 26 09:57:18 rpzlvs05 IO window: 1000-1fff Oct 26 09:57:18 rpzlvs05 MEM window: 40000000-403fffff Oct 26 09:57:18 rpzlvs05 PREFETCH window: 10000000-100fffff Oct 26 09:57:18 rpzlvs05 PCI: Setting latency timer of device 0000:00:1e.0 to 64 Oct 26 09:57:18 rpzlvs05 NET: Registered protocol family 2 Oct 26 09:57:18 rpzlvs05 IP route cache hash table entries: 1024 (order: 0, 4096 \ bytes) Oct 26 09:57:18 rpzlvs05 TCP established hash table entries: 4096 (order: 2, 16384 \ bytes) Oct 26 09:57:18 rpzlvs05 TCP bind hash table entries: 2048 (order: 1, 8192 bytes) Oct 26 09:57:18 rpzlvs05 TCP: Hash tables configured (established 4096 bind 2048) Oct 26 09:57:18 rpzlvs05 TCP reno registered Oct 26 09:57:18 rpzlvs05 Machine check exception polling timer started. Oct 26 09:57:18 rpzlvs05 NTFS driver 2.1.27 [Flags: R/O]. Oct 26 09:57:18 rpzlvs05 io scheduler noop registered Oct 26 09:57:18 rpzlvs05 io scheduler cfq registered (default) Oct 26 09:57:18 rpzlvs05 Real Time Clock Driver v1.12ac Oct 26 09:57:18 rpzlvs05 Non-volatile memory driver v1.2 Oct 26 09:57:18 rpzlvs05 Linux agpgart interface v0.101 (c) Dave Jones Oct 26 09:57:18 rpzlvs05 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ \ sharing disabled Oct 26 09:57:18 rpzlvs05 serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A Oct 26 09:57:18 rpzlvs05 serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A Oct 26 09:57:18 rpzlvs05 00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A Oct 26 09:57:18 rpzlvs05 00:0a: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A Oct 26 09:57:18 rpzlvs05 Floppy drive(s): fd0 is 1.44M Oct 26 09:57:18 rpzlvs05 FDC 0 is a post-1991 82077 Oct 26 09:57:18 rpzlvs05 RAMDISK driver i