Error Device Another Node Is Heart Beating In Our Slot
our slot! Next message: [Linux-HA] Heartbeat listening on all interfaces Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] I know I am not providing a solution for your problem but when one of the nodes in the 2 ocfs2 cluster crashed, the 2nd would crash or freeze all I/O activity on the OCFS2 filesystems for a couple of minutes. that behavior went away when I put them under Heartbeat control (I's stuck with SLES 10), oh! ocfs2 filesystems won't mount on a cluster, until you provide a stonith config. best regards, enrique. On Tue, Jan 26, 2010 at 6:25 PM, Hunny Bunny
tmphb at yahoo Jan26,2010,3:25PM Post #1 of 12 (5100 views) Permalink ERROR: Device "drbd0": another node is heartbeating in our slot! Hello folkz, I'm very puzzled by the error messages in my /var/log/warn shown below. I'm not using any Pacemaker or Hearbeat CRM at this point. I have a plain configuration of DRBD which replicates /dev/sda4 as a block device /dev/drbd0 between two nodes node1 and node2 On a top of /dev/drbd0 block I have OCFS2 partition which is mounted on node1 and node2 as /data. Also http://lists.linux-ha.org/pipermail/linux-ha/2010-January/039496.html /dev/drbd0 is available as iscsi-target from the node2 to the node3 On the node3 using iscsiadm (open-iscsi) I can login into iscsi-target on the node2 which then becomes /dev/sdb on the node3. Then by executing /etc/init.d/o2cb.init start and /etc/init.d/ocfs2.init start I can also mount /dev/sdb as /data on the node3. It works and I can access all the files in /data which is http://www.gossamer-threads.com/lists/linuxha/users/61060 shared between these three nodes. However, this "...another node is heartbeating in our slot!" warning on the node2 just drives me crazy. Could somebody please help me to either turn it off or explain what is wrong. Many thanks in advance, Alex messages in /var/log/warn <------------ snipped --------------> Jan 26 13:50:31 node2 kernel: [ 2740.959848] OCFS2 Node Manager 1.5.0 Jan 26 13:50:31 node2 kernel: [ 2740.961868] OCFS2 DLM 1.5.0 Jan 26 13:50:31 node2 kernel: [ 2740.962625] ocfs2: Registered cluster interface o2cb Jan 26 13:50:31 node2 kernel: [ 2740.974260] OCFS2 DLMFS 1.5.0 Jan 26 13:50:31 node2 kernel: [ 2740.974331] OCFS2 User DLM kernel interface loaded Jan 26 13:50:44 node2 kernel: [ 2753.748066] OCFS2 1.5.0 Jan 26 13:50:44 node2 kernel: [ 2753.757040] ocfs2_dlm: Nodes in domain ("8B9FFC5A4F12408EA5FC14B7CD1B3E97"): 2 Jan 26 13:50:44 node2 kernel: [ 2753.821386] ocfs2: Mounting device (147,0) on (node 2, slot 0) with ordered data mode. Jan 26 13:55:01 node2 kernel: [ 3011.010186] (3449,0):o2hb_do_disk_heartbeat:768 ERROR: Device "drbd0": another node is heartbeating in our slot! Jan 26 13:55:03 node2 kernel: [ 3013.020160] (3449,0):o2hb_do_disk_heartbeat:768 ERROR: Device "drbd0": another node is heartbeating in our slot! Jan 26 13:55:05 node2 kernel: [ 3
from 4 people Tags Short URL https://thr3ads.net/493040 If this information is useful, please help other people find it: Share http://thr3ads.net/ocfs2-users/2008/10/493040-Another-node-is-heartbeating-in-our-slot-errors-with-LUN-removal-addition via: Twitter Facebook Email Daniel Keisling 2008-Oct-22 16:22 UTC head link [Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition Greetings, Last night I manually unpresented and deleted a LUN (a SAN snapshot) that was presented to one node in a four node RAC environment running OCFS2 v1.4.1-1. The system then rebooted with error device the following error: Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device dm-24 after 120000 milliseconds Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active regions. I''m assuming that dm-24 was the LUN that was deleted. Looking back in the syslog, I see many of these errors since the time the error device another snapshot was taken until the reboot: Oct 21 16:42:54 ausracdb03 kernel: (6624,2):o2hb_do_disk_heartbeat:770 ERROR: Device "dm-24": another node is heartbeating in our slot! The errors stopped when the node came back up. However, after another snapshot was taken, the errors are back, and I''m afraid a node will reboot again when the LUN snapshot gets unpresented. Here are the steps that happen to generate the errors: After unmounting and deleting the LUN that contains the snapshot, I receive: Oct 22 03:15:43 ausracdb03 multipathd: dm-28: umount map (uevent) Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at 0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4 Oct 22 03:15:44 ausracdb03 kernel: ocfs2: Unmounting device (253,28) on (node 2) The kernel will then sense that all SCSI paths to the device are gone, and multipathd will then mark all paths as down, which seems correct behavior. After creating and presenting a new snapshot, multipath will now see the paths reappear, which also seems normal behavior: Oct 22 03:16:06 ausracdb03 multipathd: sdcj: tur chec