Failed To Ping @tcp Input/output Error
OSTs fails after format with error -110? Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] OK, I did a little more research and I found that I could increase the verbosity of the LNET debugging output by doing the following: echo +neterror > /proc/sys/lnet/printk So, I did that and tried one of the failing "lctl ping" commands again: [root at lustre-mgs ~]# lctl ping 192.168.1.101 at tcp0 failed to ping 192.168.1.101 at tcp: Input/output error [root at lustre-mgs ~]# Here's what I see now in dmesg: [174224.584669] LNet: 3900:0:(lib-socket.c:626:lnet_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 192.168.1.101/988 [174224.584692] LNet: 3900:0:(acceptor.c:114:lnet_connect_console_error()) Connection to 192.168.1.101 at tcp at host 192.168.1.101 was unreachable: the network or that node may be down, or Lustre may be misconfigured. [174224.584711] LNet: 3900:0:(socklnd_cb.c:424:ksocknal_txlist_done()) Deleting packet type 2 len 0 192.168.1.100 at tcp->192.168.1.101 at tcp I understand tcp is just a synonym for tcp0 so I think that's okay ... Network configuration on each of these machines is very simple; only one interface on any of them is up and running; one port on an Intel X520 10 Gig NIC; I have LNET configured in /etc/modprobe.d/lustre.conf on i.e. the MGS as so: options lnet networks=tcp0(p1p2) That's correct, yes? In this case, p1p1 and p1p2 are the two 10 Gig NIC ports ... I don't know why RHEL uses such funky names ... But very basic, no routing, not even multiple interfaces ... Continuing to research ... I assume error -113 in this case is just a generic "connection failure" type error although if something could be deduced from that, it would certainly be great :O Thanks, Sean On Thu, Jul 2, 2015 at 4:47 PM, Sean Caron
EIO on freshly restarted nodeAgile Board ExportXMLWordPrintable Details Type: Bug Status: Open Priority: Major Resolution: Unresolved Affects Version/s: Lustre 2.3.0, Lustre 2.1.1 Fix Version/s: None Labels: None Severity: 3 Rank (Obsolete): 4035 Description I have two nodes, "oss1" and "mds1". I can lctl ping from oss1 to mds1 with no problem, proving basic infrastructure is correct. If I kill mds1 (i.e. pull the power cord) and boot it back up and start lnet, the first lctl ping from oss1 to it will fail with an EIO. A subsequent lctl ping will https://lists.01.org/pipermail/hpdd-discuss/2015-July/002458.html succeed. In terms of network traffic when this happens, 3 TCP sessions from oss1 to mds1 are established, serially - that is 1 is opened and then closed, 2 is opened and then closed and 3 is opened and left open. The first two exchange a few packets each and close down gracefully. The third seems to be much https://jira.hpdd.intel.com/browse/LU-1394 longer lived and is in fact the session where both the failed and then the successful lctl ping happen and it continues to live on. Clearly this problem is not an artefact of TCP or IP and is entirely an artefact of LNET itself, so it seems that LNET ought to be able to handle this situation more gracefully. Activity All Comments History Activity Ascending order - Click to sort in descending order Hide Permalink Brian Murrell added a comment - 09/May/12 12:51 PM I should expand a bit on this and outline the real problem we are seeing. In the above scenario, if "mds1" is the MGS, registration of a new OST can fail to reach this freshly rebooted MGS and results in: # mount -t lustre -o loop /var/tmp/test_ost /mnt/lustre/ost3 mount.lustre: mount /dev/loop1 at /mnt/lustre/ost3 failed: Input/output error Is the MGS running? The syslog from this: May 9 16:40:11 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: May 9 16:40:11 oss1 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mo
Apr 13, 2007, at 10:49 PM, Scott Atchley wrote: Hi all, I am trying to set up Lustre https://www.mail-archive.com/lustre-discuss@clusterfs.com/msg00855.html using TCP. I have the following in / etc/modprobe.conf: options lnet http://lustre-discuss.clusterfs.narkive.com/VPPOWBg3/configuring-lustre-routring-between-two-tcp-networks networks="tcp0(eth2)" to specify the third NIC only. I have two OSSs and one MDS. They startup and see each fine. My XML is pasted below. When I try to have a client start with: # lconf --node client lustre-fs.xml it hangs at: + mount -t failed to lustre_lite -o osc=lov1,mdc=MDC_compute-0-1.local_mds1_MNT_client lustre-fs /mnt/ lustre If I check its NIDs, I see: # cat /proc/sys/lnet/nis nid refs peer max tx min [EMAIL PROTECTED] 2 0 0 0 0 [EMAIL PROTECTED] 2 8 256 256 255 which is the correct address for this client. If instead of using lconf, I simply modprobe lnet and run lctl, failed to ping then I try to ping the MDS. It fails: # lctl lctl > network up LNET configured lctl > network tcp lctl > ping 192.168.1.250 failed to ping [EMAIL PROTECTED]: Input/output error Yet, I can ping the node on the command line: # ping -s 9000 192.168.1.250 PING 192.168.1.250 (192.168.1.250) 9000(9028) bytes of data. 9008 bytes from 192.168.1.250: icmp_seq=0 ttl=64 time=0.142 ms # ping -s 9000 nas-0-0-m PING nas-0-0-m.local (192.168.1.250) 9000(9028) bytes of data. 9008 bytes from nas-0-0-m.local (192.168.1.250): icmp_seq=0 ttl=64 time=0.111 ms If I try using lctl on the MDS to ping the client, it fails as well but I can ping the two OSSs. Any suggestions? Thanks, Scott